Rashid Sayyid: Howdy everybody, and thanks for becoming a member of us at the moment on this UroToday Journal Membership recording, the place we’ll be discussing the not too long ago revealed paper, the PI-CAI Problem, which seems at synthetic intelligence and radiologists in prostate most cancers detection MRI, which was a world paired non-inferiority confirmatory research. I am Rashid Sayyid, a urologic oncology fellow on the College of Toronto, and I am joined at the moment by Zach Klaassen, the Affiliate Professor and Program Director at Wellstar MCG Well being.
This paper, a vital paper, was not too long ago revealed within the Lancet Oncology with Dr. Aninda Saha as the primary writer. We have seen over the past, primarily decade, that pre-biopsy MRI has emerged and is now endorsed by quite a few worldwide tips. That is for a lot of causes, together with improved detection of clinically important prostate most cancers when it is integrated within the pre-biopsy setting. It reduces the analysis of grade group one illness and in addition probably can cut back the variety of pointless biopsies as properly. However there are points that come up, clearly, with the adoption of MRIs, and one in every of them is that they are truly labor-intensive. And never solely from having extra machines obtainable and having extra personnel to run them, but additionally the extra that they’re carried out, the extra that they must be learn as properly. From a reader, radiologist standpoint, that does considerably enhance the workload. There’s additionally the difficulty of excessive inter-reader variability. These are a number of the points that we encounter, a number of the points that we have to tackle with novel methods.
One potential technique is incorporating the AI fashions, and the AI fashions have been proven to match skilled clinicians in medical picture evaluation throughout quite a few specialties, significantly prostate and breast most cancers. And so AI-assisted picture interpretation can probably tackle this rising demand in medical imaging worldwide. And so earlier than they’re adopted, nonetheless, the efficacy of those AI fashions must be examined with a view to enable for this wide-scale adoption within the prostate most cancers diagnostic house.
And so to this finish, the authors and research investigators hypothesize the state-of-the-art AI fashions skilled utilizing 1000’s of affected person exams, probably may very well be non-inferior to radiologists for detecting clinically important prostate most cancers utilizing MRI, which stays the final word aim. And so to this finish, they designed the prostate imaging most cancers synthetic intelligence, the PI-CAI problem, whereby they actually did a reasonably complete job when it comes to growing, coaching, after which externally validating an AI system that was developed for detecting clinically important prostate most cancers utilizing a big worldwide multi-center cohort. After which in contrast the efficiency of this AI mannequin to first, radiologists taking part in a research. After which two, the radiology readings from the precise radiologists who learn the photographs once they had been carried out within the setting of a multidisciplinary routine observe. And so we’ll speak about this additional within the strategies part. I simply need to spotlight that though the methodology right here is kind of intense and dense, it is necessary to go over this with a view to perceive how this mannequin works and the way we will incorporate it probably into the longer term as clinicians into our observe.
And so on this research, this was a world paired non-inferiority confirmatory research. Primarily, it’s ensuring this AI mannequin, on the very least, is nearly as good as a radiologist’s studying. And in order a primary step, algorithm builders design the AI fashions utilizing a pattern of about 10,000 instances from 9,000 sufferers with these photographs collected from 4 European tertiary care facilities over a decade between 2012 and 2021. After which we’ll speak about this additional, however amongst 1000’s of fashions submitted, the highest 5 performing fashions had been chosen after which mixed into one algorithm.
On the similar time, as well as or in parallel to those algorithms being developed, 62 radiologists had been invited to take part in a multi-reader, multi-case observer research. That is the research group of radiologists, which is completely different from the real-world radiologists who learn them on the time of the imaging being carried out. We’ll speak about this distinction later. These algorithm builders and radiologists had been invited to take part by referrals, outreach applications, convention shows, and most significantly, an open name on the grand-challenge.org platform.
By way of the affected person inclusion standards, these are the sufferers for whom the photographs had been carried out, and these had been the choice standards that had been utilized. All sufferers who underwent imaging had been grownup sufferers with the median age of 66, who had a excessive suspicion for prostate most cancers, both an irregular rectal examination, or PSA of three or larger. These sufferers had been allowed to have had prior biopsies carried out, however they may have had no prior therapy for prostate most cancers and no identified historical past of grade two or larger illness. It is necessary, clearly, that each one these sufferers who had been chosen for inclusion had MRI photographs obtainable with full reporting and with excessive picture high quality.
By way of the MRIs, these had been both 1.5 or three Tesla scanners, obtained commercially. All these photographs, once they had been carried out as clinically indicated within the routine observe, had been learn by at the least one in every of 18 radiologists from these taking part facilities. And these had been fairly skilled radiologists who had anyplace between one to 21 years of expertise studying these prostate MRIs and reported the findings from these imaging utilizing PI-RADS classification.
It is necessary to notice that is the real-world radiology cohort and in order you’ll count on, the affected person historical past from the charts, in addition to peer consultations, a second set of eyes or a 3rd set of eyes, had been obtainable to help within the analysis. Sufferers with optimistic MRIs outlined as PI-RADS three or larger underwent biopsies, and the focused variety of cores was two to 4 per lesion. And amongst those that had a adverse MRI, which means both no lesion in any respect or PI-RADS one to 2, they both had no biopsy carried out or solely a scientific biopsy carried out with six to 16 cores. This primarily was based mostly on clinician choice.
On this research, clinically important prostate most cancers, the final word final result was outlined as grade two to 5 illness. And the way was this outlined utilizing the biopsy of the RP? Properly, if sufferers underwent a radical prostatectomy, the research investigators used entire mount specimen to assign grade, in any other case the biopsy specimen was used. After which in sufferers who had a adverse MRI, with a view to be certain that they really had adverse illness, a minimal fallout interval of three years was utilized to substantiate the absence of clinically important prostate most cancers in these sufferers. Once more, this simply asks for the constancy and validity of this research to make sure that adverse is really adverse. A vital element to concentrate on on this research.
Now, let’s simply take a step again and speak about how the AI system was developed. We’re not going to enter the precise particulars. We perceive that is fairly difficult and really intense from a technique standpoint, however primarily step one of the research was to ask the AI algorithm builders to affix the research. And the best way this was carried out was by the PI-CAI problem being hosted on the grand-challenge.org platform. It is attention-grabbing to notice that this problem will probably be constantly hosted till Might 2027, so this provides the research investigators an opportunity to additional optimize their mannequin. And so based mostly on the knowledge on the web site, AI builders worldwide might choose in they usually obtain an annotated public information set of about 1,500 MRI instances. Their AI fashions had been skilled for detecting this clinically important prostate most cancers utilizing bi-parametric MRI, so it is an necessary element.
These AI fashions actually needed to full two duties. First, they needed to localize and classify a lesion when it comes to a chance of getting clinically important most cancers from zero to 100, after which classify the general case utilizing the identical zero to 100 chance rating, so two completely different duties. After which it is necessary to know the fashions might use the imaging information, so the bi-parametric MRI, additionally a number of metadata that had been made obtainable to those algorithm builders, so: age, PSA stage, prostate quantity, and the MRI scanner identify.
Subsequent, as soon as these algorithm builders developed these AI fashions on the finish of the event cycle, they submitted them. And so the research investigators subsequent validated these AI fashions. First step is creating it. The subsequent step is validating it in a set of a thousand instances. This was carried out in a distant offline heart and all people was totally masked to the outcomes. And once more, they used histopathology in a fallout interval of at the least three years to ascertain the reference normal.
After which out of all these fashions submitted, the research investigators independently retrained the 5 prime performing AI fashions utilizing over 9,000 instances. And as soon as they had been skilled, the great factor is that they mixed these 5 completely different fashions into one mannequin utilizing equal weighting. At this level, we’ve got our AI mannequin developed. Along with the AI mannequin that was developed after which validated by these research investigators, they needed to recruit the radiologists. It is not honest to match the AI mannequin alone to the real-world radiologists. You want a 3rd pattern. We are able to take into consideration this as a three-arm research. You’ve gotten the AI mannequin in a single arm. The second arm is the radiologist taking part on this research only for the aim of the research. After which you might have the third cohort of the radiologists who learn them in a real-world, real-time setting.
These radiologists, the second cohort, had been additionally invited utilizing the grand-challenge.org platform, they usually learn 400 MRI exams that had been randomly sampled from the testing cohort. It is necessary that the investigators chosen 4 radiologists that had been skilled, so all of them had in depth expertise studying multi-parametric MRI utilizing the PIRADS scoring system with a view to rating them with a median expertise of seven years. It is necessary to notice that none of those radiologists had participated or had been working at one in every of these seven facilities. If we take a look at their experience, based mostly on the ESUR/ESUI consensus statements, 74% of them had been self-designated as specialists. So actually, an skilled cohort right here.
These radiologists weren’t requested to learn all 400 MRIs. That is clearly labor-intensive and would lower the possibility that these radiologists could be prepared to partake on this research. And they also used a break up plot design the place the readers and instances had been randomly distributed into 4 completely different blocks of 100 instances every, after which every of those radiologists needed to learn the photographs in two sequential rounds. First, they regarded on the bi-parametric imaging and the metadata that was obtainable for the AI system as properly. That is the prostate quantity, and PSA, and so forth., that had been obtainable. After which because the second step, these radiologists needed to learn this in a multi-parametric MRI research, so not bi-parametric, multi-parametric, and the readers might use this extra info from the extra sequencing to replace their findings.
Nevertheless it’s necessary to notice that these readers didn’t have entry to affected person historical past or might seek the advice of with their friends. That is completely different from the cohort of radiologists who learn these photographs in a real-time setting. They actually have much less info. It helps parse these comparisons. It is necessary that within the context of this research, solely the multi-parametric MRI readings had been thought of for the evaluation.
And so when it comes to the statistics, actually we will take into consideration this when it comes to two pairwise comparisons. We evaluate the AI system to the 62 radiologists who participated only for the aim of the research, after which we evaluate the AI system as properly to the historic radiology readings that had been made throughout scientific observe.
The first speculation was that the standalone AI system could be non-inferior to each units of radiologists. After which if it was non-inferior, then probably we might additionally check for the prevalence of the AI system, which is fairly normal in these non-inferiority designs.
Once we speak concerning the check statistics for comparisons, when the investigators in contrast the AI mannequin to the 62 radiologists, they used the world underneath the receiver working curve attribute statistic. After which once they had been the AI mannequin to the historic radiology readings, primarily they regarded on the distinction in specificity when the identical sensitivity as a PI-RADS three or larger threshold was set.
And once we speak about non-inferiority, non-inferiority could be concluded if the check statistic was larger than zero and the decrease boundary of the 95% confidence interval was larger than adverse 0.05. After which if non-inferiority was concluded based mostly on these standards, then the prevalence of the AI system over both set of radiologists was assessed.
At this level, thanks for bearing with us by this dense methodology part, however it’s crucial to grasp with a view to contextualize the outcomes. Zach will go over the outcomes of this part, going over the traits of the cohort after which trying on the outcomes on how this AI mannequin in comparison with the radiologists.
Zach Klaassen: Rashid, thanks a lot for that nice introduction and overview of the methodology. So earlier than we take a look at the desk on the precise, we will spotlight that between June 12, 2022 and November 28, 2022, there have been 809 people from 53 international locations that opted into the event of the AI system. This resulted in 293 AI algorithms that had been submitted. And as Rashid talked about, the highest 5 fashions used included the College of Sydney, the College of Science and Know-how in China, the Guerbet Analysis Heart in France, the Istanbul Technical College, and Stanford College.
Once we take a look at the affected person distribution throughout the cohorts, there was a complete of 9,129 sufferers with a median age of 66 years of age. The median PSA on the time of the research was eight, and the median prostate quantity was 61 mls. Once we take a look at the MR scanners, the vast majority of these had been Siemens and Phillips medical system scanners. And once we take a look at the sphere energy from the Tesla standpoint, roughly half had been 1.3 and the opposite half had been three. Once we take a look at the instances, there have been 76% of sufferers that had benign or indolent prostate most cancers. And the necessary half right here is that 24% of sufferers had clinically important prostate most cancers outlined as larger than or equal to Gleason grade group two.
With reference to the optimistic MRI lesions, 16% had PI-RADS three, 47% had PI-RADS 4, and 37% of sufferers had PI-RADS 5. Lastly, once we take a look at the ISUP-based lesions, we see that 40% of sufferers had been Gleason grade group one, 31% had been grade group two, 14% of sufferers had been grade group three, 6% of sufferers had been grade group 4, and eight% of sufferers had been Gleason grade group 5.
That is the ROC curve of the AI system and the pool of 62 radiologists. As Rashid laid out, that is the Reader research, 400 testing instances, and we see that the teal line right here is the AI system with an ROC curve of 0.91. The radiologist is in purple with an ROC curve of 0.86. And as Rashid properly laid out the strategies, that is the important thing discovering on this research is that the AI system within the Reader research was non-inferior, and thus was examined as superiority, and was deemed superior to the radiologist on this final result. Once we take a look at this, that is the distinction within the ROC metric between the AI system and the pool of 62 radiologists. This can be a little bit extra visually interesting, when it comes to understanding non-inferiority and superiority. We are able to see right here that the road is to the precise of each the non-inferiority margin and the prevalence margin, favoring the AI system.
And so what does this actually imply when it comes to real-world outcomes and the way does this operationalize to the clinic? What this implies is that the AI system versus radiologist at their PI-RADS three or larger working level, detected 6.8% extra clinically important prostate most cancers on the similar specificity, 50.4% fewer false positives, and 20% fewer Gleason grade group one cancers on the similar sensitivity because the radiologist.
That is the ROC curve of the AI system and the PI-RADS working factors of the radiology reads made throughout routine multidisciplinary observe, and this was the thousand testing instances. That is the “real-world expertise.” We see right here that the AI system had an ROC curve of 0.93 and once we extrapolate this in comparison with the radiologist, the AI system was non-inferior to radiologists in routine multidisciplinary observe, however was not deemed superior.
And once more, once we take a look at this from virtually like a power plot standpoint, we see the distinction in specificity that matched sensitivity was adverse 0.2. Once more, we see that this line and the arrogance intervals is to the precise of the non-inferiority margin, however is to not the precise of the prevalence margin. Once more, non-inferior for the AI system in comparison with radiologists in routine scientific observe.
What does this all imply? Mainly, the PI-CAI problem confirmed {that a} state-of-the-art AI system was superior in discriminating sufferers with clinically important prostate most cancers on MRI in comparison with 62 radiologists utilizing PI-RADS model 2.1 in a world reader research. Secondly, it was non-inferior when in comparison with the usual of routine care in radiology observe. Why was it not superior? There are a number of causes that this can be secondary. One, is that these radiologists had entry to affected person historical past. They had been in a position to seek the advice of with their friends if it was a tough case. They may ask different radiologists, they may attend conferences, have case shows, in addition to probably the protocol familiarity. That is maybe why there was no superiority of the AI system to radiologists in routine radiology observe.
What we see on this research is that predictive values for the AI system had been very excessive, 89.5% sensitivity at a 79.1% specificity, in addition to a 93.8% adverse predictive worth at an estimated 33% prevalence. Though it is considerably tough to have a look at these outcomes and evaluate them to radiologists and former different analyses, once we do take a look at these in comparison with multi-parametric AI and the PROMIS Trial, 88% sensitivity at 45% specificity for the radiologists in PROMIS and a 76% adverse predictive worth at an estimated 53 prevalence.
Once we take a look at two meta-analyses of 42 research for radiologists, the sensitivity was 96% at a 29% specificity with a 90.8% adverse predictive worth. Once we take a look at the constraints, there are a number of for this. This was a well-designed research, however the authors did an excellent job of highlighting a number of limitations. The primary is that the info set was retrospectively curated over a number of years and a number of websites, which led to a mixture of consecutive sufferers and samples.
Secondly, radiologists offered their evaluation for retrospective information by an internet studying atmosphere, and this will likely have differed considerably from their day-to-day native workflow.
Third, biopsy planning and histological verification had been guided by the unique radiology learn and never prospectively by both the radiologist or the unreal intelligence system.
Fourth, this research could have been hampered by differential verification bias, and this implies affected person examinations are verified, however a number of requirements similar to biopsies, prostatectomies, and follow-up are mixed to ascertain the presence or absence of serious most cancers.
And eventually, there was no information on ethnicity, and 93.4% of all MRI exams had been acquired from one MRI producer. Actually, reproducibility must be confirmed on different MRI methods, in addition to in different races throughout the prostate most cancers spectrum.
In conclusion, an AI system was superior to radiologists utilizing PI-RADS model 2.1 at detecting clinically important prostate most cancers and comparable to plain of care in routine radiology observe. Such a system reveals the potential to be a supportive software inside a main diagnostic setting with a number of potential related advantages for sufferers and radiologists.
And eventually, potential validation, which is present process within the CHANGE trial, is required to check the scientific applicability of this technique.
Thanks very a lot on your consideration. We hope you loved this Uro Immediately Journal Membership dialogue of the PI-CAI research revealed not too long ago within the Lancet Oncology.

