Summary
The CAP Artificial Intelligence Committee is developing a guide for AI implementation in laboratories, covering the entire machine learning life cycle. Laboratories should validate AI tools rigorously, even if purchased from vendors with FDA approval, focusing on clinical impact, biases, and population drift.
Charna Albert
December 2025—As more laboratories explore the use of artificial intelligence, the CAP is working behind the scenes to develop a guide to AI implementation.
The CAP Artificial Intelligence Committee is taking the lead on the guide, which will tackle the entire machine learning life cycle from ideation and model development to validation, deployment, monitoring, and reporting. The guide will be submitted in 2026 to the Archives of Pathology & Laboratory Medicine.
Laboratories need not wait for formal guidance to begin using AI, but they should be aware that a full validation is required, even if the tool is purchased from a vendor and has the Food and Drug Administration’s stamp of approval.
Nicholas Spies, MD, AI Committee member and medical director of applied AI at ARUP Laboratories’ Institute for Research and Innovation in Diagnostic and Precision Medicine, says laboratories should approach AI validation with at least as much rigor as they would a typical laboratory assay, at least “until we have a robust set of best-practice guidelines and a more well-defined regulatory environment.” On the other hand, a typical assay validation won’t necessarily be sufficient for validating an AI-driven tool, he says, “especially when it comes to clinical impact, edge cases, biases, and population drift.”

“My biggest takeaway is that even if something is plug and play, you still have to make sure once you plug it in that it’s not going to cause any downstream consequences before turning it on,” he says.
The University of Texas Medical Branch in Galveston early this year implemented the Ibex Medical Analytics algorithms for prostate and breast cancer, which provide cancer heat maps and other AI-driven clinical decision support. The pathology department adopted digital primary diagnosis in 2021, says Harshwardhan Thaker, MD, PhD, chair ad interim. “All our pathologists are comfortable viewing digital slides, so the addition of AI assistance was a natural next step,” he says.
To perform the clinical validation for the Ibex Prostate model, “we selected a representative mix of cases from our archives,” Dr. Thaker says. In all, Dr. Thaker and his colleagues selected 318 slides: 214 benign and 104 malignant. “I personally reviewed each of the slides to confirm the original diagnosis and, if malignant, the Gleason score.” In choosing the slides, they sought a mix of high- and low-grade cancers and varying amounts of cancer, including cores with only a small amount of cancer. They also included variant forms of cancer and some benign mimics.
“We subjected this set of slides to AI analysis and noted the readout of the algorithm,” he says. Ninety-one percent of the cancerous slides were read out as having a high probability for cancer, he says, while the remaining cancers were given a medium probability. “None of the slides labeled low probability had cancer. In other words, the sensitivity of the AI to detect cancer, either high or medium, was 100 percent,” he says. They did have two benign cases the algorithm predicted as high probability for cancer, he says, making the specificity of a high readout for cancer 99 percent.
The validation for the breast algorithm, he says, followed the same method and had similar sensitivity and specificity. “These results, using our cases in our own hands, provided us with the confidence to deploy the model in routine clinical practice.”
Dr. Thaker is quick to note the Ibex algorithm is an assistive tool. “This model is solely intended as a ‘second look’ quality assurance tool and does not in any way substitute for the work of our pathologists, who still review every slide and can overrule the AI,” he says, adding, “They maintain full responsibility for the final diagnosis.”

Vidarshi Muthukumarana, MD, GU pathologist at UTMB, uses the tool daily as she signs out prostate cases. “It’s a very sensitive tool, which means it hasn’t missed any cancers, at least for me,” says Dr. Muthukumarana, assistant professor and fellowship director in surgical pathology and cytopathology. “But it will pick up other things that mimic cancer because of its sensitivity, so to me it’s more like a screening tool.”
For Dr. Muthukumarana and others in the pathology department, Ibex has made it easier to prioritize high-risk cases on days when the biopsy service workload is heavy. The tool analyzes the department’s slides overnight as they’re scanned, sorting them by likelihood of cancer before the workday begins, says Mariangela Gomez, MD, breast pathologist and assistant professor. It’s an especially effective moment in their workflow to save time, Dr. Gomez explains, because the laboratory’s cutoff for immunostain ordering is in late morning, and if units aren’t ordered the same day, they risk adding a day’s turnaround time to a high-priority case. Once a high-priority case in Ibex is opened, “it will take you directly to the slide that it thinks needs immunos,” she says.
Dr. Muthukumarana estimates it has cut by half the time she used to spend on case triage. “I only look at slides that have medium-to-high likelihood [for cancer] in Ibex, to see if I agree with it or if I need to order immunostains because I’m not sure,” she says. She can confidently exclude the benign ones at the time of triage, she says, “because the sensitivity and negative predictive value are good.” Every slide gets evaluated after that critical window when immunostains are ordered. “I’m still spending the same time on digitally scanned slides to diagnose the cancer,” she says. “I still have to look at a two-part prostate biopsy, 36 slides. That hasn’t stopped or [been] reduced.”
The prostate algorithm can do Gleason grading, but the pathology department isn’t using that feature now, Dr. Muthukumarana says. She’s interested in using it “once it’s validated and we can do our own institutional study on it and once we have good concordance rates,” she says. “That would reduce my time spent on prostate biopsy, but more than grading, simple things like measuring the length of the cancer, giving the percentage [involvement]—that would be very useful for me.” Even with their digital pathology setup, she says, they’re still using rulers to measure cancer length. “And that takes a lot of time.” Immunostain interpretation, in her view, also could be a good use case for AI, given that “immunostains are not driven by detailed analysis. It’s more of a positive/negative thing.”
Another potential use case: evaluating prostate chip specimens, which come in large numbers of mostly benign fragments. “When we look at chips, our main challenge is to not miss an incidental focus of cancer,” she says. Ibex has been receptive to making updates, she says, including the ability to analyze prostate chip specimens. “That’s something we brought to them, and they’re working on it.”

To the breast algorithm, Ibex recently added the ability to highlight benign cancerous mimics such as sclerosing and proliferative lesions, Dr. Gomez says. She and her colleagues use the tool to identify these and to further discriminate between usual ductal hyperplasia, atypical ductal hyperplasia, and ductal carcinoma in situ, as well as to detect microcalcifications. Ibex also has a feature that differentiates ductal from lobular origin, she says, which in the current practice requires immunostains.
Another upcoming decision support tool from Ibex will delineate HER2 expression into 0, 1+, 2+, or 3+ based on the 2023 ASCO/CAP guidelines. Dr. Gomez and her colleagues are testing the tool, which is pending FDA approval, for the low scores, and she says it’s performing well. They expect to bring it in, she says, with an eye to reducing the number of cases reflexed to FISH.
Dr. Gomez too points to fixes that would improve the product. The way it’s set up at UTMB, for example, rather than being able to view the Ibex heat map in Epic, “you need to open a different window on Ibex to see the case. You see two different screens with the same image, one with the heat map and one without.” They hope to integrate it with Epic, she says.
Residents at UTMB do not get access to Ibex, Dr. Gomez says. They want trainees to learn to detect neoplasms by their histomorphological features, rather than rely on Ibex’s heat map. They do show residents interesting cases on Ibex, and they’ll review a case with them using Ibex, “but they don’t get their own user ID.” The department took a similar approach when it went digital, she notes. “We noticed the first years, when we were doing slide sessions, didn’t know how to turn a microscope, how to focus.” Now, everyone learns on glass slides. “We want them to know what a microscope is and how to use it,” she says, because where the resident will work is unknown.
Surgical pathology fellows, Dr. Muthukumarana says, are given Ibex access after six months of training. “They’re advanced, they can use it, but not at the beginning because we need to know how much knowledge they have first.” When she works with residents, she uses it as an educational tool. “I open Ibex with them and go through the case with them and say, ‘Let’s see what you thought and let’s see what Ibex thought.’ It’s something like comparing with another resident. They have to learn to use these tools, because this is the future of pathology, and we are one of the institutions pioneering this.”
As the work to develop AI implementation guidance continues, the CAP AI Committee and others are beginning to chart the relatively newer waters of model performance evaluation.
“We’re still wrapping our heads around how we validate these systems, how we get them implemented and deployed,” says committee chair Matthew Hanna, MD, vice chair of pathology informatics at the University of Pittsburgh Medical Center. “In the same vein, we all know we need to be monitoring these systems over time.”

Not long ago, Dr. Hanna and a number of coauthors wrote in a CAP concept paper that the evidence available to develop formal guidelines for evaluating model performance was lacking (Hanna MG, et al. Arch Pathol Lab Med. 2024;148[10]:e335–e361). At the time, he recalls, he and his coauthors felt the number of peer-reviewed clinical validation studies on machine learning applications in use in clinical practice wasn’t sufficient.
Now, his thinking has changed. “It has only been a year or so since the paper was published, but I think we’re on the cusp of having a sufficient number of publications and data as the field has been ramping up and the technology has been evolving so rapidly.”
That rapid pace of change, on the other hand, could complicate efforts to develop guidance. “We may find these validated examples are not all using similar processes or systems, and that if we do subgroup analyses, they may be using very different technology stacks,” he says, making it difficult to draw conclusions from the published examples. Still, the AI Committee may soon initiate the process to develop formal guidelines, he says.
In the absence of guidelines, the CAP concept paper provided 15 recommendations for performance evaluation of machine learning models in pathology. One is on the need for ongoing monitoring. “Performance evaluation isn’t a one and done,” Dr. Hanna says. “You have to continue monitoring performance to make sure it doesn’t drift or shift or change, especially as other quality control systems need to be put in place to monitor these models moving forward after deployment.”
Some forms of model shift are drastic and immediate. Say, for example, a model pulls data from the laboratory information system or electronic health record, and the institution implements a new system. “Because that new electronic health record or LIS has a different data format or data architecture,” he says, the model could shift dramatically. “Those are easy to detect, when there’s large, high-scale errors,” especially because the models used in clinical practice typically assist usual workflows. “The pathologist or laboratory will be able to see there’s something wrong here, and that it’s happening in almost every case.” Other forms of drift stem from small changes—minor variations in patient population, staining properties, or reagents, for example—and could potentially affect a single laboratory test, making up only one component of the model. The model may drift incrementally, too, Dr. Hanna says, adding to the detection difficulties.
At the University of Pittsburgh’s year-old Computational Pathology and AI Center of Excellence, where Dr. Hanna is on faculty, the operating procedures for monitoring performance were developed internally because of the lack of formal guidelines. They are running daily controls for a clinically validated AI-based immunohistochemistry quantification tool, and they have instituted monthly monitoring as a check on that and other models to ensure they haven’t drifted compared with their original standard output. These include an algorithm that detects acid-fast bacilli and one that assists with prostate cancer diagnosis. “For those, we have a standardized data set and a ground truth,” he says. “Our goal is to check at least every six months to see how the model performs over time in practice.”
The daily control—something quite familiar in the laboratory—is one option for performance monitoring, though in his view there are arguments for and against it.
“Different schools of thought will claim different things, but at the end of the day, we all agree performance monitoring over time has to happen,” he says. “But in terms of the interval and how that happens, we still haven’t figured out all the details, I’d say.”
When an AI model is deployed in a real-world clinical setting, its performance may deteriorate.
For one thing, “models don’t perform as well as they performed in the training data set,” says Judy Wawira Gichoya, MD, MS, codirector of the health care AI innovation and translational informatics laboratory and associate professor in the Department of Radiology and Imaging Sciences at Emory University School of Medicine. The culprit there is model drift, Dr. Gichoya says. The other problem is that the models “tend not to be very valuable” in practice, she says. At Emory, for example, a model that alerts physicians to patients with an intracranial hemorrhage has largely been ignored, owing to high rates of alerts for inpatients. “These are patients already in the hospital, most likely because they just had surgery, and blood in the brain is expected of them,” she says. Another common problem is fragmented workflow implementation: “If you have a model deployed in radiology, and you’re telling someone to log in to another system, even if your model is perfect, no one wants to do that. They already have four systems they have to log in to.”

What’s needed, she says, and not being done now, is real-world surveillance and monitoring. “We don’t understand what things in the real world cause the AI system to drop in its performance. That’s been the gap that is unpredictable, and the reality is humans are unpredictable.” After a tool has been deployed, “the reality of real-world messiness sets in. People are going to be people,” she says, and that means they’re unpredictable and may use it incorrectly or ignore it if they feel it isn’t helpful.
As AI adoption ramps up and performance monitoring becomes a serious need, Dr. Gichoya hopes the human factors will receive more attention. “How does Judy work at 3 am when she’s tired?” for example, after a long shift. The human factors are impossible to account for without a real-world monitoring system. “We need new tooling. We need new types of skills for us to think about this, and we need to move away from snapshots of model evaluations,” she says, to companion monitoring systems that assess how individuals work with AI and AI’s impact on the entire institution and on patient care.
On par with the problem of human factors is the real-world messiness of clinical deployment. One example: At the beginning of her career, Dr. Gichoya developed an algorithm to detect when a patient has an inferior vena cava filter. The development itself was simple; incorporating it into clinical practice was difficult.
“If I’m going to be reviewing 1,000 cases to cut [only] one filter a day, I just can’t do it,” she says. That patient may already have an upcoming appointment with a doctor to discuss the filter, or the patient may be on anticoagulation or unable to be anticoagulated, or they may be in palliative care. “And it’s hard,” she says, “because every site is different, and they all have different workflows.”
Moreover, those developing the models may not have a full grasp of the complexities of clinical workflows. That’s one reason why she advocates greater participation from medical professionals, and from laboratory medicine experts in particular. Laboratory medicine, she notes, is highly regulated, and that could make laboratory professionals well suited to the project of developing better AI governance. “That’s somewhere we can transfer knowledge and put it in a new domain. Number two is that everyone using or building AI is probably using some form of a lab component,” she says. Her advice: “Figure out when your institution is building AI how they’re using data from the lab. Do you think they’re carrying out the right assumptions? And then think about participating.”
“We underestimate how difficult it is to deploy meaningful AI,” she continues. “It’s going to be done by people in the trenches.”
Charna Albert is CAP TODAY senior editor.