Home >> ALL ISSUES >> 2022 Issues >> Newsbytes

Newsbytes

image_pdfCreate PDF

CAP working group targets machine-learning issues

July 2022—If a machine-learning algorithm is trained to help detect cancer in whole slide images at one health care location, shouldn’t the same algorithm work on digital slides from a similar patient population at another site?

“We expect the answer to be yes, but there is a lot of evidence that the answer is no,” says Michelle Stram, MD, clinical assistant professor, Department of Forensic Medicine, New York University.

Generalizability in machine learning, which refers to the ability of a model to make accurate predictions using various data sources that are not included in the model’s training data set, can have important implications in pathology. This is especially true in whole slide imaging because multiple factors can negatively affect the performance of a machine-learning algorithm, says Dr. Stram. Understanding generalizability and the variables in pathology that can affect it is the focus of a project undertaken by the CAP Machine Learning Working Group, of which Dr. Stram is a member.

“Pathologists are able to recognize cancer in one slide versus another slide, and if the preparation looks a little different, we overlook that because it’s not important to the diagnosis,” she says. But with whole slide imaging, the variables that could impact how an algorithm performs include not only differences in how a slide was prepared but also variability in the color profile, contrast, and brightness that could result from using different scanners.

Dr. Stram

“One of the problems is that even when you are a large hospital network, you still have a limited number of sites and labs, and you don’t really get an idea of how much variability there is,” Dr. Stram says.

That’s where the CAP Machine Learning Working Group comes in. The group is using data the College collected from medical sites worldwide, through its various programs, to develop a broad perspective on the factors that can affect a machine-learning algorithm’s ability to generalize information. The working group has also reached out to the HistoQIP Whole Slide Image Quality Improvement Program, a joint undertaking of the CAP and National Society for Histotechnology, to obtain data for assessing machine-learning generalizability. The data are generated from whole slide images from different scanners and slides prepared at a variety of histology labs, Dr. Stram says.

The HQWSI program provides feedback to laboratories that use whole slide imaging for clinical applications. More specifically, an expert panel of pathologists, histotechnicians, and histotechnologists identifies issues in digital whole slide images and corresponding glass slides that are submitted. These issues range from defects in histology preparation, such as knife marks or folds in the tissue, to image imperfections introduced in the scanning process, like blurry patches or incorrect color tones, Dr. Stram says. The program also collects background demographic data from laboratories that submit slides, including the type of hospital from which the specimen was obtained, types of pathology services offered, and even details about the stains used, such as whether H&E staining was automated or performed manually.

The relevance of this is not lost on Matthew Hanna, MD, director of digital pathology informatics at Memorial Sloan Kettering Cancer Center, who leads the CAP Machine Learning Working Group. Dr. Hanna and several of his colleagues at Memorial Sloan Kettering published a paper in Nature Medicine that addressed, as one of its subgoals, the generalizability of machine-learning algorithms in pathology (doi.org/10.1038/s41591-019-0508-1). The study found that a model trained on an uncurated data set of slides with various interlaboratory variations outperformed a model trained on a curated data set of slides that did not generalize well to slides prepared at other institutions or with interlaboratory variation.

“That suggested that the scanner you are using and the histology preparation—even though they may look the same to us—may not look the same to the algorithm,” Dr. Stram says.

To investigate this further, the working group has procured and begun examining 568 slide images from the HQWSI program. The group has also received the complete list of questions that the HQWSI asks laboratories about their institutions and how their slides are prepared. (It is awaiting answers to these questions.)

During the pilot stage of its investigation, the working group plans to study digital prostate cancer slides, as it wants to focus on a smaller subset of images and they “would be the most interesting to the most people,” Dr. Stram says. She estimates that of the 568 slides the working group had obtained as of CAP TODAY press time, between 20 and 40 were prostate cancer biopsies.

The first step of the project, Dr. Stram explains, will involve quantifying and delineating the factors that could affect a prostate cancer machine-learning algorithm’s performance, including variability in histology preparation and differences in color profile or contrast that could result from the scanning process.

The group will be examining the prostate slides via a variety of methods, including using the open-source software HistoQC, which is a quality control tool for digital pathology slides that can quantify the areas of blurriness on slide images and provide information on brightness, contrast, and other aspects of the slides. The HistoQC analyses should help identify issues and differences among slides that could affect machine-learning algorithm performance, Dr. Stram says.

The group plans to rescan the prostate biopsy slides using several different scanners to identify potential variances among whole slide images that are specific to the scanning process or scanner used.

“If you have a slide that is prepared at the hospital and you scan it on scanners A, B, and C, what is the variability for that exact same slide?” she says. “Areas of blurriness, contrast, brightness, the color profile, these are all things that can vary. In this case, the differences would be due to the scanner because we would have scanned the exact same slide.”

While the process of rescanning the prostate slides has not yet begun, the working group, which meets virtually on a monthly basis, intends to eventually share its findings with other CAP committees and the FDA. The latter has long expressed concern about how generalizability in machine learning can impact pathology.

The working group’s long-term plans also involve investigating how to establish the reliability of ground truth labeling, or assigning the correct diagnosis to a slide, based on number of pathologists and their training level and years of expertise. Having the correct ground truth label is very important for training algorithms to assist with diagnosis, Dr. Stram says.

CAP TODAY
X