Editors: Liron Pantanowitz, MD, PhD, MHA, chair of the Department of Pathology and professor of pathology, University of Pittsburgh Medical Center, and Matthew G. Hanna, MD, vice chair of pathology informatics and associate professor, Department of Pathology, University of Pittsburgh Medical Center.
Use of open-source LLMs for extracting cancer-related attributes from pathology reports
October 2025—Large language models are becoming commonplace for personal and business use. The health care community is leveraging large language models (LLMs) for various purposes. Researchers at the University Medical Center Hamburg-Eppendorf, Hamburg, Germany, have successfully used open-source LLMs to extract critical medical data from pathology reports. They conducted a study that demonstrated how LLMs can transform unstructured clinical text into structured pathology data. While pathology reports are rich in information about tumor type, size, and stage, their narrative format makes automated data extraction difficult. Strict data-protection laws in Germany further complicate the use of cloud-based artificial intelligence (AI) tools in health care. To address such issues, the authors deployed and evaluated five LLMs: Llama 3.3 70B and Mistral Small 24B, which were pretrained on multilingual corpora, and domain-adapted models that included variants fine-tuned on German-language corpora by Vago Solutions—Llama 3.1 8B, Mistral NeMo 12B, and Mixtral 8×7B. The authors compiled a data set of 522 anonymized pathology reports, each from a unique patient, for the study. They developed a retrieval-augmented generation (RAG) pipeline using an additional 15,000 reports from previous years to enhance model performance. This approach significantly boosted the performance of smaller models, enabling them to match the accuracy of their larger counterparts. The five LLMs were evaluated using zero-shot, few-shot, and RAG-enhanced few-shot prompting strategies. All models produced structured JSON outputs and were assessed using precision, recall, accuracy, and macro-averaged F1 scores. The RAG strategy significantly improved the extraction of rare but clinically important features, like metastasis and staging, especially in smaller models. For example, Mistral Small 24B achieved nearly identical results to Llama 3.3 70B while using far fewer computational resources. The study also revealed that prompt design plays a crucial role in model output. Models performed best when given clear instructions and examples, especially for identifying rare but vital features, such as metastasis and cancer staging. The authors’ findings suggest that AI can help streamline cancer documentation, reduce manual workload, and improve the completeness of cancer registries. Prompt design and retrieval mechanisms, in particular, are critical for optimizing model performance. The study also underscores the feasibility of using mid-sized LLMs in clinical settings, thereby offering a scalable and privacy-compliant solution for medical data extraction. Despite such challenges as class imbalance and occasional misclassifications, integrating RAG with few-shot prompting proved effective in enhancing the accuracy and robustness of LLMs. The authors concluded that this approach can support high-quality, automated cancer documentation and improve data-driven oncology care in German hospitals. As hospitals worldwide consider integrating AI into their workflows, the study offers a blueprint for secure, scalable, and effective deployment of a process for taking unstructured data in the laboratory information system and converting them to usable structured information.
Bartels S, Carus J. From text to data: Open-source large language models in extracting cancer related medical attributes from German pathology reports. Int J Med Inform. 2025. doi.org/10.1016/j.ijmedinf.2025.106022
Correspondence: Dr. Stefan Bartels at [email protected] or Jasmin Carus at [email protected]
Use of a polarized scanner in digital pathology
It is questionable whether pathology could go fully digital because not every slide can be scanned and reviewed digitally. Until recently, this was the case for polarized glass slides, which are frequently necessary in routine pathology practice. Polarized light microscopy is required for visualizing birefringent structures in tissue that may otherwise be invisible under standard brightfield illumination. For example, slides are polarized to help interpret Congo red staining for amyloid deposits or reveal structures, such as crystals or foreign material, that are not easily seen with a standard light microscope. The authors evaluated the effectiveness of the Glissando POL brightfield and polarized light scanner for use in digital pathology workflows. The goal of their study was to determine whether images captured at 20 times magnification using the Glissando scanner (0.25 µm per pixel resolution) were comparable to those obtained via conventional polarized light microscopy (0.5 µm per pixel resolution) using the Olympus DP 26-CU digital camera attached to an Olympus BX53 light microscope. The study focused on 75 archival histological cases, including 16 amyloidosis (Congo red-stained sections), 21 periprosthetic membranes with wear particles, 17 foreign body (mainly suture) granulomas, eight gout, six pseudo-gout, three breast tissue with calcium oxalate crystals, and four nodular sclerosing Hodgkin lymphoma with birefringent collagenous bands. The authors found that the scanner produced virtually identical images to those from conventional polarized light microscopy across all case types. They concluded that although scan times were longer using the Glissando scanner than for conventional polarized light microscopy, the ability to integrate polarized light imaging into digital workflows improves efficiency and reduces the need for physical slide handling.
Al Sheikhyaqoob D, Oliveira A, Fella M, et al. Polarised light scanner for digital pathology. Virchows Archiv. 2025;487(1):209–213.
Correspondence: Dunia Al Sheikhyaqoob at [email protected]