Anne Paxton
June 2024—Since ChatGPT sprang on the scene a year and a half ago, debate and concern about the OpenAI chatbot and how closely it can replicate human abilities have been widespread, as have thoughts on how to translate AI-based applications into clinical practice. Adoption of these applications in medicine has so far been limited.
A recently published study looks to change that story in pathology education (Wang AY, et al. Arch Pathol Lab Med. Published online Jan. 20, 2024. doi:10.5858/arpa.2023-0296-OA). It compares AI with human performance in answering pathology-specific questions that could be seen on licensing (board) examinations.
After asking an international group of pathologists to generate 15 such questions, the authors of this study had those participants compare answers to the questions that were given by two OpenAI large language models—GPT-3.5 and the updated GPT-4—with answers written by a newly practicing staff pathologist who had recently passed the Canadian pathology licensing exam. Participants were asked to score each answer on a five-point scale and to identify which answer was written by ChatGPT.
GPT-3.5 received an average score higher than the pathologist on nine of 15 questions, but there was no statistically significant difference in the aggregated mean scores, with the scores remaining similar after adjusting for difficulty and discipline. GPT-4 scored higher than both GPT-3.5 and the pathologist on 12 of 15 questions.
When provided with a human answer for comparison, participants in the initial survey with GPT-3.5 were able to identify the AI-generated answers, largely because they were unnecessarily long, overly detailed, or awkward, or contained incorrect information.
“Overall, the study highlights that ChatGPT, particularly version 4, has demonstrated notable performance in providing answers to domain-specific pathology questions, even surpassing the performance of a practicing pathologist in some instances,” says study coauthor Luca Cima, MD, practicing staff pathologist, Department of Diagnostics and Public Health, Section of Pathology, University and Hospital Trust of Verona, Italy.

AI applied to pathology has long fascinated Dr. Cima, dating back to his residency at the University of Verona from 2012 to 2017. After conducting AI research and working as a staff member for the European Society of Digital and Integrative Pathology, he moved on to using ChatGPT while he was a practicing staff pathologist. “Starting to use ChatGPT was a natural step for a pathologist like me because of my interest in artificial intelligence and computational pathology.” He met the senior author of the Archives study, Matthew J. Cecchini, MD, PhD, of the Department of Pathology and Laboratory Medicine, Western University and London Health Sciences Centre, London, Ontario, Canada, whom he describes as the “mind behind the study,” over social media, then became involved in the study himself.
Medicine is at a stage now, Dr. Cima says, where the potential benefits and applications of AI technologies, such as improving diagnostic accuracy, managing large data volumes, and supporting medical education, are recognized. “However, medicine is also recognizing that there are many challenges related to the integration of AI tools into clinical practice, including the need for extensive validation, concerns about accuracy, and the development of trust in AI-assisted decision-making,” he says. The field is progressing from theoretical and experimental applications of AI to more practical, clinically relevant implementations, “but there remains a significant gap between the first and second phases.”
As large language models like ChatGPT evolve, Dr. Cima says, with advances in their learning algorithms, training data, and understanding of medical content, they could surpass the initial versions in their ability to provide more accurate, relevant, and context-specific responses in the medical domain.
“This technology is highly advanced and has shown significant progress, particularly in the field of natural language processing that includes deep learning techniques. It trains models on a massive corpus of text data, allowing them to learn complex language patterns and generate responses that can mimic human-like text.”
“Despite their impressive capabilities,” he continues, “this technology also has limitations, particularly in areas requiring high levels of domain-specific knowledge or ethical considerations. They can sometimes generate incorrect or irrelevant information and may struggle with tasks that require deep understanding beyond pattern recognition.”
Dr. Cecchini, a pathologist and assistant professor, says the study had its inception in part in his own experience with ChatGPT.
“These tools have been transformative in my life. They’ve enhanced my productivity in everything else I’ve done over the last year.”
Dr. Cecchini has found ways to use ChatGPT in practicing and teaching pathology. In fact, he says, “I use it all the time. This week I wrote three briefing notes for my hospital organization around digital pathology, and I wrote a bunch of project proposals for student research projects.” Recently, he had an oral online conversation with ChatGPT while on a long drive back from his parents’ house for the purpose of writing the first few chapters of a book.
Dr. Cecchini used ChatGPT to devise a game to help students and others memorize medical vocabulary. “Given a command like ‘Write lyrics about mitosis in the style of a popular song,’ ChatGPT can make up songs about pathology knowledge so students can learn in a fun and interactive way,” he says.

He describes the chatbot as a “workflow genius.” Dr. Cecchini tested the abilities of the system with a publicly available teaching case he used to simulate pathology reporting. While dictating on this simulated case, he asked a relevant question of ChatGPT: “Remind me of the high-grade patterns for lung cancer.” The app complied. “Then I said, ‘Can you summarize all of this into a microscopic description for me?’ Then it took our conversation in natural language and turned it into a microscopic description and a full pathology report.”
Next, Dr. Cecchini could say to ChatGPT, “‘I’m on for tumor boards tomorrow. Can you condense this into two lines for me?’ So it did that. And I said, ‘This patient wants a summary of their pathology. Can you write a summary of this discussion at a level that’s for someone who has a great education but English is not their first language?’ And it did that too.”
He then asked, “‘Can you make multiple-choice questions about this case for a medical student?’” In the same way, ChatGPT produced a string of such questions on command.
Dr. Cecchini emphasizes that commercial large language models should not be used for clinical work involving patient data. His experiments with synthetic data aimed to explore the capabilities of these emerging systems. Some institutions are integrating LLMs into their hospital infrastructure, he notes, ensuring compliance with privacy and data security regulations. “This approach is critical for the safe clinical deployment of these tools,” he says.
As a medical school assistant professor, he says, “All my questions are written with the assistance of large language models such as ChatGPT because it writes better questions than I can write. I check them all to make sure they’re accurate. I can say, ‘Give me 100 different questions on this topic,’ and it will generate 100 different questions. Then I pick the six I like. So instead of spending hours trying to make multiple-choice questions, I get ChatGPT to do it. And the cool thing is you can dump documents into it. So I can take my lecture notes, dump them into ChatGPT, and say, ‘Make questions for me about this content.’ And it does.”
What is important, Dr. Cecchini says, is “to be responsible for everything it makes.”
“People have gotten into trouble. For example, there have been multiple reports of lawyers who got into trouble because they had ChatGPT write their whole legal brief and it hallucinated aspects including legal citations. As long as you review, edit, and be responsible for it all, there’s no problem. You just have to be transparent with how you’ve used it,” he says. It can be challenging to know when AI is being used, he adds, given its integration into products such as Microsoft Office through Copilot and in tools such as Grammarly that use AI to improve documents.
Academic writing is in a gray area now, Dr. Cecchini says. “Some journals have come out and said you absolutely cannot use large language models. Others have said you can use it, no problem; you just have to outline how you’ve used it.” For people with limited English, ChatGPT can bring equity, he says, describing how tools like ChatGPT can help overcome language barriers.
Unlike standard translation tools, he adds, LLMs understand context. “Google Translate can translate words to words. But ChatGPT understands the overall context of what you are talking about and puts in the nuances and figures that all out.”
The study of ChatGPT’s performance in pathology published in Archives differs from similar studies of its performance on other medical and legal exams, Dr. Cecchini says.
“In those cases, the questions come from public resources. Our study is different in that these are not publicly available questions. So they would not have been in the training sets for GPT so that it could have already been trained on the answers. That’s one thing you don’t want when you’re testing performance.” In the Archives study, “Our questions were all brand-new and made up by an international group of pathologists,” recruited through what was then Twitter, now X.
Though the study’s participants were able to identify the ChatGPT-generated answers, it was likely because of their length. “ChatGPT tends to be verbose. The pathologist’s answers were succinct and cursory rather than a prolific, long discursive thing,” Dr. Cecchini says.
For example, with a prompt of “Describe the histologic features of adult-type fibroadenomas of the breast,” GTP-3.5 responded with three paragraphs and a separate summary sentence. GTP-4 responded with a definition of fibroadenoma, a numbered list of seven key histologic features of adult-type fibroadenoma—circumscription, biphasic nature, epithelial component, stromal component, regular arrangement, calcifications, and absence of atypia—each with an explanation and wrap-up sentence. The pathologist, on the other hand, when answering that prompt, listed only three histologic features using 48 words.
Dr. Cima found the study results surprising in two ways. “First, the high rate of participants accurately identifying AI-generated responses from ChatGPT-3.5, with 95.8 percent of the ChatGPT-generated answers correctly identified, means there are still distinguishable characteristics that set AI-generated text apart from human-generated text.”
Second, he says, “The fact that ChatGPT can sometimes sound confidently incorrect”—one of the participants’ impressions of ChatGPT answers—“is a critical reminder of the challenges in using AI in high-stakes fields like pathology. While AI can generate responses that are technically coherent and linguistically sound, its inability to always understand complex human contexts or verify factual accuracy remains a significant hurdle.”
That is one reason why Dr. Cima has reservations about the use of these large language models in pathology: If they can generate confidently incorrect responses, that should raise concern about accuracy and reliability.
He raises similar reservations about bias and training. “Large language models are trained on vast data sets that may contain biases or outdated information. There is a concern that these biases could be reflected in the AI-generated responses,” he says, “potentially leading to misinformed decisions or reinforcing existing biases in pathology practice.”
Risks to security and privacy should also be taken into account. “The use of AI in medicine raises concerns regarding data security and patient privacy, especially given the sensitive nature of medical records and the potential for misuse of information,” Dr. Cima says. Regulatory and ethical issues pose potential problems as well, because “the integration of large language models into clinical practice involves navigating complex regulatory and ethical frameworks to ensure these tools are used safely and responsibly.”
There is also a risk of overreliance on AI, Dr. Cima says. “Reliance on AI tools like ChatGPT could lead to a degradation of diagnostic skills among pathologists and medical students, as they might defer too readily to the AI’s conclusions without critical evaluation.”
Finally, he cites interpretability and explainability as dangers: “Large language models often operate as black boxes. This lack of explainability can be problematic in medicine, where understanding the reasoning behind diagnoses and treatment decisions is essential.”
Future studies should assess the performance of ChatGPT and similar generative language models against a diverse modality of pathology questions, including visual inputs, Dr. Cima suggests.
“This direction is crucial because visual interpretation, such as the classic examination of histopathology/cytopathology slides, is the basis of pathology and is a skill that current large language models like ChatGPT are not fully equipped to handle due to their primarily text-based nature. Therefore, integrating visual analysis capabilities would enhance the utility of large language models in pathology and other medical fields where diagnostic imaging and visual data interpretation are critical.”
In addition, he says, it would be beneficial to study the application of large language models in real-world clinical and educational scenarios to assess their practical utility. And “addressing the limitations identified in our study, such as the occasional production of confidently incorrect information, should be another focus for future studies.”
The study’s authors write: “The field of natural language processing and LLMs has made considerable progress and is well poised to reshape the way pathology is learned and performed. Our study leverages a professional network of pathologists on X (Twitter) to evaluate AI applications such as ChatGPT against human observers with impressive performance.” As large language models mature, the authors continue, “there will be rich opportunities for medical AI applications to alleviate the burden of increasing complexity in medicine and offer tremendous educational opportunities for pathology trainees.”
Dr. Cima expands on that forecast. “There is indeed a fast rate of improvement underway in the field of large language models, particularly in their application to medical domains like pathology. The performance of GPT-4, as evidenced by the study, shows measurable improvements over GPT-3.5 in answering domain-specific questions. This trend aligns with broader observations in the computer science field, and it could lead to enhanced applications in medical education, diagnostics, and clinical decision-making, provided these models are properly validated and integrated into clinical practice.”
The most important message the study should convey to pathologists, in his view, is that “large language models like ChatGPT may be transformative tools in the field of pathology in their ability to provide instant access to information, simplify complex pathology concepts, and offer interactive and personalized experiences.” As pathology knowledge becomes more complex and voluminous, he adds, “Based on the conclusions of our study, more advanced iterations of large language models with increased domain-specific knowledge will likely aid pathologists, enhancing the training of residents and assisting the diagnostic routine of staff pathologists.”
Anne Paxton is a writer and attorney in Seattle.