Editor: Deborah Sesok-Pizzini, MD, MBA, chief medical officer, Labcorp Diagnostics, Burlington, NC, and adjunct professor, Department of Clinical Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia.
Assessment of pathology domain-specific knowledge in ChatGPT and comparison to human performance
December 2024—Excitement over the impact of artificial intelligence-based tools in different areas of health care has prompted position papers and research on the application of these new devices. One such tool is ChatGPT, which is publicly available and has demonstrated domain-specific knowledge in numerous areas, including medicine. The vast amount of data generated with current technologies, including digital pathology applications, and in subspecialty areas of pathology may lend itself to interpretation with artificial intelligence-based algorithms. But while AI-based applications can automate routine tasks and enhance diagnostic accuracy, their widespread use has been limited. Further AI research and validation of AI-based applications will increase adoption of such technology and, thereby, the overall efficiency and accuracy of the diagnostic process in pathology. The large language model (LLM) Chat Generative Pretrained Transformer, or ChatGPT, can learn complex language patterns and generate human-sounding and context-specific text based on the prompt entered. ChatGPT has been shown to perform well on standardized tests, but the evidence to support this is limited. More specifically, data on how well this method of text generation can articulate detailed concepts in complex areas, such as pathology, are limited. The authors conducted a study to assess how well LLMs perform when applied to short-answer pathology questions, similar to those found on standardized pathology exams. They focused on the performance of GPT-3.5 and GPT-4 in the pathology domain and their application to pathology education. The authors also examined other LLMs, including Bard, Perplexity, and Claude Instant, as well as the ability of human observers to recognize AI-generated outputs. They recruited an international group of pathologists (n=15) from nine countries to generate pathology-specific questions at a level similar to those found on licensing (board) examinations. The questions were constructed with a difficulty level suitable to a senior pathology resident trainee. They were answered by GPT-3.5, GPT-4, and a staff pathologist who had recently passed the Canadian pathology exams. Study participants were instructed to score the answers on a five-point scale and predict which answer was written by ChatGPT. The study was conducted in two phases: Phase one involved evaluating responses generated by ChatGPT-3.5 and the answering pathologist, while phase two involved evaluating responses (after a two-month washout period) generated by ChatGPT-4, which was not available during phase one of the study. The results showed that GPT-3.5 performed at a level similar to the staff pathologist, whereas GPT-4 outperformed both GPT-3.5 and the pathologist. The overall score for GPT-3.5 and GPT-4 was within the range of meeting expectations for trainees undergoing licensing examinations. Of interest, for all but one question, the reviewers were able to correctly identify the answers generated by GPT-3.5. Questions generated with GPT-4 consistently incorporated elements of accuracy relevant to the prompt. They were also concise and succinct and used a clear, well-organized answer format and appropriate pathology terminology. The authors concluded that this study demonstrates the potential for LLMs to transform pathology practice. They anticipate that as LLMs continue to develop, there will be many opportunities to use AI applications to help alleviate the burden and complexities of pathology data and enhance educational opportunities for pathology trainees.
Wang AY, Lin S, Tran C, et al. Assessment of pathology domain-specific knowledge of ChatGPT and comparison to human performance. Arch Pathol Lab Med. 2024;148:1152–1158.
Correspondence: Dr. Matthew J. Cecchini at matthew.cecchini@lhsc.on.ca
Fecal immunochemical test screening and risk of colorectal cancer death
Colorectal cancer is a major contributor to cancer deaths worldwide. The U.S. Preventive Services Task Force recommends annual fecal immunochemical test (FIT) screening for those at average risk to reduce their chances of developing colorectal cancer (CRC). FIT is simple to use and can be completed at home. It is sensitive for CRC and adenomas yet highly specific. FIT screening programs have reported reduced incidences of CRC and mortality, but more data are needed to support their effectiveness. Many FIT trials have limited power, or they are not designed to compare those who receive FIT screening with unscreened individuals. Furthermore, the effectiveness of FIT may vary based on colon site and race and ethnicity given differences in social and structural barriers that impact care quality across the screening continuum. The authors conducted a study to evaluate whether FIT screening is associated with a lower risk of dying from CRC overall and based on cancer location and race and ethnicity. They performed a nested case-control study of a cohort of screening-eligible people in two large integrated health systems that had racially, ethnically, and socioeconomically diverse health plan members with a long-term history of returning FIT screening tests. The study population included adults ages 52 to 85 years who died from colorectal adenocarcinoma between 2011 and 2017. Each case patient was matched, in a 1:8 ratio based on age, sex, health plan duration, and geographic area, to randomly selected people who were alive and did not have CRC. The latter were used as the case controls. The authors then performed data analysis from January 2002 to December 2017. A total of 10,711 racially and ethnically diverse participants were identified out of a cohort of 2,127,128 people. During the 10-year period prior to the reference date, 6,101 (63.5 percent) of the control subjects completed one or more FITs, with a cumulative 12.6 percent positivity rate (768 controls), of whom 610 (79.4 percent) went on to have a colonoscopy within one year. During the first five years of the study, 494 (44.8 percent) cases and 5,345 (55.6 percent) controls completed one or more FITs. Regression analysis showed that completing one or more FITs was associated with a 33 percent lower risk of death from CRC (adjusted odds ratio, 0.67) and a 42 percent lower risk in the left colon and rectum (adjusted odds ratio, 0.83). No association with right colon cancers was found, but the difference between the right and left site colon cancers was statistically significant. The authors concluded that FIT screening was associated with a lower risk of colorectal cancer death among non-Hispanic Asian, non-Hispanic Black, and non-Hispanic white people. This study provides community-based evidence that FIT screening lowers the mortality risk of CRC and supports the use of this testing for population-based screening.
Doubeni CA, Corley DA, Jensen CD, et al. Fecal immunochemical test screening and risk of colorectal cancer death. JAMA Network Open. 2024;7(7). doi.10.1001/jamanetworkopen.2024.23671
Correspondence: Dr. Chyke A. Doubeni at chyke.doubeni@osumc.edu