Newsbytes

in 2023 Issues, In Every Issue, June 2023

NLP program takes free-text searches from zero to 60

June 2023—The time it takes to read through numerous pathology reports to find nuggets of critical information buried within narrative sections of text is tantamount to the time it takes for carbon atoms to turn into diamonds—or so it may seem to those tasked with digging for medical information.

But at Mount Sinai Health System, New York, a natural language processing program can search for information in the free-text portions of multiple pathology reports instantaneously, according to Aryeh Stock, MD, instructor in the Department of Pathology, Molecular, and Cell-based Medicine, Icahn School of Medicine at Mount Sinai.

Dr. Stock estimates that reading a batch of 1,000 pathology reports in search of specific diagnostic information would take him approximately 22.9 40-hour workweeks to complete. Using the natural language processing program that he wrote, he can complete that search in four seconds, he says.

While artificial intelligence-based chatbots like ChatGPT have brought increased attention to natural language processing, creating a pathology-based NLP program presents unique challenges, Dr. Stock says. “In pathology, the vocabulary is much more limited,” he explains. “The order of words and the syntax become much more important, and this presents a significant challenge when you are trying to train an NLP program to focus on these nuances.”

Dr. Stock

Dr. Stock began to consider the lengthy, labor-intensive process of searching for information within the text of pathology reports when he was in his last year of medical school. While assisting a pathology group at Mount Sinai that was researching Crohn’s disease and inflammatory bowel disease, he was tasked with categorizing foci of inflammation by their locations within the bowel. The work involved interpreting prior biopsy reports and entering information into a spreadsheet. The repetitive nature of the work and amount of time it consumed inspired Dr. Stock to build an automated solution. A spreadsheet program he created in 2018 for the research project became the first prototype for Mount Sinai’s pathology natural language processing program.

Dr. Stock later rewrote the spreadsheet program in Python code, which enhanced its flexibility, functionality, and scalability and enabled the program to process an entire database of pathology cases at once rather than one file at a time, he says. The updated program searches plain text using comma-separated value files from the laboratory information system for input. Because the program works with CSV files, which are widely used by LISs, it can easily be used with a variety of vendors’ systems. A separate version of the program, which is also written in Python code, uses XML (extensible markup language) files as input, taking advantage of the fact that Mount Sinai’s LIS can output data in XML format. (The XML version was written by Hansen Lam, MD, who was in residency with Dr. Stock and is now a cytopathology fellow at Johns Hopkins University School of Medicine.) It has a more user-friendly interface because XML contains greater context on how to format data—but it cannot function with an LIS that does not use XML files.

Both versions of the program have saved time in pathology research projects by automating searches for information that would be nearly impossible to find in a regular LIS query, according to Alexandros D. Polydorides, MD, PhD, professor and vice chair for clinical research and trial design in the Department of Pathology, Molecular, and Cell-based Medicine, Icahn School of Medicine at Mount Sinai. A key advantage is the clear, well-organized data output that they deliver, he says. “The output of this is an Excel document with one row per specimen, or one row per patient, that can be very easily sorted, mined, and applied to any project.”

Often, even with more basic LIS queries, the data output is a spreadsheet in which data pertaining to one specimen may be spread over multiple columns or rows. It requires a lot of effort to delete and move information so that it is readable and easy to organize for research purposes, Dr. Polydorides notes.

While the two programs are similar, Dr. Lam’s XML version is open source, with the code published on GitHub (github.com/hansenlam/public_PathReporter) and details about that version published in the Journal of Pathology Informatics (Lam H, et al. Published online Nov. 8, 2022. http://dx.doi.org/10.1016/j.jpi.2022.100154). Dr. Stock’s CSV-file version of the program is proprietary intellectual property of Mount Sinai Health System.

The underlying principles of both versions of the NLP program are the same, Dr. Stock says. And both take into account that the text portion of a pathology report is a compilation of observations by multiple people. For example, if a biopsy yields five specimens, there are typically multiple comments about each of those specimens interspersed through the report. Therefore, the NLP application uses a specialized module for the Python programming language, called regular expressions, which allows users to specify rules for searching strings of text. Using regular expressions, the program identifies information pertaining to specific specimens in the text of the report and reorganizes it so data on each specimen are on a separate row of the spreadsheet. The algorithm scores words it finds in the report based on how well they match items in a library of pathology words built into the program, he says. Through this process, the program can, for example, identify which words pertain to diagnosis and which address location.

This pathology-centric approach makes the Mount Sinai NLP programs unique, according to Dr. Lam. Medicine-specific NLP programs typically have “limited pathology-related dictionaries,” and pathology NLP programs are “tailored for narrow sets of cases,” he wrote in the Journal of Pathology Informatics article.

While the program created at Mount Sinai has so far only been used for research, when Dr. Lam decided to publish a paper on the XML version, Dr. Polydorides helped him test it on projects that could demonstrate the program’s clinical usefulness in pathology. “The three things we tested it on were things that pathologists might be interested in, like Gleason scoring in prostatic adenocarcinoma, grading of anal intraepithelial neoplasia, and grade or location of dysplasia in IBD,” Dr. Polydorides says. “These are some of the many pathology-centered projects for which the program might be useful.”

The results of those pathology projects show the program has a 90 to 100 percent concordance with manually reading pathology reports to retrieve information, while saving significant amounts of time, according to the journal article. In most cases, the automated program was more accurate than the manual method. In one project, which identified dysplasia among 72 anal biopsy specimens and then determined the grade of the dysplasia in those specimens, there were just seven discordant results. But in six of the seven cases, the automated system made the correct determination.

“The upshot is that research studies that previously required months of chart review and data assembly can be done in a fraction of the time and with significantly less manpower,” Dr. Stock emphasizes.

Dr. Polydorides

Dr. Polydorides suggests that the NLP program could be adapted to automate some quality control and quality assurance processes. A standard quality control process in cytology, for example, involves checking whether there is a surgical specimen for each cytology specimen and, if so, comparing whether the cytology and surgical specimen diagnoses agree. “Rather than have someone go through all cases and see which ones have biopsies, this program might help automate that process,” he says.

The program is flexible enough that it should be relatively easy to adapt to clinical use, Dr. Stock adds. “The way the code is structured, you can essentially plug it into other Python code. So if you wanted to incorporate parsing into some other project, it could be brought in to do that sorting, and then you could move on with that data for whatever specific purpose you are trying to address,” he explains.

Dr. Stock plans to expand the scope of the CSV version of the program by writing code that would enable the program to identify other information, such as measurements of pathology specimens, he says. “You could imagine very easily that we could add another column that tells us the dimensions of specimen A and the dimensions of specimen B, which makes that information more accessible than reading reports to find it.”

In the meantime, Dr. Stock is addressing the common problem of understanding “incorrect” inputs, such as unusual terms, alternative spellings, or semicolons instead of commas, that may prevent the program from returning results for that file.

Pages: 1 2