Editors: Raymond D. Aller, MD, & Dennis Winsten
Fairness as a focal point in creating ML algorithms
June 2024—Fairness, like so many principles, is subjective. Yet that is not deterring a growing number of medical informaticists and others interested in health care technology from advocating to incorporate fairness into machine learning algorithms to combat bias.
Mark Zaydman, MD, PhD, assistant professor of pathology and immunology, and Vahid Azimi, MD, instructor of pathology and immunology, both at Washington University School of Medicine, St. Louis, have seen the need to address bias firsthand in the form of pronounced inequalities in health outcomes across patient populations in the St. Louis area. “Folks who live equally proximal from where I’m standing are probably going to have a decade or two difference in life expectancy,” says Dr. Zaydman, who stresses that these findings are not unique.
Yet knowing which factors drive health care disparity is not tantamount to solving the problem, and no consensus exists on how to quantify fairness. With this in mind, Drs. Zaydman and Azimi have developed recommendations that pathologists and others can use to incorporate algorithmic fairness into machine learning, which they published in the Journal of Applied Laboratory Medicine (doi:10.1093/jalm/jfac085). Dr. Zaydman also gave a presentation on the topic at the 2023 Pathology Informatics Summit. A synopsis of their suggestions follows.
Defining problems. The first challenge in algorithmic fairness is assessing who is defining the problems that the algorithms will address, Dr. Azimi says. “If you don’t have people of diverse and representative backgrounds in the room to decide what problems are being worked on, then the problems that get addressed are probably not the ones that are going to impact everyone equally,” he explains.

To avoid exacerbating health care disparities by targeting only those issues facing select portions of the population, Drs. Azimi and Zaydman recommend that health care institutions and academia involve ethnically, racially, and gender diverse teams of pathologists in developing machine learning algorithms.
Curating data. Pre-existing data sets that do not reflect target populations can be difficult to correct, Dr. Zaydman says. Marginalized groups historically have not been invited to participate in health studies, and some institutions do not have sufficient cohorts of minority patient groups for sampling. And without sufficient data on subpopulations in a data set, the quality of results may suffer. For example, variants that are relatively common in some subgroups may be overcalled as pathogenic.
Inter-institution data sharing and being culturally sensitive about enrolling diverse groups of patients in studies are two promising approaches to creating more representative data sets, Dr. Zaydman continues. Examples of the latter include communicating with patients using their native language and engaging local community leaders in study design and recruitment.
Data resampling techniques, such as upsampling and downsampling, can also help balance the data set, Dr. Azimi says. Upsampling requires duplicating the records of patients from underrepresented groups, and downsampling involves randomly removing records from overrepresented patient groups. However, laboratories that use data resampling techniques should be aware of the potential risks: Downsampling can discard valuable information, and upsampling can overfit data sets, hindering the ability to generalize data and, consequently, reliably predict future observations.
Incomplete data files, also called “data missingness,” are a common data-curation challenge because files with missing data can be incompatible with some machine learning algorithms. The simplest solution to address this issue is to discard incomplete files, Dr. Azimi says, but he warns that this practice can amplify health care disparities. (Studies have shown that patients from minority groups are more likely to have incomplete records.) Instead, Dr. Azimi recommends data imputation, which involves filling in missing data in patient records with estimated values based on statistics gleaned from observed values in the population.
Assigning ground truth labels. The labels used to train algorithms are often considered to be ground truths, but these “truths” may themselves be infused with biases, Dr. Zaydman says. For example, the definition of a typical angina resembles the angina commonly seen in men. This can result in an angina presentation in a woman fitting the definition of atypical even though it might be the more common presentation for the female gender. While they do not provide recommendations for assigning ground truth labels, Drs. Zaydman and Azimi caution laboratories against conceptualizing algorithmic training labels as ground truths.
Selecting features. A seemingly straightforward method of selecting features or inputs for a machine learning model is “fairness through unawareness,” which refers to leaving out of an algorithm sensitive attributes, such as race, gender, or religion, in an effort to produce fair outputs. But sensitive attributes may be highly correlated to necessary input data, making them hard to eliminate, Dr. Zaydman says. In the St. Louis area, for example, when race is eliminated as a factor in an algorithm, members of minority groups can still be identified by correlated features, such as the area where they live or other socioeconomic factors, he notes.
An alternative strategy is to select features that optimize fairness by using cross-validation to evaluate how different combinations of features affect the results produced by machine learning algorithms, Dr. Zaydman says. This is already common practice for selecting features related to accuracy and generalizability. Features related to fairness could similarly be included in those cross-validation tests, Dr. Zaydman says. “In cross-validation, we can select what collection of features will produce the fairest models out of the set of features that are available,” he says.

Choosing algorithms. Interpretable models, such as logistic regression and decision trees, may be sufficiently powerful for addressing many problems in laboratory medicine, according to Drs. Zaydman and Azimi. Therefore, they recommend using them in lieu of complex black-box models to avoid hidden bias. With the former, they add, it is easier to understand relationships between inputs and outputs and to determine whether decisions are being made in a fair manner. “An additional benefit is that these models can be translated into simple rule-based algorithms for programming into middleware or the LIS,” according to the Journal of Applied Laboratory Medicine article.
Determining objective function. The objective function translates a laboratory’s priorities into a mathematical equation that instructs the machine learning algorithm on how to determine which model is best for what the laboratory is trying to achieve. It tells an algorithm what concepts to prioritize in its calculations and how much emphasis or weight to give priorities. Objective functions typically balance accuracy and generalizability, but the fair and equal treatment of all subpopulations can be added as a priority, Dr. Zaydman says.
Adding a fairness term to the objective function may lead to a decrease in overall accuracy, but, despite this, Drs. Zaydman and Azimi recommend including it. “This becomes a moral, ethical, and societal question about our willingness to accept a decrement in accuracy in order to achieve fairness,” Dr. Zaydman acknowledges.
Tuning and calibrating. Hyperparameters are settings selected before training an algorithm that optimize performance of the model. They can help fine-tune the algorithm to achieve a balance between accuracy and fairness, Dr. Azimi says.
Just as laboratories calibrate lab instruments to ensure their outputs correlate with expected results, developers of machine learning algorithms should calibrate their algorithms, Dr. Zaydman says. To improve a model’s fairness, he recommends calibrating algorithms to produce expected results equally across different patient subgroups. Both pathologists acknowledge, however, that it can be difficult to create models that are fair in terms of accuracy and calibration across groups.
Testing the model. When testing a model on new data that it was not exposed to during the training process, it is important to not only evaluate the model’s overall results but also how it performs with patient subgroups, Dr. Zaydman says. “That is how you are going to detect unfairness prior to deployment.”
If only a small amount of data are available from a minority population, it may be necessary to perform stratified random sampling when splitting the data into a training set and test set. This will ensure there are examples from the subpopulation in both the training and test data sets, he says.
Monitoring performance. Once deployed, algorithms may perform less accurately and less fairly if the patient population is different from those in the data sets used to train and test the algorithm. This may become even more of an issue as patient populations change over time. “If it is detected that fairness has degraded over time, then efforts should be made to go back and figure out why it has degraded and to improve it,” Dr. Azimi says.
It is also important to be aware of the possibility of algorithms creating a positive feedback loop. An example of this was an algorithm that used a Black patient population’s historical health care expenditures to allocate future patient resources to that group, according to Drs. Zaydman and Azimi. Because those Black patients historically had spent less money on health care, the algorithm allocated fewer resources to them, widening the imbalance and further biasing the algorithm’s predictions.
“Machine learning algorithms are potentially so scalable and so powerful,” Dr. Azimi says, “that they can change the world in a way that reinforces their behavior.”
—Renee Caruthers
Indica Labs gets FDA nod for AP digital pathology platform
Indica Labs has received FDA 510(k) clearance for its Halo AP Dx enterprise digital pathology platform for primary diagnosis in anatomic pathology laboratories. This allows the product to be used in conjunction with the Hamamatsu NanoZoomer S360MD slide scanner for in vitro diagnostic use.
Halo AP Dx can be deployed in the cloud or on premises. It acts as a hub for primary diagnosis, collaboration, tumor boards, and second opinions.
Indica Labs, 505-492-0979
Roche signs deal to distribute sepsis software from Prenosis
Prenosis has entered an agreement that allows Roche to offer Prenosis’ Sepsis ImmunoScore artificial intelligence-enabled software as a medical device, or SaMD, on Roche’s digital platform.
Roche will provide Sepsis ImmunoScore, the first FDA de novo authorized AI diagnostic tool for sepsis, via its Navify algorithm suite. The suite offers evidence-based, certified medical algorithms from Roche and its third-party partners on one platform that spans disease areas.
ImmunoScore leverages biomarkers, clinical data, and AI to evaluate a patient’s chances of having or developing sepsis within 24 hours of the patient being assessed. The functionality is integrated into the electronic medical record.
Roche, 800-428-5074
ONC updates version of Common Agreement
The Office of the National Coordinator for Health Information Technology and its recognized coordinating entity, the Sequoia Project, have released Common Agreement version 2.0, which updates version 1.1 of the Trusted Exchange Framework and Common Agreement.
The Common Agreement, which establishes the technical infrastructure, governance, and policy for securely sharing clinical information to achieve nationwide interoperability, now provides enhancements and updates to require support for Fast Healthcare Interoperability Resources application programming interface exchange. This will allow TEFCA participants and subparticipants to more easily exchange information and will allow individuals to more easily access their own health care information using apps of their choice via TEFCA, according to a press release from the Department of Health and Human Services.
Participants and subparticipants can also incorporate the terms of participation for TEFCA into their data use agreements. This revision is intended to reduce the legal costs and other burdens for organizations seeking to connect to TEFCA, HHS reported.
“We have long intended for TEFCA to have the capacity to enable FHIR API exchange,” said Micky Tripathi, PhD, national coordinator for health information technology, in a press statement. “This is in direct response to the health IT industry’s move toward standardized APIs with modern privacy and security safeguards and allows TEFCA to keep pace with the advanced, secure data services approaches used by the tech industry.”
Additional information about Common Agreement version 2.0 is available at https://bit.ly/4bcvaMz.
Dr. Aller practices clinical informatics in Southern California. He can be reached at rayaller@gmail.com. Dennis Winsten is founder of Dennis Winsten & Associates, Healthcare Systems Consultants. He can be reached at dennis.winsten@gmail.com.