Charna Albert
May 2024—In a time of wellness testing and high rates of levothyroxine prescribing for hypothyroidism, it may also be time to rethink TSH test result interpretation.
Laboratory testing is more accessible now than it used to be and patients are more involved in their own care. “You’re trying to give the patient access and ability to take care of their own health. But the double-edged sword is you can start over-ordering things, and the way we’ve designed lab testing was not made for that,” says Joe El-Khoury, PhD, D(ABCC), associate professor of laboratory medicine, Yale University School of Medicine, and director of the clinical chemistry laboratory, Yale New Haven Health.
Commercial providers of thyroid-stimulating hormone assays suggest an upper reference limit of about 4.0 mIU/L or slightly above. But in the absence of clinical symptoms, a patient with a TSH of 4.0 mIU/L doesn’t necessarily require treatment.
“Guidelines based on cardiovascular outcome trials advise that treatment of hypothyroidism may not need to be initiated unless TSH is above 10 mIU/L or significant symptoms of hypothyroidism or lipid profile abnormalities can be improved,” says Laura Boucai, MD, endocrinologist at Memorial Sloan Kettering Cancer Center. “Evaluating for thyroid autoimmunity and understanding the true probability that a minimal elevation of TSH indeed represents disease is critical.”

The clinical symptoms of hypothyroidism are nonspecific, says Joely A. Straseski, PhD, MS, MT(ASCP), D(ABCC), professor in the Department of Pathology at the University of Utah School of Medicine and section chief for clinical chemistry, medical director of endocrinology, and co-director of the automated core laboratory at ARUP Laboratories. “It’s not like there’s one thing that nails this diagnosis, and that’s what makes it difficult. I will probably have five of the 10 symptoms before I go home today,” jokes Dr. Straseski, a member of the CAP Accuracy-Based Programs Committee. In lieu of obvious clinical symptoms, physicians may look to flagged results to make decisions about initiating levothyroxine. “It’s the way our systems work. But if someone is primarily treating to the flag, if they see there are results outside of what we would expect to see in a healthy individual, then they are compelled to follow up with treatment.”
TSH is commonly ordered, “and the more you test, the more you find” and see subclinical versions of disease, she says. “This trigger reaction to something that’s just barely over the upper reference limit is problematic.” Other tests, like the thyroid antibodies, can help confirm the diagnosis. “There are other tests that have to be considered in conjunction.”
Subclinical hypothyroidism is defined as an elevated TSH and a free thyroxine (T4) within the normal reference range. A TSH between the upper reference limit and 10 mIU/L is usually thought to represent subclinical disease (Ross DS. J Intern Med. 2022;291[2]:128–140). But intermethod biases add to the confusion around diagnosis, Dr. Straseski says. “You’ll see quite a bit of variation in reference intervals among labs, primarily in the upper reference limit.” A patient with borderline TSH values may have a result above the upper limit with one test and below it with another. “The differences between the assays as far as what upper limit is used is confusing to clinicians.”
To complicate matters further, Dr. El-Khoury says, TSH is subject to diurnal and seasonal variation, with values peaking in the winter and at their lowest in the summer in healthy individuals, while free T4 remains relatively stable (Yamada S, et al. J Endocr Soc. 2022;6[6]:bvac054). “These seasonal changes . . . are not captured by our reference intervals and may lead to false diagnoses of subclinical hypothyroidism and unnecessary prescriptions of levothyroxine to euthyroid individuals,” he wrote in a 2023 letter to the editor (El-Khoury JM. Clin Chem. 2023;69[5]:537–538). TSH levels also can rise with nonthyroidal illness, he says.
Dr. Boucai notes that TSH is also known to vary with, among other things, race, body mass index, age, and whether the patient is a smoker. “Confirming the diagnosis of subclinical hypothyroidism is critical before initiating lifelong replacement therapy,” she says. “This includes measuring TSH in different seasons, in fasting and fed states, multiple times before prescribing levothyroxine replacement.”
The data suggest patients are being overprescribed levothyroxine, Dr. El-Khoury says. He points to a 2021 study by Juan Brito, MBBS, of Mayo Clinic, and others (Brito JP, et al. JAMA Intern Med. 2021;181[10]:1402–1405). Dr. Brito and his coauthors analyzed insurance claims data linked with laboratory results from patients throughout the U.S. who were prescribed levothyroxine between 2008 and 2018. In a subset of 58,706 patients with thyrotropin and FT4 or T4 levels available, Dr. Brito and his coauthors found that levothyroxine was initiated for overt hypothyroidism in 8.4 percent, for subclinical hypothyroidism in 61 percent, and for patients with normal thyroid levels in 30.5 percent.
“My concern,” Dr. El-Khoury says, “is that the general population is being treated at such low levels just because they’re flagging high when in fact there’s nothing wrong with them.” Another problem: The drug’s side effects, which include heart palpitations and headaches, among others, have been understated, he says. Response to a public service video he posted on YouTube about thyroid testing and levothyroxine has been large: “Many patients say their doctors do not believe them when they say they have these symptoms.”
TSH levels tend to rise with age, says James D. Faix, MD, medical director of immunology at Quest Diagnostics and a member of the CAP Clinical Chemistry Committee. “Many elderly people are being treated for hypothyroidism because people aren’t aware that the reference interval that’s used generally is too low for elderly individuals,” Dr. Faix says. “An elderly person with a TSH of six is probably completely euthyroid” (Biondi B, et al. Lancet Diabetes Endocrinol. 2022;10[2]:129–141). And thyroid hormone replacement can precipitate atrial fibrillation or accelerate osteoporosis, he says.
In his review of hypothyroidism treatment, Douglas Ross, MD, of Massachusetts General Hospital’s Endocrinology Division, says epidemiology studies demonstrate increased cardiovascular mortality in patients with subclinical hypothyroidism. In addition, subclinical hypothyroidism has been associated with measures that correlate with CVD disease, he writes, including higher lipid levels, increased epicardial adipose tissue, increased carotid intima-media thickness, and endothelial dysfunction, and it improves total and LDL cholesterol levels. Dr. Ross describes it as “reasonable and necessary,” based on studies, to treat patients with subclinical hypothyroidism if their TSH levels exceed 7–10 mIU/L. “However, treatment of lower levels of TSH has not been shown to provide a clear benefit,” he writes, though “increasingly most patients with TSH values flagged as abnormal, because they exceed 4.2–5.0 mIU/L, are treated with thyroid hormone” (Ross DS. J Intern Med. 2022;291[2]:128–140).
In a randomized, placebo-controlled trial of 737 adults (≥ 65) with subclinical hypothyroidism, Stott, et al., found no apparent change at one year in the Hypothyroid Symptoms score or Tiredness score when TSH levels were lowered from a mean of 6.4 mIU/L at baseline to 3.63 mIU/L with treatment. Half received placebo with mock dose adjustment (Stott DJ, et al. N Engl J Med. 2017;376[26]:2534–2544). And in a meta-analysis of 21 randomized clinical trials that compared treatment of subclinical hypothyroidism with placebo, treatment was not associated with improvements in general quality of life or thyroid-related symptoms (Feller M, et al. JAMA. 2018;320[13]:1349–1359).
“The problem,” Dr. El-Khoury says, “is that subclinical hypothyroidism is a biochemically defined disease, meaning that the definition comes purely from the changes in TSH versus T4.” That’s what is being looked to for the diagnosis, he says. “It’s okay to have biochemical-based definitions, as long as they continue to be relevant and have not been challenged by new data—which is what I feel is happening here.”

At Dr. El-Khoury’s institution, the clinical laboratory and endocrinology department opted to append a comment to all results that fall between 4.2, the assay’s suggested upper reference limit, and 10 mIU/L.
The comment explains that TSH is known to naturally increase in winter, with age, and with certain nonthyroidal illnesses, and recommends retesting patients with mild abnormalities in two to three months. The European Thyroid Association guideline for managing subclinical hypothyroidism also recommends retesting after a two- to three-month interval (Ross DS. J Intern Med. 2022;291[2]:128–140). In a study published this year, van der Spoel, et al., found that in a large proportion of adults 65 and older with mild subclinical hypothyroidism, TSH levels spontaneously normalized in a median follow-up of one year even after two consecutive measurements of elevated levels. A third measurement may be recommended in older adults, the authors said, before treatment is considered (van der Spoel E, et al. J Clin Endocrinol Metab. 2024;109[3]:e1167–e1174).
In Dr. El-Khoury’s view, retesting patients falls short, in part because physicians may not be aware of the guidance on retesting or overlook the laboratory’s comment and prescribe levothyroxine at first elevation. Then, too, “We’re only retesting because we don’t have the process set up right from the beginning,” he says. Raising the assay’s upper limit to 7 mIU/L and establishing it as a clinical decision limit, or the universal point at which treatment should be considered, would be a better fix, he says.
But big obstacles stand in the way of adopting a universal clinical decision limit for TSH, and the lack of intermethod agreement is one. It isn’t like glucose or vitamin D or other standardized tests like cholesterol, Dr. Straseski says. “If your cholesterol is above 200, guidelines say you’re likely going to do something. We all know that. We don’t ask, ‘Was it a Roche cholesterol? Was it an Abbott cholesterol?’ Clinicians don’t evaluate results like that, though they likely should if an assay isn’t standardized.”

Katleen Van Uytfanghe, PhD, MSc, postdoctoral fellow in the Ref4U reference laboratory (part of the laboratory of toxicology) at Ghent University, Belgium, explains that for TSH, a true standardization approach isn’t possible. She is chair of the International Federation of Clinical Chemistry and Laboratory Medicine Committee for Standardization of Thyroid Function Tests, which has been chipping away at standardization for TSH and serum total and free thyroid hormone testing since 2005. “For standardization, we need a primary reference material and/or a secondary measurement procedure,” Dr. Van Uytfanghe says. “For TSH, both are missing.”
There is a WHO international standard that’s used as a primary reference material and calibrator for the current assays. “But this material is not commutable” with patient samples; it doesn’t behave quite like a native human sample in an immunoassay (Cowper B, et al. Clin Chem Lab Med. 2024;62[5]:824–829). TSH is a complex molecule that can undergo many modifications after it’s formed, she explains. “It’s very difficult to make just one material that would represent that complex mixture of different forms as you encounter it in a human serum sample.” And there isn’t yet a mass spectrometry-based method for quantification of TSH, she says, because the molecule is so complex.
In lieu of standardization, the committee has pursued a harmonization approach. They’ve developed a clinical serum-based reference panel for TSH, traceable to the WHO reference material. The panel includes 100 serum samples, spanning the full concentration range of TSH. Committee members worked with manufacturers on a comparison study to prove that the reference panel could be used to recalibrate the commercial assays based on statistically inferred targets, leading to improved harmonization, and published its proof-of-concept study (Thienpont LM, et al. Clin Chem. 2017;63[7]:1248–1260). “In an ideal world where there is harmonization, we could use a common reference interval, and with the publication we wanted to prove that this is possible,” Dr. Van Uytfanghe says.
In the committee’s follow-up study, published this year, the reported TSH concentrations of the reference panel samples relative to assay calibrators, across 15 immunoassays, were better harmonized, suggesting that some manufacturers have made improvements to their assays since the committee made the sample reference panel available (Cowper B, et al. Clin Chem Lab Med. 2024;62[5]:824-829). “We can see that the difference between the methods is smaller compared to five years ago,” Dr. Van Uytfanghe says. In Japan, she notes, the regulatory agencies now require TSH assay harmonization, which may have an indirect influence on manufacturers internationally.
Assay harmonization could pave the way toward a more uniform reference interval, or what Dr. Van Uytfanghe calls generalized reference intervals. But “a uniform reference interval does not indicate a one-size-fits-all reference interval,” she says. “What we can do is come up with reference ranges that should be applicable to multiple assays should they be harmonized, but it will never be one reference range.” Using one number for the upper reference limit will not be possible because TSH values depend on too many factors.
Dr. Boucai of Memorial Sloan Kettering notes that “different populations may have their own set points for TSH,” as with older individuals, for example. “It would be ideal to establish a TSH reference range defined by age, race, BMI, in specific populations, but this is hard to practically implement,” she says.
If the assays were harmonized, Dr. Van Uytfanghe says, it would be easier to establish reference intervals specific to these populations. “There are a lot of variables to take into account, which means there’s a lot of work in establishing these kinds of reference intervals.” But with assay harmonization, “the manufacturers could share the work, and it would take less effort to come up with region-specific and age-specific reference intervals and so on.”
Harmonization also would make it easier for laboratories to implement recommendations from the medical literature. Now, it’s an issue when the assay used in the study differs from the assay the laboratory uses. “If that doesn’t matter, it’s way easier to use someone else’s research for yourself.”
The manufacturers have been partners in the committee’s work on standardization, Dr. Van Uytfanghe says. “It’s not just some scientists who say it can be done. They all were coauthors. They all cooperated.” That’s been the case throughout, she says, despite the likelihood that recalibration could be demanding from a regulatory standpoint. The FDA would probably require new 510(k) approval. “But then you have Europe, which has a set of regulations, you have China with different regulations—so for the manufacturers, it’s not easy.”
Even as progress on harmonization marches on, Dr. Van Uytfanghe acknowledges that the TSH problem has no simple fixes.
“There has been a lot of controversy on the upper limit of the TSH reference interval,” she says. “The controversy is still here.”

Understanding the assay’s history may shed light on the current situation. “The first generation of TSH assays couldn’t differentiate normal from low,” says Dr. Faix, a former member of the IFCC Committee for Standardization of Thyroid Function Tests. “Because the focus was on the low end, the upper limit of TSH was not given as much attention.” The first-generation tests used an upper limit of 10 mIU/L. As sensitivity improved, manufacturers lowered upper reference limits, typically to around 5 mIU/L.
Dr. Straseski says there have been long-standing questions about how the earliest reference interval studies for TSH were conducted. The study populations likely included individuals with subclinical hypothyroidism, skewing the upper limit. “That issue is still plaguing us today, and that was many decades ago, at this point.” And package inserts tend to provide limited information about how reference populations are derived, Dr. Straseski says. Take the TSH assay her institution uses. The package insert says it was derived from “a total of 516 healthy test subjects,” she says, but nothing about how those subjects were ruled in or out. “Much more information needs to be provided when we’re looking at who the reference population is made up of.”
In 2002, Dr. Faix says, researchers analyzed a nationally representative sample of thyroid function test results from National Health and Nutrition Examination Survey data, finding that the mean TSH level in the general U.S. population is about 1.5 mIU/L (Hollowell JG, et al. J Clin Endocrinol Metab. 2002;87[2]:489–499). That finding precipitated calls to further lower the upper limit, he says, though physicians were concerned about false-positives and resisted. “You want to detect people who have early hypothyroidism,” he says, “but you don’t want too many false-positives. You don’t want too many people who are euthyroid and have just a slightly elevated TSH.”
Christopher Naugler, MD, observed this problem firsthand in Alberta, where he is professor in the Department of Pathology and Laboratory Medicine at the Cumming School of Medicine, University of Calgary. One of two major laboratories in the province lowered the upper limit of the reference range for TSH from 6 to 4 mIU/L to match that of the other laboratory, following an administrative directive to harmonize reference ranges among laboratories in Alberta (Symonds C, et al. CMAJ. 2020;192[18]:E469–E475). “New abnormal TSH results tripled from 3.3 percent to 9.1 percent,” and new levothyroxine prescriptions increased by about 20 percent, Dr. Naugler says. “Now we have an entire new cohort of individuals overnight who were believed by their clinicians to be hypothyroid when the only thing that happened was a minor change in the reference range.”
The endocrinology community in Alberta was not in favor of the change, he says. “They were worried it was going to create potentially unnecessary endocrinology referrals and an increase in prescription rates and cost to patients, and that’s exactly what happened.”
Sachin Majumdar Jr., MD, associate professor of medicine, Yale University School of Medicine, and director of the endocrine neoplasia clinic, Yale New Haven Health, says he sees a fair number of referrals for patients with subclinical hypothyroidism, but he doesn’t have a large data set with which to determine if the number has risen. Many of these referrals come from primary care and include patients seen by physicians’ assistants and advanced practice registered nurses. “In general, we get more referrals from those providers for things like hypothyroidism,” he says. “In the past, primary doctors would manage that more, but in their absence, other practitioners probably feel less comfortable.” And “as things become protocolized, people may tend to go more by reference ranges now.”
In the bigger picture, Dr. El-Khoury says, the way reference intervals typically are derived in laboratory medicine may not be suited to a test like TSH, which is often ordered in healthy individuals.
With reference intervals based on a central 95 percent of the reference population, “that means two and a half percent of people who get tested are going to flag as high, with no other problems.”
In an opinion piece in Clinical Chemistry, he and his coauthors argue for reevaluating the 95 percent inclusion criteria for defining reference intervals, where appropriate (El-Khoury JM, et al. Clin Chem. Published online March 18, 2024. doi:10.1093/clinchem/hvae026). “What we’re calling it is ‘separate approaches,’” he says. For tests commonly ordered in healthy people, they suggest a central 99 percent spread. This approach could increase false-negatives but would significantly reduce the false-positive rate, they write. Widening the spread, Dr. El-Khoury tells CAP TODAY, would keep the same general model while accounting for more natural variation. For TSH, “I’m not saying widen it to the 99th percentile; I’m saying use the existing studies that tell us where it’s beneficial and then set the limit there,” as was done for glucose and vitamin D. “Those are cutoffs based on clinical studies that assessed the risk of developing disease. With TSH, the risk is not getting benefit and more side effects.”
Other options involve moving away from the within-the-reference-interval-is-good, outside-is-bad approach. “That’s not how biology works. It’s a spectrum,” Dr. El-Khoury says. “The problem is you have people who may [in the lab report] look ‘out’ but are fine, and TSH is proof of that.” (The opposite can be true also.) Clinical decision limits are one option, for tests that are standardized. Personalized reference intervals are another, though that won’t help a patient without a recorded medical history who presents in a crisis state. Reporting results with z-scores, or how far the result is from the median in terms of standard deviations, is yet another possibility, though it’s not one with which many physicians are familiar. “Instead of flagging one as in or out, it shows how far you are without giving you one critical indication that something is wrong,” he explains.
Dr. Straseski believes personalized reference intervals are “where the world will eventually end up. Maybe a four is completely appropriate for me, and a 4.17 is completely appropriate for you.”
“We are absolutely not there,” she says. “But it’s something we’ll be considering in the future.”
For now, Dr. Straseski points to the other tests that support the diagnosis.
“I can name almost a dozen tests that support thyroid function testing,” she says. “TSH is our jumping-off point, but it’s not the end-all, be-all.” Her own hospital system is evaluating the testing formularies for thyroid disease, she says.
The early reference interval studies for TSH likely didn’t employ that supportive testing in recruiting reference populations, she says. “Free T4 hasn’t been a great test for all that long.” Neither has free T3 or the thyroid antibodies. With supportive testing, it would be possible to eliminate from the studies individuals with subclinical disease. “Ruling in or ruling out individuals from these reference populations is the critical part of this, no matter what analyte we’re talking about, but particularly for the thyroid. More careful consideration of who’s included in these populations would be helpful.”
So too would more granularity in results interpretation, says Dr. Naugler of the University of Calgary. He adds, “We need to give guidance to clinicians in terms of what to do with that result.”
“In a time when labs are increasingly automated and lab tests are commoditized, this is a real opportunity for laboratory professionals to value-add by providing interpretive explanations for clinicians and to be available for our clinical colleagues to consult on questions that arise,” he says.
The need is greatest in primary care, Dr. Naugler says, and the laboratory could partner with endocrinology groups to help design and draft the interpretive comments. He cites as an example an initiative his research laboratory spearheaded, in which it partnered with local cardiology groups to provide, for primary care physicians, interpretive comments for lipid tests and prescription recommendations. “They would get the lipid result, a Framingham risk score, and a recommendation as to whether the patient should be prescribed a statin.” A similar approach could be used for TSH testing, he says, in which the laboratory partners with endocrinology groups to generate the comments and algorithms for TSH interpretation and possibly also recommendations in the report about whether levothyroxine should be prescribed.
“The diagnostic interpretive part for general lab tests, for chemistry, hematology, is often overlooked,” he says. “And that’s something we need to be aware of and take every opportunity to show the value in laboratory professionals by providing those interpretive services.”
Charna Albert is CAP TODAY associate contributing editor.