At Futurescape, a look at next-generation IHC

CAP Today

November 2009
Feature Story

David L. Rimm, MD, PhD, spoke in June at the CAP Foundation’s Futurescape conference in Rosemont, Ill. His topic: Beyond IHC: Accurate, Reproducible, and Quantitative Measurement of Protein Analyte Concentrations in Fixed Tissue. Here is an abridged version of his presentation, with references added. Dr. Rimm is a professor in the Department of Pathology, Yale University School of Medicine. He is a consultant, stockholder, and scientific founder of HistoRx, the company that commercialized the AQUA technology he describes. He is an author on the Yale-held patent on the AQUA platform, of which there are 18 placements worldwide.

David L. Rimm, MD, PhD

The true value of conventional immunohistochemistry (IHC) is not the brown chromogen visualized by diaminobenzadine (DAB); rather, it’s the hematoxylin counterstain. The blue stain shows the context in which, for example, cells are identified as thyroglobulin positive and hence derived from a thyroid tumor and that the thyroglobulin staining is in the epithelioid cells and not in the benign lymph node or in the stroma regions. This sort of contextually driven, binary decision has historically been the primary usage for IHC. However, more recently it has been used in a quantitative mode.

Starting with assessment of estrogen receptor as a replacement for the ligand binding assay, intensity scales have been developed to use the amount of DAB as a quantitative measure of the amount of protein present. This has presented a challenge to the technology since the human eye is not well designed to accurately and reproducibly assess subtle differences in intensity. A nice illustration of this fact can be found on the Web site of Edward Adelson (http://web.mit.edu/persci/people/adelson/checkershadow_illusion.html). The illustration shows that we use not only intensity but also context to judge intensity. Thus, we can’t expect the human eye to judge intensity accurately for the purposes of quantitative assessment for companion diagnostic tests.

To address this issue, we have designed a system that can measure protein expression on slides in a rigorously quantitative manner, similar to what our colleagues in laboratory medicine do. This system, called AQUA for automated quantitative analysis, is used to address some of the key problems in anatomic pathology related to standardization and reproducibility.

When a patient has diabetes, we obtain a blood sample and then do a quantitative test with a defined coefficient of variation to obtain an objective measurement that leads to an appropriate therapy. When a patient has a breast lesion, we do a core biopsy of the tissue and make a histologic diagnosis, which is followed by a measurement of estrogen receptor level—subjective judgment that’s based on our own ability to judge intensity and that affects treatment just as dramatically. We’ve tried to adopt some of the technologies from cell biology, flow cytometry, and immunohistochemistry to achieve an accurate, nondisruptive assessment of tissue.

To do that, AQUA uses only molecular features to define architectural or subcellular compartments. Molecular interactions can include proteins, DNA, RNA, and anything else, but we exclude morphology-based features, like ‘roundness’ or other architectural definitions. That is, AQUA does not use contrast-generating edges to define features or compartments. Finally, fluorescence was chosen for its broader dynamic range and its ability to easily multiplex because an emissive signal is measured, rather than an absorptive signal.

To understand the AQUA technology, we need to define a few terms. A ‘mask’ is a region of interest. Within a cancer slide there are regions of blank space, regions of non-tumor stroma, and other connective tissue. To use a molecular tool to define a region of interest in epithelial tumors, the mask is most commonly defined by an antibody binding to cytokeratin followed by a hole-filling algorithm. [For details, see references 1–3.]

Ultimately, the goal of the AQUA technology is to generate a protein concentration within a molecularly defined subcellular or architectural compartment. A ‘concentration’ is simply a numerator over a denominator. The denominator is the total area of interest or what you’re trying to measure within, and we call that the ‘compartment.’ The numerator is our ‘target’ of interest.

Perhaps this is best illustrated with the example of estrogen receptor measurement in breast cancer. Starting with either tissue microarrays or whole sections, the cytokeratin staining provides a region of interest from which AQUA creates a mask. Then DAPI staining is done to define pixels that will represent the nuclear compartment (as opposed to round things or spheres, which would use contrast generation as other methods use). Next, estrogen receptor levels are measured in the pixels that are DAPI positive. Then the estrogen receptor intensity is divided by the area of the compartment to generate an AQUA score. [This is a simplification; for full details on compartmentalization, see references 1–3.] This score can then be standardized directly to absolute amounts of protein. This is achieved by quantitative measurement of absolute recombinant estrogen receptor by Western blot, with simultaneous measurement of ER concentrations in a series of standard cell lines. Those cell lines are then produced in a tissue microarray [as described in reference 4] and read using the AQUA technology. This analysis makes it possible to produce a standard curve of AQUA score versus absolute protein concentration, read in our lab as pg/µg of total protein. Using this method, we have determined that the cutoff for the limit of detection of the device is about 50 pg/µg. This method and its implication will be described in detail in a publication being prepared now.

A similar approach has been taken for HER2 and was published in 2005 in the Journal of the National Cancer Institute.⁵ In that work, we converted HER2 by using a similar method and showed the accuracy was comparable to an ELISA. This method has subsequently been commercialized and tested for robustness with different operators, different machines, and different stains on different days. The cumulative coefficient of variation using this technology is under five percent, which is comparable to or better than an ELISA—the standard in much of laboratory medicine (Fig. 1). The AQUA technology has been commercialized by a company called HistoRx.

What are the implications of exact measurement of ER? If we accept an exact cutoff at the limit of detection (about 50 pg/µg), we can compare quantitative information with the conventional approach. Initially, this was assessed in a large retrospective cohort used for many previous studies.⁶ Although this was done on a tissue microarray rather than whole slides, the tissue microarray format allowed us to assess 640 cases, including about half node positives with various histologies, ages, nodal status, ER, PR, and HER2 data shown in previous publications.

Comparing the measured AQUA score, with ER positive defined as ER >50 pg/µg total protein, we found a 16.7 percent misclassification rate between the AQUA method and the pathologist scoring by eye. Fig. 2 shows that most of the differences cluster around the threshold, but some of the differences are at the high and low ends of the scale, suggesting either false-negative or false-positive results. The implications of this observation are unknown at this time. Given that the gold standard for any false-positive or false-negative tests should be patient outcome, we have not yet collected data to show which test is more accurate and whether the misclassification affects patient care. This technology is being tested now in the TEAM trial (a 9,600-case trial in Great Britain of tamoxifen versus the aromatase inhibitor exemestane) and in other cohorts. We should be able to answer the significance question when these studies have been completed.

In attempts to estimate the prevalence of this misclassification rate, we reviewed a series of other cohorts previously tested with the AQUA technology, as part of other studies. In a cohort from the British Columbia Cancer Agency and in the SWOG 9313 cohort, we found misclassification is in the same range (10 percent to 20 percent). These cohorts are being examined in greater detail and will be reported elsewhere.

Though we have shown that the AQUA method can measure protein on a slide with good accuracy and reproducibility, the method can measure only protein that is present on the slide. A potential highly significant problem is the question: How faithfully does measuring what is on the slide represent the status of the biomolecules in the patient? That is, if the tissue sat on the bench for a long time, did some of that target degrade? How were the epitopes affected by fixation conditions or delay in fixation with resultant extended cold ischemic time? In fact, if only a small percentage of the target has survived the preparation process, what evidence will be present to show loss? Even if all of the preanalytical variables are controlled by rapid and complete fixation of thin tissue (like a core biopsy), what controls are necessary to prove that the antibody we are using is recognizing the target protein and only the target protein? These are all critical issues for analyte measurement on slides. First I will discuss dealing with preanalytical variables and then address antibody validation.

As a first approximation, we define well-fixed core biopsies as the best tissue we can obtain that is free of preanalytical variable. Since a core biopsy is immediately fixed, it should have no cold ischemic time, especially compared with resections, which have variable cold ischemic time and some of which may sit on the bench for an hour or more, allowing the tissue to degrade before the critical cells are ultimately fixed in formalin. We have assessed a series of core biopsies and matched resection specimens from the same lesions to begin to address this issue. We are working on a project to assess a series of antibodies, including phosphoepitopes which we expect to be more labile than peptide epitopes. As an example, Fig. 3 shows our assessment of both AKT and phospho-AKT. The core biopsies (shown in light blue) seem to be consistently higher than their matched resections (shown in red). The number of fields examined and measured by AQUA are shown above each bar. The top graph shows that pAKT is highly sensitive to cold ischemic time, while total AKT is less sensitive (not statistically significant). We have assessed a number of other biomarkers in this manner and not all phosphoepitopes are sensitive to cold ischemic time. Also, an epitope does not need to contain post-translational modification to be sensitive. The most notable example is estrogen receptor, which shows a significant decrease in resection specimens compared with core biopsies. These data will be presented in a separate publication that is in preparation.

One possible method to fix this problem or at least adjust for it could be by normalizing using some global control for epitopic degradation due to preanalytical variables. A large NCI-sponsored study is underway to assess this problem, but initially we looked at cytokeratin because it was already run on each slide, as part of the AQUA algorithm, to define the mask or the region of interest. We noted that cytokeratin expression is fairly variable, but it goes down in nearly each resection compared with needle biopsy case, with a few rare exceptions. We constructed a ratio of cytokeratin in the resection over cytokeratin in the biopsy to create a percent degradation index. Then we normalized by this index. Using this algorithm, the statistically significant difference disappears. The ER scores, once adjusted for preanalytical variables by the cytokeratin normalization factor, were essentially the same between resections and core biopsies.

However, these are still early data, and some of the newer data using cytokeratin normalization have been contradictory, suggesting that levels of cytokeratin associated with tumor differentiation may be a confounding factor. Thus, we believe we may be more successful using housekeeping genes that are less variable with differentiation state or hypoxia-sensitive markers. We are trying hypoxia-sensitive markers since they might be the most sensitive to changes that occur during cold ischemic time. These investigations are funded by an NCI contract and are underway in the laboratory.

Antibody validation is another significant pit-fall in protein analyte measurement on slides. The most unfortunate example of this was probably reflected in the fact that EGFR was a failed test to predict response to Erbitux. This may have been a function of the fact that there is no required quality control for antibodies. If a company invents a drug, it has to take it through the FDA and through a long set of prospective trials for that drug to be approved. If a company invents an antibody, it makes large quantities of the antibody, aliquots it into tubes, and sells it. There’s no requirement on the part of the company to prove that the antibody in the tube, in fact, binds to what it claims on the label. The best companies provide controls and package inserts with validation information. However, there is no regulatory authority or even a ‘good housekeeping’ seal of approval for antibodies. Thus, it is up to the end user to validate the antibody. When we have done this, we have been shocked at the variability between antibodies claimed to be for the same protein, and even between lots of the same antibody from the same vendor.

Before I discuss how we validate antibodies, I would like to show a few examples of antibodies we purchased that were clearly not validated or at least did not recognize the proteins the vendor claimed. A good antibody should show reproducibility from one section to the next. Thus we can use quantitative analysis of tissue microarrays to look at dozens or hundreds of histospot sections and then do a regression analysis to assess reproducibility. If an antibody is what the vendor claims it is, we find that reproducibility values show correlations—R values—in the 0.6 to 0.9 range. Anything below 0.5 is probably not a reproducible antibody and probably should be discarded, even at expense to the laboratory. Fig. 4A shows an example of HER2, a validated antibody with high reproducibility run on serial sections to show the maximum possible correlation. Figs. 4B and 4C assess two different antibodies against C-Met. Fig. 4B shows the 3D4 antibody had no relationship in one lot versus another lot of the same antibody from the same vendor, run through the same process on the same cases, but it showed a regression of only 0.19, which suggests one lot of this antibody should be discarded. A second pair of lots of a different antibody against C-Met called MAb 3729 showed a regression of 0.8, which is acceptable and was used for the subsequent study.⁷

This sort of assessment can also be done to compare multiple antibodies with different epitopes, but sold as recognizing the same protein. We tested five different clones of EGFR; they are shown in Table 1 with the binding site and other data on each antibody. Each antibody was then run on a large breast cancer tissue microarray series and analyzed by regression analysis. With only one exception, where the two antibodies related to each other (31G7 and 2-18C9), each of the other antibodies showed no relation to each other, even though all are antibodies that the vendors claim recognize EGFR. Thus, the results obtained for a given EGFR assay could be a function of which EGFR antibody a given laboratory purchased. This may be one of the reasons why that test has failed. We tested a similar set of antibodies claimed to bind to HER3. Again we found essentially no relationship when assessing tissues or cell lines with antibodies from previous lots. These data are being prepared for publication and will be submitted soon.

Given the magnitude of this problem, it is clear that each laboratory must have a protocol to validate antibodies, and furthermore, if we use an antibody for clinical tests, we are probably obliged to test each lot. Although there are clearly many ways to validate an antibody, Fig. 5 is an example of a validation we did in our laboratory where we used the Western blot approach on an mTOR antibody. Fig. 5 shows the Western blot in frame A and then the scan of that Western blot in frame B. Frame C shows an expanded series of cell lines read by AQUA in a tissue microarray that correlates well with the protein levels measured by Western blot. Frame D shows runs of the antibody on control arrays on different days about one month apart. When the data support a regression as shown, we have good confidence that we have an acceptable degree of accuracy and reproducibility.

Finally, I’d like to go through an example of how we’ve used this technology of multiplex assessment with quantitative controls to do some of the things that have been done by companies using RNA or DNA or RT-PCR, where they’re measuring to provide a prognostic outcome that might be useful in stratifying patients for therapy or no therapy. The application is for melanoma. Whenever a melanoma is above a certain thickness, 1 mm thick by a Breslow score, then a sentinel node biopsy is recommended. However, nearly nine times out of 10 that sentinel node biopsy is negative. That means the patient endured a fairly morbid and expensive surgical procedure and got no actionable information. Furthermore, of those patients who are negative, between 20 percent and 30 percent will have a recurrence. Thus, there is a clinical need for a test that can offer more specific information to the sentinel-node-negative patient. I will describe here a test that can assess risk using quantitative measurement of protein expression. [No figures are included because the complete description of this assay is in press at the Journal of Clinical Oncology (Gould Rothberg B, et al).]

We created a training set array, a discovery set that represented about 500 melanomas from the Yale archives from 1959 to 1995, with followup between five and 30 years. We tested that array with about 70 different markers using AQUA with antibodies that were obtained commercially, using candidate genes that we believe for one reason or another to be associated with metastasis or with progression in melanoma. We then used genetic algorithms to construct a model to assess which of those 70 markers would be most prognostic. The genetic algorithm selected five proteins—ATF2, P21, P16, beta-catenin, and fibronectin—and interestingly, two of the five algorithms required spacial information. That is, we needed to know how much was in the nucleus compared with how much was in the cytoplasm. In both cases, the proteins were transcription factors that are present in both the cytoplasm and nucleus but active only in the nucleus. The genetic algorithm model defined five conditions. If four or five of those conditions were met, the patients did very well; if zero to three of those conditions were met, the patients did very poorly. The information was independent of the Breslow score (tumor thickness).

The model was then tested on a validation set. A second cohort was collected from an affiliated laboratory at Yale where a single surgeon did sentinel node biopsies on about 550 cases. We were able to obtain tissue from 270 of those primary melanomas to produce our validation set. When normalized for relevant melanoma variables, multivariate analysis of the genetic algorithm high-risk group showed a 2.7-fold increase in death compared with the low-risk group (P=0.027). More important, if you look at just the node negatives, the target population for benefit from this test, we see a difference between a 10 percent chance of death at eight years in the low-risk group versus a 40 percent chance of death at eight years in the high-risk group. Though we do not yet know how these data will be used, it is likely that patients with a 40 percent risk of recurrence would seek adjuvant therapy, even if they are sentinel-node-negative.

References

1. Camp RL, Chung GG, Rimm DL. Automated subcellular localization and quantification of protein expression in tissue microarrays. Nat Med. 2002;8:1323–1327.

2. Camp RL, Neumeister V, Rimm DL. A decade of tissue microarrays: progress in the discovery and validation of cancer biomarkers. J Clin Oncol. 2008;26:5630–5637.

3. Gustavson MD, Bourke-Martin B, Reilly DM, et al. Development of an unsupervised pixel-based clustering algorithm for compartmentalization of immunohistochemical expression using Automated QUantitative Analysis. Appl Immunohistochem Mol Morphol. 2009;17:329–337.

4. Moeder CB, Giltnane JM, Moulis SP, Rimm DL. Quantitative, fluorescence-based in-situ assessment of protein expression. Methods Mol Biol. 2009;520:163–175.

5. McCabe A, Dolled-Filhart M, Camp RL, Rimm DL. Automated quantitative analysis (AQUA) of in situ protein expression, antibody concentration, and prognosis. J Natl Cancer Inst. 2005;97:1808–1815.

6. Dolled-Filhart M, McCabe A, Giltnane J, Cregger M, Camp RL, Rimm DL. Quantitative in situ analysis of {beta}-catenin expression in breast cancer shows decreased expression is associated with poor outcome. Cancer Res. 2006;66:5487–5494.

7. Pozner-Moulis S, Cregger M, Camp RL, Rimm DL. Antibody validation by quantitative analysis of protein expression using expression of Met in breast cancer as a model. Lab Invest. 2007;87:251–260.

Dr. Rimm thanks all the people in his laboratory who did the work he shared with Futurescape attendees, and he acknowledges in particular Robert Camp, MD, PhD, who was the original co-investigator with him and who wrote the original software for the AQUA technology. Their laboratory’s Web site is www.tissuearray.org.