Feature Story

cap today

DNA deluge

October 2002
Eric Skjei

Finding one small variation in one genetic sequence and developing ways to fight its often lethal consequences is the great biomedical quest of our age. But this quest cannot succeed without better tools—better software, hardware, and information systems, better processes and strategies, better digital DNA.

"This is an exciting time for pathology, trying to assimilate all this additional information that goes beyond what’s under the microscope," says Arul Chinnaiyan, MD, PhD, assistant professor of pathology and urology at the University of Michigan Medical School, Ann Arbor. "We now have the chance to try to encompass the entire expressed genome or proteome, and doing so is going to add so much additional information to our work that it is going to be very powerful."

But as pathologists begin to look at the molecular profile of, say, a tumor, rather than rely solely on the morphology viewable under the microscope, "we will find we have a lot more information to deal with," Dr. Chinnaiyan says.

The sheer volume of information generated may overwhelm our ability to understand it and apply its insights. We will need better standards, faster databases, improved image-viewing tools, more flexible and simpler reporting applications, and more intelligent decision-support software.

"We need tools that allow all of us involved in this work to talk to each other, exchange data in a common format, and deposit data in publicly available sites," says Kenneth Hillan, MD, "so that when new research findings are published, anyone can go to the appropriate site, access that data and analyze it themselves, then draw their own conclusions." Dr. Hillan is senior director of pathology at Genentech and professor of molecular and therapeutic pathology at the University of Leeds, England.

The good news is that we can build a faster and smarter informatics infrastructure, and much of that work is underway.

An LIS for the future

Working toward this end is Molecular Pathology Laboratory Network, Maryville, Tennessee.¹

"We’ve been in operation in east Tennessee for 13 years," says Roger Hubbard, PhD, CEO of the network. "During that period we’ve grown dramatically, to the point that we now have a national presence as well as several international clients."

The network, which focuses on infectious disease testing, molecular oncology, and mutation analysis, determined over time that mainstream laboratory information systems couldn’t meet its needs. "We found that some of the types of information that we need to store could not be handled by a conventional LIS," says Dr. Hubbard, "information such as genetic expression data, DNA sequences, and so on." So the company decided to build an LIS from scratch to retrieve and manage test data from the sophisticated assays that will emerge in a postgenomic era. Work on the new LIS began late last year, and as the company made progress, Dr. Hubbard says, it became increasingly excited about the system’s larger potential.

"We built it to solve our own needs," he explains, "but in the course of doing so, we realized that it might have a broader market than just for use within our own labs, so we are now thinking of commercializing it." That broader market comprises the same entities now targeted by LIS companies—hospital labs, reference labs, private companies, and academic research centers.

The system has been designed to handle all types of laboratory tests. "Our thinking is that this is a tool that can be used in a molecular lab, but it should be able to manage conventional tests as well," Dr. Hubbard says. "We made it sophisticated enough for DNA but flexible enough to store and retrieve chemistry data, so it’s really a multifunctional platform."

From a technology standpoint, says CIO Kenneth Billings, the new system is based on "a very fast, object-oriented Caché database that is easily scalable and interacts easily with other technologies, like XML, Java, and active server pages." It is also completely Web-enabled. "Every user of the program is a Web user, even in-house," he says. "It doesn’t matter if users are on Windows boxes or Apple products or something else. As long as they have a functioning Web browser, they can use the system."

The LIS incorporates strong image-handling functionality. "We’re actually storing results images as part of a patient’s record; a user can pull them up dynamically without a problem," says Billings. "Physicians will be able to view those images, zoom and browse through them, and analyze them." Real-time graphing capabilities, including time-series data depicting patient results over time, are also supported. Adds Dr. Hubbard, "We do a lot of flow cytometry immunophenotyping for leukemia and many molecular assays, including FISH [fluorescence in situ hybridization] tests. This application will enable us to dynamically generate a report that not only incorporates the flow cytometry data but the dot plots or histograms and the FISH assay results images at very high speed, dynamically, to any remote computer."

Those who have seen prototypes of the system are enthusiastic, Dr. Hubbard says, and he predicts a strong reception in the LIS marketplace. The as yet unnamed product was slated to go live this fall within Molecular Pathology Laboratory Network’s labs, and the commercial launch is tentatively scheduled for next year.

A focus on clinical ’actionability’

A different but complementary strategy for guiding the development of these new tools is to focus on the ultimate user: the physician. The first step is to ask which new genetic insights and tests are "clinically actionable"—that is, they can be used immediately by physicians to help their patients. Then you build the systems and processes to make that possible. One company taking this approach is Milwaukee-based PointOne, which plans this year to launch a clinical genetic information service it calls a clinical data-analysis system, or CDAS. The service initially will focus on thrombophilia.

To zero in on clinically actionable genetic and proteomic insights, PointOne applies a "so what" test. "We ask ourselves: If a physician has this information, what can he or she do about it? How will it help the patient?" says CEO Drew Palin, MD. CDAS can test for any number of genetic abnormalities, but those conditions for which little can be done, such as Alzheimer’s disease, are not good candidates.

In this prototype stage, PointOne is working with Wisconsin-based Aurora Healthcare System and Third Wave Technologies, a biotechnology company in Madison.

The process starts with existing data and attempts to identify patients with a genetically influenced risk for thrombophilia. "A group of 50 to 75 physicians have volunteered to participate in this beta stage," Dr. Palin says. "We’re taking existing claim, clinical, pharmacy, and other data, developing clinical criteria to identify an at-risk cohort, and then filtering that data to identify those patients with all the risk factors, those who may have an increased risk of thrombophilia."

Once people who appear to be at high risk for this disorder are identified through data analysis and review, the next step is to issue a report to their physicians, telling them they have patients who may be at risk for a genetically influenced coagulation disorder. The report also informs the physicians that they can order a diagnostic test to determine if those patients have a factor II or V single nucleotide polymorphism, a genetic variation that may predispose them to developing a deep venous thrombosis. The test is then conducted by Third Wave, which, Dr. Palin says, has "a very simple, affordable fluorescence assay that can identify these variations."

At this point, the CDAS process is similar to requesting an esoteric test from a reference lab. "The SNP test and use of the Third Wave technology take place in a reference lab setting," Dr. Palin says. And in this sense, PointOne is calling on some of the basic functions—quality assurance, test validation—provided by laboratorians thousands of times a day in labs nationwide. But in this case, he adds, interpreting the result thus obtained is emblematic of a new reality—the need to call on several sources of information to understand the significance of these results. "It will be increasingly hard for any single clinical specialty to interpret these tests without having access to the rest of the clinical data," he says.

Once at-risk patients are tested and their results reported, physicians can recommend care and treatment. The prototype CDAS project allows PointOne to create an infrastructure and a process that can be applied to other genetic tests as they appear. "We’re going to have hundreds of thousands more data points to work with, and we’re going to have to be able to distill and customize all that data into something that is relevant for a specific patient," says Dr. Palin. "What we are doing now at PointOne is building the information system to support that process."

PointOne plans to develop two other products within the next year or so. "CDAS will evolve into what we call a genetic knowledge management system, which will have the same basic functionality but which will also incorporate links to clinical content, relevant articles, refereed journals, and so on," Dr. Palin says. The third product, a genetic research system, will focus on the needs of researchers.

The power of integration

In the quest for these new digital tools, the power of integration is a recurrent theme. Physicians must be able to bring together data from many sources. A group of physician researchers at the University of Michigan Medical School, which has for several years been studying prostate cancer, is finding that its enhanced information system is proving this point once again and helping to uncover important clinical insights. In this case, the information technology strategy has been to develop tools in an ad hoc manner as resources allow, primarily to meet requirements for higher performance and stronger database functionality.

"We are building a bioinformatics infrastructure that we call Profiler," says the University of Michigan’s Dr. Chinnaiyan. The system includes clinical and pathology information and a growing body of tissue microarray and gene expression data. It also supports image viewing and is accessible via the Web.² Although focused on prostate cancer, it is already being adapted to accommodate work on other cancers.

With the system’s value and functionality having been proven at the University of Michigan, it will soon be replicated at Brigham and Women’s Hospital, Boston, under the direction of Mark Rubin, MD, associate professor of pathology. Dr. Rubin is also director of the Harvard Tissue Microarray Core, a collaboration of the Dana-Farber/ Harvard Cancer Center and the Harvard Center for Neurodegeneration and Repair Center for Molecular Pathology.³

"We’ve been developing Profiler for gene expression array and tissue microarray analysis," says Dr. Rubin, "and we’ve made progress in both areas." The tissue microarray components allow an investigator to review slides associated with the data. Earlier versions also supported image-viewing capabilities, but the system has been significantly improved in the past year. "We are now using an Oracle database and we have cleaned up our database structure," Dr. Rubin says. "Our main goal was to make it possible to do what we refer to as clinical queries, which allow us to look for associations between pathology data and other clinical parameters, including outcomes analysis."

The system will soon evolve to a point where a clinician can introduce a patient’s data into it and then compare that person’s gene expression profile to the many profiles already developed, Dr. Chinnaiyan predicts. The system will offer insight into probable outcomes, therapies previous patients were exposed to, and more. It may even help tailor treatment to a patient’s genomic and proteomic profile.

The Profiler is already delivering on its early promise. "We just completed a study where we evaluated over 20,000 tissue microarray sample images collected over the past two-and-a-half years," Dr. Rubin says. "The system allowed us to do what we refer to as multiplex analysis, where we can look for combinations of markers that we’ve already profiled to see which ones work best in combination." As a result, Dr. Rubin’s group identified two prostate cancer markers that had been identified and described previously but which it has now learned are much more predictive in combination than had been recognized. At CAP TODAY press time, an article detailing some of this work was scheduled to be published in Nature.

Making things mesh

Integration presupposes a certain uniformity of data definitions, tables, and types across different sources. This consistency does not exist, however, and will be achieved only by developing and promulgating standards for use by researchers, clinicians, and others in this field.

Working toward this end is the Microarray Gene Expression Data Society, an international organization that was created some time ago to develop and disseminate standards for sharing gene microarray data.⁴ More recently, a group working under the auspices of the Association for Pathology Informatics, or API, has embarked on a similar standards project for tissue microarray data. That effort is proceeding with the support of the National Cancer Institute, which has helped finance some of the group’s convening activities.⁵

"Our objective," says Mary Edgerton, MD, PhD, head of the API research committee, "is to establish standards so that when tissue microarray data is published, it can be placed in a general repository and accessed by other researchers, either to combine in a larger meta-analysis or to re-examine by other statistical means in the same way that gene expression array data is analyzed by different researchers and across different organizations." Dr. Edgerton is director of the Molecular Profiling and Data Mining Shared Resource, Vanderbilt-Ingram Cancer Center, Departments of Pathology and Biomedical Informatics, at the Vanderbilt University Medical Center.

As it stands, when researchers publish new findings in this area, they also typically post their data on a Web site for review by their colleagues. If other researchers want to download the data to conduct their own review and analysis, a significant amount of data remapping is often required to import the data into the new database. Data remapping can be a painstaking process that may require determining what kind of data reside in which fields, how the data are labeled, what annotation was used and what it indicates, and what measurements were employed.

"Right now, for example, when we pull data over from a cancer data registry to use in a clinical followup study, we often find we have to remap the original clinical and pathological staging into whatever staging regimen we’re using," says Dr. Edgerton. "Or we may have to remap ICD-9 codes, which change regularly, or map older histology data into a newer schema."

Remapping can lead to errors and slow the pace of scientific collaboration and communication. Although those who do the remapping are typically proficient in computer science or programming, they may have to rely on clinical colleagues for their biomedical expertise, which slows the process further. And even when the process works smoothly, deciding what data go where often calls for some level of subjective judgment. Moreover, remapping is not always entirely successful, and important data occasionally may be lost or subsumed under more general categories than appropriate.

Well-developed, widely implemented standards have the potential to alleviate much of this problem. "Were the data to exist in a standard structure," says Dr. Edgerton, "other researchers could download it knowing that there was sufficient commonality in how fields were defined, labeled, and used, that this high level of remapping effort could be minimized, and that the extensive reprogramming we are often forced to do could be bypassed." It is a step and a structure that is not only necessary in its own right, says Dr. Edgerton, but is part of a larger effort to achieve a greater level of similarity in how these data are recorded and shared. "A lot of this information comes through the pathology laboratory," she adds, and because it does, this work is a "key piece of a roadmap toward a more comprehensive standards framework."

At a deeper level

But even as we build better information technology tools for postgenomics pathology, deeper questions remain. Some reflect the sheer scale of the problem, but others have more to do with subtler issues of interpretation. How will we decipher the meaning of the flood of data rolling our way?

"The volume of information that can be generated is potentially so huge that a human being simply can’t make sense of it," says Walter H. Henricks, MD, director of laboratory information services at the Cleveland Clinic Foundation. "We’re going to have much more information than we can understand the relevance of, and I think it’s clear we will need a new level of decision support, filtering, and automated analysis to do that."

Today, says Dr. Henricks, though the information demands placed on clinicians are large, they are familiar. "Now, when we look at a tumor under a microscope to make an interpretation, the data are on the slide," he says. While that will still be true in the future, there will be much more data that are not on the slide, data from gene expression arrays, tissue microarrays, and other sources. To be useful, these data will have to be interpreted. But will it be possible for any human, no matter how well trained or insightful, to interpret them?

The answer may be no.

"We aren’t going to be able to keep track of and interpret everything ourselves on an individual or semiautomated basis," says Dr. Henricks. "There will be so many data points that they will need to be handled in some sort of automated or semiautomated way, proactively, upstream, before they reach us."

Notes

1. For more information, see www.molpath.net or contact Dr. Hubbard at rhubbard@molpath.net or Ken Billings at kbillings@molpath.net.

2. See http://rubinlab.tch.harvard.edu/.

3.Dr. Rubin can be reached at marubin@partners.org.

4.See http://www.mged.org/Workgroups/MIAME/ miame_1.1.html for an overview and examples of standards in development by the Microarray Gene Expression Data Society; the standards under development by the API group are similar in some respects.

5.See http://medicine.osu.edu/Informatics/Api/index.htm for information on the API. The API standards group has received R13 funding from the NCI. R13 is an NCI funding category designed to further research by supporting recipient-sponsored and -directed international, national, or regional meetings, conferences, and workshops.

Eric Skjei is a writer in Stinson Beach, California.