DNA deluge
October 2002 Eric Skjei
Finding one small variation in one genetic sequence and developing
ways to fight its often lethal consequences is the great biomedical quest of
our age. But this quest cannot succeed without better tools—better software,
hardware, and information systems, better processes and strategies, better digital
DNA.
"This is an exciting time for pathology, trying to assimilate all this
additional information that goes beyond what’s under the microscope," says
Arul Chinnaiyan, MD, PhD, assistant professor of pathology and urology at the
University of Michigan Medical School, Ann Arbor. "We now have the chance
to try to encompass the entire expressed genome or proteome, and doing so is
going to add so much additional information to our work that it is going to
be very powerful."
But as pathologists begin to look at the molecular profile of, say, a tumor,
rather than rely solely on the morphology viewable under the microscope, "we
will find we have a lot more information to deal with," Dr. Chinnaiyan
says.
The sheer volume of information generated may overwhelm our ability to understand
it and apply its insights. We will need better standards, faster databases,
improved image-viewing tools, more flexible and simpler reporting applications,
and more intelligent decision-support software.
"We need tools that allow all of us involved in this work to talk to each
other, exchange data in a common format, and deposit data in publicly available
sites," says Kenneth Hillan, MD, "so that when new research findings
are published, anyone can go to the appropriate site, access that data and analyze
it themselves, then draw their own conclusions." Dr. Hillan is senior director
of pathology at Genentech and professor of molecular and therapeutic pathology
at the University of Leeds, England.
The good news is that we can build a faster and smarter informatics infrastructure,
and much of that work is underway.
An LIS for the future
Working toward this end is Molecular Pathology Laboratory Network,
Maryville, Tennessee.1
"We’ve been in operation in east Tennessee for 13 years," says Roger
Hubbard, PhD, CEO of the network. "During that period we’ve grown dramatically,
to the point that we now have a national presence as well as several international
clients."
The network, which focuses on infectious disease testing, molecular oncology,
and mutation analysis, determined over time that mainstream laboratory information
systems couldn’t meet its needs. "We found that some of the types of information
that we need to store could not be handled by a conventional LIS," says
Dr. Hubbard, "information such as genetic expression data, DNA sequences,
and so on." So the company decided to build an LIS from scratch to retrieve
and manage test data from the sophisticated assays that will emerge in a postgenomic
era. Work on the new LIS began late last year, and as the company made progress,
Dr. Hubbard says, it became increasingly excited about the system’s larger potential.
"We built it to solve our own needs," he explains, "but in the
course of doing so, we realized that it might have a broader market than just
for use within our own labs, so we are now thinking of commercializing it."
That broader market comprises the same entities now targeted by LIS companies—hospital
labs, reference labs, private companies, and academic research centers.
The system has been designed to handle all types of laboratory tests. "Our
thinking is that this is a tool that can be used in a molecular lab, but it
should be able to manage conventional tests as well," Dr. Hubbard says.
"We made it sophisticated enough for DNA but flexible enough to store and
retrieve chemistry data, so it’s really a multifunctional platform."
From a technology standpoint, says CIO Kenneth Billings, the new system is based
on "a very fast, object-oriented Caché database that is easily scalable
and interacts easily with other technologies, like XML, Java, and active server
pages." It is also completely Web-enabled. "Every user of the program
is a Web user, even in-house," he says. "It doesn’t matter if users
are on Windows boxes or Apple products or something else. As long as they have
a functioning Web browser, they can use the system."
The LIS incorporates strong image-handling functionality. "We’re actually
storing results images as part of a patient’s record; a user can pull them up
dynamically without a problem," says Billings. "Physicians will be
able to view those images, zoom and browse through them, and analyze them."
Real-time graphing capabilities, including time-series data depicting patient
results over time, are also supported. Adds Dr. Hubbard, "We do a lot of
flow cytometry immunophenotyping for leukemia and many molecular assays, including
FISH [fluorescence in situ hybridization] tests. This application will enable
us to dynamically generate a report that not only incorporates the flow cytometry
data but the dot plots or histograms and the FISH assay results images at very
high speed, dynamically, to any remote computer."
Those who have seen prototypes of the system are enthusiastic, Dr. Hubbard says,
and he predicts a strong reception in the LIS marketplace. The as yet unnamed
product was slated to go live this fall within Molecular Pathology Laboratory
Network’s labs, and the commercial launch is tentatively scheduled for next
year.
A focus on clinical ’actionability’
A different but complementary strategy for guiding the development
of these new tools is to focus on the ultimate user: the physician. The first
step is to ask which new genetic insights and tests are "clinically actionable"—that
is, they can be used immediately by physicians to help their patients. Then
you build the systems and processes to make that possible. One company taking
this approach is Milwaukee-based PointOne, which plans this year to launch a
clinical genetic information service it calls a clinical data-analysis system,
or CDAS. The service initially will focus on thrombophilia.
To zero in on clinically actionable genetic and proteomic insights, PointOne
applies a "so what" test. "We ask ourselves: If a physician has
this information, what can he or she do about it? How will it help the patient?"
says CEO Drew Palin, MD. CDAS can test for any number of genetic abnormalities,
but those conditions for which little can be done, such as Alzheimer’s disease,
are not good candidates.
In this prototype stage, PointOne is working with Wisconsin-based Aurora Healthcare
System and Third Wave Technologies, a biotechnology company in Madison.
The process starts with existing data and attempts to identify patients with
a genetically influenced risk for thrombophilia. "A group of 50 to 75 physicians
have volunteered to participate in this beta stage," Dr. Palin says. "We’re
taking existing claim, clinical, pharmacy, and other data, developing clinical
criteria to identify an at-risk cohort, and then filtering that data to identify
those patients with all the risk factors, those who may have an increased risk
of thrombophilia."
Once people who appear to be at high risk for this disorder are identified through
data analysis and review, the next step is to issue a report to their physicians,
telling them they have patients who may be at risk for a genetically influenced
coagulation disorder. The report also informs the physicians that they can order
a diagnostic test to determine if those patients have a factor II or V single
nucleotide polymorphism, a genetic variation that may predispose them to developing
a deep venous thrombosis. The test is then conducted by Third Wave, which, Dr.
Palin says, has "a very simple, affordable fluorescence assay that can
identify these variations."
At this point, the CDAS process is similar to requesting an esoteric test from
a reference lab. "The SNP test and use of the Third Wave technology take
place in a reference lab setting," Dr. Palin says. And in this sense, PointOne
is calling on some of the basic functions—quality assurance, test validation—provided
by laboratorians thousands of times a day in labs nationwide. But in this case,
he adds, interpreting the result thus obtained is emblematic of a new reality—the
need to call on several sources of information to understand the significance
of these results. "It will be increasingly hard for any single clinical
specialty to interpret these tests without having access to the rest of the
clinical data," he says.
Once at-risk patients are tested and their results reported, physicians can
recommend care and treatment. The prototype CDAS project allows PointOne to
create an infrastructure and a process that can be applied to other genetic
tests as they appear. "We’re going to have hundreds of thousands more data
points to work with, and we’re going to have to be able to distill and customize
all that data into something that is relevant for a specific patient,"
says Dr. Palin. "What we are doing now at PointOne is building the information
system to support that process."
PointOne plans to develop two other products within the next year or so. "CDAS
will evolve into what we call a genetic knowledge management system, which will
have the same basic functionality but which will also incorporate links to clinical
content, relevant articles, refereed journals, and so on," Dr. Palin says.
The third product, a genetic research system, will focus on the needs of researchers.
The power of integration
In the quest for these new digital tools, the power of integration
is a recurrent theme. Physicians must be able to bring together data from many
sources. A group of physician researchers at the University of Michigan Medical
School, which has for several years been studying prostate cancer, is finding
that its enhanced information system is proving this point once again and helping
to uncover important clinical insights. In this case, the information technology
strategy has been to develop tools in an ad hoc manner as resources allow, primarily
to meet requirements for higher performance and stronger database functionality.
"We are building a bioinformatics infrastructure that we call Profiler,"
says the University of Michigan’s Dr. Chinnaiyan. The system includes clinical
and pathology information and a growing body of tissue microarray and gene expression
data. It also supports image viewing and is accessible via the Web.2
Although focused on prostate cancer, it is already being adapted to accommodate
work on other cancers.
With the system’s value and functionality having been proven at the University
of Michigan, it will soon be replicated at Brigham and Women’s Hospital, Boston,
under the direction of Mark Rubin, MD, associate professor of pathology. Dr.
Rubin is also director of the Harvard Tissue Microarray Core, a collaboration
of the Dana-Farber/ Harvard Cancer Center and the Harvard Center for Neurodegeneration
and Repair Center for Molecular Pathology.3
"We’ve been developing Profiler for gene expression array and tissue microarray
analysis," says Dr. Rubin, "and we’ve made progress in both areas."
The tissue microarray components allow an investigator to review slides associated
with the data. Earlier versions also supported image-viewing capabilities, but
the system has been significantly improved in the past year. "We are now
using an Oracle database and we have cleaned up our database structure,"
Dr. Rubin says. "Our main goal was to make it possible to do what we refer
to as clinical queries, which allow us to look for associations between pathology
data and other clinical parameters, including outcomes analysis."
The system will soon evolve to a point where a clinician can introduce a patient’s
data into it and then compare that person’s gene expression profile to the many
profiles already developed, Dr. Chinnaiyan predicts. The system will offer insight
into probable outcomes, therapies previous patients were exposed to, and more.
It may even help tailor treatment to a patient’s genomic and proteomic profile.
The Profiler is already delivering on its early promise. "We just completed
a study where we evaluated over 20,000 tissue microarray sample images collected
over the past two-and-a-half years," Dr. Rubin says. "The system allowed
us to do what we refer to as multiplex analysis, where we can look for combinations
of markers that we’ve already profiled to see which ones work best in combination."
As a result, Dr. Rubin’s group identified two prostate cancer markers that had
been identified and described previously but which it has now learned are much
more predictive in combination than had been recognized. At CAP TODAY press
time, an article detailing some of this work was scheduled to be published in
Nature.
Making things mesh
Integration presupposes a certain uniformity of data definitions, tables, and
types across different sources. This consistency does not exist, however, and
will be achieved only by developing and promulgating standards for use by researchers,
clinicians, and others in this field.
Working toward this end is the Microarray Gene Expression Data Society, an international
organization that was created some time ago to develop and disseminate standards
for sharing gene microarray data.4
More recently, a group working under the auspices of the Association for Pathology
Informatics, or API, has embarked on a similar standards project for tissue
microarray data. That effort is proceeding with the support of the National
Cancer Institute, which has helped finance some of the group’s convening activities.5
"Our objective," says Mary Edgerton, MD, PhD, head of the API research
committee, "is to establish standards so that when tissue microarray data
is published, it can be placed in a general repository and accessed by other
researchers, either to combine in a larger meta-analysis or to re-examine by
other statistical means in the same way that gene expression array data is analyzed
by different researchers and across different organizations." Dr. Edgerton
is director of the Molecular Profiling and Data Mining Shared Resource, Vanderbilt-Ingram
Cancer Center, Departments of Pathology and Biomedical Informatics, at the Vanderbilt
University Medical Center.
As it stands, when researchers publish new findings in this area, they also
typically post their data on a Web site for review by their colleagues. If other
researchers want to download the data to conduct their own review and analysis,
a significant amount of data remapping is often required to import the data
into the new database. Data remapping can be a painstaking process that may
require determining what kind of data reside in which fields, how the data are
labeled, what annotation was used and what it indicates, and what measurements
were employed.
"Right now, for example, when we pull data over from a cancer data registry
to use in a clinical followup study, we often find we have to remap the original
clinical and pathological staging into whatever staging regimen we’re using,"
says Dr. Edgerton. "Or we may have to remap ICD-9 codes, which change regularly,
or map older histology data into a newer schema."
Remapping can lead to errors and slow the pace of scientific collaboration and
communication. Although those who do the remapping are typically proficient
in computer science or programming, they may have to rely on clinical colleagues
for their biomedical expertise, which slows the process further. And even when
the process works smoothly, deciding what data go where often calls for some
level of subjective judgment. Moreover, remapping is not always entirely successful,
and important data occasionally may be lost or subsumed under more general categories
than appropriate.
Well-developed, widely implemented standards have the potential to alleviate
much of this problem. "Were the data to exist in a standard structure,"
says Dr. Edgerton, "other researchers could download it knowing that there
was sufficient commonality in how fields were defined, labeled, and used, that
this high level of remapping effort could be minimized, and that the extensive
reprogramming we are often forced to do could be bypassed." It is a step
and a structure that is not only necessary in its own right, says Dr. Edgerton,
but is part of a larger effort to achieve a greater level of similarity in how
these data are recorded and shared. "A lot of this information comes through
the pathology laboratory," she adds, and because it does, this work is
a "key piece of a roadmap toward a more comprehensive standards framework."
At
a deeper level
But
even as we build better information technology tools for postgenomics
pathology, deeper questions remain. Some reflect the sheer scale
of the problem, but others have more to do with subtler issues of
interpretation. How will we decipher the meaning of the flood of
data rolling our way?
"The volume
of information that can be generated is potentially so huge that
a human being simply can’t make sense of it," says Walter H.
Henricks, MD, director of laboratory information services at the
Cleveland Clinic Foundation. "We’re going to have much more
information than we can understand the relevance of, and I think
it’s clear we will need a new level of decision support, filtering,
and automated analysis to do that."
Today, says Dr.
Henricks, though the information demands placed on clinicians are
large, they are familiar. "Now, when we look at a tumor under
a microscope to make an interpretation, the data are on the slide,"
he says. While that will still be true in the future, there will
be much more data that are not on the slide, data from gene expression
arrays, tissue microarrays, and other sources. To be useful, these
data will have to be interpreted. But will it be possible for any
human, no matter how well trained or insightful, to interpret them?
The answer may
be no.
"We aren’t
going to be able to keep track of and interpret everything ourselves
on an individual or semiautomated basis," says Dr. Henricks.
"There will be so many data points that they will need to be
handled in some sort of automated or semiautomated way, proactively,
upstream, before they reach us."
Notes
1. For more information, see www.molpath.net or contact Dr. Hubbard at rhubbard@molpath.net
or Ken Billings at kbillings@molpath.net.
2.
See http://rubinlab.tch.harvard.edu/.
3.Dr. Rubin can be reached at marubin@partners.org.
4.See http://www.mged.org/Workgroups/MIAME/
miame_1.1.html for an overview and examples of standards in development by
the Microarray Gene Expression Data Society; the standards under development by
the API group are similar in some respects.
5.See http://medicine.osu.edu/Informatics/Api/index.htm
for information on the API. The API standards group has received
R13 funding from the NCI. R13 is an NCI funding category designed
to further research by supporting recipient-sponsored and -directed
international, national, or regional meetings, conferences,
and workshops.
Eric Skjei is a writer in Stinson Beach, California.
|
|
|