The Challenge of Information
Some 265 years ago, the Swedish taxonomist Carolus Linnaeus created a system that revolutionized the study of plants and animals and laid the foundation for much of the work in biology that has been done since. Before Linnaeus weighed in, the living world had seemed a hodge-podge of organisms. Some were clearly related, but it was difficult to see any larger pattern in their separate existences, and many of the details that biologists of the time were accumulating seemed little more than isolated bits of information, unconnected with anything else.
Linnaeus's contribution was a way to organize that information. In his Systema Naturae, first published in 1735, he grouped similar species—all the different types of maple trees, for instance—into a higher category called a genus and lumped similar genera into orders, similar orders into classes, and similar classes into kingdoms. His classification system was rapidly adopted by scientists worldwide and, although it has been modified to reflect changing understandings and interpretations, it remains the basis for classifying all living creatures.
The Linnaean taxonomy transformed biologic science. It provided biologists with a common language for identifying plants and animals. Previously, a species might be designated by a variety of Latin names, and one could not always be sure whether two scientists were describing the same organism or different ones. More important, by arranging biologic knowledge into an orderly system, Linnaeus made it possible for scientists to see patterns, generate hypotheses, and ultimately generate knowledge in a fundamentally novel way. When Charles Darwin pub-
lished his On the Origin of Species in 1859, a century of Linnaean taxonomy had laid the groundwork that made it possible.
Today, modern biology faces a situation with many parallels to the one that Linnaeus confronted 2 ½ centuries ago: biologists are faced with a flood of data that poses as many challenges as it does opportunities, and progress in the biologic sciences will depend in large part on how well that deluge is handled. This time, however, the major issue will not be developing a new taxonomy, although improved ways to organize data would certainly help. Rather, the major issue is that biologists are now accumulating far more data than they have ever had to handle before. That is particularly true in molecular biology, where researchers have been identifying genes, proteins, and related objects at an accelerating pace and the completion of the human genome will only speed things up even more. But a number of other fields of biology are experiencing their own data explosions. In neuroscience, for instance, an abundance of novel imaging techniques has given researchers a tremendous amount of new information about brain structure and function.
Normally, one might not expect that having too many data would be considered a problem. After all, data provide the foundation on which scientific knowledge is constructed, and the usual concern voiced by scientists is that they have too few data, not too many. But if data are to be useful, they must be in a form that researchers can work with and make sense of, and this can become harder to do as the amount grows.
Data should be easily accessible, for instance; if there are too many, it can be difficult to maintain access to them. Data should be organized in such a way that a scientist working on a particular problem can pluck the data of interest from a larger body of information, much of it not relevant to the task at hand; the more data there are, the harder it is to organize them. Data should be arranged so that the relationships among them are simple to understand and so that one can readily see how individual details fit into a larger picture; this becomes more demanding as the amount and variety of data grow. Data should be framed in a common language so that there is a minimum of confusion among scientists who deal with them; as information burgeons in a number of fields at once, it is difficult to keep the language consistent among them. Consistency is a particularly difficult problem when a data set is being analyzed, annotated, or curated at multiple sites or institutions, let alone by a well-trained individual working at different times. Even when analyses are automated to produce objective, consistent results, different versions of the software may yield differences in the results. Queries on a data set may then yield different answers on different days, even when superficially based on the same primary data. In short, how well data are turned into knowledge depends on how they are gathered, organized,
managed, and exhibited—and those tasks are increasingly arduous as the data increase.
The form of the data that modern biologists must deal with is dramatically different from what Linnaeus knew. Then—and, indeed, at any point up until the last few decades—most scientific information was kept in “hard” format: written records, articles in scientific journals, books, artifacts, and various sorts of images, eventually including photographs, x-ray pictures, and CT scans. The information content changed with new discoveries and interpretations, but the form of the information was stable and well understood. Today, in biology and a number of other fields, the form is changing. Instead of the traditional ink on paper, an increasingly large percentage of scientific information is generated, stored, and distributed electronically, including data from experiments, analyses and manipulations of the data, a variety of images both real and computer-generated, and even the articles in which researchers describe their findings.
AN EXPLOSION OF DATABASES
Much of this electronic information is warehoused in large, specialized databases maintained by individuals, companies, academic departments in universities, and federal agencies. Some of the databases are available via the Internet to any scientist who wishes to use them; others are proprietary or simply not accessible online. Over the last decade, these databases have grown spectacularly in number, in variety, and in size. A recent database directory listed 500 databases just in molecular biology—and that included only publicly available databases. Many companies maintain proprietary databases for the use of their own researchers.
Most of the databases are specialized: they contain only one type of data. Some are literature databases that make the contents of scientific journals available over the Internet. Others are genome databases, which register the genes of particular species—human, mouse, fruit fly, and so on—as they are discovered, with a variety of information about the genes. Still others contain images of the brain and other body parts, details about the working of various cells, information on specific diseases, and many other subsets of biologic and medical knowledge.
Databases have grown in popularity so quickly in part because they are so much more efficient than the traditional means of recording and propagating scientific information. A biologist can gather more information in 30 minutes of sitting at a computer and logging in to databases than in a day or two of visiting libraries and talking to colleagues. But the more important reason for their popularity is that they provide data in a form that scientists can work with. The information in a scientific paper is
intended only for viewing, but the data in a database have the potential to be downloaded, manipulated, analyzed, annotated, and combined with data from other databases. In short, databases can be far more than repositories—they can serve as tools for creating new knowledge.
A WORKSHOP IN BIOINFORMATICS
For that reason, databases hold the key to how well biologists deal with the flood of information in which they now find themselves awash. Getting control of the data and putting them to work will start with getting control of the databases. With that in mind, on February 16, 2000, the National Research Council's Board on Biology held a workshop titled “Bioinformatics: Converting Data to Knowledge.” Bioinformatics is the emerging field that deals with the application of computers to the collection, organization, analysis, manipulation, presentation, and sharing of biologic data. A central component of bioinformatics is the study of the best ways to design and operate biologic databases. This is in contrast with the field of computational biology, where specific research questions are the primary focus.
At the workshop, 15 experts spoke on various aspects of bioinformatics, identifying some of the most important issues raised by the current flood of biologic data. The pages that follow summarize and synthesize the workshop's proceedings, both the presentations of the speakers and the discussions that followed them. Like the workshop itself, this report is not intended to offer answers as much as to pose questions and to point to subjects that deserve more attention.
The stakes are high—and not only for biologic researchers. “Our knowledge is not just of philosophic interest,” said Gio Wiederhold, of the Computer Science department at Stanford University. “A major motivation is that we are able to use this knowledge to help humanity lead healthy lives.” If the data now being accumulated are put to good use, the likely rewards will include improved diagnostic techniques, better treatments, and novel drugs—all generated faster and more economically than would otherwise be possible.
The challenges are correspondingly formidable. Biologists and their bioinformatics colleagues are in terra incognita. On the computer science side, handling the tremendous amount of data and putting them in a form that is useful to researchers will demand new tools and new strategies. On the biology side, making the most of the data will demand new techniques and new ways of thinking. And there is not a lot of time to get it right. In the time it takes to read this sentence, another discovery will have been made and another few million bytes of information will have been poured into biologic databases somewhere, adding to the challenge of converting all those data into knowledge.