Barriers to the Use of Databases
If researchers are to turn the data accumulating in biologic databases into useful knowledge, they must first be able to access the data and work with them, but this is not always as easy as it might seem. The form in which data have been entered into a database is critical, as is the structure of the database itself, yet there are few standards for how databases should be constructed. Most databases have sprung up willy-nilly in response to the special needs of particular groups of scientists, often with little regard to broader issues of access and compatibility. This situation seriously limits the usefulness of the biologic information that is being poured into databases at such a prodigious rate.
The most basic barrier to putting databases to use is that many of them are unavailable to most researchers. Some are proprietary databases assembled by private companies; others are collections that belong to academic researchers or university departments and have never been put online. “The vast majority of databases are not actually accessible through the Internet right now,” said Peter Karp, director of the Bioinformatics Research Group at SRI International in Menlo Park, California. If a database cannot be searched online, few researchers will take advantage of it even if, in theory, the information in it is publicly available. And even the hundreds of databases that can be accessed via the Internet are not necessarily easy to put to work. The barriers come in a number of forms.
One problem is simply finding relevant data in a sea of information, Karp said. “If there are 500 databases out there, at least, how do we know which ones to go to, to answer a question of interest?” Fortunately for biologists, some locator help is available, noted Douglas Brutlag, professor of biochemistry and medicine at Stanford University. A variety of database lists are available, such as the one published in the Nucleic Acid Researchsupplemental edition each January, and researchers will find the large national and international databases—such as NCBI, EBI, DDBJ, and SWISS-PROT—to be good places to start their search. “They often have pointers to where the databases are, ” Brutlag noted. Relevant data will more than likely come from a number of different databases, he added. “To do a complete search, you need to know probably several databases. Just handling one isn't sufficient to answer a biologic question.” The reason lies in the growing integration of biology, Karp said. “Many databases are organized around a single type of experimental data, be it nucleotide-sequence data or protein-structure data, yet many questions of interest can be answered only by integrating across multiple databases, by combining information from many sources. ”
The potential of such integration is perhaps the most intriguing thing about the growth of biologic databases. Integration holds the promise of fundamentally transforming how biologic research is done, allowing researchers to synthesize information and make connections among many types of experiments in ways that have never before been possible; but it also poses the most difficult challenge to those who develop and use the databases. “The problem,” Karp explained, “is that interaction with a collection of databases should be as seamless as interaction with any single member of the collection. We would like users to be able to browse a whole collection of databases or to submit complex queries and analytic computations to a whole collection of databases as easily as they can now for a single database.” But integrating databases in this way has proved exceptionally difficult because the databases are so different.
“We have many disciplines, many subfields,” said Gio Wiederhold, of Stanford University's Computer Science Department, “and they are autonomous—and must remain autonomous—to set their own standards of quality and make progress in their own areas. We can't do without that heterogeneity.” At the same time, however, “the heterogeneity that we find in all the sources inhibits integration.” The result is what computer scientists call “the interoperability problem,” which is actually not a single difficulty, but rather a group of related problems that arise when researchers attempt to work with multiple databases. More generally, the problem arises when different kinds of software are to be used in an integrated manner.
The simplest yet most unyielding difficulty is that biologists in different specialties tend to speak somewhat different languages. They use jargon and terminology peculiar to their own subfields, and they have their own particular theories and models underlying the collection of data. “We get major terminologic problems,” Wiederhold said, “because the terms used in one field will have different granularity depending on the level at which the abstractions or concepts in that field work and will have different scope, so a term taken in a different context often has a somewhat different meaning. The simple solution is that we will make everybody speak the same language. That, however, requires a degree of stability that we cannot expect in any technology and certainly not in bioinformatics. The fields are moving rapidly—new terms will develop, meanings of terms will change—so we will have to deal with the difference in terminology and recognize that there are differences and be careful with precision.”
Besides the differing terminologies, someone who wishes to work across many databases must also deal with differences in how the various collections structure their data. “There are many protein databases out there,” Karp said, “and each one chooses to conceptualize or represent proteins in its schema in a different way. So someone who wants to issue a query to 10 protein databases has to examine each database to figure out how it encodes a protein, what information it encodes, what field names it uses, and what units of measurement it uses. There are also different data models: object-data models versus relational-data models versus ad hoc, invented-by-the-database-author data models. Daniel Gardner, of Cornell University, added, “it is interfaces, not uniformity, that can provide interoperability—interfaces for data exchange and data-format description, interfaces to recognize data-model intersections, to exchange metadata and to parse queries. ”
Wiederhold continued, “Another very important issue is the heterogeneity in user expertise. Addressing complex queries to large collections of databases requires significant sophistication in the user who is going to create a query of that form. The vast majority of users simply do not have that expertise today.”
None of those issues is new, and for a number of years bioinformatics specialists have been devising ways to improve interoperability. Beginning in 1994, Karp organized a series of workshops on interconnecting molecular-biology databases. Those workshops stimulated the develop-
ment of a number of practical software tools. “I am pleased to report,” Karp said, “that over the last 5 years there really has been some significant progress in building a software infrastructure for database interoperation, which we can liken to building the Internet. Just as the Internet connects a diverse set of geographically distributed locations, we have seen growth in a software infrastructure for connecting molecular-biology databases.”
Bioinformatics specialists have developed two broad approaches to integrating databases, each with its strengths and weaknesses. The first, which Karp referred to as the warehousing approach, combines a large number of individual databases in a single computer and lets outside users submit queries to that collection of databases. An example is the Sequence Retrieval System (SRS), which contains 133 databases and is available through the European Bioinformatics Institute (EBI). The SRS treats all the files in all the databases as text files and indexes the databases by keywords within the files and by record names in each of the fields within each database. People using the system search for relevant files by keyword and by record name. “The main advantage of the text warehousing approach,” Karp said, “is that users can essentially use point-and-click. You enter a set of keywords and you get back lots and lots of records that match those keywords. Point-and-click is the major advantage of this approach because it is easy for people to use, but it is also the major disadvantage because it can take so long to evaluate complex queries.”
Suppose, for example, that someone wished to find examples of sets of genes that were clustered tightly on a single chromosome and that specified enzymes that worked within a single metabolic pathway. The search would demand the comparison of two types of information: on the location of genes and on the metabolic pathways that particular enzymes play a role in. To perform that search in the SRS, Karp said, “we might enter a keyword like pathway and get back the names of every pathway and every pathway database within the SRS. To answer this query and to find linked genes in a single metabolic pathway, we would have to point-and-click through hundreds of pathway records, follow each pathway to its enzymes, and follow each enzyme to its genes. We would have a case of repetitive-stress injury by the time we were finished.”
The second system for integrating databases is the multidatabase approach, which takes a query from a user, distributes the query via the Internet to a set of many databases, and then collects and displays the results. Examples of that approach are the Kleisli/K2 system developed by Chris Overton and colleagues at the University of Pennsylvania, the OPM system developed by Victor Markowitz at Gene Logic, and the TAMBIS system (which is built on Kleisli) developed by Andy Brass and
Carole Goble at the University of Manchester. Because the individual databases maintain their structures instead of being treated as collections of text files, searches can be much more powerful and exact in this type of system than in the warehousing system. After the user formulates a question, a query processor transforms the question into individual queries sent to whichever of the various member databases might have information relevant to the original question. Later, the query engine receives and integrates the results of the individual queries and returns the results to the user. “For instance,” Karp said, “in our pathway-and-gene example, the query processor might farm out individual queries across the Internet, combine the results, formulate more queries for genome databases, and then combine the results. The main advantage of the multidatabase and warehousing approaches is that they are high-throughput approaches. They allow us to process complex queries that might access tens of databases and thousands or tens of thousands of objects to perform interesting system-level analyses of large amounts of data. [For text-based warehousing], the point-and-click approach will never do that.”
In contrast, although anyone can point-and-click with little training, the preparation of the complex queries for a multidatabase system demands much greater expertise. “The majority of the multidatabase systems force their users to learn some complex query language,” Karp said. “They also force their users to learn a lot about the schemes of each database they want to query. Some graphical query interfaces are available, but they tend to be fairly primitive. More work is needed in this direction.”
In short, the good news is that systems do exist to allow researchers to search 100 or more databases simultaneously. The bad news is that it is still difficult for anyone but database experts to perform the sorts of complex searches that are most valuable to researchers. And further bad news, Karp said, is that many of the existing databases cannot be integrated into such interoperation systems, because they do not have the necessary structure. “Many individual databases have not been constructed with any kind of database-management system. They are simply text files created with a text editor. Many have no defined ontology or schema, so it is difficult to tell what data are in them and what the different fields mean. Most are not organized according to any standard data model, and many of these flat files have an irregular structure that is very hard to parse. They often have inconsistent semantics.”
The take-home message, Karp said, is that databases should be constructed with an eye to interoperability, but, so far, most are not. “Unfortunately, database expertise is very much lacking in the vast majority of bioinformatics database projects. In general, these projects have been lacking in the discipline to use database-management systems, to use more
standardized data models, and to come up with a more-regular syntax. ” The result is that only a minority of all biology databases are available over the Internet for the interoperation engines to use. In the future, advances in tools designed for the World Wide Web, in combination with advances in databases and in other forms of software, are likely to make more biology data more easily available. This will involve progress in multiple components of computer science and attention to the specific interests of biologists as data generators and users. It will also require biologists to present their needs in ways that excite database experts and other computer scientists to overcome the expertise scarcity noted by Karp.