Maintaining the Integrity of Databases
Databases can contain billions of bytes of information, so it is inevitable that some errors will creep in. Nonetheless, the researchers who work with databases would like to keep the errors to a minimum. Error prevention and correction must be integral parts of building and maintaining databases.
The reasons for wanting to minimize errors are straightforward, said Chris Overton, director of the Center for Bioinformatics at the University of Pennsylvania. “I work on genome annotation, which, broadly speaking, is the analysis and management of genomic data to predict and archive various kinds of biologic features, particularly genes, biologic signals, sequence characteristics, and gene products.” Presented with a gene of unknown function, a gene annotator will look for other genes with similar sequences to try to predict what the new gene does. “What we would like to end up with,” Overton explained, “is a report about a genomic sequence that has various kinds of data attached to it, such as experimental data, gene predictions, and similarity to sequences from various databases. To do something like this, you have to have some trusted data source.” Otherwise, the researchers who rely on the genome annotation could pursue false trails or come to false conclusions.
Other fields have equally strong reasons for wanting the data in databases to be as accurate as possible. It is generally impractical or impossible for researchers using the data to check their accuracy themselves: if the data on which they base their studies are wrong, results of the studies will most likely be wrong, too.
To prevent errors, Overton commented, it is necessary first to know how and why they appear. Some errors are entry errors. The experimentalists who generate the data and enter them into a database can make mistakes in their entries, or curators who transfer data into the database from journal articles and other sources can reproduce them incorrectly. It is also possible for the original sources to contain errors that are incorporated into the database. Other errors are analysis errors. Much of what databases include is not original data from experiments, but information that is derived in some way from the original data, such as predictions of a protein's function on the basis of its structure. “The thing that is really going to get us,” Overton said, “is genome annotation, which is built on predictions. We have already seen people taking predictions and running with them in ways that perhaps they shouldn't. They start out with some piece of genomic sequence, and from that they predict a gene with some sort of ab initio gene-prediction program. Then they predict the protein that should be produced by that gene, and then they want to go on and predict the function of that predicted protein.” Errors can be introduced at any of those steps.
Once an error has made it into a database, it can easily be propagated —not only around the original database, but into any number of other systems. “Computational analysis will propagate errors, as will transformation and integration of data from various public data resources,” Overton said. “People are starting to worry about this problem. Data can be introduced in one database and then just spread out, like some kind of virus, to all the other databases out there.” And as databases become more closely integrated, the problem will only get worse.
Because Overton's group is involved with database integration, taking information from a number of databases and combining it in useful ways, it has been forced to find ways to detect and fix as many errors as possible in the databases that it accesses. For example, it has developed a method for correcting errors in the data that it retrieves from GenBank, the central repository for gene sequences produced by researchers in the United States and around the world. “Using a rule base that included a set of syntactic rules written as grammar,” Overton said, “we went through all the GenBank entries for eukaryotic genes and came up with a compact representation of the syntactic rules that describe eukaryotic genes.” If a GenBank entry was not “grammatical” according to this set of syntactic rules, the system would recognize that there must be an error and often could fix it.
“That part was easy”, he said. “The part that got hard was when we had to come up with something like 20 pages of expert-system rules to describe all the variations having to deal with the semantic variability that was in GenBank. At the moment, the system—which we have been working on for a relatively long time—can recognize and correct errors in about 60–70% of GenBank entries. Another 30-40% we end up having to repair by hand. Unfortunately, although dealing with feature-table information in GenBank is relatively easy—the information is highly structured, and you can write down rules that capture the relationships between all the features—that is certainly not true for a lot of the other biologic data we are looking at, and we do not have any way to generalize these error-detection protocols for other kinds of data that are out there.” In short, even in the best case, it is not easy to correct errors that appear in databases; and in many cases, there is no good way to cleanse the data of mistakes.
On the basis of his group's experience with detecting and correcting errors, Overton offered a number of lessons. The first and simplest is that it is best not to let the errors get into the database in the first place. “Quality control and quality assurance should be done at the point of entry —that is, when the data are first entered into a database. We shouldn 't have had to run some tool like this. It should have been run at GenBank at the time the information was entered. That would have been a way to clear up a lot of the errors that go in the database. ” Supporting this comment was Michael Cherry of Stanford, who stated that “everything we do has to be right. The quality control has to be built into the design.”
THE IMPORTANCE OF TRAINED CURATORS AND ANNOTATORS
A related piece of advice was that the best people to enter the data are not the researchers who have generated them, but rather trained curators. “At GenBank, data are entered by the biologists who determine a sequence. They are not trained annotators; but when they deposit the nucleic acid sequence in GenBank, they are required to add various other information beyond the sequence data. They enter metadata, and they enter features, which are the equivalent of annotations. That is why we get a lot of errors in the database. Most of the people involved in this process come to the same conclusion: that trained annotators give generally higher quality and uniformity of data than do scientists. So one goal would be to just get the biologist out of the loop of entering the data.”
Once errors have crept into a database, Overton said, there is likely to be no easy way to remove them. “Many of the primary databases are not
set up to accept feedback. When we find errors in, say, GenBank, there is nobody to tell.” Without a system in place to correct mistakes, those who operate a database have a difficult time learning about errors and making corrections. “Furthermore,” Overton said, “we would get no credit for it even if we did supply that kind of information.” Scientists are rewarded for generating original data but not for cleaning up someone else 's data. “The other part of the problem is that most of these databases do a very poor job of notifying you when something has changed. It is extremely difficult to go to GenBank and figure out whether something has changed.” So even if an error is discovered and corrected in a database, anyone who has already downloaded the erroneous data is unlikely to find out about the mistake.
One way to ameliorate many of the problems with errors in databases, Overton said, is to keep detailed records about the data: where they came from, when they were deposited, how one gets more information about them, and any other relevant details concerning their pedigree. The approach is called “data provenance,” and it is particularly applicable to minimizing errors in data that propagate through various databases.
“The idea of data provenance,” Overton said, “is that databases should describe the evidence for each piece of data, whether experimental or predicted, so you should know where the data came from. Furthermore, you should be able to track the source of information. In our own internal databases, we track not only the history of everything as it changes over time, but all the evidence for every piece of data. If they change, we have a historical record of it.”
“This is an important and extremely difficult problem,” Overton said. “There is no general solution for it at the moment.”
At Knowledge Bus, Inc., in Hanover, Maryland, Bill Andersen has a different approach to dealing with records. Knowledge Bus is developing databases that incorporate ontologic theories—theories about the nature of and relationships among the various types of objects in a database. In other words, the databases “know” a good deal about the nature of the data that they contain and what to expect from them, so they can identify various errors simply by noting that the data do not perform as postulated.
For example, one of the thousands of axioms that make up the Knowl-
edge Bus ontology describes what to expect from two closely related reactions, one of which is a specialized version of the other, such as glucose phosphorylation and glucose phosphorylation catalyzed by the enzyme hexokinase. Those two reactions are identical except that the second proceeds with the help of an enzyme. “The rule,” Andersen explained, “just explains that if the normal glucose phosphorylation has a certain free energy, then the one catalyzed by hexokinase will have the same free energy.” (Free energy, a concept from thermodynamics, is related to the amount of work performed, or able to be performed, by a system.)
Suppose that the system pulls in perhaps from one or more databases on the Internet experimentally determined values for the free energy of glucose phosphorylation and of glucose phosphorylation catalyzed by hexokinase, and suppose further that they do not agree. The system immediately recognizes that something is wrong and, equally important, has a good starting point for figuring out exactly what is wrong and why. “What went wrong had a lot to do with the theory you built the database with,” Andersen said. “Either the constraints are wrong, the ontology is wrong, or the data are wrong. In any case, we can use the violated constraint to provide an explanation. Here is our starting point. This is what went wrong. The idea that I want to get across is that once we have got hold of that proof [that an error has occurred], we can start to look at the information. All the proof told us is that, according to our model, the database is wrong. But how? Was the information input reliable? Was another class of mistake made? What can we tell from examining the provenance of the information? Maybe we believe this piece more than that piece because of where they came from, and we can resolve it that way.”
The ontology is combined with extensive metadata (data on the data) so curators can quickly learn where data came from and what their potential weaknesses are. “Combining the annotations and the proof, ” Andersen said, “we can start reasoning about the errors that have appeared. We can use these facilities to provide tools to guide human curators to the sources of error so that they can fix them rapidly.” Once an error has been identified, a curator can use the information about it to decide what to do. “You can remove the conflicting data, or you can simply take the formula and put an annotation on it, say that it is in conflict and we don't know what to do with it. Mark it and go on. That is also possible to do. You don't have to eliminate the conflict from the database.”
Either way, Andersen said, by having a database that can identify inconsistencies in data and give information about how and why they are inconsistent, curators should be able to deal with errors much more effectively than is possible with current database configurations.
As databases become increasingly widespread, more and more people will find that data about them appear in databases. The data might have been gathered as part of an experiment or might represent information collected by doctors during normal medical care of patients; they could include genetic information, medical histories, and other personal details. But whatever their form, warned Stanford's Gio Wiederhold, those who work with databases must be careful to respect the privacy and the concerns of the people whose data appear in them.
“You have to be very careful about how people will feel about your knowledge about them,” he said. Detailed medical information is a sensitive subject, but genetic information may well be even touchier. Genetic data can be used for paternity testing, for detecting the presence of genetic diseases, and eventually for predicting a person 's physical and psychologic propensities. “Privacy is very hard to formalize and doesn't quite follow the scientific paradigm that we are used to. That doesn't mean that it is not real to people—perceptions count here. I request that scientists be very sensitive to these kinds of perceptions, make every possible effort to recognize the problems that they entail, and avoid the backlash that can easily occur if privacy is violated and science is seen in a negative light. ”
There are also a number of practical issues in preserving privacy, Wiederhold noted, such as the possibility of unethical use of genetic information by insurance companies. Methods for protecting privacy have not kept pace with the increasing use of shared databases.
“In our work, we are always collaborating,” Wiederhold said, “but the technical means that we have today for guarding information come from commerce or from the military and are quite inadequate for protecting collaboration.” In those other fields, the first line of defense has been to control access and to keep all but a select few out of a database altogether. That won't work in research: “We have to give our collaborators access.”
Those who run databases that contain sensitive information will therefore need to find different approaches to protecting privacy. “We have to log and monitor what gets taken out. It might also be necessary to ensure that some types of information go out only to those who are properly authorized,” he said, noting the well-reported case of a person who logged onto an Internet music site and, instead of downloading a music track, downloaded the credit-card numbers of hundreds of thousands of the site's customers. “They obviously were not checking what people were taking out. The customer had legitimate access, but he took out what he shouldn't have taken out.”
Wiederhold concluded: “Unless we start logging the information that is taken out, and perhaps also filtering, we will not be fulfilling our responsibilities.”