Incorporating Statistical Expertise into Data Analysis Software
AT&T Bell Laboratories
Nearly 10 years have passed since initial efforts to put statistical expertise into data analysis software were reported. It is fair to say that the ambitious goals articulated then have not yet been realized. The short history of such efforts is reviewed here with a view toward identifying what went wrong. Since the need to encode statistical expertise in software will go on, this is a necessary step to making progress in the future.
Statistical software is guided by statistical practice. The first-generation packages were aimed at data reduction. Batch processing had no competition, and users had little choice. The late 1960s brought an era of challenging assumptions. Robust estimators were introduced “by the thousands” (a slight exaggeration), but these were slow to enter into statistical software, as emphasis was on numerical accuracy. By the late 1970s, diagnostics for and generalizations of classical models were developed, and graphical methods started to gain some legitimacy. Interactive computing was available, but statistical software was slow to capitalize on it. By the late 1980s, computer-intensive methods made huge inroads. Resampling techniques and semi-parametric models freed investigators from the shackles of normal theory and linear models. Dynamic graphical techniques were developed to handle increasingly complex data.
The capabilities and uses of statistical software have progressed smoothly and unerringly from data reduction to data production. Accomplished statisticians argue that alternative model formulations, diagnostics, plots, and so on are necessary for a proper analysis of data. Inexperienced users of statistical software are overwhelmed. Initial efforts to incorporate statistical expertise into software were aimed at helping inexperienced users navigate through the statistical software jungle that had been created. The typical design had the software perform much of the diagnostic checking in the background and report to the user only those checks that had failed, possibly providing direction on what might be done to alleviate the problem.
Not surprisingly, such ideas were not enthusiastically embraced by the statistics community. Few of the criticisms were legitimate, as most were concerned with the impossibility of automating the “art” of data analysis. Statisticians seemed to be making a distinction between providing statistical expertise in textbooks as opposed to via software. Today the commonly held view is that the latter is no more a threat to one's individual
methods and prejudices than is the former.
Given the weak support by peers in the field, and the difficulties inherent with trying to encode expertise into software, some attempts were made to build tools to help those interested in specific statistical topics get started. These tool-building projects were even more ambitious than earlier efforts and hardly got off the ground, in part because existing hardware and software environments were too fragile and unfriendly. But the major factor limiting the number of people using these tools was the recognition that (subject matter) context was hard to ignore and even harder to incorporate into software than the statistical methodology itself. Just how much context is required in an analysis? When is it used? How is it used? The problems in thoughtfully integrating context into software seemed overwhelming.
There was an attempt to finesse the context problem by trying to accommodate rather than integrate context into software. Specifically, the idea was to mimic for the whole analysis what a variable selection procedure does for multiple regression, that is, to provide a multitude of context-free “answers” to choose from. Context guides the ultimate decision about which analysis is appropriate, just as it guides the decision about which variables to use in multiple regression. The separation of the purely algorithmic and the context-dependent aspects of an analysis seems attractive from the point of view of exploiting the relative strengths of computers (brute-force computation) and humans (thinking). Nevertheless, this idea also lacked support and recently died of island fever. (It existed on a workstation that no one used or cared to learn to use.)
So where does smart statistical software stand today? The need for it still exists, from the point of view of the naive user, just as it did 10 years ago. But it is doubtful that this need is sufficient to encourage statisticians to get involved; writing books is much easier. But there is another need, this one selfish, that may be enough to effect increased participation. Specifically, the greatest interest in data analysis has always been in the process itself. The data guides the analysis; it forces action and typically changes the usual course of an analysis. The effect of this on inferences, the bread and butter of statistics, is hard to characterize, but no longer possible to ignore. By encoding into software the statistician's expertise in data analysis, and by directing statisticians' infatuation with resampling methodology, there is now a unique opportunity to study the data analysis process itself. This will allow the operating characteristics of several tests applied in sequence--or even an entire analysis, as opposed to the properties of a single test or estimator--to be understood. This is an exciting prospect.
The time is also right for such endeavors to succeed, as long as initial goals are kept fairly limited in scope. The main advantage now favoring success is the availability of statistical computing environments with the capabilities to support the style of programming required. Previous attempts had all tried to access or otherwise recreate the statistical computing environment from the outside. Keeping within the boundaries of the statistical computing environment eliminates the need to learn a new language or operating system, thereby increasing the chance that developers and potential users will experiment with early prototypes. Both are necessary for the successful incorporation of statistical expertise into data analysis software.
What is Meant by Statistical Expertise?
What do I mean by statistical expertise? Let me recommend How to Solve It by George Polya [Polya, 1957]. It is a beautiful book on applying mathematics to real-world problems. Polya differentiates four steps in the mathematical problem-solving process: (1) understanding the problem, (2) devising a plan to solve the problem, (3) carrying out the plan, and (4) looking back on the method of solution and learning from it.
All four steps are essential in mathematical problem solving. For instance, devising a plan might consist of, say, using induction. Carrying out the plan would be the actual technical steps involved. Having proved a specific fact, one might look back and see it as a special case of something else and then be able to generalize the proof and perhaps solve a broader class of problems.
What kind of expertise do I want to put into software? Many of the steps that Polya outlines are very context dependent. Knowledge of the area in which the work is being done is needed. I am not talking about integrating context into software. That is ultimately going to be important, but it cannot be done yet. The expertise of concern here is that of carrying out the plan, the sequence of steps used once the decision has been made to do, say, a regression analysis or a one-way analysis of variance. Probably the most interesting things statisticians do take place before that.
Statistical expertise needs to be put into software for at least three reasons. The first of course is to provide better analysis for non-statisticians, to provide guidance in the use of those techniques that statisticians think are useful. The second is to stimulate the development of better software environments for statisticians. Sometimes statisticians actually have to stoop to analyzing data. It would be nice to have help in doing some of the things one would like to do but has neither the time nor the graduate students for. The third is to study the data analysis process itself, and that is my motivating interest. Throughout American or even global industry, there is much advocacy of statistical process control and of understanding processes. Statisticians have a process they espouse but do not know anything about. It is the process of putting together many tiny pieces, the process called data analysis, and is not really understood. Encoding these pieces provides a platform from which to study this process that was invented to tell people what to do, and about which little is known.
One of the most compelling reasons for resuming efforts to try to infuse guidance into statistical software, and to implement plans, is to better understand the process of analyzing data. But some of this is also motivated by folly. Part of my earlier career dealt with regression diagnostics, which is how to turn a small-sample problem into a large-sample problem. One can increase the size of the data set by orders of magnitude. Just 100 or 200 data points can easily be increased to thousands or tens of thousands. The larger set is highly correlated, of course, and may be reiterating the same information, but it can be produced in great quantity.
In the good old days, there was data reduction. This is what analysis of variance
did. What began as a big body of data was reduced to mean and standard errors. Today with all the computing and statistics advances, the opposite end of the spectrum has been reached. Ever more data is produced, overwhelming the users. Some of that has to be suppressed.
Who Needs Software with Statistical Expertise?
The audiences for statistical software are many and varied. Infrequent users probably make up the majority of users of statistical software. They want active systems, systems that take control. In other words, they want a black box.
Most professional statisticians are probably frequent users. These users want to be in control, want passive systems that work on command. One might call such a passive system a glass box, indicating that its users can see what it is doing inside and can understand the reasoning that is being used. If such users do not like what they see in the box, they will throw away the answer.
But there is a range of things in between. Problems are inevitable because, in having to deal with users with very diverse needs and wants, it is hard to please all users. When building expertise into software it must be remembered that good data analysis relies on pattern recognition, and consequently graphics should be heavily integrated into the process. Most of what is seen cannot be simply quantified with a single number. Plots are done in order to see the unexpected.
Limitations to the Incorporation of Statistical Expertise
A lot of the experience that goes into data analysis cannot be very easily captured. Moreover, good data analysis relies on the problem context; statistics is not applied in a void. It is applied in biology, as well as in business forecasting. The context is very important, although it is very hard to understand when and where it is important. Related to this is the difficulty of developing the sequence of required steps, the strategy. Finally, implementing that strategy is hard to do. Some hard trade-offs must be made; some engineering decisions that are not really statistically sound must be made. When a simulation is run, crude rules of thumb evolve and get implemented. Thus hard engineering decisions must be made in order to make any progress here.
One guiding principle to follow consistently when incorporating expertise into software, and not merely for statistical software, is this: whenever something is understood
well enough, automate it. Matrix inversion has been automated, because it is believed to be well understood. No one wants to see how a matrix inversion routine works, and so it is automated.
Procedures that are not well understood require user interaction. In all of the systems described below, there are various levels of this automation-interaction trade-off.
Efforts to Build Data Analysis Software
Going back into recent history, there was a system that Bill Gale and I were involved with called REX (circa 1982), an acronym for Regression EXpert [Gale, 1986a]. It was what one might call a front end to a statistics system, where the thing between the statistics system expertise and the user was an interface called REX. It was a rule-based interpreter in which the user never talked directly to the statistics expertise, but only talked to it through this intermediary. The user would say something such as, “regression Y on X.” That was, in fact, the syntax, and was approximately all that the user could say. Then, if everything went right and this case was for the most part textbook data that satisfied a whole bunch of tests, REX would come out with a report, including plots, on the laser printer.
If one of the tests failed, REX would say, “I found a problem and here is how I would suggest you fix it.” After REX offered the suggestion, the user could pick one of five alternatives: implement that suggestion, show the user a plot, explain why that suggestion was being made, describe alternatives (if any), or quit as a last resort.
REX was fairly dogmatic in that respect. If it found a severe problem and if the user refused all the suggestions, it would just say that it was not going to continue. Such intransigence was fine; in an environment where there were no users, it was easy to get away with that. If there had been any users, one can imagine what would have happened: the users would have simply used some other package that would have given them the answers.
What did REX do? It encoded a static plan for simple linear regression. It allowed a non-statistician to get a single good analysis of the regression, and it provided limited advice and explanation. It was an attempt to provide a playground for statisticians to build, study, and calibrate their own statistical plans.
On the subject of building, REX was actually built by a sort of bootstrapping. There was no history available on how to do something such as this. Instead, about a half-dozen examples were taken, all fairly small simple regression problems. Your speaker then did many analyses and kept a detailed diary of what was done, and why it was done. By knowing when something was done in the analysis sequence, one could study those
steps and try to extract what commonality there was. Next, Bill Gale provided an architecture and a language to encode that strategy in the form of a fixed decision tree with if-then rules associated with each node, such as is shown in Figure 4. Each procedure encoded one strategy based on about a half-dozen examples.
It had been hoped that the language provided in REX would be a fertile playground for others to bootstrap their own interests, whether in time series, cluster analysis, or whatever. We knew of no other way to build strategies.
As to “features,” please notice the use of quotation marks. Whenever a software developer mentions a feature, beware. Features are often blunders or inadequacies. In the case of REX, I will even tell you which ones are which.
REX had a variable automate-interact cycle. This is a positive feature whereby, if the data were clean, the user really did not have to interact with the system. In that situation, the user encountered no bothers. It was not like a menu system in which one has to work around the whole menu tree every time. But if the data set was problematic, REX would halt frequently and ask the user for more information so that it could continue.
With REX, the user was insulated from the statistical expertise. That is one of those dubious features. Many users are quite happy that they never have to see any underlying statistical expertise. Others are crying to do something on their own. This is related to a third so-called feature, that REX was in control. If the user had a different ordering of how he or she would like to do regression--e.g., looked at X first and then at Y, and later just wanted to look at Y first and then X--that could not be done. It was very pragmatic; one could not deviate from the path. Again, a certain class of users would be perfectly happy with that.
Another feature was that the system and language were designed to serve as an
expert system shell that could be adapted for types of analyses other than regression.
Several things were learned from the work on REX. The first was that statisticians wanted more control. There were no users, rather merely statisticians looking over my shoulder to see how it was working. Automatically, people reacted negatively. They would not have done it that way. In contrast, non-statisticians to whom it was shown loved it. They wanted less control. In fact they did not want the system--they wanted answers.
To its credit, some of the people who did like it actually learned from REX. Someone who did not know much statistics or perhaps had had a course 5 or 10 years before could actually learn something. It was almost like an electronic textbook in that once you had an example, it could be an effective learning tool.
The most dismaying discovery of all was that not only did the statisticians around me dislike it, but they were also not even interested in building their own strategies. This was because the environment was a bit deficient, and the plan formulation--working through examples and manually extracting commonality in analyses--is a hard thing to do.
REX died shortly thereafter. The underlying operating system kept changing, and it just became too painful to keep it alive. There were bigger and better things to do. It was decided to next attack the second part of the problem, to get more statisticians involved in building expertise into software. As it was known to be a painful task, the desire was to build an expert system building tool, named Student [Gale, 1986b].
I was fairly confident that the only way to end up with strategies and plans for understanding the data analysis process was by recording and analyzing the working of examples. Yet taking the trees of examples, assimilating what was done, and encoding it all into software was still a hard process. So Student was an ambitious attempt to look over the shoulder and be a big brother to the statistician. It was intended to watch the statistician analyzing examples, to capture the demands that were issued, to maintain a consistency between what the statistician did for the current example versus what had been done on previous examples, and to pinpoint why the statistician was doing something new this time. The statistician would have to say, e.g., “this case was concerned with time theories,” or “there was a time component,” and Student would thereby differentiate the cases and then proceed.
The idea was that the statistician would build a system encapsulating his or her expertise simply by working examples, during which this system would encode the sequence of steps used into software. In terms of the architecture, everything resided under the watchful eye of Student, with the statistician analyzing new data within the statistics package. Student watched that person analyze the data and tried to keep him or her honest by saying, “What you are doing now differs from what you did in this previous case; can you distinguish why you did it this way this time and that way that time?” However, for any analysis there is always an analysis path. Student was merely going to fill out the complete tree that may be involved in a more complicated problem.
The Student system also had features. Statisticians did not have to learn an unfamiliar language or representation, but simply used the statistics package that they ordinarily used, with no intermediate step (in contrast to REX, which had been written in LISP).
“Knowledge engineering” was a buzz word 10 years ago in building such systems. A knowledge engineer was someone who picked the brain of an expert and encoded that knowledge. With Student, the program was doing that, and so one only needed to engineer a program, and it would subsequently do all the work for any number of problems.
The system was again designed so that others could modify it, override it, and extend it to other analyses. Basically, that “feature” did not work, and there are two reasons why. One is that the Student system again was written in LISP. It actually ran on a different computer than that which was running the statistical software. With a network between the two, there were numerous logistical problems that got in the way and made things very tedious.
Perhaps the other reason it died was the gradual realization of the problem concerning context. It is just not sufficient to capture the sequences of commands that a statistician has issued in the analysis of data. Context is never captured in key strokes. There is almost always some “aha!” phase in an analysis, when someone is working at a knot and all of a sudden something pops out. Almost always that “aha!” is related to the context. Beyond the technical troubles in building and debugging a system such as Student was the realization that, in actuality, the wrong problem was being solved. That was what caused the project to be abandoned before progress was ever made.
That did not bring efforts to a complete halt, however. In 1987, a completely different approach was tried next with a system called TESS. The realization that context is important, and that how to incorporate it into strategy was unknown, led to the selection of an end-around approach. Context would be accommodated rather than incorporated. In much the same way that computer chess programs worked, the accommodation would be done by sheer brute-force computation.
Most statisticians have used subset regression or stepwise regression. These programs are simply number crunchers and do not know anything about context. They do not know, for instance, that a variable A is cheaper to collect than variable B or that the two variables are both measuring the same thing. Statisticians think that these regressions are useful, that they help to obtain a qualitative ordering on the variables and thereby perhaps help give clues to which classes of models are interesting. After the subset selection is done and the class of models is considered, context is typically brought in to pick one of the models, after which that chosen model is used to make inferences.
The idea of TESS (Tree-based Environment for developing Statistical Strategy) was to expand that game (of looking at the subset) to the overall selection of the model. For
example, in performing a regression with a search-based approach, one defines a space of descriptions D for a class of regression data sets Y, where those data sets are actually ordered pairs, y and x. The goal is to get a hierarchy of the descriptions. Since some of the regressions are going to fit better than others and some will be more concise than others, one tries to order them. After all possible regression descriptions have been enumerated, the space D is searched for good ones. The user interface of this system is radically different from those of the previous two. The procedure is to tell the computer, “Give me 10 minutes worth of regressions of y on x.” (Two transformations were involved to re-express variables. Often, samples or data would be split into two, and outliers would be split off.) At the end of 10 minutes, a plot is made for each description in the space. There is a measure of accuracy, and also a measure of parsimony, and so a browsing list is started. This can be thought of as a CP plot. For each element of a CP plot, the accuracy is measured by CP, and the verbosity is measured by the number of parameters.
For TESS, that concept was generalized. When attempting to instill guidance into software, this overall approach will be important. For TESS, it was necessary to develop a vocabulary of what seemed to be important in describing regression data, i.e., the qualitative features of the data. These were not merely slopes and intercepts, but rather qualitative features closely linked with how data are represented and summarized. To organize all these things, a hierarchy was imposed and then a search procedure devised to traverse this hierarchy. The hope was to find good descriptions first, because this search could in principle continue forever.
TESS had three notable “features”: it was coded from scratch outside of any statistical system; it had a single automate-interact cycle, which permitted context to guide the user in selecting a good description; and once again, there were broad aspirations that the design would stress the environment tools so as to allow others to imitate it.
General Observations on TESS and Student
TESS and Student were very different in how they dealt with the issue of context, but there were several similarities. Both embodied great hopes that they would provide environments in which statisticians could ultimately explore, build, and analyze their own strategies. Ultimately, both used tree-like representations for plans. They both used examples in the plan-building process. And lastly, both are extinct.
Why did others not follow suit? There are a number of reasons. It was asking a lot to expect an individual to develop a strategy for a potentially complex task, and to learn a new language or system in which to implement it. These plans with which Student and TESS were concerned are very different from what is ordinary, usual thinking. A statistician, by training, comes up with a very isolated procedure; for example when one looks for normality, only a test for normality is applied, and it is assumed that everything else is out of the picture. Once all the complications are brought in, it is more than most people can sort out.
The other barrier was always learning a new language or system to implement
these things. Even if we got Student running, there was a problem in how we designed it from the knowledge-engineering point of view. It was always outside the statistical system; there was always some additional learning needed.
The implication of all of this is the need to aim at smaller and simpler tasks. Also, an overriding issue is to stay within the bounds of the statistical system. There is a great deal of potential here that did not exist 5 years ago. With some modern statistical-computing languages, one can do interesting things within a system and not have to go outside it.
A final type of data-analysis software that possesses statistical expertise is a mini-expert function. These could pervade a statistical package, so that for every statistical function in the language (provided it is a command-type language), many associated mini-expert functions can be defined.
The first example of a mini-expert function is interpret(funcall, options), which provides interpretation of the returned output from the function call; i.e., it post-processes the function call's output. For instance, the regression command regress(x, y) might produce some regression data strategy. Applying the interpret function to that output, via the command “interpret(regress(x, y)),” might result in the response “regression is linear but there is evidence of an outlier.” The idea here is that “interpret” is merely a function in the language like any other function, with suitable options.
The next(funcall, options) mini-expert function has much the same syntax. When it is given a function call and options, it will provide a suggestion for the next function call. It could serve as a post-processor to the interpret function, and entering “next(regress(x, y))” might result in the response “try a smooth of y on x since there is curvature in the regression.” Options might include ranking, or memory to store information as to the source of the data that it was given.
Let us give an example of how the user interface would work with mini-expert functions. In languages where the value of the last expression is automatically stored, by successively entering the commands “smooth(x, y)” and “interpret(tutorial = T),” the interpreter would interpret the value of that smooth by default. It would not have to be explicitly given the function call name itself. As a second example, if the value of whatever had been done were assigned to a variable, say “z tree(x, y),” a plot of the variable could be done, e.g., “treeplot(z).” Then the variable could be given to the function next via “next(z)” in order to have that mini-expert supply a message suggesting what to do next.
For those who really want a black box, the system can be run in black-box mode. If one enters “regress(x, y)” followed by “next(do it = T), next(do it = T), …,” the system will always do whatever was suggested. In this way, the system can accommodate a range of users, from those who want to make choices at every step of the
data analysis to those who want a complete black-box expert system.
One point to note: these mini-expert functions have not even been designed yet. This is more or less a plan for the future. As to “features,” these functions possess a short automate-interact cycle, the user remains in complete control (but can relinquish it if desired), they are suitable for novices and experts alike, they exist within the statistical system, they are designed for others to imitate, and they force the developer to think hard about what a function returns.
As to the future of efforts to incorporate statistical expertise into software, progress is being made, but it has been very indirect and quite modest. Work in this area may subtly influence designers of statistical packages and languages to change. Effort in this direction parallels the contributions of artificial intelligence research in the areas of interactive computing, windowing, integrated programming environments, new languages, new data structures, and new architectures.
There are some packages available commercially that attempt to incorporate statistical expertise, if one is interested in trying such software. Since the particular systems that were described in this presentation do not exist now, anyone with such an interest would have to look for what is available currently.
Finally, let me mention some interesting work that may stimulate readers who feel that incorporating software into data analysis plans is important. John Adams, a recent graduate from the University of Minnesota, wrote a thesis directed at trying to understand the effects of all possible combinations of the things that are done in regression for a huge designed experiment [Adams, 1990]. The design not only involved crossing all of the various factors, but also involved different types of data configurations, number of co-variances, correlation structure, and so on. The goal was to learn how all those multiple procedures fit together.
Prior to any incorporation of statistical expertise into data analysis software, that sort of study must first be done. In a way, it amounts to defining a meta-procedure: statistical expertise will not be broadly incorporated into software until that kind of understanding becomes a standard part of the literature.
Adams, J.L., 1990, Evaluating regression strategies, Ph.D. dissertation, Department of Statistics, University of Minnesota, Minneapolis.
Gale, W.A., 1986a, REX review, in Artificial Intelligence and Statistics, W.A. Gale, ed., Addison Wesley, Menlo Park, Calif.
Gale, W.A., 1986b, Student--Phase 1, in Artificial Intelligence and Statistics, W.A. Gale, ed., Addison Wesley, Menlo Park, Calif.
Polya, G., 1957, How to Solve It, 2nd edition, Princeton University Press, Princeton, N.J.