TURKAN GARDENIER (Equal Employment Opportunity Commission): Professor Velleman's comments really spoke to the heart. After five or six years of dealing with military simulations for our designed experiments, where I had people run multivariance regression models in two weeks and analyze the nice multivariance regression model, I'm now in a setting where all of the data is serendipitous.
We held a seminar yesterday in which many attendees asked whether they could assume that a data set from some company was part of the universe of employee records. The statisticians looked at the data and said, “Not really; there are interventions in company records over a period of time.” By our not permitting that assumption, there was criticism expressed as to our not being real statisticians, because if it could not be assumed that this year's data was a random sample from a population, what were we doing there? Thanks to what Professor Velleman presented today, I am going back and reporting the innovations for serendipitous data that could be applied.
As a statistician working with serendipitous data, let me make another comment. We need lag time, both as data analysts as well as statisticians. I am dealing with a lot of litigation cases as a statistical expert, telling attorneys what data to collect. Some people come to our office with partially collected data; such data is very hard to analyze and have the analyses stand up in court.
Within our organization, we write memoranda of understanding. It's part of a statistician's professional responsibility to work with the right types of data, to make the right assumptions before using a computer, and to do statistical significance tests. Having lag time available ties in with ex post facto collection of the right types of data that could interface with interactive data analysis.
PAUL VELLEMAN: I think you are right. I mentioned that we need to teach about scientific statistics. For years we have taught the mathematical statistics approach as the approach in all introductory statistics courses. As a result, the world is full of people who think that the way one does statistics is to test a hypothesis. Many of our clients know only that much statistics, and this has in effect made our lives more difficult. It will make our future lives easier if we start teaching more about scientific statistics, rather than just hypothesis testing.
CLIFTON BAILEY (Health Care Financing Administration): The HCFA deals with all the Medicare data. I certainly concur regarding serendipitous data. We try to analyze the 6 million persons who have a hospitalization in each year, within a universe of 10 million hospitalizations from the 34 million beneficiaries. The data are provided on bills, and one needs different kinds of tools and techniques to deal with many of the issues in these types of analyses, e.g., the diagnostics.
I like Paul Velleman's example involving the pathway through an analysis. Many times we want to make comparisons, while taking one pathway, about what would have happened if another analysis, another pathway, had been used. When I look at standard statistical packages, I frequently see that they put in a log likelihood, but they do not do it in the same way. Many of them leave out the constants or formulate the likelihood differently and then do not tell how it is done. I want to be able to compare across pathways in that larger perspective, to do exploratory analyses in that context.
SALLY HOWE (National Institute of Standards and Technology): One of the things that you face that other people often don't face is that you have very large data sets. Is that an obstruction to doing your work, that there is not software available for very large data sets?
CLIFTON BAILEY: Yes, the available software is very limited when you get into large data sets, unless you deal with samples. But you want to be able to ask finer questions because you have all of that data, and so you do not want to merely use samples--or you want to use combinations of samples and then look at the subsets using the base line against the sample.
There are many complex issues involved in doing that. However, we can put a plot up and look at residuals for our data sets that are generated in the psychology laboratory or in many of the non-serendipitous database contexts. We can look at that on a graph; we can scan down a column in a table. But we need other techniques and ways of doing exploratory analyses with large data sets.
Another agency, the Agency for Health Care Policy Research, is funding interdisciplinary teams to make use of these kinds of data. There are at least a dozen research teams that are focusing on patient outcomes. Every one of those teams is facing this problem, as is our agency, and I am sure that many others are also.
SALLY HOWE: Do you see any additional obstructions that the previous speakers have not yet mentioned?
CLIFTON BAILEY: I recently needed to have a user-provided procedure (or proc) in SAS modified by one of the authors, because the outputs would not handle the large volumes of numbers. When the number of observations went over 150,000, it would not run. Many of the procs get into trouble when there are more than 50,000 observations or some similar constraint.
PAUL TUKEY (Bellcore): This problem, dealing with these very large data sets, is one that more and more people are having to face, and we should be very mindful of it. The very fact that everything is computerized, with network access to other computers and databases available, means that this problem of large data sets is going to arise more and more frequently. Some different statistical and computational tools are needed for it. Random sampling is one approach, but an easy and statistically valid way to do the
random sampling is needed, and that is not always so simple to specify. People like Dan Carr [George Mason University], who is here today, and Wes Nicholson [Battelle Pacific Northwest Laboratory] have explicitly thought about the issue of how to deal with large data sets. There are ways to do it.
When you have a lot of data points, one issue is that you can no longer do convenient interactive computing. It can take five minutes to run through the data set once, doing something absolutely trivial.
There is also a statistical issue involved. When you have an enormous number of observations, you suddenly confront the horrifying fact that statistical significance is not what should be examined, because everything in sight becomes statistically significant. When that is the case, what should replace statistical significance? Other ways of determining what is of practical significance are needed. Perhaps formal calculations of statistical significance can still be used, but interpreted differently, e.g., used only as a benchmark for comparing situations. I completely agree that this issue of large data sets is very important.
FORREST YOUNG: I too agree with what Paul Velleman is saying and with what Clifton Bailey has brought up. In exploratory methods, the definition of “very large” is a lot smaller than it is in confirmatory methods, by their very nature. To handle 10,000 observations in exploratory methods is really difficult. To handle 10 million, perhaps, in confirmatory methods is really difficult. That is their very nature.
ROBERT HAMER (Virginia Commonwealth University): I agree with Forrest that some exploratory methods do not work well with large data sets. In fact, with sufficiently large data sets, almost any plotting technique will break down in a sense, because the result is a totally black page; every single point is darkened.
PAUL TUKEY: See Dan Carr about that.
MIKE GUILFOYLE (University of Pennsylvania): I help people analyze data, and this problem about scale intensiveness is very, very important. Software sophistication has to be considered in light of the specific task in which the users are involved.
Many of these statistical software packages were involved at their origins with teaching. Later they moved from teaching to research, and now from research the focus is turning to production. This is what Clifton Bailey from HCFA is saying. When large production runs are involved, the clever things performed at the touch of a mouse on small research or educational data sets cannot be done. My large production runs are done with batch processing and involve multi-volumes of tapes, a context in which the mouse-click cleverness is lost.
I am surprised that no speakers have yet mentioned Efron's work in jackknifing and bootstrapping [see, e.g., Efron and Tibshirani (1991), or Efron (1988)], which may be useful at least at the research level, to see if the model does or does not work. It is also a computer-intensive process, for which many people do not have the resources. But there
are people thinking about such approaches, and the literature on those approaches might be of interest to statistical software producers.
All of the speakers alluded to software packages as black boxes. But beyond that, one needs to think about technology. People who use these clever little packages on workstations become very good at mastering the technology, but they then assume that they understand the underlying processes. However, it is more complicated than that, because those people may or may not understand statistics, and may or may not understand the computing involved. There is a nexus between computing and statistics, but you do not know whether it's a fifty-fifty split. One person may be reasonably competent in statistics and also reasonably competent at mastering the technology, while another is very good at handling the technology but knows nothing about the statistics. I worry that these technological interfaces make things too easy; people can sit there and fake it. It goes beyond the old regime of taking some data, running the regression, and asking if this is the best r2 that can be obtained. It is much more complicated.
PAUL TUKEY: Yes, you can fake it with a lot of these packages, and you could fake it with the old packages, too. Part of the salvation might be what Daryl Pregibon will discuss this afternoon, an expert system that does not let you get away with doing something that is grossly inappropriate, or at least forces you to confront it, and will not quietly acquiesce.
In a way, a package cannot force a user to do something responsible. But at least it can responsibly bring things to the user's attention.
RAYMOND JASON (National Public Radio): I am not a degreed statistician; I am a user of statistical software. I have been assigned on occasion to design experiments, carry them out, and analyze them. The focus this morning and for the program this afternoon seems to be on analysis and not on the design of experiments. Of course, a fully designed experiment gives data that are analyzable by any package. In my work, I have found that I was not supported at all by software or its documentation in the design of the experiments. If a goal is to improve the real-world utility of statistical software, experimental design is a necessary responsibility that needs to be addressed.
Two ways of addressing it would be to specify standards for expert systems that do design of experiments, or at the very least to come up with some suitable warning to be applied to documentation. Statistical software is advertised and available to the general public through mail order outlets. You do not have to be a member of the American Statistical Association to be enticed by the software, and you do not have to be wealthy or associated with large companies to buy the software. It is definitely targeted at people such as myself. Yet, I came very close to a major statistical analysis disaster. Only because of work with other experts, certainly not because of the documentation of the
software, was disaster prevented.2
BARBARA RYAN: Those comments of Raymond Jason are excellent. I also deal with many people who are not degreed statisticians. There is a real concern here because they are going to do statistical analyses. So unless we help those people, either through education or through software guidelines or other means, they are going to make mistakes.
Regarding his other comment about designing experiments, it is true that most of the software available is for either exploring or analyzing, not to help in designing. There are some attempts with diagnostics to warn when there are problems, but for up-front design, less software has been developed to help.
KEITH MULLER: Design is my favorite topic, but I find it the most fuzzily defined task that I face as a statistician. Therefore, it would be the most difficult to implement in a program package. Experimental design is a collection of heuristics at this stage, rather than an algorithm or a collection of algorithms.
The dimension I identified under the “Audiences” heading in my talk is that of user sophistication. Statistical software ought to try to identify itself with respect to its target audience. That at least would be an attempt to be responsible in distributing and marketing software.
PAUL TUKEY: I also agree that the packages being designed now should have better design modules. People developing the software tend not to be doing a lot of design experiments, and so experimental design tends to be a glaring omission.
Years ago, while I was a graduate student, I did some work with a Russian statistician who was visiting in London. He had just written a book containing an idea that greatly impressed me. His idea was that design and analysis are really one and the same, not two different things, and all should be integrated. He had some very good ideas on the “how to,” very practical ways of designing efficient experiments, not purely abstract mathematical things. We need to build on those kinds of ideas and get those things into our software packages. At least in situations where we have an opportunity to design experiments, we ought to take advantage of it. There are big gains to be had by doing so.
HERBERT EBER (Psychological Resources, Inc.): The answer to the questions of what to do with huge data sets, and with the fact that everything is then significant, has been around for a while. It is called confidence limits, power analysis, and effect size. There are packages available. I am aware of one in progress that will do away with significance
concepts almost completely and talk about confidence intervals. So alternatives do exist.
WILLIAM PAGE (Institute of Medicine): Concerning a previous point, we have not really talked about sampling yet. The way you collect the data has something to do with the way you analyze the data. Either we are putting things in a simple random sample box, such as our ANOVAs, or we're putting them in a serendipity box. There ought to be a way to handle something in between. Do you have a complex sample survey? Then you should be using the ANOVA box.
This may pertain to the guidance issue of this afternoon. The first line produced by the package might ask, “Do you have a simple random sample, yes or no?” If the user punches the help button, then a light flashes on and the output reads, “Do you have money to pay a statistician?”
KEITH MULLER: On the issue of statistical significance, I try to teach my students the distinction between statistical significance and practical or scientific importance, and that the latter is what the user is actually interested in.
RICHARD JONES (Philip Morris Research Center): There is an area of statistical software that is not being addressed here at all. Many scientific instruments in laboratories have built-in software that does regressions, or hypothesis testing. Some researchers, for instance, use these instruments and blindly accept whatever is produced. I recently had an opportunity, fortunately, to catch a chemist taking results from a totally automated analysis. He was using a four-parameter equation that left me incredulous. When I asked why he used that, he said that it was because it gave a correlation coefficient of .998. I said, “Why don't you just do a log transform?” He replied, “Because that correlation coefficient is only .994.” This gentleman was perfectly serious about this. This software built into instruments is in general use. Though not part of the big statistical packages that have been discussed, it is just as important to the scientific community and to their understanding of things.
PAUL TUKEY: We have standards from IEEE for how to do arithmetic computations to ensure that different machines get the same answers. An effort is needed to develop some standard statistical computing algorithms. Some of these things already exist, of course. But specifying the actual code or pseudo-code would allow these algorithms to be rendered in different languages, so that they can certifiably produce the same answers and do the basic building-block statistical kinds of things from which one can build analyses. The major package vendors could adopt these and replace whatever they are doing with software that adheres to the standards, and the people building these instruments could, at least in the future machines, build in some coherence.
PAUL VELLEMAN: I endorse that very strongly. There was a mention of providing test data sets. As a comparison, I think that test data sets are inevitably a failure, because the result is programs that get the right answer on the test data sets, but not necessarily
anywhere else. Standard algorithms and implementations of them would compel the same answer across a wide variety of data sets. If a particular implementation or use then fails to give the same answer as other methods, at least the situation is well defined, and so it can be fixed. This is something that should be looked into.
BARBARA RYAN: I think the issue raised by Richard Jones is an issue of education. There are many ways to do statistical analysis. Often the procedures are built into devices, as small packages that were developed in-house. There's a huge amount of such statistical software currently in use that the “professional statistics community” almost ignores. I have a sense that we dismiss it as being so simplistic and superficial that we are not even going to look at it.
The trouble is there are thousands of people who are using it. It is more an issue of education. Maybe I misunderstood his question, but it's really what happens with a log transform to your r2. This is fundamentally an issue of education, not whether you are getting the right or wrong answers. So if a fairly naive user, as far as data analysis is concerned, gets involved in statistics but does not know what he or she is doing, how do you address that misuse of statistics, when such statistical software is so available to everyone?
WILLIAM EDDY: I want to make a comment about standard algorithms versus standard data sets. If you read the IEEE floating-point arithmetic standard 754, you will see that it does not specify how to do the arithmetic. What it specifies is what the results of arithmetic operations will be. Therefore, the standard is actually articulated in terms of standard data sets, rather than in terms of standard algorithms. If we are to emulate such an organization, we have a difficult task ahead of us. Algorithms are the easy way out.
KEITH MULLER: I have a problem with an algorithm standard. Let me give you an example in everyday life with which you are all familiar--headlights on your automobile. In the mid-1930s there were all kinds of bad headlights available. Therefore the United States created an equipment standard that said sealed beam headlights were required. In Europe, the standard used was a performance standard. The equipment standard held back development in the United States of quartz halogen bulbs, which are far superior in performance. It took a revision of the law to permit their use here. Consequently, I would urge us to specify performance standards rather than algorithmic standards.
PAUL GAMES (Department of Human Development and Family Studies, Pennsylvania State University): One of the things that I am most disturbed by in statistics is what some people promote as causal analysis. This is where they take what amounts to correlational data that has been collected in strange ways and, after merely getting regression weights and using what they call “the causal path,” produce fantastic interpretations of the experimental outcomes. Is there anything being done in statistical packages that might induce a little sophistication in those people?
FORREST YOUNG: The recent development along that line is that those kinds of analyses have now been added to the major statistical packages.
KEITH MULLER: This is an issue of philosophy and education. If you don't value such causal analyses, then you should teach your consultees accordingly. That is a statistical issue, and not a statistical computing issue, I would argue.
BARBARA RYAN: There is a question of how much expertise and training we can build into software. People sometimes learn about statistics first from a software package manual. There is a philosophical issue of how much teaching you can provide through software vehicles. Much is being provided by training courses. Many packages offer a lot of training with them, but provided through the software rather than through the university or independent workshops run by institutes for professionals.
It may also be a practical issue. If people keep learning from packages, perhaps the best way will then be to provide more statistical education and guidance through a software vehicle.
PAUL VELLEMAN: I see that more as an opportunity than a problem. The biggest problem in teaching statistics is convincing your students that they really want to know this. When a person already has a statistics package in front of him and wants to understand it, that is the time to teach him. The Minitab student handbook, which was really the first tutorial package-based manual, was therefore an important innovation.
BARBARA RYAN: There is a problem when the people writing the manuals are not statisticians. It takes the educational role away from people who are the real educators. One must find the right balance.
KEITH MULLER: Concerning the issue of large data sets that several members of the audience raised earlier, I presented a paper with some co-authors a few years ago [Muller, et al., 1982] in which we talked about the analysis of “not small” data sets. I suggested there that one needed to take the base 10 logarithm of the number of observations to classify the task that one faced. One could classify small data sets as those involving a hundred observations or fewer, not small as those in the 1,000 to 50,000 or 100,000 range, large as 100,000 to 1 million, and very large as 10 million or more. If we were to classify statistical software according to the size of data set for which it does work, it would help the user because it is obvious that people run into problems and that software does not transport across those soft boundaries.
Also, I neglected to mention a paper by Richard Roistacher on a proposal for an interchange format [Roistacher, 1978]. That appeared a few years ago, and there has been a thundering silence following its appearance.
WILLIAM EDDY: There's been about 15 years of silence following that.
ROBERT TEITEL: At the Interface conference that was held in North Carolina in 1978 or 1979, Roistacher got to the point of having most of the vendors accepting that standard in principle. But then, as I understand it, his agency contract ran out of money, so that didn't go anywhere.
The notion of small or medium or large cannot be done in absolute terms. “Small-medium-large” is relative to the equipment you are using. Many people would consider 10,000 observations on a hundred variables to be enormous, if you are trying to run it on a PC. I like to define “large” as any size data set that you have difficulty handling on the equipment you are using.
CLIFTON BAILEY: If we had analog data on all of the Medicare patients like that which is collected at bedside, our data sets would be very, very small.
DANIEL CARR (George Mason University): I worked on the analysis of a large data set project years ago. Leo Breiman [University of California at Berkeley] has raised the issue that complexity of the data is more important than the large sample size. For instance, if someone said to me, “I have 500 variables, what do I do?”, I would say, “That is not the ‘large’ I like to work with. I like to have hundreds of thousands of observations with only three variables.” Complexity is an issue in the different types of data.
I am very interested in interfaces to large databases. Most of the data that is collected is never seen by humans and I think that is a tragedy, because a lot of this data is very important. So I think statistical packages need more interfaces to standard government data sets. I went to the extreme of trying to interface GRAS and EST; GRAS is a geographical information system. The neat thing about GRAS is, it already had the tools to read lots of government tapes. Having that as a part of standard statistical packages would be great.
At some point, I believe we are going to have to think differently about how we analyze large data sets. Some things are just too big to keep. In fact, a lot of data is summarized at the census level. So I think statisticians are going to have to think about flow-through analysis at some point. It may take a new generation of people to actually do that.
Another topic brought up was what I call data analysis management. More and more, there is emphasis on quality assurance in analysis. Most of that means proving what you did. It does not mean having to do it right, but at least proving what you did. I think that needs to be built into the software, so that we can keep a record of exactly what we did and can clean up the record. When I do things interactively I keep a log of it, and then I go back and clean out the mistakes and re-run it. But I think we need to have this kind of record, and it needs to be annotated so that it is meaningful later on. A lot of times, I go back to my old analysis, look at it, and ask, “What was I doing?” I often think maybe I made a mistake. Then two days later I realize I really did know what I was doing, so it was right. But I had forgotten in the meantime. So there's a need for this quality assurance feature, data analysis management with annotation, which may include dictation, voice, whatever is easy.
I would also like to see some high-end graphics tools. I am envious of people right now who can make movies readily. It is a very powerful communications medium that ought to be part of statistical software. I would like to see more integration of mathematics tools like Mathematica. Somebody ought to be addressing memory management; eight megabytes on a Spark workstation may not be enough, depending on the software I am using.
I would like to have more involvement with standards. For example, I am not sure that the standards that are developed are always optimal for statistical analysis. For instance, there are some defaults on the Iris workstations for projection that may not produce exactly the projections I would like for stereo. But at some point they even get built into the hardware, and so I have to program around them. It would be nice if in some areas we could get involved with the standards, both for hardware and software, and that boundary is getting closer all the time.
Efron, Bradley, 1988, Computer-intensive methods in statistical regression, SIAM Review, Vol. 30, No. 3, 421–449.
Efron, Bradley, and Robert Tibshirani, 1991, Statistical data analysis in the computer age, Science, Vol. 253, 390–395.
Muller, K.E., J.C. Smith, and J.S. Bass, 1982, Managing “not small” datasets in a research environment, SUGI '82--Proceedings of the Seventh Annual SAS User's Group International Conference.
Nachtsheim, Christopher J., 1987, Tools for computer-aided design of experiments, Journal of Quality Technology, Vol. 19, No. 3, 132–160.
Roistacher, R.C., 1978, Data interchange file: progress toward design and implementation, in Proceedings of the 11th Symposium on the Interface: Statistics and Computing Science, Interface Foundation, Reston, Va., pp. 274–284.