Executive Summary
Software, a critical core industry that is essential to U.S. interests in science, technology, and defense, is ubiquitous in today's society. Software coexists with hardware in our transportation, communication, financial, and medical systems. As these systems grow in size and complexity and our dependence on them increases, the need to ensure software reliability and safety, fault tolerance, and dependability becomes paramount. Building software is now viewed as an engineering discipline, software engineering, which aims to develop methodologies and procedures to control the whole software development process. Besides the issue of controlling and improving software quality, the issue of improving the productivity of the software development pro cess is also becoming important from the industrial perspective.
PURPOSE AND SCOPE OF THIS STUDY
Although statistical methods have a long history of contributing to improved practices in manufacturing and in traditional areas of science, technology, and medicine, they have up to now had little impact on software development processes. This report attempts to bridge the islands of knowledge and experience between statistics and software engineering by enunciating a new interdisciplinary field: statistical software engineering. It is hoped that the report will help seed the field of statistic al software engineering by indicating opportunities for statistical thinking to contribute to increased understanding of software and software production, and thereby enhance the quality and productivity of both.
This report is the result of a study by a panel convened by the Committee on Applied and Theoretical Statistics (CATS), a standing committee of the Board on Mathematical Sciences of the National Research Council, to identify challenges and opportunities in the development and implementation of software involving significant statistical content. In addition to pointing out the relevance of rigorous statistical and probabilistic techniques to pressing software engineering concerns, the panel outlines opp ortunities for further research in the statistical sciences and their applications to software engineering. The aim is to motivate new researchers from statistics and the mathematical sciences to tackle problems with relevance for software development, as well as to suggest a statistical approach to software engineering concerns that the panel hopes software engineers will find refreshing and stimulating. This report also touches on important issues in training and education for software engineers in the statistical sciences and for statisticians with an interest in software engineering.
Central to this report's theme, and essential to statistical software engineering, is the role of data: wherever data are used or can be generated in the software life cycle, statistical methods can be brought to bear for description, estimation, an d prediction. Nevertheless, the major obstacle to applying statistical methods to software engineering is the lack of consistent, high-quality data in the resource-allocation, design, review, implementation, and test stages of software development. Statisticians interested in conducting research in software engineering
must play a leadership role in justifying that resources are needed to acquire and maintain high-quality and relevant data.
The panel conjectures that the use of adequate metrics and data of good quality is the primary differentiator between successful, productive software development organizations and those that are struggling. Although the single largest area of overlap between statistics and software engineering currently concerns software development and production, it is the panel's view that the largest contributions of statistics to software engineering will be those affecting the quality and productivity of front-en d processes, that is, processes that precede code generation. One of the biggest impacts that the statistical community can make in software engineering is to combine information across software engineering projects as a means of evaluating effects of technology, language, organization, and process.
CONTENTS OF THIS REPORT
Following an introductory opening chapter intended to familiarize readers with basic statistical software engineering concepts and concerns, a case study of the National Aeronautics and Space Administration (NASA) space shuttle flight control software is presented in Chapter 2 to illustrate some of the statistical issues in software engineering. Chapter 3 describes a well-known general software production model and associated statistical issues and approaches. A critique of some current applications of statistics and software engineering is presented in Chapter 4. Chapter 5 discusses a number of statistical challenges arising in software engineering, and the panel's closing summary and conclusions appear in Chapter 6.
STATISTICAL CHALLENGES
In comparison with other engineering disciplines, software engineering is still in the definition stage. Characteristics of established disciplines include having defined, tested, credible methodologies for practice, assessment, and predictability. Software engineering combines application domain knowledge, computer science, statistics, behavioral science, and human factors issues. Statistical challenges in software engineering discussed in this report include the following:
-
Generalizing particular statistical software engineering experimental results to other settings and projects,
-
Scaling up results obtained in academic studies to industrial settings,
-
Combining information across software engineering projects and studies,
-
Adopting exploratory data analysis and visualization techniques,
-
Educating the software engineering community regarding statistical approaches and data issues,
-
Developing methods of analysis to cope with qualitative variables,
-
Providing models with the appropriate error distributions for software engineering applications, and
-
Enhancing accelerated life testing.
SUMMARY AND CONCLUSIONS
In the 1990s, complex hardware-based functionality is being replaced by more flexible, software-based functionality, and massive software systems containing millions of lines of code are being created by many programmers with different backgrounds, training, and skills.The challenge is to build huge, high-quality systems in a cost-effective manner. The panel expects this challenge to preoccupy the field of software engineering for the rest of the decade. Any set of methodologies that can help in this task will be invaluable. More importantly, the use of such methodologies will likely determine the competitive positions of organizations and nations involved in software production. What is needed is a detailed understanding by statisticians of the software engineering process, as well as an appreciation by software engineers of what statisticians can and cannot do.
Catalysts essential for this productive interaction between statisticians and software engineers, and some of the interdisciplinary research opportunities for software engineers and statisticians, include the following:
-
A model for statistical research in software engineering that is collaborative in nature. The ideal collaboration partners statisticians, software engineers, and a real software process or product. Barriers to academic reward and recognition barriers, as well as obstacles to the funding of cross-disciplinary research, can be expected to decrease over time; in the interim, industry can play a leadership role in nurturing collaborations between software engineers and statisticians and can reduce its own set of barriers (for instance, those related to proprietary and intellectual property interests).
-
A model for data collection and analysis that ensures the availability of high-quality data for statistical approaches to issues in software engineering. Careful attention to data issues ranging from definition of metrics to feed-back/-forward loops, including exploratory data analysis, statistical modeling, defect analysis, and so on, is essential if statistical methods are to have any appreciable impact on a given software project under study. For this reason it is crucial that the software industry take a lead position in research on statistical software engineering.
-
Attention to relevant issues in education. Enormous opportunities and many potential benefits are possible if the software engineering community learns about relevant statistical methods and if statisticians contribute to and cooperate in the education of future software engineers. Some relevant areas include:
-
Designed experiments. Software engineering is inherently experimental, yet relatively few designed experiments have been conducted. Software engineering education programs must stress the desirability, where feasible, of validating new techniques using statistically valid designed experiments.
-
Exploratory data analysis. Exploratory data analysis methods are essentially ''model free," whereby the investigator hopes to be surprised by unexpected behavior rather than having thinking constrained to what is expected. Modeling. Recent advances in the statistical community in the past decade have effectively relaxed the linearity assumptions of nearly all classical techniques. There should be an emphasis on educational information exchange leading to more and wider use of these recently developed techniques.
-
Risk analysis. A paradigm for managing risk for the space shuttle program, discussed in Chapter 2 of this report, and the corresponding statistical methods can play a crucial role in identifying risk-prone parts of software systems and of combined hardware and software systems.
-
Attitude toward assumptions. Software engineers should be aware that violating assumptions is not as important as thoroughly understanding the violation's effects on conclusions. Statistics textbooks, courses, and consulting activities should convey the statistician's level of understanding about and perspective on the importance and implications of assumptions for statistical inference methods.
-
Visualization. Graphics is important in exploratory stages in helping to ascertain how complex a model the data ought to support; in the analysis stage, by which residuals are displayed to examine what the currently entertained model has failed to account for; and in the presentation stage, in which graphics can provide succinct and convincing summaries of the statistical analysis and the associated uncertainty. Visualization can help software engineers cope with, and understand, the huge quantities of data collected as part of the software development process.
-
Tools. It is important to identify good statistical computing tools for software engineers. An overview of statistical computing, languages, systems, and packages should be done that is focused specifically for the benefit of software engineers.