Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
I. INTRODUCTION One of the first problems of national importance Hat was considered by Be Committee on Applied and Theoretical Statistics (CATS) was posed to it by staff members of He Intemal Revenue Service (IRS). They were concemed with the lack of appropriate statistical methodologies for certain nonstandard situations Hat arise in auditing where the disuibucions appropriate for modeling the data are mastery different from those for which most statistical analyses were designed. The quality of He procedures used in a statistical analysis depends heavily on He probability mode! or distributions assumed. Because of this, considerable effort over the years has been expended in He development of large classes of standard distributions, along win relevant statistical methodologies, designed to serve as models for a wide range of phenomena. However, there still remain many important problems where the data do not follow any of these more "standard" models. The problem raised by the IRS provides a strikingly simple example of data from a nonstandard distnbudon for which statistical methodologies have ordy recently begun to be developed, and for which much additional research is needed. The example is of such national importance, both for government agencies and for business and industry, Cat it is He primary focus of this report. The potential monetary losses associated with poor statistical practice in tills auditing context are exceedingly high. It is the purpose of this report to give a survey of the available statistical methods, to provide an annotated bibliography of the literature on which He survey is based, to summarize important open questions, to present recommendations designed to improve the level and direction of research on these matters and to encourage greater interaction between statisticians and accountants. This report is primarily directed towards researchers, both in statistics and accounting, and students who wish to become familiar with He important problems and literature associated with statistical auditing. It is hoped Hat this report wild stimulate the needed collaborative research in statistical auditing involving both statisticians and accountants. It is also hoped that practitioners will benefit from the collection of methodologies presented here, and possibly will be able to incorporate some of these ideas into their own work. Although this report is centered upon a particular nonstandard distnbunon that arises in auditing, the original proposal for this study recognized that this same type of nonstandard mode] arises in many quite different applications covering almost all other disciplines. Three general areas of application, (accounting, medicine and engineering) were initially chosen for consideration by the Panel. Later in this Introduction we list 1
several examples in order to illustrate the wide-spread occurrence of similar nonstandard models throughout most areas of knowledge. These examples will, however, pnma~y reflect Me opal areas of emphasis of He Panel. Before describing these examples. however, we briefly discuss the general concept of a mixture of distributions since it appears in me name of the Panel. Nonstandard Mixtures: The phrase "mixture of distnbudons" usually refers to a situation in which tile j-th of k (taken here to be finite) underlying distributions is chosen with probabilitypi,j=l,...,k. The selection probabilities are usually unknown and the number of underlying distnbui~ons k may be fixed or random. The special case of two underdog dis~ibubons is an important classical problem which encompasses this report's particular problem In which, with probability p, a specified constant is observed while, win probability I-p, one observes a random measurement whose distribution has a density function. That is, it is a mixture of a degenerate distnbunon and an absolutely continuous one. There are many examples of probability models that are best described as mixtures of two or more other models in me above sense. For example, a probability model for the heights of 16 year olds would probably best be descnbed as He mixture of two unmoral distnbui~ons, one representing He model for the heights of girls and one for He boys. Karl Pearson in 1894 was possibly me first to study form ally the case of a mixture of two distributions; in this case they were two nonnal distnbunons Hereby providing one possible mixture mode} for the above example of heights. Following this, there were few if any notable studies until the paper of Robbins and Pionan (1949) in which general mixtures of chi-square distnbutions were denved as probability models for quadratic founs of nonnal random vanables. Since ~en, there have been many other papers dealing win particular mixture models. The published research primarily deals with mixtures of distnbutions of similar types, such as mixtures of nominal distr~bution-s, mixtures of chi-square distnbuiions, mixtures of exponential distributions, mixtures of binomial distnbutions, and so on. However, the literature contains very few papers that provide and deal with special "nonstandard" mixtures that mix discrete (degenerate, even) and continuous distributions as emphasized in this Report. In general, the word mixture refers to a convex combination of distributions or random variables. To illustrate, suppose X arid Y are random variables with distribution functions F and G respectively. Let O<p<1. Then H =pF+(l-p JIG is a distribution fi~nchon that may be caned a mixture of ~ and G . The interpretation of H is that it represents a 2
model in which the distribution F is used win probability p while G is used with probability 1-p. ~ terms of random variables, one may say Hat H models an observation Z that is obtained as follows: With probability p observe X having distnbui~on F. and m~ probability 1-p observe Y having distnbution G. Such mixtures may then be viewed as models for data that may be interpreted as the outcomes of a two-stage expenment: In He first stage, a population is randomly chosen and Hen in He second stage an obse~vabon is made from He chosen population. It is not necessary to limit oneself to mixtures of just two or even a finite number of distnbutions. In general, one may have an arbitranly indexed family of distributions, for which an index is randomly chosen from a given mixing distribution It should also be emphasized that Here is considerable ambiguity associated with mixtures; every distribution may be expressed as a mixture in infinitely many ways. Nevertheless, when mixture models are formulated reasonably, they can provide useful tools for statistical analysis. There is by now a large literature pertaining to statistical analyses of mixtures of distnbudons; for a source of references, see Tittenngton, Smith and Makov (19851. Problems and applications of mixtures also appear in He literature associated with the term heterogeneity; see Keyfitz (1984~. Applications lavolving Nonstar~ard Mixtures The interpretation of the nonstandard mixtures emphasized in this report is quite simple. If F. the degenerate distribution, is chosen in the first stage, the observed value of the outcome is zero; otherwise the observed value is drawn from the other distribution. In what folBows we illustrate several situations in which this type of nonstandard mixture may arise, and indicate thereby its wide range of applications. There are of course fundamental differences among many of these applications. For example, in some of these applications, the mixtures are distinguishable in the sense that one can tell from which population an observation has come, whilelin others the mixtures are indistinguishable. In many applications it is necessary to form restrictive parametric models for the non-degenerate distribution G; in at least one example, G is itself seen to arise from a mixture. In some cases G admits only positive values of X; In over cases G presents positive, negative, or even zero values. Of course, if G also permits zero values with positive probability, men the mixture is clearly indistinguishable. The descriptions of the following applications are brief and somewhat simplified. They should suffice, however, to indicate the broad diversity of important situations in which these nonstandard mixtures arise. We begin with the auditing application that is the focus of dais report.
1. In auditing, some population elements contain no errors while other population elements contain errors of varying amounts. The distribution of Eros can, therefore, be Rewed as a mixture of two distinguishable distributions, one win a discrete probability mass at zero and He over a continuous distribution of non-zero positive and/or negative error amounts. The main statistical objective in this auditing problem is to provide a statistical bound for Be total error amount in the population. The difficulty inherent in this problem is the typical presence of only a few or no errors in a given sample. This application win be the main focus of this report; it is studied at length in Chapter IT. Independent public accountants often use samples to estimate the amount of monetary error in an account balance or class of transactions. Their interest usually centers on obtaining a statistical upper bound for the true monetar, error, a bound that is most likely going to be greater than the ear. A major concern is Mat the estimated upper bound of monetary error may in fact be less Tan the true amount more often than desired. Govewrnental auditors are also interested in monetary error - the difference between the costs reported and what should have been reported, for example. Because Be government may not wish to over estimate the adjustment that the auditee owes the govemment, interest often centers on the lower confidence limit of monetary error at a specified confidence level allowed by the policy. The mixture problem affects bow groups of auditors as well as intemal auditors who may be concemed win both upper and lower limits. In all cases there is a serious tendency for Be use of standard statistical techniques, mat are based upon the approximate normality of Be estimator of total monetary error, to provide erroneous results. Specifically, as will be reviewed in the following chapter, both confidence limits tend to be too small. Upper limits being too small means that the frequency of upper limits exceeding me true monetary error is less than the nominal confidence level. Lower limits being too small means that the frequency of lower limits being smaller than Be true monetary error is greater man Be nominal confidence level. To the auditors these deficiencies have important practical consequences. 4
Most of the research to date has been directed toward the independent public accountants' concern with the upper limit. For example, the research outlined in the next chapter that is concemed with sampling of dollar units represents a major Trust in this direction. By contrast, very little research has been done on We problem of the lower confidence bound. This represents an area of considerable importance where research is needed. 2. In a community a particular service, such as a specific medical care, may not be utilized by an families in me community. There may be a substantial portion of non- takers of such a service. Those families, who subscribe to it, do so in varying amounts. Thus me distnbution of the consumption of the service may be represented by a mixture of zeros and positive values. 3. In the mass production of technological components of hardware, intended to Unction over a penod of time, some components may fail on installation and therefore have zero life lengths. A component Hat does not fail on installation win have a life length which is a positive random v en able whose distnbution may take different fonns. Thus, the overall distribution of lifetimes which includes the duds is a nonstandard mixture. In measuring precipitation amounts for specified time periods, one must deal with the problem that a proportion of these amounts win be zero (i.e. measured as zero). The remaining proportion is charactenzed by some positive random variable. The distribution of this positive random variable usually looks reasonably smooth, but in fact is itself a complex mixture arising from many different types of events. 5. In the study of human smoking behavior, two variables of interest are smoking status - Ever Smoked and Never Smoked - and score on a ''PhaImacological Scale" of people who have smoked. This also is a bivar~ate problem with a discrete v en ate - O (Never Smoked), 1 (Ever Smoked) and a continuous variate "Pharmacological Score." A nontrivial conditional distnbution of the second variate can be defined only in association with die 1 outcome of the first vanate. This problem can be further complicated by nonresponse on either of the first or second variates. s
6. In the study of minor c~actenshcs. two variates may be recorded. The first is me absence (0) or presence (~) of a minor and the second is tumor size measure on a continuous scale. In this problem, it is sometimes of interest to consider a marginal tumor measurement which is O with nonzero probability, an example of a mixture of unrelated distributions. The problem can be furler complicated by recognizing that Me absence of a tumor is an operational definition and mat in fact patients with non-detectable minors win be included in this category. 7. In series of genetic bird defects, children can be characterized by two vanates, a discrete or categoncal vanable to indicate if one is not affected, affected and born dead, or affected and born alive, and a continuous variable measuring me survival time of affected children born alive. The conditional distnbution of survival tome given this first variable is undefined for children who are not affected, a mass point at O for children who are affected and born dead, and nontrivial for children who are born alive. In some cases it may be necessary to consider the conditional survival time distnbudon for affected children as a mixture of a mass point (at O) and a nontrivial continuous distribution. 8. Consider measurements of physical performance scores of parents with a debilitating disease such as multiple sclerosis. There wild be frequent zero measurements from those giving no performance and many observations with graded positive performance. 9. In a study of took decay, the number of surfaces in a mouth which are filled, missing, or decayed are scored to produce a decay index. Healthy teeth are scored O for no evidence of decay. The distribution is a mixture of a mass point at O and a nontrivial continuous distnbution of decay score. The problem could be funkier complicated if the decay score is expressed as a percent of damage to measured teeth. The distnbudon should then be a mixture of a discrete random variable (0 - healthy teeth, ~ - all teeth missing) with nonzero probability of bow outcomes and a continuous random vanable (amount of decay in the (0,1) interval). 10. In studies of mesons for removing certain behaviors (e.g., predatory behavior, or salt consumption), the amount of 6
the behavior which is exhibited at a certain point in time may be measured. In this context, complete absence of the target behavior may represent a different reset man would a reduction from a baseline level of the behavior. Thus, one would mode} the distnbudon of activity levels as a mixture of a discrete value of zero and a continuous random level. 11. T~me until remission is of interest in studies of drug effectiveness for treannent of certain diseases. Some patients respond and some do not. The distribution is a mixture of a mass point at O and a nontrivial continuous distnbudon of positive remission times. 12. In a quite different context, important problems exist in dme-senes analysis in which there are mixed spectra containing both discrete and continuous components. In some of the above examples, the value zero is a natural extension of the possible measurements, and in other examples it is not. For example, in measuring behavioral activity (Example 9), a zero measurement can occur because the subject has totally ceased the behavior, or because the subject has reduced the behavior to such a low average level that the time of observation is insufficient to observe the behavior. This indecision might also occur in the example concerning tumor measurement or in rainfall measurement. An other examples, however, it is possible to determine the source of me observation. The very fact that the service lifetime of a component in Example 3 is zero identifies that component as a dud, and in Example 7 there is a clear distinction between stillbom and livebom children. These To kinds of examples represent applications of indistinguishable and distinguishable mixtures, respectively. 7