Appendix F
Computerized Scoring of Polygraph Data
INTRODUCTION
A critical part of polygraph examination is the analysis and interpretation of the physiological data recorded on polygraph charts. Currently, polygraph examiners rely on their subjective global evaluation of the charts, various partly objective numerical scoring methods, computerized algorithms for chart scoring, or some combination of the three. Computerized systems have the potential to reduce bias in the reading of charts and eliminate problems of imperfect interrater variability that exist with human scoring. The extent to which they can improve accuracy depends on how one views the appropriateness of using other knowledge available to examiners, such as demographic information, historical background of the subject, and behavioral observations.^{1}
Computerized systems have the potential to perform such tasks as polygraph scoring better and more consistently than human scorers. This appendix summarizes the committee’s review of existing approaches to such scoring systems. Specifically, it focuses on two systems: the Computerized Polygraph System (CPS) developed by Scientific Assessment Technologies based on research conducted at the psychology laboratory at the University of Utah, and the PolyScore® algorithms developed at Johns Hopkins University Applied Physics Laboratory. We also comment on the Axciton™ and Lafayette™ polygraph instruments that use the PolyScore algorithms.
The statistical methods used in classification models are well devel
oped. Based on a set of data with predictor variables (features in the polygraph test) of known deceptive and nondeceptive subjects, one attempts to find a function of the predictor variables with high values for deceptive and low values for nondeceptive subjects. The conversion of continuous polygraph readings into a set of numeric predictor variables requires many steps and detailed decisions, which we outline below. In particular, we discuss aspects of choosing a small number of these predictors that together do the best job of predicting deception, and we consider the dangers of attempting to use too many variables when the test data set is relatively small.
We examined the two scoring systems with sufficient documentation to allow evaluation. The CPS system has been designed with the goal of automating what careful human scorers currently do and has focused from the outset on a relatively small set of data features; PolyScore has been developed from a much larger set of features, and it is more difficult to evaluate because details of development are lacking. Updates to these systems exist, but their details are proprietary and were not shared with us. The description here focuses on the PolyScore and CPS scoring algorithms since no information is publicly available on statistical methods utilized by these more recently developed algorithms, although the penultimate section includes a summary of the performance of five algorithms, based on Dollins, Kraphol, and Dutton (2000).^{2}
Since the 1970s, papers in the polygraph literature have proffered evidence claiming to show that automated classification algorithms could accomplish the objective of minimizing both false positive and false negative error rates. Our own analyses based on a set of several hundred actual polygraphs from criminal cases provided by the U.S. Department of Defense Polygraph Institute (DoDPI), suggest that it is easy to develop algorithms that appear to achieve perfect separation of deceptive and nondeceptive individuals by using a large number of features or classifying variables selected by discriminant analysis, logistic regression, or a more complex datamining technique. Statisticians have long recognized that such a process often leads to “overfitting” of the data, however, and to classifiers whose performance deteriorates badly under proper crossvalidation assessment (see Hastie, Tibshirani, and Friedman [2001] for a general discussion of feature selection). Such overestimation still occurs whenever the same data are used both for fitting and for estimating accuracy even when the appropriate set of features is predetermined (see Copas and Corbett, 2002). Thus, on a new set of data, these complex algorithms often perform less effectively than alternatives based on a small set of simple features.
In a recent comparison, various computer scoring systems performed similarly and with only modest accuracy on a common data set used for
validation (see Dollins, Krapohl, and Dutton, 2000). The committee believes that substantial improvements to current numerical scoring may be possible, but the ultimate potential of computerized scoring systems depends on the quality of the data available for system development and application and the uniformity of the examination formats with which the systems are designed to deal.
STATISTICAL MODELS FOR CLASSIFICATION AND PREDICTION
Before turning to the computer algorithms themselves, we provide some background on the statistical models that one might naturally use in settings such as automated polygraph scoring. The statistical methods for classification and prediction most often involve structures of the form:
response variable = g(predictor variables, parameters, random noise). (1)
For prediction, the response variable can be continuous or discrete; for classification, it is customary to represent it as an indicator variable, y, such that, in the polygraph setting, y = 1 if a subject is deceptive, and y = 0 is the subject is not. Some modern statistical approaches, such as discriminant analysis, can be viewed as predicting the classification variable y directly, while others, such as logistic regression, focus on estimating its functions, such as Pr(y = 1). Typically, such estimation occurs conditional on the predictor variables, x, and the functional form, g.
Thus, for linear logistic regression models, with k predictor variables, x = ( x_{1}, x_{2}, x_{3}, x_{4}, . . . , x_{k}), the function g is estimated in equation (1) using a linear combination of the k predictors:
score(x) = ß_{0}+ ß_{1} x_{1} + ß_{2} x_{2} + ß_{3} x_{3}+ ß_{4} x_{4} +...+ ß_{k} x_{k}, (2)
and the “response” of interest is
(3)
(This is technically similar to choosing g = score(x), except that the random noise in equation (1) is now associated with the probability distribution for y in equation (3), which is usually taken to be Bernoulli.) The observations on the predictor variables here lie in a kdimensional space and, in essence, we are using an estimate of the score equation (2) as a hyperplane to separate the observations into two groups, deceptives and
nondeceptives. The basic idea of separating the observations remains the same for nonlinear approaches as well. Model estimates do well (e.g., have low errors of misclassification) if there is real separation between the two groups.
Model development and estimation for such prediction/classification models involve a number of steps:

Specifying the list of possible predictor variables and features of the data to be used to assist in the classification model (1). Individual variables can often be used to construct multiple prediction terms or features.

Choosing the functional form g in model (1) and the link function to the classification variable, y, as in equation (3).

Selecting the actual features from the feature space to be used for classification.

Fitting the model to data to estimate empirically the prediction equation to be used in practice.

Validating the fitted model through classification of observations in a separate dataset or through some form of crossvalidation.
Hastie, Tibshirani, and Friedman (2001) is a good source of classification/prediction models, crossvalidation, related statistical methodologies and discussions that could be applied to the polygraph problem. Recently, another algorithmic approach to prediction and classification problems has emerged from computer science, which is also called data mining. It focuses less on the specification of formal models and treats the function g in equation (1) more as a black box that produces predictions. Among the tools used to specify the black box are regression and classification trees, neural networks, and support vector machines. These still involve finding separators for the observations, and for any method one chooses to use, step 1 and algorithmically oriented analogues of steps 25 listed above still require considerable care.
Different methods of fitting and specification emphasize different features of the data. The standard linear discriminant analysis is developed under the assumption that the distributions of the predictors for both the deceptive group and the nondeceptive group are multivariate normal, with equal covariance matrices (an assumption that can be relaxed), which gives substantial weight to observations far from the region of concern for separating the observations into two groups. Logistic regression models, in contrast, make no assumptions about the distribution of the predictors, and the maximum likelihood methods typically used for their estimation put heavy emphasis on observations close to the boundary between the two sets of observations. Common experience with all prediction models
of the form (1) is that with a large number of predictor variables, one can fit a model to the data (using steps 1 through 4) that completely separates the two groups of observations. However, implementation of step 5 often shows that the achieved separation is illusory. Thus, many empirical approaches build crossvalidation directly into the fitting process and set aside a separate part of the data for final testing.
The methods used to develop the two computerbased scoring algorithms, CPS and PolyScore, both fit within this general statistical framework. The CPS developers have relied on discriminant function models, and the PolyScore developers have largely used logistic regression models. But the biggest differences that we can discern between them are the data they use as input, their approaches to feature development and selection, and the efforts that they have made at model validation and assessment. The remainder of this appendix describes the methodologies associated with these algorithms and their theoretical and empirical basis.
DEVELOPMENT OF THE ALGORITHMS
A common goal for the development of computerbased algorithms for evaluating polygraph exams is accuracy in classification, but the devil is in the details. A proper evaluation requires an understanding of the statistical basis of classification methods used, the physiological data collected for assessment, and the data on which the methods have been developed and tested.
CPS builds heuristically on the Utah numerical manual scoring, which is similar in spirit to the SevenPosition Numerical Analysis Scale, a manual scoring system currently taught by DoDPI. PolyScore, in contrast, does not attempt to recreate the manual scoring process that the examiners use. Neither appears to rely on more fundamental research on information in the psychophysiological processes underlying the signals being recorded, except in a heuristic fashion.
CPS was developed by Scientific Assessment Technologies based on research conducted at the psychology laboratory at the University of Utah by John Kircher and David Raskin (1988) and their Computer Assisted Polygraph System developed in the 1980s. While the latter system was developed on data gathered in the laboratory using mock crime scenarios, the newer CPS versions have been developed using polygraph data from criminal cases provided by U.S. Secret Service Criminal Investigations (Kircher and Raskin, 2002). The CPS scoring algorithm is based on standard multivariate linear discriminant function analysis followed by a calculation that produces an estimate of the probability of truthfulness or equivalently, deception (Kircher and Raskin, 1988, 2002). The most recent version utilizes three features in calculating a discriminant score: skin
conductance amplitude, the amplitude of increase in the baseline of the cardiograph, and combined upper and lower respiration linelength (excursion) measurement (Kircher and Raskin, 2002).
PolyScore was developed by Johns Hopkins University Applied Physics Laboratory (JHUAPL), and version 5.1 is currently in use with the Axciton and Lafayette polygraph instruments. The algorithm has been developed on polygraph tests for actual criminal cases provided by the DoDPI. The input to PolyScore is the digitized polygraph signal, and the output is a probability of deception based either on a logistic regression or a neural network model. The PolyScore algorithm transforms these signals on galvanic skin response, blood pressure (cardio), and upper respiration into what its developers call “more fundamental” signals that they claim isolate portions of the signals that contain information relevant to deception. It is from these signals that the PolyScore developers extracted features for use, based on empirical performance rather than a priori psychophysiological assumptions.
The next sections describe how the two algorithms treat data used, signal processing, feature extraction, statistical analysis, and algorithm evaluation. These descriptions provide the basis for a discussion of possible future efforts at algorithm development and assessment. Since virtually all of the development and testing of algorithms has been done on specificincident data, with highly varying formats and structures, some of the observations and comments on the algorithms may not always have as much relevance to highly structured screening polygraph tests, like the Test for Espionage and Sabotage (TES), but other problems, such as low base rates, do have salience for the TES. The final sections of this appendix on algorithm evaluation and summary describe some of these issues.
Data Used
Current polygraph machines typically record four signals during a polygraph examination: thoracic and abdominal respirations, a cardiovascular signal, and an electrodermal signal. Differences between specific analog and digital machines exist in the recording of the physiological measurements. Sampling rates may vary between different systems. Analog to digital conversion, filtering, and pen adjustments may also vary. One crucial difference lies in the recording of the electrodermal channel, which is believed by many polygraph researchers to be the most diagnostic (Kircher and Raskin, 2002). Stoelting (and CPS) records skin conductance; Lafayette appears to record skin resistance, a signal that requires further filtering in order to stabilize the baseline of the response; Axciton actually uses a hybrid of skin resistance and skin conductance
(Dollins, Kraphol, and Dutton, 2000) (see the discussion of the advantages and disadvantages of these two measures in Appendix D). Kircher and Raskin (2002) provide more details on the physiological recordings and conversion of analog to digital signal, although they focus mainly on the procedures used by CPS. These matters are, in effect, precursors to the development of automated scoring algorithms, which presume that the analyzed signals “accurately” reflect the psychophysiological phenomena that are capable of distinguishing deception and nondeception.
PolyScore® 3.0 was developed by analyzing polygraph data from 301 presumed nondeceptive and 323 presumed deceptive criminal incident polygraph examinations, with six Axciton instruments. The apparatus specifications for these cases are not available. “Truth” for these cases was obtained in three ways:

confession or guilty plea,

consensus on truthful subjects by two or more different examiners,
or

confirmed truthful.
Version 5.1 of PolyScore used Zone Comparison Test (ZCT) and Modified General Question Test (MGQT) data from 1,411 real cases (J. Harris, personal communication, Johns Hopkins University Applied Physics Laboratory, 2001).
Chapters 2 and 4 of this report describe many of the biases that can result from the use of field cases selected from a larger population on the basis of truth and point out that consensus among multiple examiners is not acceptable as a criterion of deceptive/nondeceptive status. In effect, the use of such data can be expected to produce exaggerated estimates of polygraph accuracy. Nonetheless, most of the discussion that follows sets these concerns aside. Using field data, especially from criminal settings, to develop algorithms poses other difficulties. Actual criminal case polygraphs exhibit enormous variability, in the subject of investigation, format, structure, and administration, etc. These data are hard to standardize for an individual and across individuals in order to develop generalizable statistical procedures.
We analyzed polygraph data from 149 criminal cases using the ZCT and MGQT test formats, data that overlapped with those used in the development of PolyScore. Besides differences in the nature of the crime under investigation, our analyses revealed diverse test structures, even for the same test format, such as ZCT. The questions varied greatly from test to test and were clearly semantically different from person to person, even within the same crime. The order of questions varied across charts for the same person. In our analyses, we found at least 15 different se
quences for relevant and control questions, ignoring the positioning of the irrelevant questions. The number of relevant questions asked varied. Typically, there were three relevant questions. Accounting for irrelevant/ control questions substantially increases the number of possible sequences. These types of differences across cases pose major problems for both within and betweensubject analyses, unless all the responses are averaged. Finally, in the cases we examined there was little or no information available to control for differences among examiners, examinerexaminee interactions, delays in the timing of questions, etc. Some of these problems can be overcome by careful systematic collection of polygraph field data, especially in a screening setting, and others cannot. Controlling for all possible dimensions of variation in a computerscoring algorithm, however, is a daunting task unless one has a large database of cases.
The laboratory or mock crime studies so commonly found in the polygraph literature typically remedy many of these problems, but they have low stakes, lack realism, and do not replicate the intensity of the stimulus of the real situations. Laboratory test formats are more structured. The same sequence of questions is asked of all the subjects, making these exams more suitable for statistical analysis. For laboratory data, the experimental setup predetermines a person’s deceptive and nondeceptive status, thus removing the problem of contaminated truth. Laboratory studies can have more control over the actual recording of the measurements and running of the examinations, as well as information on examiners, examinees, and their interactions. A major shortcoming of laboratory polygraph data for developing computerbased algorithms, however, is that they do not represent the formats that will be ultimately used in actual investigations or screening settings. Similarly, laboratory subject populations differ in important ways from those to whom the algorithms will be applied.
Signal Processing
With modern digital polygraphs and computerized systems, the analog signals are digitized, and the raw digitized electrodermal (skin conductance), cardiovascular and respiratory (abdominal and thoracic) signals are used in the algorithm development. The analogtodigital conversion process may vary across different polygraph instruments. We were unable to determine Axciton instrument specifications. Kircher and Raskin (1988) provide some procedures used by Stoelting’s polygraph instruments for CPS. Once the signals have been converted, the primary objective of signal processing is to reduce the noisetoinformation ratio.
This traditionally involves editing of the data, e.g., to detect artifacts and outliers, some signal transformation, and standardization.
Artifact Detection and Removal
Artifacts indicate distortions in the signal that can be due to the movement of the examinee or some other unpredicted reactions that can modify the signal. Outliers account for both extreme relevant and control responses. The PolyScore algorithms include components for detecting artifacts and deciding if a signal is good or not. Kircher and Raskin (2002) report that they developed algorithms for artifact removal and detection in the 1980s, but they were not satisfied with their performance and did not use them as a part of CPS. Thus, examiners using CPS need to manually edit artifacts before the data are processed any further.
PolyScore tests each component of each question for artifacts and outliers. If any are detected, the algorithms remove those portions of the record from scoring, but examiners can review the charts and change the labeled artifacts, if they find it appropriate. Olsen et al. (1997) report that PolyScore labels a portion of a record as an extreme reaction (outlier) if it accounts for more than 89 percent of the variability among all the responses on the entire polygraph exam for a person; although the precise meaning of this is not totally clear, a portion of the individual’s data would probably need to be totally off the scale to account for so much of the variation.
The committee was told that the PolyScore algorithms are proprietary and not available for evaluation. Thus, we were unable to examine the appropriateness of the procedures used in connection with artifact adjustment and the accuracy of any of the related claims.
Signal Transformation
A second step in data editing is signal transformation. Both CPS and PolyScore algorithms transform the raw digitized signals in different ways, but with a common goal of further signal enhancement.
PolyScore detrends the galvanic skin response and cardio signals by removing the “local mean,” based on 30second intervals both before and after the point, from each point in the signal, thus removing longterm or gradual changes unrelated to a particular question. This removes pen adjustments caused by the examiner. After detrending, PolyScore separates the cardio signal through a digital filter into the highfrequency portion representing pulse and the lowfrequency component corresponding to overall blood volume. The derivative of the detrended blood volume then measures the rate of change and uncovers the remnants of the
pulse in the blood volume signal, which are further eliminated by a second filter. The respiration signal, like the cardio signal, has two frequency components: a high frequency corresponding to each breath and a low frequency representing the residual lung volume. Baselining, achieved by matching each low point of exhalation between breaths to a common level, separates these frequencies and makes it easier to compare the relative heights of breaths (Harris et al., 1994).
CPS creates response curves (waveforms) for the digitized signals of skin conductance, thoracic respiration, and abdominal respiration by the sequence of stored poststimulus samples for a 20second period following the onset of each question (Kircher and Raskin, 1988). To produce the blood pressure response waveform, CPS averages the systolic and diastolic levels for each second. Finger pulse amplitude is a secondbysecond waveform like the blood pressure. However, this waveform is the difference of diastolic and systolic levels, not the average. Diastolic levels at 2 seconds prestimulus and 20 seconds poststimulus are subtracted from the corresponding systolic levels. Twenty poststimulus ratios are calculated by dividing each poststimulus amplitude by the average of the two prestimulus values. Each proportion is then subtracted from unity, reflecting the finger pulse amplitude waveform that rises with decrease in amplitude of finger pulse. Features are extracted from the times and levels of inflection points.
Signal Standardization
PolyScore performs signal standardization to standardize the extracted features; CPS does not. Harris et al. (1994) stress the importance of this step in the development of PolyScore. The goal of this step is to allow amplitude measurements across different charts or individuals to be scored by a common algorithm. Typically, standardization is performed by subtracting the mean of the signal from each data point and dividing this difference by the standard deviation. JHUAPL points out that since the data contain outliers, this method is inaccurate and thus PolyScore standardizes by subtracting the median from each data point and dividing it by the interquartile range (1st and 3rd quartiles are used, corresponding to the 25th and the 75th percentile).
Feature Extraction
The discussion of general statistical methodology for prediction and classification at the beginning of this appendix noted the importance of feature development and selection. The goal is to obtain a set of features from the raw data that can have some relevance in modeling and classifi
cation of internal psychological states, such as deception. For polygraph data, a feature can be anything measured or computed that represents an emotional signal. The mapping between psychological and physiological states remains a substantial area of investigation in psychophysiology. Some commonly used features in the manual scoring are changes in amplitude in respiration, galvanic skin response and cardiovascular response, changes in baseline of respiration, duration of a galvanic skin response, and change in rate of cardiovascular activity. Computerized analysis of digitized signals offers a much larger pool of features, some of them not easily observable by visual inspection.
The general psychophysiological literature suggests describing the skin conductance response using such features as level, changes in the level, frequency of nonspecific responses, eventrelated response amplitude, latency, rise time, half recovery time, number of trials before habituation, and rate of change of eventrelated amplitude. Dawson, Schell, and Filion (2000) note that the rise time and half recovery time might be redundant measures and not as well understood as amplitude in association with psychophysiological responses. Similarly, cardiovascular activity is typically analyzed using heart rate and its derivatives, such as the heart rate variability or the difference of the maximum and minimum amplitudes. Brownley, Hurwitz, and Schneiderman (2000), however, state that reliability of heart rate variability as a measure is controversial, and they suggest the use of respiratory sinus arrhythmia, which represents the covariance between the respiratory and heart rate activity. This approach implies a need for frequencydomain analysis in addition to timedomain analysis of the biological signals. Harver and Lorig (2000) suggest looking at respiratory rate and breathing amplitude as possible features that describe respiratory responses. They also point out that recording changes only of upper or only of lower respiration is not adequate to estimate relative breathing amplitude. In general, area measures (integrated activity over time) are less susceptible to highfrequency noise than peak measures, but amplitude measurements are more reliable than latency (Gratton, 2000).
Early research focusing specifically on the detection of deception suggested that the area under the curve and amplitudes of both skin conductance and cardiovascular response can discriminate between deceptive and truthful subjects. Other features investigated included duration of rise to peak amplitude, recovery of the baseline, and the overall duration of the response. Kircher and Raskin (1988) report that line length, the sum of absolute differences between adjacent sample points, which captures some combination of rate and amplitude, is a good measure of respiration suppression.
Harris (1996, personal communication) reports that the initial feature
space for PolyScore 3.0 had 4,488 features and that about 10,000 features were considered for the 5.1 version. PolyScore’s main focus for feature development and selection appears to have been on reaction time (i.e., where the reaction starts, peaks, ends) and the reaction’s magnitude (i.e., amplitude), described by four numerical characterristics: percentile, derivative, line length, and latency period. JHUAPL evaluated the features using different window sizes (response intervals) for different signals.
PolyScore 3.2 uses a logistic regression model incorporating ten features: three each that describe galvanic skin response and blood volume and two each that describe pulse and respiration (Olsen et al., 1997). PolyScore 5.1 uses a neural network incorporating 22 features. JHUAPL declined to provide the committee with the specific features used by either program or detailed information on their selection.
Kircher and Raskin (1988, 2002) report that CPS initially considered 12 features describing the response waveforms for its discriminant analysis:

skin conductance amplitude,

blood pressure amplitude,

finger pulse amplitude,

skin conductance rise time,

skin conductance full recovery time,

blood pressure duration of half recovery time,

finger pulse amplitude duration of half recovery time,

skin conductance rise rate,

blood pressure half recovery rate,

skin conductance full recovery rate,

electrodermal burst frequency, and

respiration line length.
The most recent version of the CPS algorithm, however, uses only three features: skin conductance amplitude, the amplitude of increases in the baseline of the cardiograph and a line length composite measure of thoracic and abdominal respiration excursion (Kircher and Raskin, 2002). These features differ from those selected for use in PolyScore and appear to resemble more closely those that polygraph examiners attempt to identify in practice than do the vast majority of features incorporated into Polyscore feature selection spaces. In numerical scoring of polygraph charts, examiners typically combine upper and lower respiration scores into one score as well. Respiration line length is a more sophisticated measurement, however, which an examiner cannot easily calculate from the paper chart.
Feature Standardization
To score a polygraph exam, one needs to be able to compare the examinee’s responses on relevant questions to those on the control questions. These comparisons need to be done for one person, but the statistical models also need to be able to account for between subjectvariability. Both algorithms attempt to standardize the extracted features for relevant and control questions, thereby calibrating all subjects to the same scale (Olsen et al., 1997), but they do not do it quite the same way.
PolyScore standardizes relevant responses from subject i’ to the control responses from subject i’ as follows:
(4)
(5)
where R_{i} is the ith relevant question feature, C_{i} is the ith control question feature, µ_{C} is the mean of the control features, µ_{R} is the mean of the relevant features, and s_{CR} is the pooled standard deviation, all determined within subject i’.
Unlike traditional manual scoring where each relevant question is compared to its “closest” control question, PolyScore computes the 80th percentile of each relevant standardized feature thus reducing the information from an entire examination to a single value for each feature.
CPS calculates a standardized response, a zscore, for each relevant and comparison question by subtracting the common withinsubject mean from the calculated response and dividing by the common withinsubject standard deviation. Podlesny and Kircher (1999) claim that the difference between the PolyScore and CPS methods of computing standard errors is small and not significant. If there are three relevant and three control questions per chart, then the common mean and standard deviation are calculated using all repeated measurements (typically 18 if there are three charts). CPS uses the zscore for multiple comparisons. Each standardized relevant question is compared with the averaged standardized control questions across all charts for a particular measure. These values are used to assess the strength of the different responses on the different relevant questions. However, CPS uses the difference of the averaged standardized control and averaged standardized relevant responses for its discriminant analysis.
Both algorithms combine the data from all three charts. In field uses of automated algorithms, standardization and comparison across charts for an individual and across individuals is problematic since the questions can be semantically different. For example, for the same person, the first relevant question on the first chart may not be the same as the first relevant question on the third chart since the question sequence may vary across charts. Laboratory experiments typically eliminate this problem: they ask the same number of questions and same type of questions in the same sequence, repeated three times for all the subjects. This is not the case in actual specific incident polygraphs using the MGQT or ZCT type test formats. The Test of Espionage and Sabotage (TES) is more standardized in this respect and hence more suitable for the statistical analysis accounting for within and betweensubject variability. Our preliminary analyses of a set of polygraph tests from widely varying criminal cases suggest that the similar features work for each chart, and that the first chart alone is a relatively good but far from perfect discriminator, and that the information from the following charts improves the classification of nondeceptive people.
Statistical Analysis
Statistical analysis involves feature evaluation and selection in the context of specific forms of scoring and methods of translating scores into an actual classification rule. The latter problem is the focus of much discussion elsewhere in this report. This section reviews aspects of feature selection and other aspects of statistical modeling involving the development of scoring rules.
While the availability of the digitized signal and computerized analyses create a large number of possible features, this does not solve the problem of discovering all the variables actually relevant to distinguishing between deception and nondeception, nor does it answer the question of how they are related to one another. The statistical classification modeling problem involves extracting a subset of relevant features that can be used to minimize some function of the two types of classification error, false positives and false negatives, when applied to inputs more general than the training dataset from which the features are selected.
Feature Selection
If the feature space is initially small, some analysts believe that the surest method of finding the best subset of features is an exhaustive search of all possible subsets. Ideally, for each subset, one designs a classifier, tests the resulting model on independent data, and estimates its associ
ated error rates. One can then choose the model with the smallest combination of error rates. While this strategy may be feasible when the number of features is small, even the preliminary list of 12 features used in the development of the CPS algorithm poses problems. According to Kircher and Raskin (2002), they performed allpossiblesubset regression analysis, but they do not provide details on possible transformations considered or how they did crossvalidation.
When the number of features is larger, the exhaustive approach is clearly not feasible. If one has a small training set of test data (and repeatedly uses the same test data) one can obtain features that are well suited for that particular training or test data but that do not constitute the best feature set in general. One also needs to be careful about the number of selected features. The larger the number of features or variables, the more likely they will overfit the particular training data and will perform poorly on new data. The statistical and datamining literatures are rife with descriptions of stepwise and other feature selection procedures (e.g., forward selection, backward elimination, etc.), but the multiplicity of models to be considered grows as one considers transformations of features (every transformation is like another feature) and interactions among features. All of these aspects are intertwined: the methodological literature fails to provide a simple and unique way to achieve the empirical objectives of identifying a subset of features in the context of a specific scoring model that has good behavior when used on a new data set. What most statisticians argue is that fewer relevant variables do better on crossvalidation, but even this claim comes under challenge by those who argue for modelfree, blackbox approaches to prediction models (e.g., see Breiman, 2001). For the polygraph, the number of cases used to develop and test models for the algorithms under review was sufficiently small that the apparent advantages of these datamining approaches are difficult to realize.
For the development of PolyScore, JHUAPL’s primary method of feature selection was a linear logistic regression model where “statistical significance” of the features was a primary aspect in the selection process. Harris (personal communication) claims that he and his colleagues primarily chose those features with higher occurrence rate across different iterations of model fitting (e.g., galvanic skin response). We were unable to determine the detailed algorithmic differences between the 3.0 and 5.1 logistic regression versions of PolyScore. For version 5.1, JHUAPL extracted a set of features from its feature space of 10,000 based on statistical significance and then checked their ability to classify by applying the estimated model to a random holdout test set involving 25 percent of the 1,488 cases in its database. This procedure yielded several good models with varying numbers of features, some subsets of others, some
overfitting, and some underfitting the data. Ultimately, JHUAPL claims to have chosen a model based on overall performance and not on the individual features themselves. There are natural concerns about claims for model selection and specification from 10,000 features using a database of only 1,488 cases, concerns that are only partially addressed by the random holdout validation strategy used by JHUAPL.
None of the JHUAPL claims or statements has been directly verifiable because JHUAPL refused to make any details or documentation available to the committee, including the variables it ultimately chose for its algorithm. The only way one could evaluate the performance of the algorithm is to apply it to a fresh set of data not used in any way in the model development and validation process and for which truth regarding deception is available from independent information.
Further Details on Statistical Modeling
In polygraph testing, the ultimate goal of classification is to assign individuals (cases) to classes in a way that minimizes the classification error (i.e., some combination of false positives and false negatives). As we noted above, CPS uses discriminant function analysis and PolyScore has algorithms based on logistic regression and neural networks.
PolyScore’s logistic regression procedure can be thought of as having two parts (although the two are actually intertwined). First, the score is calculated as a linear combination of weighted features using maximum likelihood estimation, for example:
(6)
Table F1 reports the values of the estimated logistic regression coefficients, or weights, for the five features presented by Harris et al. (1994). A positive sign for a weight indicates an increase in the probability of deception, while a negative sign denotes a decrease. The absolute value of a weight suggests something about the strength of the linear association with deception. These results agree with the general results of CPS, which also claims that the stronger measure is the skin conductance measure, and they assign the most weight to it, while the respiration measure has a negative correlation with deception.
Second, one can estimate the probability of deception from the logistic regression:
(7)
TABLE F1 Features Implemented in Version 3.0 of PolyScore with Their Estimated Coefficients
Features 
Weights 
x_{1} GSR Range 
+5.5095 
x_{2} Blood Volume Derivative 75th Percentile 
+3.0643 
x_{3} Upper Respiration 80th percentile 
–2.5954 
x_{4} Pulse Line Length 
–2.0866 
x_{5} Pulse 55th Percentile 
+2.1633 
and then choose the cutoffs for the estimated probabilities (7) with values above the upper cutoff being labeled as deceptive and those below the lower cutoff as nondeceptive. The currently used cutoffs are 0.95 and 0.05, respectively. Different methods can be used to produce the scoring equation (6), and there is a lack of clarity as to precisely what method was used for the final PolyScore algorithm.
The CPS algorithm relies on the result of a multivariate discriminant analysis, which is known as a less robust method than the logistic regression with respect to departures from assumptions and which gives more weight to extreme cases in building a classifier. Kircher and Raskin (1988) report that they used allpossiblesubsets regression analysis on the 12 feature differences of scores to choose the best model and retained the five features listed in Table F2. However, Kircher and Raskin’s (2002) most recent model relies on only three features: skin conductance amplitude, the amplitude of increases in the baseline of the cerograph, and the respiration length.
Kircher and Raskin’s discriminant analysis provided “optimal” maximum likelihood weights for these variables to be used in a classification equation of the form (6) to produce a score for each subject in the two
TABLE F2 Features Implemented in CPS (reported by Kircher and Raskin, 1988) and Their Estimated Coefficients
Features 
Weights 
x_{1} SC Amplitude 
+0.77 
x_{2} SC full recovery time 
+0.27 
x_{3} EBF 
+0.28 
x_{4} BP Amplitude 
+0.22 
x_{5} Respiration Length 
–0.40 
groups. Note that these coefficients are essentially on a different scale than those of the PolyScore logistic regression model. They need to be converted into estimates for the probabilities of observing the scores given deception and nondeception by means of the normal probability density function. Kircher and Raskin allow these probability functions to have different variances:
(8)
(9)
where _{ND}_{ND}and are the estimates of the mean and standard deviation, respectively, of the discriminant scores from the nondeceptive subjects, and _{D} and _{D} are the estimates of the mean and standard deviation, respectively of the discriminant scores from the deceptive subjects.^{3} Finally, one can convert these estimated values into estimated probabilities of deception through Bayes’ theorem:
(10)
where P(ND) and P(D) are the prior probabilities of being nondeceptive (ND) and deceptive (D), respectively. Kircher and Raskin take these prior probabilities to be equal to 0.5. Despite the use of Bayes’ theorem in this final step, this is not a proper Bayesian approach to producing a classification rule.
Kircher and Raskin (1988) report that if (NDScore) based on three charts is greater than 0.70 they classify that person as nondeceptive, and if (NDScore) is less than 0.30, the person is classified as deceptive. For those whose estimated probability is between these two cutoff points, they calculate a new discriminant score based on five charts and then recalculate (NDScore) and use the same cutoff points. At that point, they label the test for subjects whose scores fall between 0.30 and 0.70 as inconclusive.
Both PolyScore and CPS seem to rely on the presumption of equal base rates for deceptive and nondeceptive cases, and they have been “evaluated” on databases with roughly equal sized groups. The performance of the algorithm in new instances or with differently structured “populations” of examinees is conjectural, and appropriate prior prob
abilities and operational cutoff points for algorithms for use in security screening are unclear.
Algorithm Evaluation
We lack detailed information from the developers on independent evaluations of the PolyScore and CPS algorithms. We do have limited information on a type of crossvalidation and a jackknife procedure to evaluate PolyScore^{®} 3.0, neither of which provides a truly independent assessment of algorithm performance in light of the repeated reanalyses of the same limited sets of cases.
Kircher and Raskin (2002) report the results of 8 selected studies of the CPS algorithm, none involving more than 100 cases, and most of which are deeply flawed according to the criteria articulated in Chapter 4. Moreover, only one of the two field studies described includes comparative data for deceptive and nondeceptive individuals. They report false negative rates ranging from 0 to 14 percent, based on exclusion of inconclusives. If inconclusives are included as errors, the false negative rates range from 10 to 36 percent. Similarly, they reported false positive rates ranging from 0 to 19 percent, based on exclusion of inconclusives. If inconclusives are included in the calculation of error rates, as for example in the calculation of ROC (receiver operating characteristics) curves, then the false positive rates ranges from 8 to 37 percent. It would be a mistake to treat these values as illustrative of the validity of the CPS computer scoring algorithm. Kircher and Raskin also list a ninth study (Dollins, Krapohl, and Dutton, 2000) that, as best we have been able to determine, is the only one that attempts independent algorithm evaluation. The values for false positive and false negative error rates that it reports appear to be highly exaggerated, however, because of the selection bias associated with the cases used.
Dollins and colleagues (Dollins, Krapohl, and Dutton, 2000) compared the performance of five different computerbased classification algorithms in late 1997: CPS, PolyScore, AXCON, Chart Analysis, and Identifi. Each developer was sent a set of 97 charts collected with Axciton instruments for “confirmed” criminal cases and used the versions of their software available at the time. Test formats included both ZCT and MGQT. None of the developers at the time of scoring knew the truth, confirmed by a confession or from indisputable corroborating evidence. An examination was labeled as nondeceptive if someone else confessed to the crime. The data contained 56 deceptive and 41 nondeceptive cases and came from a mix of federal and nonfederal agencies. All of the computer programs were able to read the Axciton proprietary format except the CPS program,
and Axciton Systems, Inc., provided the CPS developers with a textformatted version of the data (see below).
Dollins and associates (Dollins, Krapohl, and Dutton, 2000) report that there were no statistically significant differences in the classification powers of the algorithms. All programs agreed in correctly classifying 36 deceptive and 16 nondeceptive cases. And all incorrectly classified the same three nondeceptive cases, but there was not a single case that all algorithms scored as inconclusive. CPS had the greatest number of inconclusive cases and the least difference between the false positive and false negative rates. Four other algorithms all showed tendencies toward misclassifying a greater number of innocent subjects. The results, summarized in Table F3, show false negative rates ranging from 10 to 27 percent and false positive rates of 31 to 46 percent (if inconclusives are included as incorrect decisions).
As Dollins and colleagues (Dollins, Krapohl, and Dutton, 2000) point out, there are a number of problems with their study. The most obvious is a sampling or selection bias associated with the cases chosen for evaluation. The data were submitted by various federal and nonfederal agencies to the DoDPI and most of these were correctly classified by the original examiner and are supported by confessions. This database is therefore not representative of any standard populations of interest. If the analyzed cases correspond, as one might hypothesize given that they were “correctly” classified by the original examiner, to the easy classifiable tests, then one should expect all algorithms to do better on the test cases than in uncontrolled settings. Because all algorithms produce relatively high rates of inconclusive tests even in such favorable circumstances, performance with more difficult cases is likely to degrade. There was no control over the procedures that the algorithm developers used to classify these cases, and they might have used additional editing and manual
TABLE F3 Number of Correct, Incorrect, and Inconclusive Decisions by Subject’s Truth

Deceptive (n = 56) 
Nondeceptive (n = 41) 

Algorithm 
Correct 
Incorrect 
Inconclusives 
Correct 
Incorrect 
Inconclusive 
CPS 
41 
4 
11 
28 
3 
10 
PolyScore 
49 
1 
6 
26 
7 
8 
AXCON 
50 
1 
5 
24 
9 
8 
Chart Analysis 
49 
2 
5 
22 
8 
11 
Identifi 
49 
1 
6 
22 
8 
11 
SOURCE: Dollins, Krapohl, and Dutton (2000:239). 
examination of the data, as well as modifications to the software for classification cutoffs. The instrumentation used was also a possible problem in this study, particularly for the CPS algorithm. Data were collected with the Axciton instrument that records a hybrid of skin conductance and skin resistance. The CPS algorithm relies on true skin conductance and the data recorded with the Stoelting instrument. The CPS algorithm was unable to process the Axciton proprietary data and was provided with the text format, in which there was also a possibility of error in rounding the onsets of the questions with further negative effect on the CPS performance. The other algorithms performed very similarly, which is not surprising because they were developed on data collected with Axciton instruments and in most cases with very similar databases.
IMPLICATIONS FOR TES
JHUAPL is currently working on a betatest version of PolyScore 5.2 that has prototype algorithms for scoring screening test formats such as TES and relevant/irrelevant formats. The current version of the TESformat algorithm uses the same features as the ZCT/MGQT–format algorithm, but this may change. Polygraph examiners review each chart in a TES separately; PolyScore analyzes them together. We are not aware of other scoring algorithms for the TES format.
Table F4 reports very preliminary results of the TES algorithm provided to us by JHUAPL. The current difficulty in developing this algorithm is the overall small number of deceptive cases. As a result, they are giving up the power to detect (that is, keeping the sensitivity of the test at lower levels) in order to keep the false positive rates lower, in effect changing the base rate assumptions. These data indicate that sensitivity of 70 percent may be attained in conjunction with 99 percent specificity (1 percent false positive rate). JHUAPL believes these numbers can be im
TABLE F4 Preliminary TES Results
Type of Analysis 
Total Number 
Inc 
Corr 
TN 
FP 
FN 
TP 
Binary^{a} 
716 
0 
707 
692 
4 
5 
15 
Ternary 
524 
192 
520 
510 
3 
1 
10 
NOTES: Inc, inconclusive; Corr, correct; TN, true negative; FP, false positive; FN, false negative; TP, true positive. ^{a}Inconclusives forced to deceptive, nondeceptive. 
proved. Of about 2,100 cases, onethird have been used strictly for training, onethird for training and testing, and onethird have been withheld for independent validation, a step that has not yet occurred. A major problem with this database is independent determination of truth.
SUMMARY
The PolyScore and CPS computerized scoring algorithms take the digitized polygraph signals as inputs and produce estimated probabilities of deception as outputs. They both assume, a priori, equal probabilities of being truthful and deceptive. PolyScore was developed on real criminal cases, and the Computer Assisted Polygraph System (CAPS) (the precursor to CPS) was developed on mock crimes. CAPS truth came solely from independent blind evaluations, while PolyScore relied on a mix of blind evaluations and confessions. The more recent CPS versions seem to rely on actual criminal cases as well although we have no details.
Both algorithms do some initial data transformation of the raw signals. CPS keeps these to a minimum and tries to retain as much of the raw signal as possible. PolyScore uses more initial data editing tools such as detrending, filtering, and baselining. PolyScore and CPS standardize signals, using different procedures and on different levels. They extract different features, and they seem to use different criteria to find where the maximal amounts of discriminatory information lie. Both, however, give the most weight to the electrodermal channel.
PolyScore combines all three charts into one single examination record and considers reactivities across all possible pairs of control and relevant questions. CAPS compares adjacent control and relevant questions as is done in manual scoring, but it also uses difference of averaged standardized responses on the control and relevant questions to discriminate between guilty and nonguilty people. CPS does not have an automatic procedure for the detection of artifacts, but it allows examiners to edit the charts themselves before the algorithm calculates the probability of truthfulness. PolyScore has algorithms for artifacts and outliers detection and removal, but JHUAPL treats the specific details as proprietary and will not reveal them. While PolyScore uses logistic regression or neural networks to estimate the probability of deception from an examination, CPS uses standard discriminant analysis and a naïve Bayesian probability calculation to estimate the probability of deception.^{4}
Overall, PolyScore claims to do as well as experienced examiners on detecting deceptives and better on detecting truthful subjects. CPS claims to perform as well as experienced evaluators and equally well on detection of both deceptive and nondeceptive people. Computerized systems clearly have the potential to reduce the variability that comes from bias
and inexperience of examiners and chart interpreters, but the evidence that they have achieved this potential is meager. Porges and colleagues (1996) evaluated PolyScore and critiqued the methodology it used as unscientific and flawed. Notwithstanding the adversarial tone taken by Porges and colleagues, many of the flaws they identified apply equally to CPS, such as the lack of adequate evaluation.^{5}
Dollins and associates (Dollins, Krapohl, and Dutton, 2000) compared the performance of these two algorithms with three other algorithms on an independent set of 97 selected confirmed criminal cases. CPS performed equally well on detection of both innocent and guilty subjects, while the other algorithms were better at detecting deceptive examinees than clearing nondeceptive ones. Unfortunately, the method of selecting these cases makes it difficult to interpret the reported rates of misclassification.
One could argue that computerized algorithms should be able to analyze the data better than human scorers because they incorporate potentially useful analytic steps that are difficult even for trained human scorers to perform (e.g., filtering and other transformations, calculation of signal derivatives), look at more information, and do not restrict comparisons to adjacent questions. Moreover, computer systems never get careless or tired. The success of both numerical and computerized systems, however, still depends heavily on the pretest phase of the examination. How well examiners formulate the questions inevitably affects the quality of information recorded.
PolyScore is currently working on algorithms for scoring the screening data coming from TES and relevant/irrelevant tests. An a priori base rate might be introduced in these algorithms to increase accuracy and to account for the low number of deceptive cases.
There has yet to be a proper independent evaluation of computer scoring algorithms on a suitably selected set of cases, for either specific incidents or security screening, which would allow one to accurately assess the validity and accuracy of these algorithms.
NOTES
1. 
Some computerized systems store biographical information such as examinee’s name, social security number, age, sex, education, ethnicity, marital status, subject’s health, use of drugs, alcohol, and prior polygraph history (e.g., see www.stoelting.com), but it is unclear how this type of information would be appropriately used to improve the diagnostic accuracy of a computer scoring system. 
2. 
Matte (1996) and Kircher and Raskin (2002) provide more details on the actual polygraph instruments and hardware issues and some of the history of the development of computerized algorithms. 
3. 
Under the assumption of unequal variance for the two groups, which Kircher and 
REFERENCES
Breiman, L. 2001 Statistical modeling: The two cultures (with discussion). Statistical Science 16:199231.
Brownley, K.A., B.E. Hurwitz, and N. Schneiderman 2000 Cardiovascular psychophysiology. Chapter 9, pp. 224264, in Handbook of Psychophysiology, 2nd ed., J.T. Cacioppo, L.G. Tassinary, and G.G. Bernston, eds. New York: Cambridge University Press.
Copas, J.B., and P. Corbett 2002 Overestimation of the receiver operating characteristic curve for logistic regression. Biometrika 89:315331.
Dawson, M., A.M. Schell, and D.L. Filion 2000 The electrodermal system. Chapter 8, pp. 200223, in Handbook of Psychophysiology, 2nd ed., J.T. Cacioppo, L.G. Tassinary, and G.G. Bernston, eds. New York: Cambridge University Press.
Dollins, A.B., D.J. Krapohl, and D.W. Dutton 2000 A comparison of computer programs designed to evaluate psychophysiological detection of deception examinations: Bakeoff 1. Polygraph 29(3):237257.
Gratton, G. 2000 Biosignal Processing. Chapter 33, pp. 900923, in Handbook of Psychophysiology, 2nd ed., J.T. Cacioppo, L.G. Tassinary, and G.G. Bernston, eds. New York: Cambridge University Press.
Harris, J. 1996 Real Crime Validation of the PolyScore® 3.0 Zone Comparison Scoring Algorithm. Unpublished paper. The Johns Hopkins University Applied Physics Laboratory.
Harris, J., et al. 1994 Polygraph Automated Scoring System. U.S. Patent Document. Patent Number: 5,327,899.
Harver, A., and T.S. Lorig 2000 Respiration. Chapter 10, pp. 265293, in Handbook of Psychophysiology, 2nd ed., J.T. Cacioppo, L.G. Tassinary, and G.G. Bernston, eds. New York: Cambridge University Press.
Hastie, T., R. Tibshirani, and J. Friedman 2001 The Elements of Statistical Learning: Data Mining, Inference and Prediction. New York: SpringerVerlag.
Kircher, J.C., and D.C. Raskin 1988 Human versus computerized evaluations of polygraph data in a laboratory setting. Journal of Applied Psychology 73:291302.
2002 Computer methods for the psychophysiological detection of deception. Chapter 11, pp. 287326, in Handbook of Polygraph Testing, M. Kleiner, ed. London: Academic Press.
Matte, J.A. 1996 Forensic Psychophysiology Using Polygraph–Scientific Truth Verification Lie Detection. Williamsville, NY: J.A.M. Publications.
Olsen, D.E, J.C. Harris, M.H.Capps, and N. Ansley 1997 Computerized Polygraph Scoring System. Journal of Forensic Sciences 42(1):6171.
Podlesny, J.A., and J.C. Kircher 1999 The Finapres (volume clamp) recording method in psychophysiological detection of deception examinations: Experimental comparison with the cardiograph method. Forensic Science Communication 1(3):117.
Porges, S.W., R.A. Johnson, J.C. Kircher, and R.A. Stern 1996 Unpublished Report of Peer Review of Johns Hopkins University/Applied Physics Laboratory to the Central Intelligence Agency.