Skip to main content

Currently Skimming:

Appendix H: Data Mining and Information Fusion
Pages 185-217

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 185...
... While technical and procedural measures offer some opportunities for reducing the negative impacts, there is a real tension between the use of data mining for this purpose and the resulting impact on personal privacy, as well as other consequences from false positive identification. These privacy implications are primarily addressed in other parts of this report.
From page 186...
... These include technologies for dealing with various nonstandard data structures, including representing networks between units of interest and tools for handling the newer forms of information touched on above. A question not addressed here -- but of considerable importance and a difficult challenge for the agencies responsible for counterterrorism in the United States -- is how best to represent massive amounts of very disparate kinds of data in linked databases so that all relevant data elements that relate to a specific query can be easily and simultaneously accessed, contrasted, and compared.
From page 187...
... • In the area of Internet search, data mining tools have been used to improve search tools that assist in locating items of interest based on a user profile. Under their broadest definitions, data mining techniques include a 2 U
From page 188...
... Pattern recognition refers to a class of data mining approaches that are often applied to sensor data, such as digital photographs, radiological images, sonar data, etc. Finally, data and information fusion are data mining methods that combine information from disparate sources (often so much so that it is difficult to define a formal probabilistic model to assist in summarizing the information)
From page 189...
... . The goal here is not to provide a comprehensive list of the issues that arise in these efforts, but simply to mention some of the common hurdles that arise prior to the use of data mining techniques so that the entire process is better understood.
From page 190...
... • Appropriate database structure. The use of appropriate database management tools can greatly expedite various data mining methods.
From page 191...
... Many data mining techniques either require or greatly benefit from the use of data sets with no missing values. To create a data file with the missing values filled in, imputation techniques are used, which collectively provide the resulting database with reasonable properties, with the assumption that the missing data are missing at random.
From page 192...
... For example, data on current news magazine subscriptions might be extremely accurate, but they might also provide little help in discriminating those engaged in terrorist activities. H.3 SUBJECT-BASED DATA MINING AS AN EXTENSION OF STANDARD INVESTIGATIVE TECHNIQUES This appendix primarily concerns the extent to which state-of-the-art data mining techniques, by combining information in relatively sophisticated ways, may be capable of helping police and intelligence officers reduce the threat from terrorism.
From page 193...
... H.4 PATTERN-BASED DATA MINING TECHNIQUES AS ILLUSTRATIONS OF MORE SOPHISTICATED APPROACHES Originating in various subdisciplines of computer science, statistics, and operations research, a class of relevant data mining techniques for counterterrorist application includes (1) those that might be used to identify combinations of variables that are associated with terrorist activities and (2)
From page 194...
... Here, supervised learning techniques can provide an improvement over rule-based expert systems by making use of feedback loops using training sets to refine algorithms through continued use and evaluation. Machines that use various types of sensing to "look" inside baggage for weapons and explosives can be trained over time to discriminate between suspicious bags and nonsuspicious ones.
From page 195...
... A more recent class of data mining techniques, which are still under development, use relational databases as input.7 Relational databases represent linkages between units of analysis, and in a counterterrorism context the key example is social networks. Social networks are people who regularly communicate with each other, for example, by telephone or e-mail, and who might be acting in concert.
From page 196...
... Without more empirical experience, it is difficult to make strong assertions, but some things are relatively clear. When training sets are available, as in the case of baggage inspection, patternbased data mining techniques are very likely to provide substantial benefits.
From page 197...
... As an illustration, consider a suite of data mining tools that facilitates the detection of aliases, record linkages concerning a given individual and his or her network of associates, identification of cluster of related events by certain patterns of interest, and indexed audio/images/video from surveillance monitors. Add to this suite data mining tools that performed as well as a very good analyst in identifying patterns of interest but did so more quickly.
From page 198...
... Resistance to gaming indicates whether an adversary can take countermeasures to reduce the effectiveness of the method. H.5 THE EVALUATION OF DATA MINING TECHNIQUES It is crucially important that analysts planning to use a data mining algorithm for counterterrorism have some objective understanding of its performance, both prior to use and continually updated while in use.
From page 199...
... Also, since the development of procedures used to discriminate between two populations is greatly facilitated when there are substantial numbers of both types represented in the training set, the rarity of terrorist events, and more broadly the rarity of people of interest, complicates both the development and the evaluation of data mining techniques for counterterrorism. Even if a procedure could be evaluated on a current training set, there is always the possibility that terrorists could adjust (game)
From page 200...
... H.5.2 Evaluation Considerations Some progress in the evaluation of data mining techniques for counterterrorism can be made without the use of training sets. In the dichotomous supervised learning case, in which one is using data mining to discriminate between terrorist activities and nonterrorist activities, two types of errors that can be made are false positives and false negatives.
From page 201...
... Therefore, for evaluation purposes, it makes sense to proceed as if there are false positives and false negatives that are the direct result of the application of data mining methods. Even without a training set, the assessment of the false positive rate for a procedure is in some sense straightforward, because if a procedure identifies a number of people as being of interest, one can further investigate (a sample of)
From page 202...
... First, as mentioned above, for any supervised learning technique, since a training set is typically not representative of time dynamics, cross-validation does not evaluate a procedure's value for future data sets. Second, using this technique, one is evaluating each procedure as a single entity.
From page 203...
... To support this, not only should experts examine cases identified as of interest to discover false positives, but also a sample of those identified as not of interest should be reviewed in order to have some possibility, admittedly remote, of discovering false negatives. The evaluation and improvement of data mining procedures for counterterrorism needs to be an iterative process.
From page 204...
... will acquire data mining algorithms for use in counterterrorism in two ways: from outside developers (contractors) and from
From page 205...
... Possible mechanisms to support such contributions include interagency professional agreements, sabbatical arrangements for academics, consulting agreements, and external advisory groups. H.6 EXPERT JUDGMENT AND ITS ROLE IN DATA MINING The importance of responsible expert judgment in various aspects of data mining, from research and development to field deployment, cannot be overstated.
From page 206...
... Therefore, there is a need to consider the operator and the data mining algorithms as a sociotechnical system, as well as a need to determine how operators and the data mining technology can best work together. As an example of a sociotechnical issue, consider a frequently held belief in the infallibility of a computer.
From page 207...
... H.7 ISSUES CONCERNING THE DATA AVAILABLE FOR USE WITH DATA MINING AND THE IMPLICATIONS FOR COUNTERTERRORISM AND PRIVACY It is generally the case that the effectiveness of a data mining algorithm is much more dependent on the predictive power of the data collected for use than on the precise form of the algorithm. For example, it typically does not matter that much, in discriminating between two populations, whether one uses logistic regression, a classification tree, a neural net, a support vector machine, or discriminant analysis.
From page 208...
... A search engine, at the user level, is not a data mining system, but instead a database with a natural query language. However, the component processes of populating this database, ranking the results, and making the query language more robust are all carried out through the essential use of data mining algorithms.
From page 209...
... What is needed is not simply the modification of commercial off-the-shelf techniques developed for various business applications, but a dedicated collaborative research effort involving both data miners and intelligence analysts with the goal of developing what are currently nonexistent techniques and tools. H.9 INFORMATION FUSION Another class of data mining techniques, referred to as "information fusion," might be useful in counterterrorism.
From page 210...
... Regarding the broader application, consider the problem of identifying whether there is a terrorist threat from the following disparate sources of information: recent meetings of known terrorists, greater than usual movement of funds from countries known to harbor terrorists, and greater than usual purchases of explosives in the United States. Information fusion uses such techniques as the Kalman filter and Bayesian networks to learn how to optimally join disparate pieces of information at different levels of the decision process, by either combining individual data elements or combining higher level assessments for the decision at hand, in order to make improved decisions in comparison to more informal use of the disparate information.
From page 211...
... It is possible that this conditional probability could be expressed as an arithmetic function of simpler conditional probabilities under some conditional independence assumptions, but then there is the problem of validating those assumptions to link those more primitive conditional probabilities to the desired conditional probability. More fundamentally, information fusion for the broader problem of counterterrorism requires a structure that expresses the forms in which information is received and how it should be combined.
From page 212...
... Thus, those intending to carry out relatively small-scale attacks might in principle leave a relevant database track, but the difficult (and for practical purposes, probably insoluble) problem would be the ability to identify that track and infer terrorist actions against a much larger background of innocuous activity.
From page 213...
... Training sets in this application can be used to develop very predictive models that discriminate well between those for whom additional loans would be both a good and a bad decision on the part of the credit granting institution. The utility of training sets in this application benefits from the prevalence of the failure to repay loans.
From page 214...
... (For example, it is known that an individual with a name that is similar to that of a person on a terrorist watch list is cause for suspicion and additional screening.) Labeled training sets for supervised learning methods cannot be developed because the number of people that have attempted to initiate attacks on aircraft and other terrorist activity is extremely small.
From page 215...
... (Such searches could also result in a large number of false positives that would require human judgment to dispose of.) Such searches are within the purview of law enforcement and intelligence analysts today, and it would be surprising if
From page 216...
... For example, being a close associate of someone suspected of terrorist activity and having similar connections to persons or groups of interest are strong predictors that a given person will also be of interest for further investigation. By contrast, pattern-based techniques, in the absence of a training set, are likely to have substantially less predictive power than the subject-based patterns chosen by counterintelligence experts based on their experience -- and consequently a very large false positive rate.
From page 217...
... identify three factors that are likely to have a bearing on the utility of data mining for counterterrorist purposes: • The ability to identify subtle and complex data patterns indicating likely terrorist activity, • The construction of training sets that facilitate the discovery of indicative patterns not previously recognized by intelligence analysts, and • The high false positive rates that are likely to result from the problems in the first two bullets. A number of approaches can be taken to possibly address this argument.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.