Management Issues In The Analysis Of Large-Scale Crime Data Sets
Charles R. Kindermann
Marshall M. DeBerry, Jr.
Bureau of Justice Statistics U.S. Department of Justice
1 The Information Glut
The Bureau of Justice Statistics (BJS), a component agency in the Department of Justice. has the responsibility for collecting, analyzing, publishing and disseminating information on crime. criminal offenders, victims of crime. and the operation of justice systems at all levels of government. Two very large data sets—the National Incident-Based Reporting System (NIBRS) and the National Crime Victimization Survey (NCVS)-are part of the analytic activities of the Bureau. A brief overview of the two programs is presented below.
2 NIBRS
NIBRS, which will eventually replace the traditional Uniform Crime Reporting (UCR)1 Program as the source of official FBI counts of crimes reported to law enforcement agencies, is designed to go far beyond the summary-based U CR in terms of information about crime. This summary-based reporting program counts incidents and arrests, with some expanded data on incidents of murder and nonnegligent manslaughter.
In incidents where more than one offense occurs, the traditional UCR counts only the most serious of the offenses. NIBRS includes information about each of the different offenses (up to a maximum of ten) that may occur within a single incident. As a result. the NIBRS data can be used to study how often and under what circumstances certain offenses. such as burglary and rape, occur together.
The ability to link information about many aspects of a crime to the crime incident marks the most important difference between NIBRS and the traditional UCR. These various aspects of the crime incident are represented in NIBRS by a series of more than fifty data elements. The NIBRS data elements are categorized into six segments: administrative. offenses, property, victim, offender. and arrestee. NIBRS enables analysts to study how these data elements relate to each other for each type of offense.
3 NCVS
The Bureau of Justice Statistics also sponsors and analyzes the National Crime Victimization Survey (NCVS) and ongoing national household survey that was begun in 1972 to collect data on personal and household victimization experiences. All persons 12 years of age and older are interviewed in approximately 50.000 households every six months throughout the Nation. There are approximately 650 variables on the NCVS data file. ranging from the type of crime committed, the time and place of occurrence. and whether or not the crime was reported to law enforcement authorities. The average size of the data file for all crimes reported for a particular calendar year is 120 megabytes.
The NCVS utilizes a hierarchical file structure for its data records. In the NCVS there are four types of records: a household link record, followed by the household, personal. and incident records. The household record contains information about the household as reported by the respondent and characteristics of the surrounding area as computed by the Bureau of the Census. The person record contains information about each household member 12 years of age and older as reported by that person or proxy. with one record for each qualifying individual. Finally, the incident record contains information drawn from the incident report, completed for each household or person incident mentioned during the interview. The NCVS is a somewhat smaller data set than NIBRS. but may be considered analytically more complex because 1) there is more information available for each incident and 2) it is a panel design, i.e., the persons in each housing unit are interviewed every six months for a period of three years, thereby allowing for some degree of limited longitudinal comparison of households over time.
4 Data Utilization
An example of how those interested in the study of crime can tap the potentially rich source of new information represented by NIBRS is seen in the current Supplementary Homicide Reports data published annually by the FBI in its Crime in
the United States series. Crosstabulations of various incident-based data elements are presented, including the age, sex, and race of victims and offenders, the types of weapon(s) used, the relationship of the victim to the offender, and the circumstances surrounding the incident (for example, whether the murder resulted from a robbery, rape, or argument). The NIBRS data will offer a variable set similar in scope.
Currently, portions of eight states are reporting NIBRS data to the FBI. In 1991. three small states reported 500,000 crime incidents that required approximately one gigabyte of storage. If current NIBRS storage demands were extrapolated to full nationwide participation. 40 gigabytes of storage would be needed each year.
Although full nationwide participation in NIBRS is not a realistic short-term expectation. it is realistic to expect that a fourth of the U.S. could be represented in NIBRS within the next several years. The corresponding volume of data, 10 gigabytes each year. could still be problematic for storage and analysis.
Certain strategies may be chosen to reduce the size of the corresponding NIBRS data files. For example, most users of NIBRS data may not need or desire a data file that contains
all twenty-two types of GROUP A offences, which contains crimes such as sports tampering, impersonation, and gambling equipment violations. If a user is interested in much smaller file, only the more common offenses, such as aggravated assault, motor vehicle theft, burglary, or larceny/theft, could be included in the data set. Another area in which data reduction can be achieved is in the actual NIBRS record layout. Although the multiple-record format may aid law enforcement agencies in the inputting of the data, it can create difficulties in analyzing the files. For example, in the current NIBRS format, each record, regardless of type, begins with 300 bytes reserved for originating agency identifier (ORI) information. Currently, nearly a third of each ORI header is filler space reserved for future use. Moreover, the records for the different incident types have been padded with filler so as to be stored as fixed length records instead of variable length records. This wasted space occupied by multiple ORI headers and filler can be eliminated by restructuring and reorganizing the current file structure into a more suitable format that current statistical software packages can utilize.
Even with the restructuring of the current record formats, the annual collection of NIBRS data will still result in a large volume of data to be organized, stored, and analyzed. One strategy BJS is considering is to sample the NIBRS data in order to better manage the volume of data expected. Since the NIBRS program can be viewed as a potentially complete enumeration of incidents obtained by law enforcement agencies, simple random sampling could be employed, thereby avoiding the complications of developing a complex sample design strategy and facilitating the use of "off the shelf" statistical software packages.
Using the sample design of the NCVS, BJS has produced a 100 megabyte longitudinal file of household units that covers a period of four and one half years. This file contains information on both interviewed and noninterviewed households in a selected portion of the sample over the seven interviews. The NCVS longitudinal file can facilitate the examination of patterns of victimization over time, the response of the police to victimizations. the effect of life events on the likelihood of victimization, and the long term effects of criminal victimization on victims and the criminal justice system. However, current analysis of this particular data file has been hampered by issues relating to the sample design and utilizing popular statistical software packages. Since the NCVS utilizes a complex sample design, standard statistical techniques that assume a simple random sample cannot be utilized. Although there are software packages that can deal with complex sample designs, the NCVS data are collected by the Bureau of the Census under Title 13 of the U.S. code. As a result, selected information that would identify primary sampling units and clusters is suppressed to preserve confidentiality. Researchers, therefore, cannot compute variances and standard errors for their analyses on this particular data sets. BJS is currently working with the Bureau of the Census to facilitate the computation of modified sample information to be included on future public use tapes that will facilitate the computation of the appropriate sample variances.
Most of the current statistical software packages are geared to processing data on a case by case basis. The NCVS longitudinal file is structured in a nested hierarchical manner. When trying to examine events over a selected time period, it becomes difficult to rearrange the data in a way that will facilitate understanding the time or longitudinal aspects of the data. For example, the concept of what constitutes a "case record" depends on the perspective of the current question. Is a case all households that remain in sample over all seven interviews,
or is it those households that are replaced at every interview period? Moving the appropriate incident data from the lower levels of the nested file to the upper level of the household can complicate obtaining a "true" count of the number of households experiencing a victimization event, since many statistical software packages duplicate information at the upper level of the file structure down to the lower level.
5 Future Issues
Local law enforcement agencies will be participating on a voluntary basis. NIBRS data collection and aggregation at the agency-level will be far more labor and resource-intensive than the current UCR system. What are the implications for coverage and data accuracy?
Criminal justice data have a short shelf life, because detection of current trends is important for planning and interdiction effectiveness. Can new methods be found to process massive data files and produce information in a time frame that is useful to the criminal justice community? Numerous offenses such as sports tampering are not of great national interest. A subset of the full NIBRS file based on scientific sampling procedures could facilitate many types of analyses.
How easy is it to integrate change into such a data system, as evaluations of NIBRS identify new information needs that it will be required to address? Does the sheer volume of data and reporting agencies make this need any more difficult than for smaller on-going data collections? As data storage technology continues to evolve, it is important to weigh both cost and future compatibility needs, particularly in regards to distributing the data to law enforcement agencies and the public. BJS will continue to monitor these technological changes so that we will be able to utilize such advances in order to enhance our analytic capabilities with these large scale datasets.