Skip to main content

Currently Skimming:

7 Matching and Cleaning Administrative Data
Pages 195-219

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 195...
... Part II Administrative Data
From page 196...
... We will discuss this issue below when we discuss record linkage in detail (Goerge, 1997~. Creating Longitudinal Files As mentioned earlier, the pull files provided by government agencies are often not cumulative files and most often only span a limited time period.
From page 197...
... When we refer to administrative data to be used for research and evaluation of social programs, we are referring primarily to data from management information systems designed to assist in the administration of participant benefits, including, income maintenance, food stamps, Medicaid, nutritional programs, child support, child protective services, childcare subsidies, Social Security programs, and an array of social services and public health programs. Because the focus of the research is often on individual well-being, most government social programs aimed at individuals could be included.
From page 198...
... Without accurate record linkage, it is likely that the data on an individual will be incomplete or contain data that do not belong to that individual. And, to do accurate record linkage, the data fields necessary to perform the linkage, typically individual identifiers, must be accurate and in a standardized format across all the data sets to be linked.
From page 199...
... Because the data record for an individual or case is likely viewed often by the program staff, opportunities exist for correcting and updating the data fields. The value of this is even greater when the old information is maintained in addition to the updates.
From page 200...
... grants. However, because they rely on the reporting of grantees for employment information and there are often incentives for providing inaccurate information, addressing questions about the employment of TANF recipients using income maintenance program data is not ideal.
From page 201...
... Assessing Data Quality Initially, the researcher would want to assess if the data entry were reliable, which would include knowing whether the individual collecting the data had the skill or opportunity to collect reliable information. The questions that should be asked are as follows: · What is the motivation for collecting the data?
From page 202...
... We were certain that we had made some error in our record linkage. When we conferred with the welfare agency staff, they also were stymied at first.
From page 203...
... It is certainly possible that two administrative databases will label an individual as participating in two programs that should be mutually exclusive. For example, in our work in examining the overlap of AFDC or TANF and foster care, we find that children are identified as living with their parents in an income maintenance case when they are actually living with foster parents.
From page 204...
... We will discuss this issue below when we discuss record linkage in detail (Goerge, 1997~. Creating Longitudinal Files As mentioned earlier, the pull files provided by government agencies are often not cumulative files and most often only span a limited time period.
From page 205...
... ADMINISTRATIVE DATA RECORD LINKAGE A characteristic of administrative data that offers unique opportunities for researchers is the ability to link data sets in order to address research questions that have otherwise been difficult to pursue because of lack of suitable data.2 For example, studying the incidence of foster care placement, or any low-incidence event, among children who are receiving cash assistance requires a large sample of children receiving cash assistance given that foster care placement is a rare event. The resources and time required to gather such data using survey methods can be prohibitive.
From page 206...
... Given the large number of cases needed to be processed during record linkage, the idea of working with data from the entire population could overwhelm the researcher. However, because most data processing now is done using computers, the sheer size of the data files needed to be linked is typically not a major factor in the time and resources needed.
From page 207...
... In some instances, information systems even in a single agency do not share a common ID. For example, many child welfare agencies maintain two separate legacy information systems; one tracks foster care placement and payments and the other records child mal
From page 208...
... In fact, a reliable record linking between the two information systems that contain a common ID on a regular basis could provide a means to "correct" such incorrect IDs. For example, when the data files from the two systems are properly linked by using data fields other than the common ID, such as names and birth dates, the results of such a link could be compared to the common IDs in the information systems to identify incorrectly entered IDs.
From page 209...
... Our discussion focuses on two methods of record linkage that are possible in automated computer systems: deterministic and probabilistic record linking. Deterministic Record Linkage Deterministic linkage compares an identifier or a group of identifiers across databases; a link is made if they all agree.
From page 210...
... Such identifying data may include last and first name, SSN, birth date, gender, race and ethnicity, and county of residence. The process of record linkage can be conceptualized as identifying matched pairs among all possible pairs of observations from two data files.
From page 211...
... Recent development in improving record linkage allows us to take advantage of the speed and cost that computerized and automated linkage confer, such as deterministic matching, while allowing a researcher to identify at which "level" a match would be considered to be a true one (see for example; Jaro, 1989; Winkler, 1993, 1994, 1999)
From page 212...
... As a result, some complex string comparator algorithms also have been developed to determine how close two strings of letters or numbers are to each other that account for insertions, deletions, and transpositions (Jaro, 1985, 1989; Winkler, 1990; Winkler and Thibaudeau, 1991~. In the record linkage process, one critical data cleaning process is to "unduplicate" each source data set before any two data sets are linked.
From page 213...
... (see Scheuren and Winkler, 1993~. In the case of deterministic record linkage, an audit check on the matched pairs could provide an estimate of false-positive errors.
From page 214...
... Table 7-1 compares the number of matched and unmatched Cornerstone data records to the Client Database records comparing the deterministic match using SSN and TABLE 7-1 Comparison of SSN Match (Deterministic) Versus Probabilistic Match (Without Missing SSN)
From page 215...
... Although the findings might be somewhat different when applied to different data systems, our finding suggests that employing a probabilistic record-linkage method helps to reduce both false-negative and false-positive errors. The findings also show that the benefit of employing probabilistic record linkage is greater in reducing false-negative errors (Type II errors)
From page 216...
... and the benefit comes largely from identifying false-negative errors. As the number of records with missing SSN increases, the benefit of employing a probabilistic record-linkage method increases.
From page 217...
... , processed, and maintained beWe also recommend using probabilistic record linkage and not relying on any one identifier for linking records. We believe our analysis above makes this case.
From page 218...
... Rubin 1995 A method for calibrating false-match rates in record linkage. Journal of the American Statistical Association 90(June)
From page 219...
... Alexandria, VA: American Statistical Association. Matching and record linkage.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.