Skip to main content

Currently Skimming:

Chapter 6 Invited Session on Business and Miscellaneous Record Linkage Applications
Pages 169-200

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 169...
... Record Linkage Techniques- 1997 Invited Session on Business and Miscellaneous Record Linkage Applications Chair: RichardlAZlen, Nationa1l Agricultural Statistics Service Authors: Jenny B
From page 171...
... Fortunately for the rest of us, Federal estate tax data offer a rare opportunity to observe the total wealth, portfolios, and bequest behavior of certain ~ndivicluals. Not only that, these data can be linked across generations, providing testing grounds for hypotheses about motives for intergenerational transfers, tradeoffs of family size and bequest amount, and Me like.
From page 172...
... Answering them requires a sufficiently large, intergenerationally linked data set that contains comprehensive demographic and socioeconomic information. The Original Estate Tax Data: Saved in the Nick of Time Estate tax records contain a weals of data on a nation's citizens.
From page 173...
... Linking the Data: Overlapping Estate Tax Returns T inking data from one set of records to another requires much information and, frequently, creative Computer programming (Fellegi and Sunter, 19691. The AUTOMATCH software written by Matt Jaro provides a solid foundation Afro, 1997~; variations on his programs coupled with SAS progra~rmiing produced Me linked estate tax records.
From page 174...
... For example, suppose He initial matching process paired Joseph McCarthy from He decedent file to Joseph McCarthy from He beneficiary file. The beneficiary file ~n 174
From page 175...
... Initially, the beneficiary file contained identifiers that pointed back to the estate tax record, but it did not have unique identifiers. Because my original files were so large, I excluded some variables while performing the match.
From page 176...
... (1963~. Lifecycle Hypothesis of Savings: Aggregate Implications and Tests American Economic Review, 53.
From page 177...
... . Matchware Product Overview, Record Linkage Techniques - 1997, eds.
From page 178...
... . Bequest and Asset Distribution: Human Capital Investment and Intergenerational Wealth Transfers, Savings and Bequests, ed.
From page 179...
... "nnI? ~.~.~.~.~ D, ~ tl~ ~ ~ ~ ~5 Introduction This paper describes a matching process which improves the linkage between sole proprietorship income tax return records from the Internal Revenue Service SIRS)
From page 180...
... Because partnership and corporation income tax returns are filed under an ElN' Me linkage between receipts from annual tax returns and payroll records for these businesses is readily available. However, for sole proprietorships, if the KIN is missing or incorrect on Me 1040-C, we obv~ously can't rely on Me KIN to update the appropriate SSEL payroll record with 1040-C receipts.
From page 181...
... as candidates for matching fields, but appeared to contribute very little to establishing new linkages touring testing, the census name field update was incomplete, and may yet be shown to be useful for future matchings. To summarize, on the KIN file we have a name field that may or may not contain a personal name; on the 1040-C file we have a name field that may contain compound names, wad either of the components a candidate for matching.
From page 182...
... Software Used for the Matching For the matching software, we used Winkler's mf3 matcher, with match specific modifications. We used bow character-by-character comparisons and one of the native string comparators.
From page 183...
... A file of randomly V V joined payrolls from a known sole proprietors file and a sample file of 1040-Cs was created (random set)
From page 184...
... In fact, the distribution centralizes faster once truth set then it does ont:he random set. The criteria for selection was to select ache model that produced the most ratios near ~ and Me fewest ratios at the extremes on the truth set, and simultaneously produced Me fewest ratios near 1 and the most at the extremes on the random set.
From page 185...
... Ties frequency occurred where husband and wife appeared in the name field of bow records. For duplicates feting this pattern, bow candidates having the same KIN and the saline SSN, the pair with the highest match strength.
From page 186...
... Table 4.-Matches of Linked Records Condition ~No. of Records True matches Type A false matches Type B false matches False nonmatches Total The results of Me match 16,364 100 1 2,130 18,595 The type A false matches involved a correct linkage between KIN and SSN, but with the incorrect schedule number.
From page 187...
... across all 25.5 million records, we would have less than a 5 percent chance of getting ~ occurrence in our sample. The question then arises how to distribute the additional 215 estimated false matches between converted true matches and converted Wise nonmatches.
From page 188...
... For the 1992 tax year, about 1.37 million sole proprietors had weir 1040-C tax return linked to weir payroll records on Me SSEL. The breakdown by source of linkage is given In Table 6.
From page 189...
... . Comparative Analysis of Record Linkage Decision Rules, Proceedings of the Section on Survey Research Methods, American Statistical Association, 829-833.
From page 190...
... While approximate string comparison has been a subject of research in computer science for many years (see survey article by Hall and Dowling, 1980) , some of the most effective ideas in the record linkage context were introduced by Jaro (1989)
From page 191...
... The final section consists of a summary and conclusion. Approximate String Comparison Dealing with typographical error can be vitally important in a record linkage context.
From page 192...
... The return value of zero is justified because if each of the strings has three or less characters, then they necessarily disagree on at least one. In record linkage situations, the string comparator value is used in adjusting the matching weight associated wig the con~anson downward from the agreement weight toward the disagreement weight.
From page 193...
... -- Comparison of String Comparators Using Last Names, First Names, and Street Names Two Strings Jaro String Comparator Values Wink | McLa ~Lynch . I Bigram SHACKLEFORD SHACKELFORD 0.970 0.982 0.982 0.989 0.925 DI~NINGHAM CUNNIGHAM 0.896 0.896 0.896 0.931 0.917 NICHLESON NICHOLSON 0.926 0.956 0.969 0.977 0.906 JONES JOHNSON 0.790 0.832 0.860 0.874 0.000 MASSEY MASSIE 0.889 0.933 0.953 0.953 0.845 ABROMS ABRAMS 0.889 0.922 0.946 0.952 0.906 EIARDIN MARTINEZ 0.000 0.000 0.000 0.000 0.000 INDIAN SMITH 0.000 0.000 0.000 0.000 0.000 JERALDINE GERALDINE 0.926 0.926 0.948 0.966 0.972 MAMA MARTHA 0.944 0.961 0.961 0.971 0.845 MICHELLE MICHAEL 0.869 0.921 0.938 0.944 0.845 JULIES JULIUS 0.889 0.933 0.953 0.953 0.906 TANYA TONYA 0.867 0.880 0.916 0.933 0.883 DWAYNE DUANE 0.822 0.840 0.873 0.896 0.000 SEAN SUSAN 0.783 0.805 0.845 0.845 0.800 JON JOHN 0.917 0.933 0.933 0.933 0.847 JON JAN 0.000 0.000 0.860 0.860 0.000 BROOKHAVEN BRROKHAVEN 0.933 0.947 0.947 0.964 0.975 BROOK HALLOW BROOK HLLW 0.944 0.967 0.967 0.977 0.906 DECATUR DECATIR 0.905 0.943 0.960 0.965 0.921 FITERUREITER FITZENR:EITER 0.856 0.913 0.923 0.945 0.932 HIGBEE HINGE 0.889 0.922 0.922 0.932 0.906 HIGBEE HIGVEE 0.889 0.922 0.946 0.952 0.906 LACUNA LOCURA 0.889 0.900 0.930 0.947 0.845 IOWA IONA 0.833 0.867 0.867 0.867 0.906 1ST IST 0.000 0.000 0.844 0.844 0.947 Data and Matching Weights-Parameters In this section, we describe the fields and the associated matching weights that are used in the record linkage decision rule.
From page 194...
... Resulls n esults are presented In two parts. In each part, the different string comparators are substituted In the String comparison subroutine of an overall matching system.
From page 195...
... We see that, if matching is adjusted for bigrams and Me string comparators, Men error rates error rates are much lower Man Nose obtained when exact matching is used. Since exact matching is not competitive, remaking results are only presented when string comparators are used.
From page 196...
... -- Matching Results at Different Error Rates: Second Pair of Files with 5,022 and 5,212 Records 37,327 Pairs Agreeing on Block and First Character of Last Name Link Error Link Rate Match/Nonm 0.002 base 3475/ 7 s c 3414/7 as 3414/7 os ~3477/ 7 _ bigram 3090/ 7 0.005 base 3503/18 s_c 3493/18 as 3493/18 os_l 3505/18 bigram 3509/18 0.010 base 3525/36 s_c 3526/36 as 3526/36 os_l 3527/36 bigram 3543/36 0.020 base 3538/72 s_c 3541/72 as 3541/72 os_l 3541/72 bigram 3551/73 Clerical Match/Nonm 63/65 127/65 127/65 63/65 461/66 35/54 48/54 48/54 36/54 42/55 13/36 15/36 15/36 14/36 8/73 0/0 0/0 0/0 0/0 0/0 196
From page 197...
... -- Matching Results at Different Error Rates: Third Pair of Files with 15~048 and 12,072 Records 116,305 Pairs Agreeing on Block and First Character of Last Name Link Error | Link Rate Match/Nonm 0.002 base 9696/19 s c 9434/19 _ as 9436/19 os 1 9692/19 _ bigram 9515/19 0.005 base 9792/49 s c 9781/49 as 9783/49 os 1 9791/49 _ bigram 9784/49 0.010 base 9833/99 s_c 9822/99 as 9823/99 os_l 9831/99 bigram 9823/99 0.020 base 9851/201 s_c 9841/201 as 9842/201 os l 9849/201 _ bigram 9850/201 Clerical Match/Nonm '- 1 155/182 407/182 406/182 157/182 335/182 59/152 60/152 57/152 58/152 66/152 18/102 19/102 17/102 18/102 27/102 0/0 0/0 0/0 0/0 0/0 The results generally show that the different string comparators improve matching efficacy. In aD of the best situations, error levels are very low The new string comparator produces worse results than Me previous one (see e.g., Winkler, 1990)
From page 198...
... -- First Pass -- Housing Unit Identifier Match: Matching Results of a Pair of Files with 226,713 and 153,644 Records, Respectively Jaro String Comparator Bigram Links Clerical Links Clerical 78814 5091 Estimated false match rate 0.~% 30% 78652 5888 0.1% 35% Second rass -- House Number and First Character of First Name: Matching Results of a Pair of Files with 132,100 and 64,121 Records, Respectively Links Clerical 16893 7207 Estimated false match rate 0.3% 40% Summary and Conclusion Application of new string comparator functions can improve matching efficacy in the files having large amounts oftypographical error. Since many of the files typically have high typographical error rates, Me string comparators can yield increased accuracy and reduced costs In matching of a~ninistrative lists and census.
From page 199...
... (1985~. Preprocessing of Lists and String Comparison, Record Linkage Techniques -1985, W


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.