Skip to main content

Currently Skimming:

Chapter 4 Invited Session on Record Linkage Methodology
Pages 79-138

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 79...
... Record Linkage Techniques- 1997 Invited Session on Record Linkage Methoclo~ogy Chair: Nancy Kirkendhall, Office of Management an cl Budget Authors: Thomas R BeZin, University of California Los Angeles, anct DonaZd(B.
From page 81...
... that is, for accurately estimating fialx~match rates for each possible cutoff weight. The strategy uses a modd where the dutabution of observed weights are viewed as a mixture of weighu for true matches and weights for false matches An EM algorithm for fitting mixtures of t~sforme~ normal d~nbutions is used to find for modem associated pos~or variability is due to uncertainty about specific normalizing formations as well as uncertainty in the parameters of the mixture model.
From page 82...
... review is common practice in operations conducted by the To illustrate this ~ conclusion, Table ~ displays em" Census Bureau Each such Gaining sadly pm~desa date set pincal findings from Belin ( 1990) with t -- census data on in which each candidate ~ has its weight Al an outcome the paformancc of the per of Felled "d Sunter defined as true match or false match, and thus provides in( 1969)
From page 83...
... Judd be "bmatot babe Ash weights. from Ah of the three sin The hi in the current site by Filing a Mali mixture model, which modality us We tru~match distribution for the Washington would estimate the mmm and ~aiunc~ of the two normal site appalls to be We to some record pads ad on atcomponen£s (i.c., a)
From page 84...
... 2. CALIBRATING FA~ICH "US IN ~0 LINKAGE USING ~NSFORMED-NORMAL MIXTURE MODELS 2.1 Strategy Based on Viewing Distribution of Weigt,1s as Mixture We assume that a Ate compoate weight has been calculated for each candle par ~ the record-li~e problem at hand, so that the distribution of obsen~imcip~ is a mixture of the distribution of weigbus for true matches and the distribution of weights For false snatches We also assume the availability of at least one training sample ~ which motets status (i.e ~ whether a pair of records ~ a Sue match or a falac match)
From page 85...
... ., n) are normally disc tabutaL Although the sample geometric mean Is determined by Tic data, we Will soon turn to a setting involving a mixture of two components with different transformaiions to normality' in which cvea the sample geometric means ofthe Nero components are w~owc; consequently, we treat ~ as an unwon parameter, the population goof metnc mean.
From page 86...
... . 2.3.3 Transformed-Nonmal MixtweAfadelSorRecordLirlka&e Weights Lot the weigl`m associated with record pairs in a current data set be denoted by W`, i = 1, .
From page 87...
... W for obtaining standard en of p~ametms in mode that In te~ofthc~model paramour, the fi~m~ are fit q the EM algorithm ~ tocl~ye uses ~ mtc among record pairs Its kits bin W and ofthe Ion of missiDgiDfo'~dooderh~lfimm~ive WE h is given by EM hastes to inlay He ~ _ provide an applopnate of ~ COvananCe mark. ~a;lS On the imP1emeD - ion Of the SEM Algorithm in Our ~mOde!
From page 88...
... we" simply fused as the geometric mam declared matches ate needed to make up for the ~ ost of a of the component geometric means Han the two perilous sputa If the methods had eDot ·~dC6d as well as, th did th false matcher' If subject matter expert who are eddy a Scow ted ant Is band data, then we Would have also reflected it kayo p~wode'Re can amvC at as anew e · e uncertainty ID tbeee puametas then a prwedurc for scaling Cabot ~ could be aaamlne~ ~ · ~ · e ~ ~ ~Due to thc structure Of ok problem, ~n which the role of selecting a cutoff weight where the estimator He - oomooa .
From page 89...
... 3. PERFORMANCE OF CALIBRATION PROCEDURE ON CENSUS COMPUTERMATCEIING DATA 3.1 Result from Tff~Us Data We use Me Maces data described in Secdon l.3 to ill~ate the pe~onnance of the proposed calib~on pan cedum where Does by clews ale Me bed menus available for ju~m~ch and fal~motch sat Who Wee xpa~ utes available, we wem able to apply our st~at e~r them times, with two sitm sandbag as flaming data and the mL~tu~model _e applied to 2hc ~" Atom the third site.
From page 90...
... ~/~; ~,(1-~7i) ~1 + "Tvar(O-j|, where no is the number of record pads with weights in the target interval and ',' is the predicted probability of false match for candidate pad i.
From page 91...
... Record Linkage Techniques-1997 Tabb ~ P-fore of l~i~ CaNb~ Era: on T-to Nehru Wee In Ida he [-5.
From page 92...
... D fb~ cpEcata where 99% or mom of the records were P~ious a~at ~m~ ~match r~ ~ ~ ~ m~ above the po~t ~ the pr~ p~ linkage were either uD~Ie or too cumbasome for p~ d~d a hl~mao:h me of .OOS, thac ~ no e~ndence tlm sampic ~ of ~e "a ba~ bed" m``tched had aD. ~mpact on the accu~aq of ~mated ~ities of false match, implyilig that brealcdo~ of the calibtadon ~iure a~ pears to bc a th~old phcoomenos.
From page 93...
... It also should be pointed out ~ Me SEM Algeria raw lee SEM algorithm is founded Ott ate ~Uty that the obey 08 ~ calculation of BLEW Altbou-it may only be nec=3ary data I boo ms~ Id. for a ~i~.l pawn to not EM for 10 or 20 iteatiomto obey ~cg to tom decimal O can be c~p~l ~ terms of the coadidond e~poctado~ of the PI in MLEs, it might "c 100 or more he to obtain "oomplc~" observed informs maw c-~ Lithe ~ al to, my, ~ aim-~ Ales Thac al of the SEM (I)
From page 94...
... . '.Ue" Mixture Models to Capote Error Rates in Recook Linl~c Prooedw~ with Applicaiioo to Computer Malchitlg for Ur~dercount E=imaiion~" Ph.D.
From page 95...
... Automated linkage involves using computers to perform matching operations quickly and accurately. Mixture models can be used when the population is composed of underlying and possibly unidentified subpopulations.
From page 96...
... Mixture Models An observation yi (possibly mult~variate) arising from a finite mixture distribution wad G classes has probability density phi ~ 0, ~ ~ = ~1,G fig pa yi ~ ego, where erg (~l,G fig =)
From page 97...
... Before the match and nonmatch status is detenn~ned by clerks tentative declarations as probable match and probable nonmatch can be made using mixture models. It is necessary to choose a class or classes to be used as probable matches and probable nonmatches, which usually can be done by looking a probabilities of agreement on fields In the mixture classes.
From page 98...
... for matches and nonmatches are well separated. The new approach of this paper does not require training data, but could use it as classified observations, and provides its own estimates of error rates as described for mixture models.
From page 99...
... Different mixture models give slightly different estimates of the likelihood ratio just as different estimation me~ods currently used in practice lead to different orderings of pairs. Application In 1988, a tnal census and post numeration survey (PES)
From page 100...
... At an error rate of .005, using the estimated false-match curve, 7462 matches and 3 nonmatches are declared matches, giving an actual error rate of .0004. At an estimated error rate of .01, 8596 matches and 23 nonmatches are declared matches, giving an actual error rate of .0027 Figure 1.-False-Match and False-Nonmatch Rates From Fiffing a Thre - Class Conditional Independence Mixture to D88a (The solid lines are actual and Me dashed lines are estimated error rates)
From page 101...
... The models not chosen had rapidly rising estimated error rates right away. Pairs were identified to be reviewed by clerks.
From page 102...
... The procedure identifies matches and nonmatches, directs clerks in their work, and provides cut-offs and estimates of error rates on five Census data sets. Acknowledgments The author wishes to thank William E
From page 103...
... Record Linkage Techniques-1997 Figure 2. ~ Fals - Match (FMR)
From page 104...
... of Record Linkage, Proceedings of the Survey Research Methods Section, American Statistical Association, 778- 783. Winkler, William E
From page 105...
... (1992) , Comparative Analysis of Record Linkage Decision Rules, Proceedings of the Survey Research Methods Section, American Statistical Association, 829- 834.
From page 106...
... Our results are preliminary and intended largely to stimulate further Work. KEY WORDS: Record linkage; Matching error; Regression analysis.
From page 107...
... We call the ratio R or any monotonely increasing transformation of it (such as given by a logarithm) a matching weight or total agreement weight.
From page 108...
... Scheuren arid Wir'* ler og frequency 11 10 9 8 7 6 5 4 3 2 1' Figure 1.
From page 109...
... 2~2 ~d~hg Pot -- Woks Even when a computer matching system uses the Fellegi-Sunter decision rule to designate some pairs as almost certain true links or true Unlinks, it could leave a large subset of pairs that are only potential links. One gray to address potentially ladled pans ~ to Ally ~ them in an attempt to de~e tree Elks by.
From page 110...
... With the adjusted slope coeff~cem a', the proper intercept can be obtained from the usual e-Won as-.~-ale, where do has beel1 adjusted. Methods for estimating recession standard errors can also be devised in the presence of matching errors.
From page 111...
... We namer asmmc ~ ~hc bias terms havoc expectation zero nor that Why are uncorrelated with the onset data. Will' the different representations, we can adjust the regression coet0~c~ents Am and their associated standard errors back to the true values ,B,,~ and their Hated standar~errors.
From page 112...
... Scheuren and Rankler FSgu~ 2. Log of Frequency v~ Weight Mod Matching Scenario, 1 inks and None cat co o (D ILL 112 ~ ~ jut t me:; ~ ~ ~ *
From page 113...
... Log of Frequency vat Weight Mediocre Matching Scenario, Links and l! ionlinks cot co 0 ILL *
From page 114...
... Scheuren and Rankler Figure 4e Log Frequency vs. Weight Poor Matching Scenario, Links and Nonlinks cat co ~1 l s at,, Die o a 0 - to l l >.
From page 115...
... F=~. ~w~ Each data base con~ct a computer matching weight, true and ~ma~ ma~ng probabilides, the independent x-~bIc for the regress[on, the ~edepend~y-vanabk, the observed ~vanables in the record having the highest match waght, and the obse~redy-~able from the record hanng ~e second highest matching weight The indepadent x-vanables for the re~ess~on were constn~cted u~ the SAS RA~4I procedure, so as to beunifonnly distributodbetw~n I and 101.
From page 116...
... For each class, the bun is squared, added to the square of the standard errors, and square roots rakes. Observations on the results we obtained are fairy straightfon~ and about what we expected.
From page 117...
... Reladve Bias ~ .02 1.00 0 98 0.96 O.94 0.92 0.90 A 4 Cumulative Weight Classes A A A Record Linkage Techniques-1997 Figure 5. Relative }bias for Add - d E:stimators, E - catted Probabilities A A A -A 6 8 }17
From page 118...
... . 6 8 4 1-' 2 -2 0 0 lo lo o o Curnulaffve Weight Classes
From page 119...
... 0 2 4 6 B Cumulative Weight Classes 119
From page 120...
... ~ Mediocre Poor _ . True probabilities Adjustment was not helpful because it was not needed Good results like those In Section 4.1 Good results like those in Section 4.1 Estimated probabilities Same ~ above Same as above Poor results because Rubin Bdin could not es~natc the probabilities Any statistical estimation procedure Will have difficulty with the poor matching scenario because of the extreme overlap of the cunres.
From page 121...
... In the present paper we have proposed tat the links, nonlinks, and potential links be provided to the analyst - not just links. We strongly recommend this, even if a clerical rewew step has been undertaken.
From page 122...
... As Intermediate stops in espying regloss coefficients and their standard errors, we need to find ~; ~ E(Z)
From page 123...
... involves the usual ~ assumption that the error tams arc indepcudent with identical Y~ce. In the sumenc examples of this papa we assumed that the Arc independent Prague Xi associated with each h was from the record with the highest matching weight and the false independent value was taken from the record with the second bighest matching weight.
From page 124...
... . Use Mixture Models to Calibrate Error Rates in Record Linkage Procedures, with Application to Computer Matcbisg for Census Undercount Estimation.
From page 125...
... . Recent developments in calibrating error rates for computer matcldag, Proceedings of the 1991 AnnualResearch Conference, U.S.
From page 126...
... Relation To Earlier Work In earlier work (Scheuren and Winlcler, 1993) , we provided~eo~y showing that elementary regression analyses could be accurately adjusted for matching error, employing lmowledge of the quality of He matching.
From page 127...
... . The ink of these simulations is to use matching scenarios Eat are more difficult than what most linkers typically encounter.
From page 128...
... We call the ratio R or any monotonely increasing t~nsfonnation of It (typically a lo~thm) a matching weight or total agreement weighs.
From page 129...
... The fast poor matching scenario consisted of using last name, first name, one address variation, and age. Minor typographical errors were introduced independency into one filth of Me last names and one Bird of the first names in one of the files.
From page 130...
... 1st Poor Matching Soenano o - a- ~, - ,\~ - tic ~,,, ,, ~ ~ ~ M - ches 0 ° None o He * ~ ~He Matching Weight The second poor matching scenario consisted of using last name, few name, and one address wriabon.
From page 131...
... In the poor matching scenario of Mat paper (first poor scenario of this paper) , Me Belin-Rubm procedure was unable to provide accurate es0;~r~es of error rates but our theoretical adjustment procure still worked wed.
From page 132...
... Figure fib. In Poor Scenario, 1st Pase All False & 5% True Matches, Observed Data, HighOverI"p 1104 Points, bed = 2 47.
From page 133...
... Matches, OuU;.r-Adjusted Data 1104 Points, beta = ^.78, R-squaw-0.40 : . _ O 1o ad x-variable Second True Reference Regression Figure 3a displays a scatte~plot of X and Y as Hey would appear if they could be true matches based on a second RL step.
From page 134...
... Sect Poor Scenario, Zinc] Paes All False ~ 59~ True Matches, Outlier Adjusted Data E;50 Pain - , Coca = 5.26, R-square-0.47 , O O O .00- ~ ~ 0~ - ,;, :s ~ '2''~__5_~_ _300- O ~ 4, ~ ~ ~ _ ~ O C r ~0 20 30 .0 50 60 70 x-varietal.
From page 135...
... In fact with each Sue much Hat is associated with an oilier Y-value, there may be many false matches that have Y-values Hat are closer to He preclicted Y-v~ue Han He Sue match. Comments and Future Study Over Summary Tennis paper, we have looked at a very restricted analysis setting: a simple regression of one qu~ative dependent 1vanable Tom one file matched to a single quantitative independent vanable Dom another file.
From page 136...
... . A Method for Calibrating False-Match Rates in Record Linkage, Journal of the America Statistical Association, 90, 694-707 Fellegi, I
From page 137...
... . Fiddling Around with Mismatches and Niches, Proceedings of the Section on Social Statistics, American Statistical Association.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.