Home

User's Guide

1. a e e e To lo lo T To l EEE D 1 0 1 1 0 1 0 1 1 1 0 1 I Ileo lIoeil Iieoeloi Ii loi2rioi 1rco 89 0 0 0 0 01013 0 0 4466329 1 0 Figure 28 table MU The meaning of the table columns is described below columns f m ef u report the estimates of the frequency distributions respectively for links and non links for each pattern of the comparison vector resulting from the selected matching variables for instance the table shown in Figure 28 reports no links for the first pattern the first row of the table while all the pairs of this pattern are estimated as non links columns m e u visualize approx to the third decimal estimate values of the link and non link probabilities respectively as shown in the table of Figure 28 usually non link distribution is concentrated in the upper side of the table that is corresponding to comparison pattern coming from inequality in the matching variables column r visualizes approx to the third decimal the values of the ratio between the link and non link probabilities that is the matching composite weight It is crucial as explained in Section in assigning each pair to the set M or U column p post visualizes approx to the third decimal the values of the posterior probabilities that a pair is a link and it is given by p_post f_m f_m f_u The model estimates are considered not reliable when the condi
2. o S S S Figure 22 Deterministic_true_table 9 4 Probabilistic Model The probabilistic decision model available in RELAIS 2 1 follows the Fellegi and Sunter approach Below we report the main aspects of this approach with the only aim of introducing and formalizing the problem In Appendix A more methodological aspects are provided Let A and B be two lists of size nA and nB The goal of record linkage is to find all the pairs of units a b a in A b in B such that a and b refer to the same unit a b Starting from the set Q la b ac A be B containing all possible pairs of records from the lists A and B with size Q N n X n a record linkage procedure is a decision rule based on the comparison of k matching variables that for each single pair of For the definition of the weights see paragraph 10 34 records can take one of the following decisions link possible link and non link The comparison between the matching variables of the two units a b is made by means of a suitable comparison function depending on the kind of variables and their accuracy For each pair of the set 2 the result of the comparison of the matching variables is summarized in the vector y called comparison vector or comparison pattern For instance when the comparison function applied to the k matching variables is the equality the resulting comparison pattern is a k dimensional vector composed by 1 or 0 depending on
3. iii 12 3 3 Linux Environment Requirements ener enne nnne nnne tenen tnn r nien nennen nnne nennen 13 3 4 Linux Environment Installation and Execution eene 13 Z Ehe RBEATSMBU iii 15 suni EE 16 6 Dataset zx EU OmUERERU DIRE RE UE NEED IUNII EU B di dl 17 Data Pronin oan a e uet ML t Exo E TM M ED en 20 FAM Methodological Aspects e er ee eed reti nre e de E RT Re ER go eere co eter ue ope e EE rr Ed 20 7 2 Diagnostics for Selecting Blocking Variables e 21 7 3 Diagnostics for Selecting Matching Variables enne nennen enne nnne nnns 22 8 Creation and Reduction of the Search Space nnne nnne nennen 23 8 1 Methodological aspect i t retire t I e Pe EE ORE C FERRE ERE GET I gees Leto e M e EE R 23 8 2 Scarch Space Er 24 8 3 Search Space Reduct0On visi nene n RE ARIA ii 24 iS 1 Blocking a creare H 25 BB Istat 8 3 2 Sorted Neighbourhood Method iaia EN RUE alan 26 9 Decisional models 2 5 5 oou Rte bodie ROO Od p pii 27 9L Methodological aspects tiiran sei e rante totos sted dafs de etu pp tee ho Tod pfe loeo iind 28 9 2 Selection of Comparison Functions nennen ennt enne nre tn innen tene nnne teen nennen nnn 29 9 3 Deterministic Decision Models enr ien e eee eror e he E d ai 31 9 4 Probabilistit Model eiu ene eil eil 34
4. Fellegi Sunter method done INFO relais_MU_TABLE view is enabled on menu Utility Figure 29 Linkage Result menu The 1 1 Result menu is disabled if no Linkage 1 1 phase has been executed The Choose Threshold menu as shown in Figure 30 allows to insert the two thresholds Unmatch threshold and Match Threshold The two thresholds take two default values Unmatch threshold 0 90 Match threshold 0 95 It is possible to change these values with the most appropriate depending on the data EA Choose thresholds X Unmatch threshold 0 90 Match threshold 0 95 Figure 30 Choose Thresholds window B RELAIS Project Dataset Data Profiling Search Space Creation Decision Model Linkage 1 1 Linkage Result Save Utility Computing Fellegi Sunter methad Choose Threshold Fellegi Sunter method done 1 1 Result Match INFO relais MU TABLE view is enabled on menu Utility Reduction to one to one linkage Computing One to One Linkage Residual Dataset Invoking R Unmatch Reduction to 1 1 matching task succeeded Reduction done Cluster Result Possible Match Figure 31 1 1 Result menu As shown in Figure 31 the 1 1 result menu allows to create four different results Match creates a table named Matchtable containing the pairs of record one of data set A and the other of data set B that are considered as match Being a 1 1 result
5. Reduction to one to one linkage g Computing One to One Linkage Exit Invoking R Reduction to 1 1 matching task succeeded Reduction done Screen Clear Figure 40 Utility menu As shown in Figure 40 the utility function are Table Display this functionality allows to display the content of a chosen table In Figure 41 the window to select the table that will be displayed is shown In Figure 42 the content of the selected Contingencytable table is shown xi Select table matching variables M Show Cancel Figure 41 Select table window EA MATCHING_YARIABLES i xj VARIABLE NAME METRIC THRESHOLD BUSINESS NAME JaroWinkler CITY Equality CLASSIFICATION Equality YEAR BEGIN Equality Figure 42 Example of a table displayed Delete Backup this menu allows to delete a database containing a copy of the tables created in a previous execution of RELAIS 47 Change Output Order Key allows to change the order of the data set A and data set B used in the results By default the first data set is the data set A Screen Clear this functionality allows to clear the output window Exit this menu allows to close the application and freeze the current repository Bibliography 1 Belin T R Rubin D B A Method for Calibrating False Match Rates in Record Linkage Journal of the American Statistical Association Vol 90 pp 694 707 1985 2 D
6. Choose Output Folder INFO relais CONTINGENCY TABLE table view is enabled on menu Utility E Computing Fellegi Sunter method Backup Change Field Separator Fellegi Sunter method done Match File Possible Match File INFO relais MU TABLE view is enabled on menu Utility Reduction to one to one linkage Computing One to One Linkage Residual Dataset A Invoking R Residual Dataset B Reduction to 1 1 matching task succeeded Table Selection Reduction done Figure 34 To File menu In Figure 35 the menu Save To file Choose Output Folder is shown 44 Choose Output folder CD LJ Figure 35 Choose Output Folder menu In Figure 36 the menu Save To file Table Selection is shown F4 Table content xi Select table Export Cancel Figure 36 Table Selection menu The Internal backup menu consists of the following menu Create result backup this menu allows creating a database backup with a user chosen name with only the tables containing the final results that is the Matchtable the Possiblemachtable the Unmatchtable the Residual dsa table and the Residual dsb table Thus it will be possible to consult the result tables only Create full backup this menu allows to create a backup with a user chosen name of the entire database that is all the created tables are copied into the new database After restoring from this intern
7. available are Optimal 1 1 linkage on Fellegi Sunter output Greedy 1 1 linkage on Fellegi Sunter output Optimal 1 1 linkage on Rule based Matching Greedy 1 1 linkage on Rule based Matching Details on the selection of unique links phase are in Section 10 2 3 Examples of Record Linkage Work flows RELAIS is based on the consideration that every record linkage process is application dependent As seen in the previous section we can consider the whole process decomposed in its constituting different phases for each phase we can choose between the available techniques or on the basis of a suitable decision model For instance choosing the decision model to apply is not immediate the probabilistic one can be more appropriate for some applications but less appropriate for others for which an empirical decision model could prove more successful Indeed the available tools do not provide a satisfying answer to the various requirements that different applications can exhibit So the RELAIS toolkit is composed by a collection of techniques for each of the singled phase that can be dynamically combined in order to build the best record linkage work flow given the application constraints and data features provided as input see Figure 2 Blocking Jaro JaroWinkler Fellegi Sunter Rule based Matching Workflow 1 Workflow 2 Figure 2 Examples of record linkage workflows 3 Installation In this section we prese
8. 9 4 1 Contingency Table inr LIRA Li 37 9 4 2 Parameter Estimation of the Probabilistic Model and the Table MU seen 38 IO Reduction to matchimg 1 1 iunio ie ee EE EHE ERE AETHERE iaia 40 10 1 Methodological Aspects siseosas eioen iaeia araa aa iaaea ea odna a a AE AEE EE aE E Eaa oaa 40 10 2 Optimized Solution denm rar e ier ttr ir e WR RET E EREE E E ESEE EVE HAS ERR C sexs 40 10 3 Greedy Solution asili alan 41 Li Emkage Result panini mee e ls E e EUH EUH RET RR C IER piani 41 WTA Choice of the Thresholds 5 2 enia rione EDD ERR a ii 41 11 2 The Linkage Result menu er menit i e P er area REIHE eU RR 41 DDS O ew T 43 PRODI n 47 Bibliography cacao A RR a 48 Appendix Parameter Estimation of the Probabilistic Model via the EM Algorithm een 49 1 Introduction RELAIS REcord Linkage At IStat 1s a toolkit providing a set of techniques for dealing with record linkage projects The purpose of record linkage is to identify the same real world entity that can be differently represented in data sources 3 even if unique identifiers are not available or are affected by errors In statistics record linkage is needed for several applications including enriching the information stored in different data sets de duplicating data sets improving the data qu
9. Tm no decision is made and the pair is assigned to a clerical review analysis According to the Fellegi and Sunter theory a decision on the threshold levels has to be made in order to properly manage the trade off between the need of a small number of expected no decisions and small misclassified error rates for the pairs In the following sections the main steps of the probabilistic procedure are shown the details on methodologies are given in Appendix A When blocking method is performed to reduce the search space of pairs RELAIS 2 1 allows to the users two different ways of applying the probabilistic model it can be applied in a one shot way to all the blocks or a specific block can be selected Anyway the common 35 preliminary operation to do for executing the linkage procedure is the selection of the matching variables and the choice of the related comparison functions and thresholds as shown in the following figures PA Select matching variables X Variables Chosen variables CITY BUSINESS NAME ADDRESS CLASSIFICATION YEAR BEGIN Remove RANGE CLASS MATCHREF Figure 23 Variables selection It is important to underline that at least 3 matching variables need to be chosen in the current implementation of RELAIS Set metric and threshold Match Variable BUSINESS NAME d Similarity Metric Threshold 0 9 Figure 24 Metrics and thresholds setting 36 ES CONTINGENCY TABLE BUSINESS NA
10. non matches and possible matches are those in the cross product 4 x B If a de duplication problem is considered the space is A x A 1 2 When dealing with large data sets comparing all the pairs a b a belonging to A and b belonging to B in the cross product is almost impracticable in fact while the number of possible matches increases linearly the computational problem raises quadratic being the complexity O n 15 To reduce this complexity it is necessary to reduce the number of pairs a b to be compared There are many different techniques that can be applied to reduce the search space blocking and sorted neighbourhood are the two main methods Blocking consists of partitioning the two sets into blocks and of considering linkable only records within each block The partition is made through blocking keys two records belong to the same block if all the blocking keys are equal or if a hash function applied to the blocking keys of the two records gives the same result Sorted neighbourhood sorts the two input files on a blocking key and searches possible matching records only inside a window of a fixed dimension which slides on the two ordered record sets Starting from the reduced search space we can apply different decision models that define the rules used to determine whether a pair of records a b is a match a non match or a possible match The core of a record linkage process is the choice of a decision model that enables t
11. specified by the user or selected among those listed 17 Load dataset from input files xj Select dataset A input file Dataset A Field separator MA Select dataset B input file DatasetB T Field separator X r o Reset Lead Camel Figure 7 Selection of input data sets Each data set must be a text file The new line character defines each single record in the data set the separator character defines unambiguously the limits of each variable in a record thus the separator character cannot appear as part of a variable value Moreover the separator character has to be unique in all the data set Note The use of a space as field separator is allowed It 1s necessary to insert the space character in the Field separator combo box The first record must specify the data set schema The two data set schema can be different but must contain a set of common variables that are used in the following phases of a record linkage process The common variables must have the same name it is important to notice that the whole data set is case sensitive The common variables are detected in a phase called schema reconciliation In the reading phase unique identifier are added for each data set named respectively key dsa for the data set A and key dsb for the data set B Choosing the Read from DB Oracle Menu or the Read from DB MySQL Menu the window sh
12. to one record of data set A corresponds only one record of data set B 42 Possible Match creates a table named Possiblematchtable containing the pairs of record that could be a match or a non match and that need a clerical review Data set residual creates two tables named Residual dsa and Residual dsb that contain respectively the non matched including the possible match records of data set A first data set and the non matched including the possible match records of data set B second data set Unmatch creates a table named Unmatchtable that contains the pairs of records that do not correspond to a match B RELAIS Project Dataset Data Profiling Search Space Creation Decision Model Linkage 1 1 Linkage Result Save Utility Computing Fellegi Sunter method Choose Threshold Fellegi Sunter method done 1 1 Result gt INFO relais_MU_TABLE view is enabled on menu Utility F Reduction to ane to one linkage ees Match Computing One to One Linkage Possible Match Invoking R Residual Dataset Reduction to 1 1 matching task succeeded lnmatch Reduction done Figure 32 Cluster Result menu As shown in Figure 32 the Cluster result menu allows to create four different results Match creates a table named Matchtable containing the pairs of record one of data set A and the other of data set B that are considered as match Being a N M result to one record of data set A
13. Space Creation Decision Model Linkage 1 1 Linkage Result Save Utility Connection to database Connection started Lasttransaction executed Load Dataset from inputfiles at 2010 04 30 11 04 00 Successfully terminated Figure 4 Connection to database Choosing the item Open backup the window shown in Figure 5 will be open This window allows to choose and to restore a process previously saved as internal backup see Section 12 By choosing this functionality the content of the repository is removed and a connection starts to the repository initialized with the content of the chosen process 16 x Select backup name demo8000 i Restore Cancel Figure 5 Selection of a backup 6 Dataset The Dataset menu lists the different methods available to load the two input data sets As shown in Figure 6 these methods are Read from input files Read from DB Oracle Read from DB MySQL Read from backup Read from residual Dataset l Data Profiling Search Space Creation Decision Model Linkage 1 1 Linkage Result Save Utility Read from Input Files Read from DB Oracle Read from DB MySQL Read from Backup Read from Residual Figure 6 Data set menu Choosing Read from input file menu the window shown in Figure 7 will be open This window allows to insert the input file paths and to specify for each of them a field separator character This character can be
14. each of the record linkage phases so that the resulting work flow is actually built on the basis of application and data specific requirements It has been developed as an open source project so several solutions already available for record linkage in the scientific community can be easily re used It is released under the EUPL license European Union Public License It has been implemented by using two languages based on different paradigms Java an object oriented language and R a functional language This choice depends on our belief that a record linkage process is composed of techniques for manipulating data for which Java is more appropriate and of calculation oriented techniques for which R is a preferable choice The choice of Java and R is also in line with the open source philosophy of the RELAIS project It has been implemented using a relational database architecture in particular it is based on a mySql environment that is also in line with the open source philosophy of the RELAIS project The RELAIS project aims to provide record linkage techniques easily accessible to not expert users Indeed the developed system has a GUI Graphical User Interface that on the one hand permits to build record linkage work flows with a good flexibility On the other hand it checks the execution order among the different provided techniques whereas precedence rules must be controlled The current version of RELAIS provi
15. files in progress Dataset A GiDocumenti utenteitoscoiDCMTTYIRELAISYDAThdata 3001cens300 bdt Dataset B GADocumenti utenteitosco DCMT1 RELAISIDATKdata_300lide300 tet Data set loaded Search Space Reduction gt Figure 13 Selection of the cross product 8 3 Search Space Reduction In some linkage applications the search space reduction can be useful for two different reasons First of all when managing huge amount of data it can be useful to reduce the execution time and the used memory space by means of a suitable partition of the entire space of pairs coming from the cross product of the input files Second if a probabilistic record linkage approach is used there can be statistical problems in dealing with huge amount of data In fact the probabilistic models generally used in order to estimate the conditional 24 probabilities of being link or being non link do not allow to correctly identify such probabilities when the number of possible links is too small with respect to the whole set of candidate pairs The statistical problem can be overcome by means of the creation of suitable groups or partitions of the whole cross product set of pairs so as in each sub group the number of expected links is not much smaller than the number of candidate pairs In particular when the probabilistic model assumes that the overall candidate pairs are a mixture of the two unknown distributions of the true links and of the true
16. non links as in the current version of the probabilistic decision model implemented by RELAIS if one of the two unknown populations the matches is really too small with respect to the other the non matches it is possible that the estimation mechanism is not able to identify it in fact the estimation algorithm could still converge but it could estimate another latent phenomenon different from the linkage In this situations some authors 10 suggest to apply a reduction of the pairs space so that the expected number of links is not below 5 of the overall compared pairs The 5 is just a suggestion actually a prudential suggestion in practice if the matching variables have a high identification power good results can be achieved even if the expected number of links was around the 196o of the overall compared pairs Among the several reduction techniques the current version RELAIS provides the Blocking method and the Sorted Neighbourhood Method The first step required in RELAIS is the selection of the variables for the reduction This task can be supported by the data profiling activity See Section 7 2 8 3 1 Blocking An easy way to reduce the search space is to restrict the comparisons only to the pairs that report the same values for the selected blocking variables The blocking step requires the selection of the blocking key one or more variables after which the creation of blocks is automatically performed To select t
17. not used If two or more adjacent not separated by a vowel letters have the same numeric value only one is used This also applies if the first letter and the second one have the same value the second letter would not be used to generate a digit If there are less then six consonants in the string the code is filled out with zeros This approach is very promising for disambiguation of transliterated misspell names 9 3 Deterministic Decision Models The deterministic record linkage is associated with the use of formal decision rules and it can be adopted instead of probabilistic method in presence of error free unique identifiers such a fiscal code or when suitable matching variables with high quality and discriminating power are available and can be combined so as to establish the pairs link status in this case the deterministic approach is very fast and effective and its adoption is appropriate The Decision Model menu lists the available models to solve a record linkage problem It contains two menus Deterministic Probabilistic In particular the Deterministic menu contains two empirical models based on deterministic rules to classify pairs of record as match or non match Deterministic models in this implementation do not admit the state of possiblematch RELAIS Project Dataset Data Profiling Search Space Creation Decision Model Linkage 1 1 Linkage Result Save Utility Database will be initialized Dete
18. the R environment and browse the menu Packages Install packages After choosing a CRAN a list of packages will be displayed In this list click on the IpSolve package and on the RODBC package 3 3 Linux Environment Requirements To run RELAIS 2 1 Java R and mySql environments must have been installed More precisely Java 2 Runtime environment J2SE 6 0 with at least JDK 6 0 x mySql server environment www mysql com mySql Odbc 5 x Driver or higher R 2 9 or higher R package IpSolve R package RODBC 3 4 Linux Environment Installation and Execution To install RELAIS2 1 starting from the RELAIS2 1 zip file it is necessary to unzip this file in a directory for example C RELAIS This directory will contain the following files Tf there are problems to reach a CRAN click with the right button of the mouse on the R icon and select the Properties menu in the Connection menu write internet2 at the end of the destination For example C Programs R R 2 8 1 bin Rgui exe internet2 Relais bat Relais jar Mu gen embedded R LPR and the lib directory which contains the files my sql connecotr java 5 1 10 bin jar and ojdbc14 jar that are necessary for the connection to the databases MySQL and Oracle During the installation of mySql environment it is necessary to create the user root specifying a password 6699 Moreover it is necessary to create an anonymous user givi
19. the Service Pack version currently installed If an update is needed just click on the Windows Update icon on the Start menu and follow the instructions 3 2 Windows Environment Installation and Execution To install RELAIS 2 1 starting from the setup RELAIS exe file just execute this file and follow the instruction The directory in which Relais 2 1 has been installed will contain the following files Relais bat Relais2 1 jar mu gen embedded R LPR and the lib directory which contains the files my sql connector java 5 1 10 bin jar and ojdbc14 jar that are necessary for the connection to the databases MySQL and Oracle To run the program just double click on the Relais bat file While installing mySql environment it is necessary to create an anonymous account besides the system account After the installation it is necessary for using RODBC package to create a new data source After creating relais data base in mySql environment browse the menu in the XP Windows Operating System Start Settings Control Panel gt Administration Tools Data Sources ODBC In the window that will be opened the User DSN menu must be chosen in this menu clicking on the Add button a new window will be open in this window the mySql ODBC driver must be chosen In the new window that will be opened the following settings must be done data source relais server localhost database relais To install R packages run
20. 7 Research Conference Arlington VA USA 2007 10 Yancey W E Improving EM Algorithm Estimates for Record Linkage Parameters RESEARCH REPORT SERIES Statistics 2004 01 48 11 Cibella N Fortini M Scannapieco M Tosco L Tuoto T Theory and practice of developing a record linkage software In Proc Of the Combination of surveys and administrative data Workshop of the CENEX Statistical Methodology Project Area Integration of survey and administrative data Vienna Austria 2008 12 Cibella N Fernandez G L Fortini M Guig M Hernandez F Scannapieco M Tosco L Tuoto T Sharing Solutions for Record Linkage the RELAIS Software and the Italian and Spanish Experiences In Proc Of the NTTS New Techniques and Technologies for Statistics Conference Bruxelles Belgium 2009 13 Newcombe H Kennedy J Axford S and James A Automatic Linkage of Vital Records Science Vol 130 pp 954 959 14 Gill L Methods for Automatic Record Matching and Linkage and their Use in National Statistics National Statistics Methodological Series no 25 HMSO Norwich UK 2001 15 Christen P amp Goiser K 2005 Assessing deduplication and data linkage quality What to measure in Proceedings of the fourth Australasian Data Mining Conference AusDM 2005 Sydney 16 Jaro M A 1995 Probabilistic linkage of large public health data file Statistics in Medicine 14 491 498 17 Winkler W E 1999 The state of record l
21. BB Istat Editors Monica Scannapieco DCMT SVS C Laura Tosco DCMT SVS C Luca Valentino DCMT SVS C User s Guide Version 2 1 Nicoletta Cibella DCMT U Tiziana Tuoto DCMT 1 Marco Fortini DCCG MTO A BB Istat Index TAGE eC M ean dare nite eee 2 I Introduction ic erac dasa Saeco eva e e e eer th eas CR ERE ER an 4 1 Addons Of RELAIS 2 1 nene AIA ALOE eas PEU e BECHER EHE EEQE RE ee EQ PER er Per eie en 5 2 Record Linkage Processes in RELAIS 2 1 reinsa enee n ae iper nnne Se enne EE ENEE R inneren nnns 2 PARE RM 5 2 2 RBEAIS 2 1 ECW Ques o5 eerdoheenebuttudideum sub LIN iii ILLA ES 8 2 1 1 Phase 1 Choice of Matching Variables enne nennen nnne nnne 8 2 1 2 Phase 2 Choice of Comparison FUnctions c cscceccscssesseneesecseeseeseeseesesseeeceeceeseeeeeceecessececaeeaesaesaeeaesaeeaeeeaeen 8 2 1 3 Phase 3 Creation and Reduction of the Search Space of Link Candidate Pairs n 9 2 1 4 Phase 4 Choice of Decision Model ee dede tet aa tete le e Rea iue HE Haan ke E ord 10 2 1 5 Phase 5 Selection of Unique Links nennen enne nnne tertie nennen nnne 10 2 3 Examples of Record Linkage Work fl0WS i 10 3 Installations sfila Ela ERROR RU DINERS 11 3 1 Windows Environment Requirements i 11 3 2 Windows Environment Installation and Execution
22. ME CLASSIFICATION YEAR BEGIM FREQUENCY 131660 2615 e eap 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 nm nm ele X Toloi Tolol o lo jo 2joj2jo 2joj 2 o 2 o 2 o 2 o Figure 25 Contingency table 9 4 1 Contingency Table The first step of the probabilistic procedure consists of computing the comparison vector P y pores joel given by the result of the function on the k matching variables for all the pairs in the space Q In fact starting from the vector distribution among the pairs reported in the contingency table the goal 1s the estimation of the probability distribution of the unknown random variable link status which assigns each pair to the set M or to the set U The comparison vector considered in RELAIS 2 1 is a binary one i e for each matching variable it reports the equality corresponding to value 1 or the inequality corresponding to value 0 between the units As already reported in Section 9 2 the comparison vector 1s binary even if a continuous comparison function has been applied actually based on the selected threshold RELAIS 2 1 converts the results in binary elements treating all the results above the threshold as 1 and all results below the threshold as 0 This choice is a very common simplification in order to design the latent class model as simple as possible The evaluation of the contingency table is available by bro
23. S aims to join the statistical and computational essence of the linkage problem 2 1 Phases The idea of decomposing the record linkage process in its phases is the core of the RELAIS toolkit and makes the whole process easier to manage each phase has its own windows Now a general overview on the main phases is given while more details are added later in the specific paragraph devoted to the considered phase The main phases shown in Figure 1 are o Data cleaning preparation of the input files pre processing o Choice of the common identifying attributes matching variables o Choice of comparison functions o Search space creation reduction o Choice of a decision model o Record linkage procedures evaluation Data cleaning a Choice of Search preparation of matching space Decision the input files variables and reduction model Possible comparison AxB matches function Unmatch Figure 1 Phases of record linkage Generally speaking the preparation of input files is the first phase which according to 14 requires 75 of the whole effort to implement a record linkage procedure in fact data can be recorded in different formats some items may be missing or with inconsistencies or errors The key job of this phase is to convert the input data in a defined format resolving the inconsistencies in order to reduce errors deriving from an incorrect reported data In this phase null strings are canceled
24. TABLE CITY 12 CONTINGENCY TABLE CITY 13 MU TABLE CITY 12 Roma Sassari CONTINGENCY TABLE CITY 14 CONTINGENCY TABLE CITY 15 MU TABLE CITY 13 MU TABLE CITY 14 MU TABLE CITY 15 Savona CONTINGENCY TABLE CITY 16 MU TABLE CITY 18 Torino Venezia CONTINGENCY_TABLE_CITY_17 CONTINGENCY_TABLE_CITY_18 MU_TABLE_CITY_17 Verona CONTINGENCY_TABLE_CITY_19 Figure 15 Block Modality table 8 3 2 Sorted Neighbourhood Method MU TABLE CITY 18 MU TABLE CITY 19 The Sorted Neighbourhood Method SNM consists of ordering the two data sets to link according to a sorting variable Then a fix size w window runs on the unified sorted list and all the pairs falling into the window are considered as candidate pairs The sorted neighbourhood also requires the selection of the sorting variable in the same way as the blocking method However it also requires the specification of the size of the comparison window The size will be 26 selected by taking into account the risk of missing true links for instance 1f the number of units with the same value of the sorted variable is larger than the fixed size Figure 16 shows the window for the selection of the sorting key one or more variables Figure 17 shows the window where the neighbourhood window size can be specified This latter is automatically opened after the selection of the sorting variable according to Search Space Creation Searc
25. Winkler ADDRESS 0 9 is defined A pair of records verifies this sub rule 1f and only if the two records have the same attribute CITY and the attribute ADDRESS similar in the sense that applying the JaroWinkler metric the result is greater than 0 9 Moreover in the example shown in Figure 20 an other sub rule Equality CITY And Levenshtein BUSINESS_NAME gt 0 9 is defined the two sub rules are separated by the operator Or A pair of records verifies the rule if and only if at least one of the two sub rules is satisfied The outputs of the Deterministic Rule based model are contained in the following tables CONTINGENCY TABLE DETERMINISTIC TRUE TABLE Each condition is applied to pairs in the search space The result of the condition applied to a pair is 1 if the condition is verified 0 otherwise The concatenation of all the results of the conditions is named 33 comparison pattern The frequency of each comparison pattern in the search space is saved in the CONTINGENCY TABLE The patterns verifying the match rule are saved in the DETERMINISTIC TRUE TABLE Referring to the rule defined in Figure 20 the output tables are shown in Figure 21 and in Figure 22 Note that each condition is identified by the variable name BUSINESS NAME ADDRESS FREQUENCY IoIi 1 o Figure 21 Contingency Table FA DETERMINISTIC_TRUE_TABLE BUSINESS_NAME ADDRESS WEIGHT
26. abbreviations punctuation marks upper lower cases etc are cleaned and any necessary transformation is carried out to standardize variables Furthermore the spelling variations are replaced with standard spelling for the common words After the previous phase it is important to choose matching variables that are as suitable as possible for the considered linking process The matching attributes are generally chosen by a domain expert in Relais 2 1 a set of meta data support the users in the choice of matching attributes If unique identifiers are available in the linkable data sources the easiest and most efficient way is to use these ones as link variables but very strict controls need to be made in case of using numeric identifiers alone Variables like name surname address date of birth can be used jointly instead of using each of them separately in such a way one can overcome problems like the wide variations of the name spelling or the changes in surname depending on the variability of the marital status It is evident that the more heterogeneous are the items of a variable the higher is its identification power moreover if missing cases are relevant in a field it is not useful to choose it as a matching variable Comparison functions are used to compute the distance between records on the values of the chosen matching variables In a linking process of two data sets say A and B the pairs needed to be classified as matches
27. able containing record of data set A that are not classified as match RESIDUAL DSB table containing record of data set B that are not classified as match Choosing the Rule based menu the window shown in Figure 20 will be open This window allows to define a complex rule for classify pairs as match 32 Bre ___T x MATCH RULE Equality CL T Y And Levenshtein BUSINESS_NAME gt 0 9 Or Equality CITY And JaroWinkler ADDRE SS z0 9 Variable ADDRESS lv Add Condition Metric JaroWinkler M Threshold 0 9 Canc Clear rule Figure 20 Input deterministic match rule This complex rule is named match rule and is organized in sub rules Each sub rule may consist of conditions that must be checked at once these conditions are separated by an AND operator The different sub rules are separated by an OR operator that can be inserted in the rule using the Or button A single condition can be added using the Add Condition button the variable the metric and the threshold to be used must be defined in advance As shown in Figure 20 the variable ADDRESS the metric JaroWinkler and the Threshold 0 9 are chosen Using the Add Condition button the condition JaroWinkler ADDRESS 0 9 is added to the rule This condition is separated by the operator And from the Equality CITY condition already existing Therefore the sub rule Equality CITY And Jaro
28. ables with respect to the matching status In general the correct identification of the links depends on the number of matching variables but at the same time if strongly correlated variables or variables with correlated errors are included in the model the estimates could be not reliable thus increasing the values of the matching weights without improving the identification of the links For this reason also the correlation indicator are calculated The correlation measures the relationship between the selected variable and another one picked up by the users among all the remaining variables The formula for the index is 2 k h n Cana Er qs minl k 1 4 1 where i and j are the items of the first and the second chosen variables i 1 2 k and j 1 2 h nj is the joined observed frequency of the i th and j th items and n nj is the observed marginal frequency of the i th j item 8 Creation and Reduction of the Search Space The search space of the candidate pairs 1s naturally formed by the cross product of the records stored in each input file The functionality that implements the cross product is described in Section 8 2 Section 8 3 introduces the problem of reducing the search space In Section 8 3 1 the blocking strategy is described while Section 8 3 2 introduces the sorted neighbourhood method 23 8 1 Methodological aspects In a linking process of two data sets say A and B the pairs need
29. al backup all the results output and intermediate results will be available to the user In Figure 37 the Backup menu is shown 45 A RELAIS Project Dataset Data Profiling Search Space Creation Decision Model Linkage 1 1 Linkage Result Save Utility J Ue au d U INFO relais_CONTINGENCY_TABLE table view is enabled on menu Utility quod Computing Fellegi Sunter method Backup Create Result Backup Fellegi Sunter method done Create Full Backup INFO relais MU TABLE view is enabled on menu Utility Reduction to one to one linkage Computing One to One Linkage Invoking R Reduction to 1 1 matching task succeeded Reduction done Figure 37 Backup menu In Figure 38 the menu Save gt Internal backup Create result backup is shown ITITTTREREEEEEER RM 4 Choose backup name Create Cancel Figure 38 Create result backup menu In Figure 39 the menu Save gt Internal backup gt Create full backup is shown ETITTCEEEEN Choose backup name Create Cancel Figure 39 Create full backup menu 46 13 Utility The Utility menu lists some available menus of general utility DE INFO relaie CONTINGENCY TABLE table view is enabled on menu Utility UIS Computing Fellegi Sunter method Delete Backup Fellegi Sunter method done Change Output Order Key INFO relais MU TABLE view is enabled on menu Utility
30. ality of a source measuring a population amount by capture recapture method checking the confidentiality of public use micro data Starting from the earliest contributions dated back to 1959 13 there has been a proliferation of different approaches based on statistics databases machine learning knowledge representation However despite this proliferation no particular record linkage technique has emerged as the best solution for all cases We believe that such a solution does not actually exist and that an alternative strategy should be adopted 5 In fact record linkage can be seen as a complex process consisting of several phases involving different knowledge areas moreover several different techniques can be adopted for each phase We believe that the choice of the most appropriate technique not only depends on the practitioner s skill but most of all it is application specific Moreover in some applications there is no evidence to prefer a given method to others or of the fact that different choices at some linkage stage could bring to the same results This is why it could be reasonable to dynamically select the most appropriate technique for each phase and to combine the selected techniques for building a record linkage work flow of a given application RELAIS is a toolkit relying on these ideas The principal features of RELAIS are 5 9 11 12 Itis designed and developed to allow the combination of different techniques for
31. an executable file chmod 777 Relais bat just double click on the Relais bat file icon To install R packages run the R environment and write the instruction install packages name of package dependencies TRUE For example to install IpSolve package write the following instruction install packages IpSolve dependencies TRUE 4 The RELAIS Menu As shown in Figure 3 the RELAIS menu is composed by the following items Project Dataset Data Profiling Search Space Creation 9 9 9 Decision Model Linkage 1 1 Linkage Result Save Utility In following sections each of this menu will be detailed 5 Project Connection items are listed in the Project menu as shown in Figure 3 RELAIS Project Dataset Data Profiling Search Space Creation Decision Model Linkage 1 1 Linkage Result Save Utility New Project Open Last Project Open Backup Figure 3 Project menu Choosing the New Project item the internal database is initialized removing the content of the current repository Thus choosing this option the new connection refers to an empty database This operation is required in first run of the software Choosing the item Open last project a new connection to the database is performed without removing the content of the repository which 1s up to date to the last run exit and the last transaction is returned on the output window see Figure 4 RELAIS Project Dataset Data Profiling Search
32. arch space See FSFail Rout for more details The estimated conditional probabilities m Y and u y can be verified in file FSFail Rout in the RELAIS2 1 folder A warning message is given when the estimates conditional probabilities of at least one of the matching variables are nearly the boundary m y gt 0 99999 or u y gt 0 99999 In such a situations the following message is shown WARNING one or more nearly boundary parameters See FSFail Rout for more details the estimation results are saved in the MU table and the estimated conditional probabilities can be verified in the file FSFail Rout in the RELAIS2 1 folder When the probabilistic model is applied on several blocks a FSFail Rout file is created for each block giving erroneous or warning results 51
33. arols t where is the length of common prefix at the beginning of the string p is a constant scaling factor for how much the score is adjusted upwards for having common prefixes In Relais the value of constant p is the standard used in Winkler s work 17 p 0 1 and the value of can not exceed 6 This adjustment gives more favorable ratings to strings that match from the beginning for a set prefix length 6 3 Grams 18 Q grams are generally used in approximate string matching by sliding a window of length q over the characters of a string s to create a number of q length grams in the case of Relais 2 1 we considered q equal to 3 for matching A match is then rated as number of q gram matches within the second string t over possible q grams When two strings s and t have a small edit distance they also have a large number of q grams in common 7 Soundex comparison function The Soundex function is a rude phonetic indexing scheme that generally focuses on individuals names with this metric the errors in phonetic are easily recognized e g the names John Johne and Jon referred to the same person This is a term based evaluation where each term is given a Soundex code each Soundex code consists of a letter the first one of the string and five numbers between 0 and 6 The numbers are based on the consonants as in the following table 30 1 B B F V 2 C S K G J Q X Z 3 D T 4L 5 M N 6 R The vowels are
34. ate task can take some minutes Fellegi Sunter One Block gt EM Estimation Contingency table evaluation Contingency table created INFO relais CONTINGENCY TABLE table view is enabled on menu Utility Figure 27 EM estimation menu After the execution of the parameter estimates the result must be visualized by means of loading the table containing the computation The table is loaded by browsing the menu Utility Table Display Figure 28 shows the loaded table 38 business name i classification year begin fm fu m u r p post 0 3492 55459 128177 44 0 39638 0 84771 0 46759 0 02645 e 588 45339 8781 54661 0 06698 0 05808 1 15323 0 0628 313 26504 2301 73496 0 03566 0 01522 2 34224 0 1198 2405 43876 11017 56124 0 27378 0 07287 3 75737 0 1792 41 71772 124 28228 0 00475 8 2E 4 5 77678 0 25131 360 15479 668 84521 0 04099 0 00442 9 26699 0 35 136 33745 12466255 0 01552 8 2E 4 18 8215 0 52237 24 07464 8 92536 0 00274 6 0E 5 46 42042 0 72953 675 99984 1 6E 4 0 07694 0 0 7 5008830 1 0 58 99999 1 0E 5 0 00672 0 0 1 8489804 1 0 36 0 0 0 0 0041 0 0 3 7573590 1 0 446 99999 1 0E 5 0 05088 0 0 6 0274678 1 0 30 0 0 0 0 00341 0 0 9 2669531 1 0 67 0 0 0 0 00763 0 0 1 4865846 1 0 30 0 0 0 0 00341 0 0 3 0192926 1 0 gt gt ESSEN
35. c Comparison 3Grams e Dice e Jaro e JaroWinkler Levenshtein e Soundex In Section 9 2 the details of such functions are introduced 2 1 3 Phase 3 Creation and Reduction of the Search Space of Link Candidate Pairs A first functionality that RELAIS 2 1 makes available is the creation of the search space of the pairs to be linked by means of Cartesian Product A reduction of the search space can be performed by means of two available techniques namely Blocking Sorted Neighborhood Both techniques need the setting of blocking variables Such a choice is supported by a set of meta data partially overlapping with those available for the choice of the matching variables These are Completeness Accuracy Consistencies Categories Frequency Distributions Blocking Adequacy Entropy In Section 7 3 a detailed discussion on such meta data is available 2 1 4 Phase 4 Choice of Decision Model Two kind of decision models are currently available in RELAIS namely deterministic and probabilistic The deterministic model about which details are discussed in section 9 3 includes Exact matching Rule based matching The probabilistic model consists of an implementation of the Fellegi Sunter decision model 4 The details of this model are in Section 9 4 2 1 5 Phase 5 Selection of Unique Links In this phase a reduction from a matching M N to a matching 1 1 can be performed The techniques currently
36. could correspond one or more records of data set B and to one record of data set B could correspond one or more records of data set A Possible Match creates a table named Possiblematchtable containing the pairs of record that could be a match or a non match therefore need a clerical review Data set residual creates two tables named Residual dsa and Residual dsb thatcontain respectively the non matched records of data set A first data set and the non matched records of data set B second data set Unmatch creates a table named Unmatchtable that contains the pairs of records that do not correspond to a match 12 Save As shown in Figure 33 starting from the Save menu it is possible to choose one of the following menu To file allows to save final and partial results to file Backup allows to produce an internal backup from which restarting the execution RELAIS Project Dataset Data Profiling Search Space Creation Decision Model Linkage 1 1 Linkage Result Save Utility Computing Fellegi Sunter method Tofile Fellegi Sunter method done Backup gt INFO relais_MU_TABLE view is enabled on menu Utility Reduction to one to one linkage Computing One to One Linkage Invoking R Reduction to 1 1 matching task succeeded Reduction done The To File menu consists of the following menu Choose output folder this menu allows to choose the folder in which the result
37. des a set of techniques to execute record linkage applications according to the decomposition described in Section 2 1 1 Add ons of RELAIS 2 1 With respect to RELAIS version 2 0 RELAIS 2 1 has the following new features o Input of data from database Oracle or MySQL o New default input values for the parameter estimation of the probabilistic model o New definition of the candidate pairs for the optimal reduction to 1 1 matching results o More than one variable selection in use of sorted neighborhood method for search space reduction Minor bugs have been solved 2 Record Linkage Processes in RELAIS 2 1 The complexity of the whole linking process relies on several aspects of different nature If unique identifiers are available in the considered data sources the problem can be quite easily treated because its complexity is reduced only to computational constraints But generally unique identifiers are not available and more sophisticated statistical procedures relying on matching variables chosen for linking data are requested However data sources are often hard to combine since errors or lacking information in the record identifiers may complicate the integrated use of the information in order to overcome such obstacles record linkage techniques provide multidisciplinary set of methods and practices whose purpose is to identify the same real world entity which can be differently represented in one or more data sources RELAI
38. details 7 3 Diagnostics for Selecting Matching Variables Generally speaking the matching variables determine if a pair of records identifies or not the same unit As for the blocking variables also for the matching ones it is suitable to select those with a high identification power and a low error and missing rates The identification power of a variable increases according to its different values and depends on the distribution of these values among the units when a variable has a large number of categories but few of these are much more frequent than others it would be useless to select them as matching variables e g the surname Rossi can be more frequent than some other surnames The larger the number of categories of a variable is the higher is its discriminative power The indicators that can be calculated in order to have a suggestion useful to the selection are those indicated in Figure 12 and described in Section 7 1 22 Settings MetaData Variable MMNAS Completeness Accuracy Consistency Entropy _ Correlation CODPRB _ Frequency Distributions a Figure 12 Matching variable indicators In case of the probabilistic model implemented in the current version of RELAIS in order to identify the model parameters at least three matching variables must be selected Furthermore the adopted model assumes the conditional independence of the matching vari
39. e interval between 0 the minimum and the worst values and 1 the highest and the best for the selected indicator The indicators are calculated separately for the two data sets considered and for more than one variable depending on the choices adopted by the user For each of the selected variables the toolkit outputs the values of the picked out indicators It s important to underline that we consider a set of indicators both suitable for selecting the blocking and the matching variables but as the aims of the selection are different there are also some measures that differ In particular the indicators common in the selection of blocking and matching variable are completeness accuracy 1 2 3 consistency 4 categories 5 frequency distribution 20 6 entropy In detail the completeness 1 is the proportion of non missing records on the overall Na or Na for the considered variable in data set A and B respectively A completeness equal to 1 means no missing value in the variables blocking or matching one The accuracy 2 implies the comparison of the recorded value of a variable with a dictionary or a set of reference values that are known to be correct for the variable this measure provides the number of correct not out of range values on the overall The consistency 3 gives information on how well each item of the selected variable relates to the items of a selected variable e g province and r
40. ed to be classified as matches non matches and possible matches are those in the cross product A x B In case we re considering the de duplication problem the space is A x A 1 2 Many problems arise when dealing with large data sets connected with both computational and statistical aspects see Section 8 3 To reduce this complexity it is necessary to reduce the number of pairs a b a belonging to A and b belonging to B so as to have a set of pairs of manageable size Starting from this reduced search space we can apply different decision models that define the rules used to determine whether a pair of records a b is a match a non match or a possible match 8 2 Search Space Creation The cross product is a functionality that can be selected after reading the input data sets It is important to note that when the search space size is huge e g as a general indication formed by more than 25 000 millions of pairs with original data sets each of 5 000 records it is not suitable to create the overall search space via the cross product while one of the reduction methods described in Sections 8 3 1 and 8 3 2 is suggested Figure 13 shows how to select the cross product option browsing the menu Search Space Creation gt Cross Product RELAIS Project Dataset Data Profiling Search Space Creation Decision Model linkage 1 1 Linkage Result Save Utility Database will be initialized Cross Product System is ready Loading
41. egion It is necessary to indicate the variable that you want to compare the item with and to give the set of values that the associated variable can assume The categories 4 report the number of different categories in the selected variable for both data sets A and B The frequency distribution 5 returns in output the tables for A and B related to the frequency distribution of the variable sorted by frequency The entropy 6 calculates the Gini index for the selected variables both for A and B data sets An index equal to 0 means that all the frequencies are concentrated in a single item of the variables instead a value of the index equal to 1 means a complete heterogeneity in the variable all the i items have the same relative frequencies f 1 K The formula adopted is K G L f log f i 1 where i i 1 2 K are the items of the selected variable 7 2 Diagnostics for Selecting Blocking Variables In order to reduce the search space of the candidate pairs the most suitable variables are generally those most discriminating and accurate i e not affected by errors or missing values Usually variables as zip code municipality geographic area year of birth can be chosen as blocking variables when dealing with individual records Then links are searched only within the blocks assuming that there are no matches out of them therefore if the blocking variable is error affected some true links could be missed Further
42. empster A P Laird N M Rubin D B Maximum Likelihood from Incomplete Data via EM Algorithm Journal of the Royal Statistical Society Series A Vol 153 pp 287 320 1977 3 Elmagarmid A K Ipeirotis P G Verykios V S Duplicate Record Detection A Survey IEEE Transaction on Knowledge and Data Engineering Vol 19 N 1 pp 1 16 2007 4 Fellegi I P and A B Sunter A Theory for Record Linkage Journal of the American Statistical Association Vol 64 pp 1183 1210 1969 5 Fortini M Scannapieco M Tosco L and Tuoto T Towards an Open Source Toolkit for Building Record Linkage Workflows In Proc of SIGMOD 2006 Workshop on Information Quality in Information Systems IQIS 06 Chicago USA 2006 6 Hernandez M and Stolfo S Real world Data is Dirty Data Cleansing and the Merge Purge Problem Journal of Data Mining and Knowledge Discovery Vol 1 N 2 1998 7 Jaro M A Advances in Record Linkage Methodology as Applied to Matching the 1985 Census of Tampa Florida Journal of the American Statistical Association Vol 84 pp 414 420 1989 8 Koudas N and Srivastava D Approximate Joins Concepts and Techniques In Proc of International Conference on Very Large Data Bases VLDB Trondheim Norway 2005 9 Tuoto T Cibella N Fortini M Scannapieco M and Tosco L RELAIS Don t Get Lost in a Record Linkage Project In Proc of the Federal Committee on Statistical Methodologies FCSM 200
43. files will be written Change field separator allows to change the default value of the field separator that will be used in the write output phase Match File allows to write the content of the Matchtable already created to a file named Match txt The records of a pair are written one below the other and only the common variables are reported Possible match file allows to write the content of the Possiblematchtable already created to a file named PossibleMatch txt The records of a pair are written one below the other and only the common variables are reported Residual data set A allows to write the content of the Residual dsa table already created to a file named ResidualDS A txt This file contains records of the first dataset named dsA that are not match that is are possible match or non match Residual data set B allows to write the content of the Residual dsb table already created to a file named ResidualDSB txt This file contains records of the second data set named dsB that are not match that 1s are possible match or non match Table selection this menu open a window from which it is possible to choose one of the table created and write it in a file having the same name of the table and extension txt In Figure 34 the menu Save To file is shown RELAIS Project Dataset Data Profiling Search Space Creation Decision Model Linkage 1 1 Linkage Result Save Utility AANER bree Tofile
44. g variable quality in the deterministic procedure some links can be missed due to presence of errors or missing values in the matching variables so the choice between the deterministic and probabilistic methods must take into account the availability the stability and the uniqueness of the variables in the files 14 It is important also to underline that in a deterministic context the linkage quality can be assessed only by means of re linkage procedures or accurate and expensive clerical reviews The probabilistic approach is more complex and formal abut can solve problems caused by bad quality data In particular it can be helpful when differently spelled swapped or misreported variables are stored in the two data files In addition the probabilistic procedure allows to evaluate the linkage errors calculating the likelihood of the correct match Generally speaking the deterministic and the probabilistic approaches can be combined in a two step process firstly the deterministic method can be performed on the high quality variables then the probabilistic 28 approach can be adopted on the residuals the units not linked in the first step however the joint use of the two techniques depends on the aims of the whole linkage project 9 2 Selection of Comparison Functions Comparison functions measure the similarity between two fields Many of them are proposed in literature and provided in RELAIS 2 1 as described below Generally
45. gorithm has been modified in order to exclude all pairs with linkage probabilities less than 0 5 in such a way the number of constraints is reduced as well as the complexity of the problem when the number of units belonging to pairs with linkage 40 probability less than 0 5 is large If such a condition is not verified huge amount of data can required large data structure that can reach the maximum allocation of memory permitted by R Experiments show that the algorithm can solve problems with input data of some thousands of records When linkage process is based on deterministic approach the optimized reduction to 1 1 links can be obtained usign a system of weights which can be assigned by the user giving the maximum positive weight for the most important rule and the subsequent ones in descending order according to the relevance of the rule Nevertheless the system gives its own default weights the maximum is given to the first rule defined by the user 10 3 Greedy Solution If the optimal solution is not able to reach a result due to computational limitations we can apply a greedy solution in order to select from the N M cluster the one to one linkage Also in this context we consider firstly the r weight for deterministic approach see par 10 2 and we sort all the record pairs by r Then we consider as one to one pairs those that have the higher r With this strategy it is not guaranteed that we reach an optimal soluti
46. h Space Reduction gt Sorted Neighborhood PA Select sorting key E xl Variables Sorting key BUSINESS NAME CITY ADDRESS CLASSIFICATION Remove RANGE CLASS MATCHREF Figure 16 Selection of the sorting variables x Window size Figure 17 Selection of the window size in the sorted neighborhood method 9 Decisional models As reported in the introduction RELAIS has the objective to provide different approaches and techniques to deal with the various record linkage problems The RELAIS 2 1 version implements both a method for probabilistic record linkage according to the Fellegi and Sunter theory 4 and two methods for deterministic record linkage based on comparisons of matching variable values The principal steps to apply the different 27 methods are treated in the next sections while methodological details on the probabilistic implementation are given in appendix A Some general aspects related to the advantages and disadvantages of each method are given in the next section 9 1 Methodological aspects A distinction between deterministic and probabilistic approaches is often made in research literature where the former is associated with the use of formal decision rules while the latter makes an explicit use of probabilities for deciding when a given pair of records is actually a match Actually it is difficult to make a clear distinction between the two approaches especially
47. he blocking variables browse the menu Search Space Creation Search Space Reduction Blocking Figure 14 shows the window that is opened to select the blocking variables After the execution of the blocking step a Block Modality table is created that reports information on the number of created blocks and on their sizes as shown in Figure 15 The table can be visualized by selecting Utility Table Display 25 PA Select blocking key X Variables BUSINESS NAME ADDRESS CLASSIFICATION AR BEGIN RANGE CLASS MATCHREF Blocking key Figure 14 Selection of the blocking variables EA BLOCK_MODALITY BLOCK_NUMBER BLOCK_MODALITY FREQUENCY CONTINGENCY_TABLE MU_TABLE Ascoli 210 CONTINGENCY_TABLE_CITY_1 MU_TABLE_CITY_1 Bologna Bolzano 289 CONTINGENCY_TABLE_CITY_2 MU_TABLE_CITY_2 a CONTINGENCY_TABLE_CITY_3 MU_TABLE_CITY_3 Brindisi Catanzaro CONTINGENCY_TABLE_CITY_4 CONTINGENCY_TABLE_CITY_5 MU TABLE CITY 4 MU TABLE CITY 5 Firenze CONTINGENCY TABLE CITY 8 MU TABLE CITY B Gorizia Milana CONTINGENCY TABLE CITY 7 MU TABLE CITY 7 CONTINGENCY TABLE CITY 8 MU TABLE CITY 8 Napoli Padova CONTINGENCY TABLE CITY 8 CONTINGENCY TABLE CITY 1 MU TABLE CITY 8 MU TABLE CITY 10 Palermo CONTINGENCY TABLE CITY 11 MU TABLE CITY 11 Perugia Pescara CONTINGENCY
48. ich more than a record of A B is matched with more than one record in B A That is the match composite weight is higher than the fixed match thresholds for more than one record in A B or in the deterministic approach more than one record satisfies the adopted rules However in several applications the record linkage target is to recognize exactly and univocally the same units and establish only 1 1 links that is each record of A with at most one of B and viceversa This kind of application requires several constraints and it is a difficult problem of optimization for which different algorithms have been proposed In the current version of the toolkit we consider two possible solutions to the problem of reduction from N M linkage to one to one the optimized solution and the greedy solution 10 2 Optimized Solution In this first alternative we consider an optimal solution for the reduction 1 to 1 Once the matching weight r is assigned to each pair the identification of 1 to 1 links can be solved as a linear programming problem where the objective function to maximize is the sum of weights for the linked pairs under the constraints given by the fact that each unit of A must be linked with only one unit in B In the current version of RELAIS the solution of such a problem is obtained by means of the simplex algorithm available in the R package IpSolve With respect to linkage process using probabilistic model the original al
49. inkage and current research problems Statistics of Income Division Internal Revenue Service Publication R99 04 http www census gov srd papers pdf rr99 04 pdf 18 Using q grams in a DBMS for Approximate String Processing by Luis Gravano Panagiotis G Ipeirotis H V Jagadish Nick Koudas S Muthukrishnan Lauri Pietarinen and Divesh Srivastava 19 Jaro M A Advances in record linkage methodology as applied to matching the 1985 Census of Tampa Florida Journal of the American Statistical Association 989 84 414 420 Appendix Parameter Estimation of the Probabilistic Model via the EM Algorithm To estimate m Y and u y Jaro 9 defines a latent vector g 49 L0 if ab e M Se Vol if abb e U and the augmented log likelihood for the observed vector x of the k matching variables and the vector g In p T mpy 1 mo Ifsgmwpl Y gun y Bu In 1 p abk Q T by 1 em a b e Q k where p represents the probability that a randomly chosen pair a b belong to the subset M Moreover a conditional independence assumption is often made so that K K m ll pp u ll up k 1 k 1 where m Prly I a b e M ug Prly I a b e U 5 5 Since the vector g and the subsets M and U cannot be directly observed the probabilities m y and u y are estimated via the EM procedure 2 providing initial values for m y u y and p and estimating expected values for the vector g lt g
50. ins the list of the common variables between the two data sets 7 Data Profiling To give the opportunity to the user of designing the record linkage work flow more appropriate for the application at hand RELAIS 2 1 toolkit supplies a data profiling phase in which a set of quality meta data are calculated starting from real data provided as input these meta data help the user in the critical phase of choosing the best blocking or matching variables among those available and common to the two data sets Moreover in order to come towards needs of non skilled users RELAIS proposes also a default set of parameters coming from communities and manuals to help the decision making stages RELAIS Project Dataset Data Profiling Search Space Creation Decision Model Linkage 1 1 Linkage Result Save Utility Database will be ini Blocking Variables System is ready Matching Variables Loading files in pro Dataset A GiDocumenti utenteitoscolDCMT1 RELAISIDATI data_300 cens300 tt Dataset B GADocumenti utenteitosco DCMT1 RELAISIDATKdata_300tide300 tet Data set loaded Figure 10 The data profiling menu 7 1 Methodological Aspects The choice of the indicators for helping not expert users in selecting the variables most suitable for the record linkage process is not easy As stated above the process is very complex and we identify some essential indicators which could be helpful in the choice All the measures vary in th
51. m Z gt STEP E K f mite ab rtl all me l 7 m k 1 sl ge 1 a tl Lage k 1 2 P dem A 1 a EESAN After this step the g values can be placed into the log likelihood 2 and a maximum likelihood estimate for m y u y and p STEP M can be obtained from BAT Yar he T _ a byeQ _ a b Q a a b Q a b Q i N The Expectation and the Maximization steps are then iterated until the convergence of the parameters of interest 1s achieved 50 In the current version of RELAIS the routine mu gen embedded R applies the R optimizer for loglinear model during the M step equivalent to the conditional independence model defined above The loglinear model parametrisation allows for a more flexible model definition in which the independence hypothesis can be relaxed in some extent The initial values of the parameters are m y 0 8 u y 0 2 and p 0 1 the maximum number of iteration is 5 000 and the stop criterion is achieved when the difference between the estimates of two iteration is 0 000001 The model estimates are considered not reliable when the estimated conditional probabilities m Y and u y of at least one of the matching variables have the same trend both for matches and for non matches populations In such conditions the system stops and the following message is shown ERROR one or more variables give inconsistent estimates Please check the variables in the model or try to reduce the se
52. more it is useful to avoid blocking variables which create too small groups i e blocking variable with a large amount of values in order to reduce risk of errors in the blocking variable in addition also blocking variables which create too large groups must be avoided generally because they do not allow to reduce enough the search space Generally speaking it is suitable to create blocks of the same size selecting one or more variables 21 which present a consistent number of values uniformly distributed among the units for instance the day and month of birth Settings MetaData Variable MMNAS Completeness Accuracy Consistency Categories _ Frequency Distributions C Blocking Adequacy Block Dimension 1000000 Entropy Figure 11 Blocking variable indicators Beside the above mentioned indicators in case of the blocking variable selection we can also consider the blocking adequacy measure see Figure 11 The indicator of the blocking adequacy gives in case of the selection of a certain variable the proportion of blocks with size number of pairs under a fixed threshold the default one is 1 000 000 on the overall A blocking adequacy indicator equal to 1 means that all the modalities of the selected variable create blocks below the fixed threshold The default value of the block dimension is related to the stochastic model estimableness see Section 8 3 for more
53. ng him all the grants on the following schema information schema mysql relais Thus the relais schema must have been created using the instruction CREATE DATABASE relais It is important to notice that the data base name relais must be written in lower cases being the Linux operating system case sensitive If no specific knowledge about the Linux operating system is held it could be useful to install the mySql Gui tool that gives a graphical interface to the data base and make easier to perform all the operations described below Moreover it is important to check that the libmyODBC library has been installed this library contains the header necessary for a correct running of mySql Moreover the libraries unixODBC dev unixODBC must have been installed these are necessary to connect to the data base starting from a R program Finally it is necessary to modify the hidden file ODBC ini that can be found in the own home folder writing the following instructions ODBC Data Sources relais Connector ODBC 3 51 Driver DSN relais Driver usr lib odbc libmyodbc so Description Connector ODBC 3 51 Driver DSN Server localhost DSN relais Port 3306 User Password Database relais ServerType MySql Option TraceFile var log mysql_test_trace log Trace 0 To run the program just use the following command java jar relais jar or after changing the Relais bat file making
54. nt the details to install Relais 2 1 both in Windows environment and in Linux environment 3 1 Windows Environment Requirements To run RELAIS 2 1 Java R and mySql environments must have been installed More precisely Java 2 Runtime Environment J2SE 6 0 with at least JDK 6 0 x mySql server environment www mysql com mySql Odbc 5 x Driver or higher R2 9 or higher R package lpSolve Rpackage RODBC It is important to verify that the system variable PATH contains the Java exe R exe and mysql exe paths The PATH variable can be found browsing the menu in the XP Windows Operating System Start gt Settings Control Panel gt System Advanced gt Environment variables To modify the system variable PATH it is necessary to be PC administrator If different version of Java and or R are installed on the PC the new paths of Java and or R must be written in the PATH variable string before the paths of the previous version we recommend to insert the new paths at the beginning of the PATH variable An example of how to set the PATH variable follows PATH C Programs Java jre1 6 0_03 bin C Programs R R 2 5 1 bin C Programs mysql MySql Server 5 0 bin Finally it is important to check that the operating system is updated at least to Service Pack 2 To verify this requirement click with the right button of the mouse on the Computer Resources icon and select the Properties menu in the General menu is described
55. o classify pairs into M the set of matches and U the set of non matches The decision rule can be empirical or probabilistic In the deterministic approach a pair is a match if it agrees completely on all the matching variables chosen or satisfies a defined rule base system that is if it reaches a score that is beyond a threshold when applying the comparison function The probabilistic approach based on the Fellegi and Sunter model 4 requires an estimation of the model parameters that can be performed via the EM algorithm Bayesian methods etc A linkage process can be also classified as 1 one to one problem if one record in the set A links to only one record in B and also the other way around ii one to many problem if a record in a set can be matched with more than one of the compared file iii many to many problem if more than one record in each file match with more than one record in the other The latter two problems may imply the existence of duplicate records in the linkable data sources Finally as not every record matched in the linkage process refers to the same identity in the record linkage procedure evaluation it s important to establish whether a match is a true one or not In other words during a linkage project is necessary to classify records as true link or true non link minimizing the two types of possible errors false matches and false non matches The first type of error refers to matched reco
56. on because we perform local choices 11 Linkage Result The last phase of the record linkage process 1s the selection of the desired output data In the following paragraphs the choices to be adopted are described and also all the available outputs in order not only to better end the whole process but also to have the residual to begin the new process with 11 1 Choice of the Thresholds At this point the unmatch threshold T and the match threshold Tm are selected If the weight value p post is higher than T and lower than T the pair is assigned to the set of the Possible Match if the value p post is higher than T the pair is assigned to the set of Match When the same value is chosen for T and T the resulting set of Possible Match is empty If the value assigned to T is higher than T an error message is given and new threshold values are requested 11 2 The Linkage Result menu As shown in Figure 29 the Linkage Result menu contains three menus Choose Threshold 1 1 Result 41 Cluster Result Bina as Project Dataset Data Profiling Search Space Creation Decision Model Linkage 1 1 Linkage Result Save Utility Setting matching variables done Choose Threshold INFO Pattern update task can take some minutes 1 1 Result gt Contingency table evaluation Contingency table created TET INFO relais_CONTINGENCY_TABLE table view is enabled on menu Utility Computing Fellegi Sunter method
57. own in Figure 8 will be open In this window the following information can be input data base name of the source name host name of the server hosting the data base port number of the port to use to connect to the server user name of the user password password of the user DSA table name name of the first table containing the data 18 DSB table name name of the second table containing the data Data Base Host User Password DSA Table Name DSB Table Name Reset Cancel Figure 8 Read from DB Oracle window or Read from DB MySQL window Choosing the Read from backup menu the window shown in Figure 9 will be open In this window is possible to choose an internal backup previously saved see Section 12 to be used to restore the input data set Read from backup X Select backup name forum x Copy Cancel Figure 9 Selection of internal backup Choosing the Read from residual menu a window similar to the one shown in Figure 9 will be open In this window is possible to choose the residual results of a process previously saved to be used as current input data sets After the Dataset reading phase the following tables are created in the database DSA contains the data set A with the generated variable key dsa DSB contains the data set B with the generated variable key dsb RECONCILED SCHEMA conta
58. rds which do not represent the same entity while the latter indicates unmatched records not correctly classified that imply truly matched entities were not linked 2 2 RELAIS 2 1 Techniques Each of the phases described in the previous section can be performed according to different techniques depending on specific applications and features of the data at hand it can be suitable to iterate and or omit some phases as well as it could be better to choose some techniques rather than others In the current version RELAIS provides some of the most widespread methods and techniques for the following phases Choice of matching variables Choice of comparison functions Creation and reduction of the search space of link candidate pairs Choice of the decision model Selection of unique links In the following for each of the implemented phase we briefly detail the available techniques 2 1 1 Phase 1 Choice of Matching Variables The choice of identifying attributes or matching variables is supported by a set of meta data that can assist Relais s users in this task The set of meta data currently available are Completeness Accuracy Consistency Entropy Correlation Frequency Distributions In Section 7 the meaning and usage of each meta data are detailed 2 1 2 Phase 2 Choice of Comparison Functions In the current version of RELAIS several comparison functions are available namely Equality e Numeri
59. rministic Equality Match System is ready Probabilistic gt Rule Based Loading files in progress Dataset A GiDocumenti utenteitoscoiDCMTTYRELAISYDAThdata 3001cens300 txt Dataset B GADocumenti utentettosco DCMT1 RELAISIDATIdata_300 ide300 te Data set loaded Search space creation via cross product Search space creation done Figure 18 Deterministic model menu RELAIS 2 1 as shown in Figure 18 implements two deterministic model Equality match 31 Rule based Choosing the Equality match menu the window shown in Figure 19 will be open This window allows to select variables that will compose the matching key Applying the Equality match a pair of record is classified as a match if all the selected matching keys are equals otherwise the pair is classified as a non match To evaluate an exact match it is not required the creation of a search space Moreover in this model it is not possible to choose a comparison function different from equality PA Select matching key x Variables Add CITY ADDRESS CLASSIFICATION Remove YEAR BEGIN RANGE Matching exact key Lok Cancel Figure 19 Selection of matching key The outputs of the Deterministic Exact model are contained in the tables MATCHTABLE table of match pairs with all the common variables of the two datasets and the generated variables key dsa and key dsb RESIDUAL DSA t
60. speaking the results of a comparison function can be also composed of categorical or continuous values In RELAIS 2 1 each comparison function is normalized and its results are in the range 0 1 Moreover it is requested to the user to choose a threshold between 0 and 1 consequently RELAIS 2 1 converts the results in binary elements treating all the results above the threshold as 1 and all results below the threshold as 0 The higher distance for two strings is the more similar the strings are A part the equality function hereafter we list of the comparison function available in RELAIS 2 1 with a short description The included functions are part of the Java package StringMetrics http www dcs shef ac uk sam stringmetrics html 1 Numeric Comparison This metric compare two strings by their numeric value Thus named Nx and Ny the numeric value of the two string Sx and Sy the numeric comparison is min N LIN max N N NC S S If the strings are not numeric or the two numbers have different signs the comparison s result is 0 2 Levenshtein comparison function This is the basic edit distance function whereby the distance is given simply as the minimum edit operations which transforms string into string2 Edit operations are copy a character from string over to string2 delete a character in stringl insert a character in string2 substitute one character for another Some other comparison f
61. the agreement or disagreement of the variables Viset josol k y 1 0 A The probability models for linkage assume that the probability distribution of the comparison pattern comes from a mixture of two probability distributions the first one comes from the pairs a b that actually are the same unit called distribution m the other one comes from the pairs a b that actually represent different units called distribution u Starting from the estimations of the two distribution m Y and u y it is possible to define the composite matching weight given by the likelihood ratio _mQ _ Prly M u y Prly U where M is the set of the pairs that actually are links and U is the set of the pairs corresponding to non links with MNU and MUU Fellegi and Sunter proposed an equation system to achieve the explicit formulas for the estimates of m Y and u y when the matching variables are at most three In more general situations the conditional distribution estimates can be obtained via the EM algorithm 19 assuming a latent class model in which the latent variable is just the link status According to the Fellegi and Sunter theory once the composite weight r is estimated it is possible to classify a pair as a link if the corresponding weight r is above a certain threshold Tm and as a non link if the weight lays below the threshold T finally for the pairs corresponding to weights falling into the range I T
62. tional probabilities of at most one of the matching variables result m y 70 or u Y 1 in such conditions the system stops and the following message is shown Estimation of parameters failed for this model The same error appears when the number of units in the model is smaller than the number of parameters to estimate 39 At this point the user can change the decision model with the purpose of modifying the performed choices that have conducted to a not reliable estimation of the model parameters 10 Reduction to matching 1 1 As reported in Section 2 the output of a record linkage procedure could be different depending on the aim of the matching process We can distinguish between 1 a one to one linkage ii a one to many linkage and iii a many to many linkage In the first case 1 we consider a problem in which a record in A can be matched to only one record in B and also the way around in ii a record in A set can be matched with more than one record of the compared file 111 allows more than one record in each file to be matched with more than one record in the other The latter two problems may imply the existence of duplicate records in the linkable data sources In Section 10 2 and 10 3 the solutions adopted in Relais 2 1 to the one to one problem are described 10 1 Methodological Aspects Both in the deterministic and in the probabilistic approach adopted in Relais 2 1 it is allowed the situation in wh
63. unctions reported below are extensions of the Levenshtein distance function and typically they alter the cost of the edit operation while in the Levenshtein function all the operations have the same cost 3 Dice comparison function Dice comparison function is a term based similarity measure and is defined as twice the number of terms common to compared entities divided by the total number of terms in both tested entities Dice 2 Common Terms Number of terms in Stringl Number of terms in String2 4 Jaro comparison function 16 29 The Jaro comparison function takes into account typical spelling deviations Briefly for two strings s and t let s be the characters in s that are common with t and let t be the characters in t that are common with s roughly speaking a character a in s is in common with t if the same character a appears in about the place in t Let Ty measure the number of transpositions of characters in s relative to t The Jaro similarity metric for s and t is 1 ls t t t 3l Ji 2 s Tay Jaro s t S 5 Jaro Winkler comparison function 17 This metric an extension of the Jaro comparison function 4 tends to modify the weights of the pairs s t with a common prefix the Jaro Winkler distance metric is particularly suitable in case of short strings such as person names The Jaro Winkler function for s and t is Jaro Winkler s t Jarols t Ip 1 J
64. with respect to proposals coming from the computer science area According to some authors e g Statistic Canada deterministic record linkage is defined just as the method that individuates links if and only if there is a full agreement of unique identifiers or a set of common identifiers the matching variables Other authors backed up that in deterministic record linkage a pair is a link also if it satisfied some specific criteria a priori defined actually not only the matching variables must be chosen and combined but also a threshold has to be fixed in order to establish whether a pair should be considered a link or not that 1s this kind of linkage is almost exact but not exact in the strict sense 14 In the deterministic approach both exact and almost exact the uncertainty in the match between two different databases is minimized but the linkage rate could be very low Deterministic record linkage can be adopted instead of probabilistic method in presence of error free unique identifiers such a social security number or fiscal code or when matching variables with high quality and discriminating power are available and can be combined so as to establish the pairs link status in this case the deterministic approach is very fast and effective and its adoption 1s appropriate From the other side the rule definition is strictly dependent on the data and on the knowledge of the practitioners Moreover due to the strong importance of the matchin
65. wsing the menu Utility Table Display 37 x Select table contingency_table Figure 26 Contingency table selection 9 4 2 Parameter Estimation of the Probabilistic Model and the Table MU In the probabilistic approach the distribution of the comparison vector is supposed to come from two different unobserved distributions according to fact that the pair is a match or not The estimation of these two distributions can be obtained by means of the maximization of the likelihood function Such operation involving a latent variable requires the use of iterative methods generally the EM algorithm or its generalizations Details on the method applied in RELAIS 2 1 to estimate the parameters of the probabilistic model are given in Appendix A including the initial values of the parameters the maximum number of iteration allowed and the stop criterion The estimation of the model parameters is achieved by browsing the menu see also Figure 27 B RELAIS Project Dataset Data Profiling Search Space Creation Decision Model Linkage 1 1 Linkage Result Save Utility Matching variable GGNAS chosen distance Equality and threst Deterministic gt Matching variable MMNAS chosen distance Equality and Mn Probabilistic gt Matching Variables gt Matching variable AANAS chosen distance Equality and thresh ire RENT Setting matching variables done Ealleptsuntan INFO Pattern upd

User's Guide

Contents

Download Pdf Manuals

Related Search

Related Contents