ICS provides ad hoc record linkage (aka matching, fuzzy matching or probabilistic matching) between person records when unique identifiers are not available. Names and dates of birth are not unique but can be used to link records of the same person between data systems or data files. Record linkage of administrative data is rarely 100% accurate but some data practices can increase both the accuracy and completeness of record linkage. Where possible ICS will report record linkage metrics (e.g., precision, recall, F-score) that can assist researches in assessing whether a record linkage project is of sufficient accuracy to support research needs. Researchers should recognize that high rates of false matches or missed matches (incomplete linkage) can result in incorrect or biased conclusions. Researches should also be aware of their responsibility to correctly interpret potential effects of inadequate or incorrect record linkage.
Minimal fields for record linkage:
- Full first name
- Full middle name (or initial if name is not available)
- Full last name (hyphenated, compound or multiple last names should remain in same column)
- First, middle and last names must be in separate columns
- Non-alphanumeric characters (e.g., accent marks, symbols) should be removed/cleansed from data when possible
- Date of birth
- Dates can be in any standard format, including separate months, days and years in separate columns, but must be consistent for the entire field
- Gender/sex
Unique persons should be de-duplicated in program records. If programs are unable to deduplicate records, notify ICS.
Additional fields may increase the match rate including SSN, county of residence, zip code, race and ethnicity. Consult with ICS to determine whether program-specific fields may increase match rates. Record linkage projects will ideally include additional fields in order to improve linkage accuracy - particularly when a large number of records require linkage.
Issues to consider when determining whether to match to data sources or an external data source to ICS:
- The proportion of records with missing identifiers
- How data sources have been validated
- Variables available for linkage
- Previous match metrics or evaluation of match outcomes
- How data owners and ICS will handle false matches or low rates of overall matches
- How issues with linked data will be communicated to end users