ICS Record Linkage Standards
ICS provides ad hoc record linkage (aka “matching", “fuzzy matching", “probabilistic matching") between person records when unique identifiers are not available. Names and dates of birth are not unique identifiers but can be used to link the records of the same person between data systems or data files. Record linkage of administrative data is rarely, if ever, 100% accurate, but some data practices can increase both the accuracy and completeness of record linkage. Where possible, ICS will report record linkage metrics (e.g. precision, recall, F-score) that can assist researchers in assessing whether a record linkage project is of sufficiently high accuracy to support research needs. Researchers should recognize that high rates of false matches, and/or missed matches (incomplete linkage) can result in incorrect or biased research conclusions. Researchers should also be aware of their responsibility to correctly interpret the possible effects of inadequate or incorrect record linkage.
Minimal fields for record linkage:
- full first name
- full last name
- full middle name (middle initial, if available, in lieu of middle name when not available)
- date of birth
- Additional fields may increase the match rate, including social security number, county of residence, zip code, race/ethnicity. Consult with ICS to determine if any program-specific fields may help increase match rates or if specific fields are available. Record linkage projects will ideally include fields in addition to the minimal fields in order to improve linkage accuracy, especially when a very large number of records require linkage.
- Names: first, last, and middle names separated into columns.
- Last names with multiple names should remain in the same column (e.g., hyphenated names, compound surnames, multiple surnames).
- Non-alphanumeric characters (e.g. accent marks, symbols) should be removed (“cleaned") from the data whenever possible.
- Dates: dates can be in any standard format, including separation of date components into different fields (i.e. day, month, year of birth in separate fields). The dates represented in any field should be consistent for the entire field (e.g. avoid mixing date formats: “12/25/2001", “2003-12-25").
- Unique persons should be de-duplicated in program records: each individual person should occupy one row within a data set, not multiple rows. If programs cannot de-duplicate records, then be sure to let ICS know that records have not been de-duplicated.
Some issues to consider when determining whether to match two data sources, or an external data source to ICS.
- The proportion of records with missing identifiers (e.g., % of records with missing last names).
- How data sources have been validated (e.g., are any variables estimated or leveraged from an unreliable source).
- The variables available for linkage (e.g., data sources without a middle name or SSN, for instance, can lead to high rates of false matches).
- Previous match metrics or evaluation of match outcomes.
- How data owners and ICS want to handle false matches or low rates of overall matches.
- How issues with linked data (e.g., bias, incomplete matches, low match rate) be communicated to data end-users