index.Rmd

---
title: "Record Linkage & Deduplication"
subtitle: "Identify and Remove duplicate with {fastlink}"
date: "`r format(Sys.Date(),  '%d %B %Y')`"
output:
  unhcrdown::html_slides:
    self_contained: true
    nature:
      highlightStyle: github
      highlightLines: true
      countIncrementalSlides: true
      ratio: "16:9"
---

```{r echo=FALSE, message=FALSE, warning=FALSE}
options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(
  warning = FALSE,
  message = FALSE,
  fig.showtext = TRUE
)
library(tidyverse)
library(unhcrthemes)
library(fontawesome)
# install.packages("RecordLinkage")
# library(RecordLinkage)
# install.packages("fastLink")
library(fastLink) 
```


##  Introduction

Imagine you need to lnk your registration database with a list of beneficiaries created by a partner....

> Record linkage (_also called de-duplication or Entity resolution_) is the process of joining multiple data sets removes duplicate entities often in the absence of a unique identifier.

Needed for instance, when reconciling multiple registration list to build a single sampling universe when preparing for a survey in the absence of a common registration database...

Exact matching is a method that says two records are a match if they agree on every feature. Exact matching ONLY works well if the linking data are perfect and present in all the databases you want to link works best when there is a single unique identifier (key/uuid). 

Performing exact matching is often not do-able when reconciling registration list with individual data coming from different partners as many issues can arise when trying to match only on names.  In such cases Matching that allows fields to __only be similar__ rather than exact duplicates will likely be more robust.

---

## Challenges with individuals matching 


.pull-left[  

Can be numerous ones: ... 

* Spelling of names and first names can be different

* Occurrence of special characters and space

* Usage of multiple names compounded together (father name + mother name)

* Usage of data of Birth Vs Age..

* not the same exact data available on each list

* multiple lists to merge together (more than 2!)

]
.pull-right[
![](img/record-linkage.png)
]

 
---

##  Workflow


.pull-left[  

The recommended workflow includes to go through specific steps:

 * Data Pre-processing
 
 * Identify blocking variables: Reduction of search space: Blocking

 * Perform Comparison
 
 * Review results based on acceptation threshold

]
.pull-right[
![](img/concept.png)
]

???
To minimize the number of computation, it is important to perform Record linkage only within the blocks, i.e. to block partitions similar records into bins or blocks.


---

## Pre-processing: Cleaning & Harmonisation

.pull-left[  

Keep only people with phone number

Common fields in the data to be identified and harmonized (Clean Gender or Nationality...) 


]
.pull-right[ 

```
data.prep <- data |>
  
  ## Filter when phone number is not available  
  dplyr::filter( telefono != "NO REFIERE") |>
  
  ## Clean the gender variable
  # data |> dplyr::select(genero) |> 
  # dplyr::distinct() |> dplyr::pull()
  
  dplyr::mutate(gender = dplyr::case_when(
    genero %in% c("F" ,"FEMENINO" ,
                   "f",  "Femenino") ~ "F",
                   
    genero %in% c("M" , "MASCULINO" , 
                   "Masculino") ~ "M",
                   
    genero %in% c("X",  "Otro") ~ "Ot",
    TRUE ~  NA  ))  
    
    
```
]


---

## Pre-processing: String  

.pull-left[  

String treatment

* Remove Punctuation

* Convert to Upper Case

* Address special characters and encoding
]
.pull-right[ 

```
  #  Convert to uppercase
  names_column_new <- toupper(names_column)
  
  # Remove punctuation, digits, and all spaces
  names_column_new <- gsub(
      "[[:punct:][:digit:]][[:space:]]", 
      "", 
      names_column_new)

  # Convert special char to ASCII equivalents
  names_column_new <- iconv(
             names_column_new, 
             "latin1", 
             "ASCII//TRANSLIT",
              sub = "")
  
  # Keep only alphabetic characters
  names_column_new <- gsub(
            "[^[:alpha:]]",
            "",
            names_column_new)
```
]

---

## Pre-processing: Common words & Stop words

.pull-left[  

 Remove user defined list of common words.

 Common words within names like  __"JR", " SR", " IV", " III", " II"__ may prevent good matching and can be removed.
 
 Needs to be contextualized to the cultural context.
]
.pull-right[ 

```
## Define list of words to remove
toRemove = c(" JR"," JUNIOR", " SR", 
                " IV", " III", " II")
                
# Remove specified name suffixes
for (tR in toRemove) {
    names_column_new <- gsub(
          tR, 
          "", 
          names_column_new)
  }
  
```
]


---

## Pre-processing: name Decomposition


.left-column[  


Build a function to decompose name 

From Full name to First name, Father and Mother name 

Enhance significantly record linkage 


]
.right-column[ 

```
## Counting the number of space 
 dplyr::mutate( numspace = stringr::str_count(fullname, ' ')) |>
## separate into multiple column based on spaces 
 tidyr::separate_wider_delim(fullname,
                 delim = " ", names_sep = "", 
                 too_few = "align_start") |>
    
### Now reconstruct the first name, father and mother name  
# Let summarize the logic...!
# Based on the number of componnent in the full name

## Case A: "firstname_fathername_mother_name" - numspace
# 0 - only firstname = fullname1  
# 1 - firstname = fullname1 & fathername = fullname2
# 2 - firstname = fullname1 & fathername  = fullname2 & 
#                  mothername  = fullname3
# 3 - firstname = paste0(fullname1, " ", fullname2) & 
#    fathername  = fullname3 & mothername  = fullname4
# 4  - firstname = paste0(fullname1, " ", fullname2, " ", fullname3) &
#    fathername  = fullname4 &  mothername  = fullname5
```
]


---

## Pre-processing: Age & Date of Birth

.left-column[  

Creating a dedicated function....

If the date of birth is present, it will parse it, extract year and month, and recalculate both age and age range
 
If the date of birth is not mentioned, then it will use age and date_record to reconstruct an estimated date of birth - and recalculate the rest

]
.right-column[ 

```
 date_birth =     dplyr::case_when(
    # Case we age but no DOB and date registration  
    is.na(date_birth) & !(is.na(age)) & !(is.na(date_record))  ~
   lubridate::as_date(date_record) - lubridate::dyears(age) ,
   # Case we age but no DOB and date registration  
   is.na(date_birth) & !(is.na(age)) & is.na(date_record) ~
   lubridate::today() - lubridate::dyears(age) ,  
   # Case take what we have...  
   TRUE ~   lubridate::as_date(date_birth) ) ,
   
   ## make sure it is in the correct format and extract day, month year,
   date_birth =  lubridate::as_date(date_birth),
   day_birth = as.numeric( lubridate::day(date_birth)),
   month_birth = as.numeric(  lubridate::month(date_birth)),
   year_birth= as.numeric( lubridate::year(date_birth)),
   
   age = round( as.numeric(lubridate::today() - date_birth) / 365),     
   age_range = dplyr::case_when(
     age  <5    ~ "0-4", age >=5 & age <= 11 ~ "4-11",
     age >= 12 & age <= 17 ~ "12-17", age >= 18 & age <= 59 ~ "18-59",
     age >= 60     ~ "60+",  TRUE ~  age_range   )   )
    
```
]

 
---

## Pre-processing: Missing gender

.left-column[  

Missing genders prevents good matching 

Possibility to use a package to get the gender based on the first name

[genderize: A simple API to predict the gender of a person given their name](https://genderize.io/)

]
.right-column[ 

```
## identify the mising..
missing_gender  <- data.prep |>
        dplyr::filter( is.na(gender)) |>
        dplyr::select( firstname1, gender) |>
        dplyr::distinct() |>
        dplyr::mutate(firstname1 = gsub(
               "[[:punct:][:digit:]][[:space:]]",
               "-", 
               firstname1)) |>
        dplyr::mutate(firstname1 = gsub(" ", "-", firstname1)) |> 
        dplyr::pull(firstname1)
      

## let's us an API to identify gender based on firstname
# devtools::install_github("coccopuffs/GenderGuesser")  
missing_gender_result <-    GenderGuesser::guessGender(missing_gender)

## bring this back into the data..
```
]

 
---

## Pipe the treatment


.pull-left[ 

Each steps described above to be:

 * organised within functions
 
 * piped using the tidyverse approach
 
Revise the [demo script](FastLink.html) and [applied real use case notebook](https://github.com/unhcr-americas/record_linkage/blob/main/deduplicate.Rmd)  to use the functions
 
 ]
.pull-right[ 

![](img/pipe.png)
]
 

---

## Reduction of search space: Blocking

To minimize the number of computation, it is important to perform Record linkage only within the blocks, i.e. __to partition similar records into bins__
 
A deterministic partition can be formed based upon the data. This imply to choose the blocking features (variables or keys) containing the fewest errors or missing values should be chosen as blocking variables. A partition is then created by treating certain fields that are thought to be nearly error-free as fixed, 

> for instance a Partition of date of birth year or age range. 

An alternative is to use multiple keys to consider typographical or measurement errors that would exclude true matches. For instance, Blocking by last name initial or zip code. This produces overlapping blocks of the data: records match on field A or field B
 

---

## Comparison: String Matching

Deterministic, i.e Automatic comparisons where either everything needs to match, or specific data specific rules are programmed. Though this Needs new rule for every variation in data, which therefore can not be generalizable to other datasets

 * Levenshtein distance is the minimum number of substitutions required to transform one string into another.
 
 * The Jaro distance considers common characters and character transpositions. 
 
 *  The Soundex algorithm generates a code representing the phonetic pronunciation of a word. This is typicall more useful on non-English names or longer names. The Soundex code for a name consists of a letter followed by three numerical digits.

---

## Compare linkage with {fastlink}


``` 
matches.out <- fastLink::fastLink(
             dfA = dfA, 
             dfB = dfB, 
                 
         # Specify the vector of variable names to be used for matching.
         # These variable names should exist in both dfA and dfB
         varnames = c('FirstName', 'LastName', 'dob_day', 'dob_month', 'dob_year'),
        
        # Specify which variables among varnames should be compared using string distance
        stringdist.match = c('FirstName', 'LastName'),
        
        # Specify which variables present in  stringdist.match can be partially matched
        partial.match = c('FirstName', 'LastName'),
        
        # Specify which variables should be matched numerically
        # Must be a subset of 'varnames' and must not be present in 'stringdist.match'.
        numeric.match = c('dob_day', 'dob_month', 'dob_year'),
        
        # Specify the number of CPU cores to utilize (parallel processing). 
        ## Get the number of detected cores minus 1,  Reserve one core for 
        #non-computational tasks to help prevent system slowdowns or unresponsiveness
        n.cores = parallel::detectCores() - 1,
        return.df = TRUE)
```
???
The default threshold for matches is 0.94 (threshold.match =) but you can adjust it higher or lower. If you define the threshold, consider that higher thresholds could yield more false-negatives (rows that do not match which actually should match) and likewise a lower threshold could yield more false-positive matches.

---

## Results on dummy data 

```{r}
#source("fastlink.R")
 
```


---

## Evaluate results


.pull-left[ 

Last stage involves a manual __clerical review__ of the results, specifically the  __Possible Matches__
 
 ]
.pull-right[ 

![](evaluation.png)
]

---

## Evaluation metrics
 

> The posterior probability refers to the  updated probability of an event occurring given some observed data using Bayesian inference. It represents the "strength of the matches"  and is a combination of the initial belief and the new evidence. 

List of parameter estimates for different fields:

 *  __patterns.w__: Counts of the agreement patterns observed (2 = match, 1 = partial match, 0 = non-match), along with the Felligi-Sunter Weights.

 *  __iter.converge__: The number of iterations it took the EM algorithm to converge.

 *  __zeta.j__: The posterior match probabilities for each unique pattern.

 *  __p.m__ & __p.u__ : The posterior probability of a pair matching (m) & pair not matching (u).

 *  __p.gamma.k.m__ & __p.gamma.k.u__: The posterior of the matching probability (m) & non-matching probability (u) for a specific matching field.

 *  __p.gamma.j.m__ & __p.gamma.j.u__: The posterior probability that a pair is in the matched set (m) or in the unmatched set (u)  given a particular agreement pattern.


???
 *  __nobs.a__: The number of observations in dataset A.

 *  __nobs.b__: The number of observations in dataset B.

---

##  Function to get the metrics


```
# Confusion Matrice
fastLink::confusion(matches.out, threshold = 0.98)

# Examine the EM object:
matches.out$EM

# Summarize the accuracy of the match:
# each column gives the match count, match rate,
# false discovery rate (FDR) and false negative rate (FNR)
# under different cutoffs for matches based on the posterior 
# probability of a match. 
summary(matches.out)

## Get the output...
matched_dfs <- fastLink::getMatches(
                      dfA = dfA, 
                      dfB = dfB, 
                      fl.out = matches.out, 
                      threshold.match = 0.85
)


```

---

## Join based on the probabilistic matches

 
```
# convert cases rownames to a column 
cases_clean <- cases %>% rownames_to_column()

# convert test_results rownames to a column
results_clean <- results %>% rownames_to_column()  

# convert all columns in matches dataset to character, 
#so they can be joined to the rownames
matches_clean <- my_matches %>%
  mutate(across(everything(), as.character))


# Join matches to dfA, then add dfB
# column "inds.b" is added to dfA
complete <- left_join(cases_clean, matches_clean, by = c("rowname" = "inds.a"))

# column(s) from dfB are added 
complete <- left_join(complete, results_clean, by = c("inds.b" = "rowname"))
```

---

## Iterate over multiple sources comparionsr

Often, we run several different matches for a single data set - for instance when there are multiple list or blocking options

```
#Preprocessing Matches via Blocking
blockgender_out <- blockData(dfA, dfB, varnames = "gender")

## Subset dfA into blocks
dfA_block1 <- dfA[blockgender_out$block.1$dfA.inds,]
dfA_block2 <- dfA[blockgender_out$block.2$dfA.inds,]

## Subset dfB into blocks
dfB_block1 <- dfB[blockgender_out$block.1$dfB.inds,]
dfB_block2 <- dfB[blockgender_out$block.2$dfB.inds,]

```

---

## Aggregating Multiple Matches Together

 
```
## Run fastLink on each
link.1 <- fastLink(
  dfA_block1, dfB_block1,
  varnames = c("firstname", "lastname", "housenum",
               "streetname", "city", "birthyear")
) 

link.2 <- fastLink(
  dfA_block2, dfB_block2,
  varnames = c("firstname", "lastname", "housenum",
               "streetname", "city", "birthyear")
)

## aggregate multiple matches into a single summary with aggregateEM() 
agg.out <- aggregateEM(em.list = list(link.1, link.2))
```

---

## Questions

Please [post here your questions](https://github.com/unhcr-americas/record_linkage/issues/new)

## Reference: 


[Fastlink Documentation](https://github.com/kosukeimai/fastLink)

[Presentation from UN Stat Commission](https://www.youtube.com/watch?v=S7boX8X4uXU) - a practical example from DANE in Colombia - matching a survey - - Gran encuesta integrada de hogares (GEIH) - with a registry - Registro Estadístico de Relaciones Laborales (RELAB) - 

[A tutorial on Probabilistic matching from The Epidemiologist R Handbook](https://epirhandbook.com/en/joining-data.html#probabalistic-matching)

[Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records](http://imai.fas.harvard.edu/research/linkage.html)