-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
593 lines (370 loc) · 16 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
---
title: "Record Linkage & Deduplication"
subtitle: "Identify and Remove duplicate with {fastlink}"
date: "`r format(Sys.Date(), '%d %B %Y')`"
output:
unhcrdown::html_slides:
self_contained: true
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: true
ratio: "16:9"
---
```{r echo=FALSE, message=FALSE, warning=FALSE}
options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(
warning = FALSE,
message = FALSE,
fig.showtext = TRUE
)
library(tidyverse)
library(unhcrthemes)
library(fontawesome)
# install.packages("RecordLinkage")
# library(RecordLinkage)
# install.packages("fastLink")
library(fastLink)
```
## Introduction
Imagine you need to lnk your registration database with a list of beneficiaries created by a partner....
> Record linkage (_also called de-duplication or Entity resolution_) is the process of joining multiple data sets removes duplicate entities often in the absence of a unique identifier.
Needed for instance, when reconciling multiple registration list to build a single sampling universe when preparing for a survey in the absence of a common registration database...
Exact matching is a method that says two records are a match if they agree on every feature. Exact matching ONLY works well if the linking data are perfect and present in all the databases you want to link works best when there is a single unique identifier (key/uuid).
Performing exact matching is often not do-able when reconciling registration list with individual data coming from different partners as many issues can arise when trying to match only on names. In such cases Matching that allows fields to __only be similar__ rather than exact duplicates will likely be more robust.
---
## Challenges with individuals matching
.pull-left[
Can be numerous ones: ...
* Spelling of names and first names can be different
* Occurrence of special characters and space
* Usage of multiple names compounded together (father name + mother name)
* Usage of data of Birth Vs Age..
* not the same exact data available on each list
* multiple lists to merge together (more than 2!)
]
.pull-right[
![](img/record-linkage.png)
]
---
## Workflow
.pull-left[
The recommended workflow includes to go through specific steps:
* Data Pre-processing
* Identify blocking variables: Reduction of search space: Blocking
* Perform Comparison
* Review results based on acceptation threshold
]
.pull-right[
![](img/concept.png)
]
???
To minimize the number of computation, it is important to perform Record linkage only within the blocks, i.e. to block partitions similar records into bins or blocks.
---
## Pre-processing: Cleaning & Harmonisation
.pull-left[
Keep only people with phone number
Common fields in the data to be identified and harmonized (Clean Gender or Nationality...)
]
.pull-right[
```
data.prep <- data |>
## Filter when phone number is not available
dplyr::filter( telefono != "NO REFIERE") |>
## Clean the gender variable
# data |> dplyr::select(genero) |>
# dplyr::distinct() |> dplyr::pull()
dplyr::mutate(gender = dplyr::case_when(
genero %in% c("F" ,"FEMENINO" ,
"f", "Femenino") ~ "F",
genero %in% c("M" , "MASCULINO" ,
"Masculino") ~ "M",
genero %in% c("X", "Otro") ~ "Ot",
TRUE ~ NA ))
```
]
---
## Pre-processing: String
.pull-left[
String treatment
* Remove Punctuation
* Convert to Upper Case
* Address special characters and encoding
]
.pull-right[
```
# Convert to uppercase
names_column_new <- toupper(names_column)
# Remove punctuation, digits, and all spaces
names_column_new <- gsub(
"[[:punct:][:digit:]][[:space:]]",
"",
names_column_new)
# Convert special char to ASCII equivalents
names_column_new <- iconv(
names_column_new,
"latin1",
"ASCII//TRANSLIT",
sub = "")
# Keep only alphabetic characters
names_column_new <- gsub(
"[^[:alpha:]]",
"",
names_column_new)
```
]
---
## Pre-processing: Common words & Stop words
.pull-left[
Remove user defined list of common words.
Common words within names like __"JR", " SR", " IV", " III", " II"__ may prevent good matching and can be removed.
Needs to be contextualized to the cultural context.
]
.pull-right[
```
## Define list of words to remove
toRemove = c(" JR"," JUNIOR", " SR",
" IV", " III", " II")
# Remove specified name suffixes
for (tR in toRemove) {
names_column_new <- gsub(
tR,
"",
names_column_new)
}
```
]
---
## Pre-processing: name Decomposition
.left-column[
Build a function to decompose name
From Full name to First name, Father and Mother name
Enhance significantly record linkage
]
.right-column[
```
## Counting the number of space
dplyr::mutate( numspace = stringr::str_count(fullname, ' ')) |>
## separate into multiple column based on spaces
tidyr::separate_wider_delim(fullname,
delim = " ", names_sep = "",
too_few = "align_start") |>
### Now reconstruct the first name, father and mother name
# Let summarize the logic...!
# Based on the number of componnent in the full name
## Case A: "firstname_fathername_mother_name" - numspace
# 0 - only firstname = fullname1
# 1 - firstname = fullname1 & fathername = fullname2
# 2 - firstname = fullname1 & fathername = fullname2 &
# mothername = fullname3
# 3 - firstname = paste0(fullname1, " ", fullname2) &
# fathername = fullname3 & mothername = fullname4
# 4 - firstname = paste0(fullname1, " ", fullname2, " ", fullname3) &
# fathername = fullname4 & mothername = fullname5
```
]
---
## Pre-processing: Age & Date of Birth
.left-column[
Creating a dedicated function....
If the date of birth is present, it will parse it, extract year and month, and recalculate both age and age range
If the date of birth is not mentioned, then it will use age and date_record to reconstruct an estimated date of birth - and recalculate the rest
]
.right-column[
```
date_birth = dplyr::case_when(
# Case we age but no DOB and date registration
is.na(date_birth) & !(is.na(age)) & !(is.na(date_record)) ~
lubridate::as_date(date_record) - lubridate::dyears(age) ,
# Case we age but no DOB and date registration
is.na(date_birth) & !(is.na(age)) & is.na(date_record) ~
lubridate::today() - lubridate::dyears(age) ,
# Case take what we have...
TRUE ~ lubridate::as_date(date_birth) ) ,
## make sure it is in the correct format and extract day, month year,
date_birth = lubridate::as_date(date_birth),
day_birth = as.numeric( lubridate::day(date_birth)),
month_birth = as.numeric( lubridate::month(date_birth)),
year_birth= as.numeric( lubridate::year(date_birth)),
age = round( as.numeric(lubridate::today() - date_birth) / 365),
age_range = dplyr::case_when(
age <5 ~ "0-4", age >=5 & age <= 11 ~ "4-11",
age >= 12 & age <= 17 ~ "12-17", age >= 18 & age <= 59 ~ "18-59",
age >= 60 ~ "60+", TRUE ~ age_range ) )
```
]
---
## Pre-processing: Missing gender
.left-column[
Missing genders prevents good matching
Possibility to use a package to get the gender based on the first name
[genderize: A simple API to predict the gender of a person given their name](https://genderize.io/)
]
.right-column[
```
## identify the mising..
missing_gender <- data.prep |>
dplyr::filter( is.na(gender)) |>
dplyr::select( firstname1, gender) |>
dplyr::distinct() |>
dplyr::mutate(firstname1 = gsub(
"[[:punct:][:digit:]][[:space:]]",
"-",
firstname1)) |>
dplyr::mutate(firstname1 = gsub(" ", "-", firstname1)) |>
dplyr::pull(firstname1)
## let's us an API to identify gender based on firstname
# devtools::install_github("coccopuffs/GenderGuesser")
missing_gender_result <- GenderGuesser::guessGender(missing_gender)
## bring this back into the data..
```
]
---
## Pipe the treatment
.pull-left[
Each steps described above to be:
* organised within functions
* piped using the tidyverse approach
Revise the [demo script](FastLink.html) and [applied real use case notebook](https://github.com/unhcr-americas/record_linkage/blob/main/deduplicate.Rmd) to use the functions
]
.pull-right[
![](img/pipe.png)
]
---
## Reduction of search space: Blocking
To minimize the number of computation, it is important to perform Record linkage only within the blocks, i.e. __to partition similar records into bins__
A deterministic partition can be formed based upon the data. This imply to choose the blocking features (variables or keys) containing the fewest errors or missing values should be chosen as blocking variables. A partition is then created by treating certain fields that are thought to be nearly error-free as fixed,
> for instance a Partition of date of birth year or age range.
An alternative is to use multiple keys to consider typographical or measurement errors that would exclude true matches. For instance, Blocking by last name initial or zip code. This produces overlapping blocks of the data: records match on field A or field B
---
## Comparison: String Matching
Deterministic, i.e Automatic comparisons where either everything needs to match, or specific data specific rules are programmed. Though this Needs new rule for every variation in data, which therefore can not be generalizable to other datasets
* Levenshtein distance is the minimum number of substitutions required to transform one string into another.
* The Jaro distance considers common characters and character transpositions.
* The Soundex algorithm generates a code representing the phonetic pronunciation of a word. This is typicall more useful on non-English names or longer names. The Soundex code for a name consists of a letter followed by three numerical digits.
---
## Compare linkage with {fastlink}
```
matches.out <- fastLink::fastLink(
dfA = dfA,
dfB = dfB,
# Specify the vector of variable names to be used for matching.
# These variable names should exist in both dfA and dfB
varnames = c('FirstName', 'LastName', 'dob_day', 'dob_month', 'dob_year'),
# Specify which variables among varnames should be compared using string distance
stringdist.match = c('FirstName', 'LastName'),
# Specify which variables present in stringdist.match can be partially matched
partial.match = c('FirstName', 'LastName'),
# Specify which variables should be matched numerically
# Must be a subset of 'varnames' and must not be present in 'stringdist.match'.
numeric.match = c('dob_day', 'dob_month', 'dob_year'),
# Specify the number of CPU cores to utilize (parallel processing).
## Get the number of detected cores minus 1, Reserve one core for
#non-computational tasks to help prevent system slowdowns or unresponsiveness
n.cores = parallel::detectCores() - 1,
return.df = TRUE)
```
???
The default threshold for matches is 0.94 (threshold.match =) but you can adjust it higher or lower. If you define the threshold, consider that higher thresholds could yield more false-negatives (rows that do not match which actually should match) and likewise a lower threshold could yield more false-positive matches.
---
## Results on dummy data
```{r}
#source("fastlink.R")
```
---
## Evaluate results
.pull-left[
Last stage involves a manual __clerical review__ of the results, specifically the __Possible Matches__
]
.pull-right[
![](evaluation.png)
]
---
## Evaluation metrics
> The posterior probability refers to the updated probability of an event occurring given some observed data using Bayesian inference. It represents the "strength of the matches" and is a combination of the initial belief and the new evidence.
List of parameter estimates for different fields:
* __patterns.w__: Counts of the agreement patterns observed (2 = match, 1 = partial match, 0 = non-match), along with the Felligi-Sunter Weights.
* __iter.converge__: The number of iterations it took the EM algorithm to converge.
* __zeta.j__: The posterior match probabilities for each unique pattern.
* __p.m__ & __p.u__ : The posterior probability of a pair matching (m) & pair not matching (u).
* __p.gamma.k.m__ & __p.gamma.k.u__: The posterior of the matching probability (m) & non-matching probability (u) for a specific matching field.
* __p.gamma.j.m__ & __p.gamma.j.u__: The posterior probability that a pair is in the matched set (m) or in the unmatched set (u) given a particular agreement pattern.
???
* __nobs.a__: The number of observations in dataset A.
* __nobs.b__: The number of observations in dataset B.
---
## Function to get the metrics
```
# Confusion Matrice
fastLink::confusion(matches.out, threshold = 0.98)
# Examine the EM object:
matches.out$EM
# Summarize the accuracy of the match:
# each column gives the match count, match rate,
# false discovery rate (FDR) and false negative rate (FNR)
# under different cutoffs for matches based on the posterior
# probability of a match.
summary(matches.out)
## Get the output...
matched_dfs <- fastLink::getMatches(
dfA = dfA,
dfB = dfB,
fl.out = matches.out,
threshold.match = 0.85
)
```
---
## Join based on the probabilistic matches
```
# convert cases rownames to a column
cases_clean <- cases %>% rownames_to_column()
# convert test_results rownames to a column
results_clean <- results %>% rownames_to_column()
# convert all columns in matches dataset to character,
#so they can be joined to the rownames
matches_clean <- my_matches %>%
mutate(across(everything(), as.character))
# Join matches to dfA, then add dfB
# column "inds.b" is added to dfA
complete <- left_join(cases_clean, matches_clean, by = c("rowname" = "inds.a"))
# column(s) from dfB are added
complete <- left_join(complete, results_clean, by = c("inds.b" = "rowname"))
```
---
## Iterate over multiple sources comparionsr
Often, we run several different matches for a single data set - for instance when there are multiple list or blocking options
```
#Preprocessing Matches via Blocking
blockgender_out <- blockData(dfA, dfB, varnames = "gender")
## Subset dfA into blocks
dfA_block1 <- dfA[blockgender_out$block.1$dfA.inds,]
dfA_block2 <- dfA[blockgender_out$block.2$dfA.inds,]
## Subset dfB into blocks
dfB_block1 <- dfB[blockgender_out$block.1$dfB.inds,]
dfB_block2 <- dfB[blockgender_out$block.2$dfB.inds,]
```
---
## Aggregating Multiple Matches Together
```
## Run fastLink on each
link.1 <- fastLink(
dfA_block1, dfB_block1,
varnames = c("firstname", "lastname", "housenum",
"streetname", "city", "birthyear")
)
link.2 <- fastLink(
dfA_block2, dfB_block2,
varnames = c("firstname", "lastname", "housenum",
"streetname", "city", "birthyear")
)
## aggregate multiple matches into a single summary with aggregateEM()
agg.out <- aggregateEM(em.list = list(link.1, link.2))
```
---
## Questions
Please [post here your questions](https://github.com/unhcr-americas/record_linkage/issues/new)
## Reference:
[Fastlink Documentation](https://github.com/kosukeimai/fastLink)
[Presentation from UN Stat Commission](https://www.youtube.com/watch?v=S7boX8X4uXU) - a practical example from DANE in Colombia - matching a survey - - Gran encuesta integrada de hogares (GEIH) - with a registry - Registro Estadístico de Relaciones Laborales (RELAB) -
[A tutorial on Probabilistic matching from The Epidemiologist R Handbook](https://epirhandbook.com/en/joining-data.html#probabalistic-matching)
[Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records](http://imai.fas.harvard.edu/research/linkage.html)