Skip to content

Commit 3d9b906

Browse files
authored
New duplicate algorithm to check for similar entries (asreview#52)
Add similarity based deduplication algorithm
1 parent c5109a5 commit 3d9b906

9 files changed

+492
-39
lines changed

README.md

+44-3
Original file line numberDiff line numberDiff line change
@@ -178,6 +178,47 @@ asreview data dedup synergy:van_de_schoot_2018 -o van_de_schoot_2018_dedup.csv
178178
Removed 104 records from dataset with 6189 records.
179179
```
180180

181+
We can also choose to deduplicate based on the similarity of the title and abstract, instead of checking for an exact match. This way we can find duplicates that have small differences, but are actually the same record (for example, an additional comma or a fixed typo). This can be done by using the `--drop_similar` flag. This process takes about 4s on a dataset of about 2068 entries.
182+
183+
```bash
184+
asreview data dedup neurips_2020.tsv --drop_similar
185+
```
186+
```
187+
Not using doi for deduplication because there is no such data.
188+
Deduplicating: 100%|████████████████████████████████████| 2068/2068 [00:03<00:00, 531.93it/s]
189+
Found 2 duplicates in dataset with 2068 records.
190+
```
191+
192+
If we want to check which entries were found as duplicates, we can use the `--verbose` flag. This will print the lines of the dataset that were found as duplicates, as well as the difference between them. Any text that has to be removed from the first entry to become the second one is shown as red and has a strikethrough, and any text that has to be added to the first entry is shown as green. All text that is the same in both entries is dimmed.
193+
194+
```bash
195+
asreview data dedup neurips_2020.tsv --drop_similar --verbose
196+
```
197+
198+
![Verbose drop similar](./dedup_similar.png)
199+
200+
The similarity threshold can be set with the `--similarity` flag. The default similarity threshold is `0.98`. We can also choose to only use the title for deduplication by using the `--skip_abstract` flag.
201+
202+
```bash
203+
asreview data dedup neurips_2020.tsv --drop_similar --similarity 0.98 --skip_abstract
204+
```
205+
```
206+
Not using doi for deduplication because there is no such data.
207+
Deduplicating: 100%|████████████████████████████████████| 2068/2068 [00:02<00:00, 770.74it/s]
208+
Found 4 duplicates in dataset with 2068 records.
209+
```
210+
211+
Note that you might have to adjust your similarity score if you choose to only use the title for deduplication. The similarity score is calculated using the [SequenceMatcher](https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher) class from the `difflib` package. The similarity score is calculated as the ratio of the number of matching characters to the total number of characters in the two strings. For example, the similarity score between the strings "hello" and "hello world" is 0.625. By default, we use the [real_quick_ratio](https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher.real_quick_ratio) and [quick_ratio](https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher.quick_ratio) methods, which are faster and usually good enough, but less accurate. If you want to use the ratio method, you can use the `--strict_similarity` flag.
212+
213+
Now, if we want to discard stopwords for deduplication (for a more strict check on the important words), we can use the `--discard_stopwords` flag. The default language for the stopwords is `english`, but that can be set with the `--stopwords_language` flag. The list of supported languages for the stopwords are the same supported by the [nltk](https://www.nltk.org/index.html) package. To check the list of available languages, you can run the following commands on your python environment:
214+
215+
```python
216+
from nltk.corpus import stopwords
217+
print(stopwords.fileids())
218+
```
219+
```
220+
['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']
221+
```
181222

182223
### Data Vstack (Experimental)
183224

@@ -186,7 +227,7 @@ Vertical stacking: combine as many datasets in the same file format as you want
186227
❗ Vstack is an experimental feature. We would love to hear your feedback.
187228
Please keep in mind that this feature can change in the future.
188229

189-
Stack several datasets on top of each other:
230+
Stack several datasets on top of each other:
190231
```
191232
asreview data vstack output.csv MY_DATASET_1.csv MY_DATASET_2.csv MY_DATASET_3.csv
192233
```
@@ -206,7 +247,7 @@ Compose is where datasets containing records with different labels (or no
206247
labels) can be assembled into a single dataset.
207248

208249
❗ Compose is an experimental feature. We would love to hear your feedback.
209-
Please keep in mind that this feature can change in the future.
250+
Please keep in mind that this feature can change in the future.
210251

211252
Overview of possible input files and corresponding properties, use at least
212253
one of the following arguments:
@@ -231,7 +272,7 @@ case of conflicts, use the `--conflict_resolve`/`-c` flag. This is set to
231272
| Resolve method | Action in case of conflict |
232273
|----------------|-----------------------------------------------------------------------------------------|
233274
| `keep_one` | Keep one label, using `--hierarchy` to determine which label to keep |
234-
| `keep_all` | Keep conflicting records as duplicates in the composed dataset (ignoring `--hierarchy`) |
275+
| `keep_all` | Keep conflicting records as duplicates in the composed dataset (ignoring `--hierarchy`) |
235276
| `abort` | Abort |
236277

237278

Tutorials.md

+17-11
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Tutorials
22

3-
---
3+
---
44
Below are several examples to illustrate how to use `ASReview-datatools`. Make
55
sure to have installed
66
[asreview-datatools](https://github.com/asreview/asreview-datatools) and
@@ -18,17 +18,17 @@ ASReview converts the labeling decisions in [RIS files](https://asreview.readthe
1818
irrelevant as `0` and relevant as `1`. Records marked as unseen or with
1919
missing labeling decisions are converted to `-1`.
2020

21-
---
21+
---
2222

23-
## Update Systematic Review
23+
## Update Systematic Review
2424

2525
Assume you are working on a systematic review and you want to update the
2626
review with newly available records. The original data is stored in
2727
`MY_LABELED_DATASET.csv` and the file contains a
2828
[column](https://asreview.readthedocs.io/en/latest/data_labeled.html#label-format)
2929
containing the labeling decissions. In order to update the systematic review,
3030
you run the original search query again but with a new date. You save the
31-
newly found records in `SEARCH_UPDATE.ris`.
31+
newly found records in `SEARCH_UPDATE.ris`.
3232

3333

3434
In the command line interface (CLI), navigate to the directory where the
@@ -52,12 +52,18 @@ asreview data convert SEARCH_UPDATE.ris SEARCH_UPDATE.csv
5252

5353
Duplicate records can be removed with with `dedup` script. The algorithm
5454
removes duplicates using the Digital Object Indentifier
55-
([DOI](https://www.doi.org/)) and the title plus abstract.
55+
([DOI](https://www.doi.org/)) and the title plus abstract.
5656

5757
```bash
5858
asreview data dedup SEARCH_UPDATE.csv -o SEARCH_UPDATE_DEDUP.csv
5959
```
6060

61+
This can also be done considering a similarity threshold between the titles and abstracts.
62+
63+
```bash
64+
asreview data dedup SEARCH_UPDATE.csv -o SEARCH_UPDATE_DEDUP.csv --drop_similar
65+
```
66+
6167
### Describe input
6268

6369
If you want to see descriptive info on your input datasets, run these commands:
@@ -78,12 +84,12 @@ asreview data compose updated_search.csv -l MY_LABELED_DATASET.csv -u SEARCH_UPD
7884
The flag `-l` means the labels in `MY_LABELED_DATASET.csv` will be kept.
7985

8086
The flag `-u` means all records from `SEARCH_UPDATE_DEDUP.csv` will be
81-
added as unlabeled to the composed dataset.
87+
added as unlabeled to the composed dataset.
8288

8389
If a record exists in both datasets, it is assumed the record containing a
8490
label is maintained, see the default [conflict resolving
8591
strategy](https://github.com/asreview/asreview-datatools#resolving-conflicting-labels).
86-
To keep both records (with and without label), use
92+
To keep both records (with and without label), use
8793

8894
```bash
8995
asreview data compose updated_search.csv -l MY_LABELED_DATASET.csv -u SEARCH_UPDATE_DEDUP.csv -c keep
@@ -154,14 +160,14 @@ added as unlabeled.
154160

155161
If any duplicate records exist across the datasets, by default the order of
156162
keeping labels is:
157-
1. relevant
163+
1. relevant
158164
2. irrelevant
159165
3. unlabeled
160166

161167
You can configure the behavior in resolving conflicting labels by setting the
162168
hierarchy differently. To do so, pass the letters r (relevant), i
163169
(irrelevant), and u (unlabeled) in any order to, for example, `--hierarchy
164-
uir`.
170+
uir`.
165171

166172

167173
The composed dataset will be exported to `search_with_priors.ris`.
@@ -193,12 +199,12 @@ new search.
193199
Assume you want to use the [simulation
194200
mode](https://asreview.readthedocs.io/en/latest/simulation_overview.html) of
195201
ASReview but the data is not stored in one singe file containing the meta-data
196-
and labelling decissions as required by ASReview.
202+
and labelling decissions as required by ASReview.
197203

198204
Suppose the following files are available:
199205

200206
- `SCREENED.ris`: all records that were screened
201-
- `RELEVANT.ris`: the subset of relevant records after manually screening all the records.
207+
- `RELEVANT.ris`: the subset of relevant records after manually screening all the records.
202208

203209
You need to compose the files into a single file where all records from
204210
`RELEVANT.csv` are relevant all other records are irrelevant.

asreviewcontrib/datatools/dedup.py

+211
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
import re
2+
from argparse import Namespace
3+
from difflib import SequenceMatcher
4+
5+
import ftfy
6+
import pandas as pd
7+
from asreview import ASReviewData
8+
from pandas.api.types import is_object_dtype
9+
from pandas.api.types import is_string_dtype
10+
from rich.console import Console
11+
from rich.text import Text
12+
from tqdm import tqdm
13+
14+
15+
def _print_similar_list(
16+
similar_list: list[tuple[int, int]],
17+
data: pd.Series,
18+
pid: str,
19+
pids: pd.Series = None
20+
) -> None:
21+
22+
print_seq_matcher = SequenceMatcher()
23+
console = Console()
24+
25+
if pids is not None:
26+
print(f'Found similar titles or same {pid} at lines:')
27+
else:
28+
print('Found similar titles at lines:')
29+
30+
for i, j in similar_list:
31+
print_seq_matcher.set_seq1(data.iloc[i])
32+
print_seq_matcher.set_seq2(data.iloc[j])
33+
text = Text()
34+
35+
if pids is not None:
36+
text.append(f'\nLines {i+1} and {j+1} ', style='bold')
37+
if pids.iloc[i] == pids.iloc[j]:
38+
text.append(f'(same {pid} "{pids.iloc[i]}"):\n', style='dim')
39+
else:
40+
text.append(f'({pid} "{pids.iloc[i]}" and "{pids.iloc[j]}"):\n',
41+
style='dim')
42+
43+
else:
44+
text.append(f'\nLines {i+1} and {j+1}:\n', style='bold')
45+
46+
for tag, i1, i2, j1, j2 in print_seq_matcher.get_opcodes():
47+
if tag == 'replace':
48+
# add rich strikethrough
49+
text.append(f'{data.iloc[i][i1:i2]}', style='red strike')
50+
text.append(f'{data.iloc[j][j1:j2]}', style='green')
51+
if tag == 'delete':
52+
text.append(f'{data.iloc[i][i1:i2]}', style='red strike')
53+
if tag == 'insert':
54+
text.append(f'{data.iloc[j][j1:j2]}', style='green')
55+
if tag == 'equal':
56+
text.append(f'{data.iloc[i][i1:i2]}', style='dim')
57+
58+
console.print(text)
59+
60+
print('')
61+
62+
63+
def _drop_duplicates_by_similarity(
64+
asdata: ASReviewData,
65+
pid: str,
66+
similarity: float = 0.98,
67+
skip_abstract: bool = False,
68+
discard_stopwords: bool = False,
69+
stopwords_language: str = 'english',
70+
strict_similarity: bool = False,
71+
verbose: bool = False,
72+
) -> None:
73+
74+
if skip_abstract:
75+
data = asdata.df['title']
76+
else:
77+
data = pd.Series(asdata.texts)
78+
79+
symbols_regex = re.compile(r'[^ \w\d\-_]')
80+
spaces_regex = re.compile(r'\s+')
81+
82+
# clean the data
83+
s = (
84+
data
85+
.apply(ftfy.fix_text)
86+
.str.replace(symbols_regex, '', regex=True)
87+
.str.replace(spaces_regex, ' ', regex=True)
88+
.str.lower()
89+
.str.strip()
90+
.replace('', None)
91+
)
92+
93+
if discard_stopwords:
94+
try:
95+
from nltk.corpus import stopwords
96+
stopwords_set = set(stopwords.words(stopwords_language))
97+
except LookupError:
98+
import nltk
99+
nltk.download('stopwords')
100+
stopwords_set = set(stopwords.words(stopwords_language))
101+
102+
stopwords_regex = re.compile(rf'\b{"\\b|\\b".join(stopwords_set)}\b')
103+
s = s.str.replace(stopwords_regex, '', regex=True)
104+
105+
seq_matcher = SequenceMatcher()
106+
duplicated = [False] * len(s)
107+
108+
if verbose:
109+
similar_list = []
110+
else:
111+
similar_list = None
112+
113+
if pid in asdata.df.columns:
114+
if is_string_dtype(asdata.df[pid]) or is_object_dtype(asdata.df[pid]):
115+
pids = asdata.df[pid].str.strip().replace("", None)
116+
if pid == "doi":
117+
pids = pids.str.lower().str.replace(
118+
r"^https?://(www\.)?doi\.org/", "", regex=True
119+
)
120+
121+
else:
122+
pids = asdata.df[pid]
123+
124+
for i, text in tqdm(s.items(), total=len(s), desc='Deduplicating'):
125+
seq_matcher.set_seq2(text)
126+
127+
# loop through the rest of the data if it has the same pid or similar length
128+
for j, t in s.iloc[i+1:][(asdata.df[pid] == asdata.df.iloc[i][pid]) |
129+
(abs(s.str.len() - len(text)) < 5)].items():
130+
seq_matcher.set_seq1(t)
131+
132+
# if the texts have the same pid or are similar enough,
133+
# mark the second one as duplicate
134+
if pids.iloc[i] == pids.iloc[j] or \
135+
(seq_matcher.real_quick_ratio() > similarity and \
136+
seq_matcher.quick_ratio() > similarity and \
137+
(not strict_similarity or seq_matcher.ratio() > similarity)):
138+
139+
if verbose and not duplicated[j]:
140+
similar_list.append((i, j))
141+
142+
duplicated[j] = True
143+
144+
if verbose:
145+
_print_similar_list(similar_list, data, pid, pids)
146+
147+
else:
148+
print(f'Not using {pid} for deduplication because there is no such data.')
149+
150+
for i, text in tqdm(s.items(), total=len(s), desc='Deduplicating'):
151+
seq_matcher.set_seq2(text)
152+
153+
# loop through the rest of the data if it has similar length
154+
for j, t in s.iloc[i+1:][abs(s.str.len() - len(text)) < 5].items():
155+
seq_matcher.set_seq1(t)
156+
157+
# if the texts are similar enough, mark the second one as duplicate
158+
if seq_matcher.real_quick_ratio() > similarity and \
159+
seq_matcher.quick_ratio() > similarity and \
160+
(not strict_similarity or seq_matcher.ratio() > similarity):
161+
162+
if verbose and not duplicated[j]:
163+
similar_list.append((i, j))
164+
165+
duplicated[j] = True
166+
167+
if verbose:
168+
_print_similar_list(similar_list, data, pid)
169+
170+
asdata.df = asdata.df[~pd.Series(duplicated)].reset_index(drop=True)
171+
172+
173+
def deduplicate_data(asdata: ASReviewData, args: Namespace) -> None:
174+
initial_length = len(asdata.df)
175+
176+
if not args.similar:
177+
if args.pid not in asdata.df.columns:
178+
print(
179+
f'Not using {args.pid} for deduplication '
180+
'because there is no such data.'
181+
)
182+
183+
# retrieve deduplicated ASReview data object
184+
asdata.drop_duplicates(pid=args.pid, inplace=True)
185+
186+
else:
187+
_drop_duplicates_by_similarity(
188+
asdata,
189+
args.pid,
190+
args.threshold,
191+
args.title_only,
192+
args.stopwords,
193+
args.stopwords_language,
194+
args.strict,
195+
args.verbose,
196+
)
197+
198+
# count duplicates
199+
n_dup = initial_length - len(asdata.df)
200+
201+
if args.output_path:
202+
asdata.to_file(args.output_path)
203+
print(
204+
f'Removed {n_dup} duplicates from dataset with'
205+
f' {initial_length} records.'
206+
)
207+
else:
208+
print(
209+
f'Found {n_dup} duplicates in dataset with'
210+
f' {initial_length} records.'
211+
)

0 commit comments

Comments
 (0)