-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathlessons-learned.txt
37 lines (23 loc) · 1.28 KB
/
lessons-learned.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
Lessons learned after retrieving individuals and firm "ids" from git repositories for many years:
Its a semi-automatic process
There more can be automated, the better.
Reprocessing is time consuming. A wrong visualizaTION, A WRON NETWORK , A WRONG FIGURE IN WEBSITE
* There are may algoritms for string similarity https://www.geeksforgeeks.org/python-similarity-metrics-of-strings/ - they can be used to identify similar names and similar emails
* GitHub REST and GRAPHML APIs are great to resolve emails and organization with the email domain users.noreply.github.com
* IBM and Intel subdomains are problematic and require manual grouping
* Bots are easy to spot as they look unhuman
* Linked it often resolves affiliation prbles
* Increasing trend to use anonynous emails
to avoid pricacy and spam issues
* danger with .cn .fi .pt
* danger with alumni accounts
* Danger with change from non-annonymous to annonymous
.com .org .edu .in help
when domain is a.b.c.com
sucess is higher if afiiliation is affiliaceted to c.
Robustness tests
: String similarity for names
: String similarty for emails
: Triangulate with names announced in releases
: Affiliation triangulaion with GitHub API
* Account for commits on behalf of organizations using the 'on-behalf-of: @ORG NAME[AT]ORGANIZATION.COM'