lessons-learned.txt

Lessons learned after retrieving individuals and firm "ids" from git repositories for many years:

Its a semi-automatic process
There more can be automated, the better.
Reprocessing is time consuming.  A wrong visualizaTION, A WRON NETWORK , A WRONG FIGURE IN WEBSITE 


* There are may algoritms for string similarity  https://www.geeksforgeeks.org/python-similarity-metrics-of-strings/ - they can be used to identify similar names and similar emails

* GitHub REST and GRAPHML APIs are great to resolve emails and organization with the email domain users.noreply.github.com

* IBM and Intel subdomains are problematic and require manual grouping 

* Bots are easy to spot as they look unhuman 

* Linked it often resolves affiliation prbles

* Increasing trend to use anonynous emails
to avoid pricacy and spam issues

* danger with .cn .fi .pt 

* danger with alumni accounts
* Danger with change from non-annonymous to annonymous

.com .org .edu .in help

when domain is a.b.c.com
sucess is higher if afiiliation is affiliaceted to c. 

Robustness tests
: String similarity for names
: String similarty for emails
: Triangulate with names announced in releases
: Affiliation triangulaion with GitHub API
* Account for commits on behalf of organizations using the 'on-behalf-of: @ORG NAME[AT]ORGANIZATION.COM'