-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pubtator vs user annotation offset differences #183
Comments
Using the same example as @gtsueng used above, Pubtator/BioC offset examples for document 17494702: I'm still not really seeing why there is a difference, but these offsets should be the same. I also checked the offset and exact location that the first concept in each section (title and abstract) should have, and it actually seems like the user annotation .json file is correct (offset should be 125), but I need to look into this more. |
http://mark2cure.org/admin/document/pubtator/9903/?_changelist_filters=q%3D17494702 |
Issue is mostly solved, but it looks like there are certain docs which the offsets start out fine, but get worse towards the mid-end of the doc. Looking at what these docs have in common, it looks like the source of the problem is the '%' sign. Basically, every time there's a percent sign, user offsets increase by 2, while pubtator offsets do not. This seems to be the cause of the problem. |
Just saving @gtsueng notes in the Issue itself: Still problematic.xlsx Current status is waiting to hear back from Pubtator team |
This is an example how to compare documents that may be problematic
The as_writer function must be expanded to support documents that don't have an existing pubtator result, required for "new" never seen before documents, or documents whose pubtator files were purposefully deleted. |
This is required to do a smart lookup for documents who don’t have pubtators in the DB avaiable for the bulk creation method. So it finds the ones that aren’t avilable and pulls those from the document table if necessary
- separation into 2 models, new one exclusively for tracking requests - new validation approach to prevent return type confusion
- migrations for production to use enum of api status - new regex for getting session_ids - ditching respone text and using status code only b/c they keep changing
I'm waiting to close this until further testing with all docs having pubtator now and scheduled updates |
This may be contributing to issue #133. Basically, the user annotations are usually in line with the pubtator annotations in terms of the offset only for the first sentence worth of characters. Thereafter, there is usually a difference in the offsets of 1.
Eg. In group 5, doc: 17494702, the pubtator annotations are:
offset | text
68 | Fbx2
270 | Fbx2
463 | Fbx2
630 | Fbxo2
But user annotations are:
offset | text
68 | Fbx2
269 | Fbx2
462 | Fbx2
629 | Fbxo2
Notice that the offsets match in the first annotation, but are off by one thereafter. This is a consistent pattern observed in the pubtator vs user annotation data, so comparisions between user annotations and pubtator annotations will give poor matches. Also, since the user annotations are off by 1, you can imagine why the feedback highlight is always missing the single-character number in examples like 'N-glycanase 1'.
The text was updated successfully, but these errors were encountered: