Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pubtator vs user annotation offset differences #183

Open
gtsueng opened this issue Jun 13, 2016 · 8 comments
Open

Pubtator vs user annotation offset differences #183

gtsueng opened this issue Jun 13, 2016 · 8 comments
Assignees

Comments

@gtsueng
Copy link
Collaborator

gtsueng commented Jun 13, 2016

This may be contributing to issue #133. Basically, the user annotations are usually in line with the pubtator annotations in terms of the offset only for the first sentence worth of characters. Thereafter, there is usually a difference in the offsets of 1.

Eg. In group 5, doc: 17494702, the pubtator annotations are:
offset | text
68 | Fbx2
270 | Fbx2
463 | Fbx2
630 | Fbxo2

But user annotations are:
offset | text
68 | Fbx2
269 | Fbx2
462 | Fbx2
629 | Fbxo2

Notice that the offsets match in the first annotation, but are off by one thereafter. This is a consistent pattern observed in the pubtator vs user annotation data, so comparisions between user annotations and pubtator annotations will give poor matches. Also, since the user annotations are off by 1, you can imagine why the feedback highlight is always missing the single-character number in examples like 'N-glycanase 1'.

@gtsueng gtsueng added the bug label Jun 13, 2016
@JTFouquier
Copy link
Collaborator

Using the same example as @gtsueng used above,
User annotation offsets for document 17494702:
http://mark2cure.org/task/entity-recognition/m2c/17494702.json
screen shot 2016-06-15 at 10 06 51 am

Pubtator/BioC offset examples for document 17494702:
http://mark2cure.org/admin/document/pubtator/9902/?_changelist_filters=q%3D17494702
screen shot 2016-06-15 at 10 07 10 am

I'm still not really seeing why there is a difference, but these offsets should be the same. I also checked the offset and exact location that the first concept in each section (title and abstract) should have, and it actually seems like the user annotation .json file is correct (offset should be 125), but I need to look into this more.

@JTFouquier
Copy link
Collaborator

http://mark2cure.org/admin/document/pubtator/9903/?_changelist_filters=q%3D17494702
Looking here at the title this doc in the Pubtator admin, if you use a character counter, the length of the title is 125 but the offset is captured as 126.
screen shot 2016-06-15 at 10 27 56 am

http://www.lettercount.com/
screen shot 2016-06-15 at 10 28 26 am

x0xMaximus added a commit that referenced this issue Jun 22, 2016
@gtsueng
Copy link
Collaborator Author

gtsueng commented Sep 30, 2016

Issue is mostly solved, but it looks like there are certain docs which the offsets start out fine, but get worse towards the mid-end of the doc. Looking at what these docs have in common, it looks like the source of the problem is the '%' sign. Basically, every time there's a percent sign, user offsets increase by 2, while pubtator offsets do not. This seems to be the cause of the problem.

@x0xMaximus x0xMaximus self-assigned this Oct 19, 2016
@x0xMaximus
Copy link
Member

Just saving @gtsueng notes in the Issue itself: Still problematic.xlsx

Current status is waiting to hear back from Pubtator team

@x0xMaximus
Copy link
Member

This is an example how to compare documents that may be problematic

doc = Document.objects.get(document_id=9140401)
writer = Document.objects.as_writer(documents=[doc])

doc.section_set.last().text
writer.collection.documents[0].passages[1].text

The as_writer function must be expanded to support documents that don't have an existing pubtator result, required for "new" never seen before documents, or documents whose pubtator files were purposefully deleted.

x0xMaximus added a commit that referenced this issue Oct 25, 2016
This is required to do a smart lookup for documents who don’t have pubtators in the DB avaiable for the bulk creation method. So it finds the ones that aren’t avilable and pulls those from the document table if necessary
x0xMaximus added a commit that referenced this issue Oct 27, 2016
- separation into 2 models, new one exclusively for tracking requests
- new validation approach to prevent return type confusion
@x0xMaximus
Copy link
Member

e9d490a and 42c4a32 moved this to the production server but we're currently on hold waiting to hear back from them. Authentication (or some mechanism to serve 403 Forbidden) was added to the server without notice.

x0xMaximus added a commit that referenced this issue Nov 15, 2016
- migrations for production to use enum of api status
- new regex for getting session_ids
- ditching respone text and using status code only b/c they keep changing
x0xMaximus added a commit that referenced this issue Nov 16, 2016
x0xMaximus added a commit that referenced this issue Nov 16, 2016
@x0xMaximus
Copy link
Member

I'm waiting to close this until further testing with all docs having pubtator now and scheduled updates

@x0xMaximus
Copy link
Member

Current analysis of broken Pubtator responses we've stored in the database

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants