Pubtator vs user annotation offset differences #183

gtsueng · 2016-06-13T21:03:30Z

This may be contributing to issue #133. Basically, the user annotations are usually in line with the pubtator annotations in terms of the offset only for the first sentence worth of characters. Thereafter, there is usually a difference in the offsets of 1.

Eg. In group 5, doc: 17494702, the pubtator annotations are:
offset | text
68 | Fbx2
270 | Fbx2
463 | Fbx2
630 | Fbxo2

But user annotations are:
offset | text
68 | Fbx2
269 | Fbx2
462 | Fbx2
629 | Fbxo2

Notice that the offsets match in the first annotation, but are off by one thereafter. This is a consistent pattern observed in the pubtator vs user annotation data, so comparisions between user annotations and pubtator annotations will give poor matches. Also, since the user annotations are off by 1, you can imagine why the feedback highlight is always missing the single-character number in examples like 'N-glycanase 1'.

JTFouquier · 2016-06-15T17:12:18Z

Using the same example as @gtsueng used above,
User annotation offsets for document 17494702:
http://mark2cure.org/task/entity-recognition/m2c/17494702.json

Pubtator/BioC offset examples for document 17494702:
http://mark2cure.org/admin/document/pubtator/9902/?_changelist_filters=q%3D17494702

I'm still not really seeing why there is a difference, but these offsets should be the same. I also checked the offset and exact location that the first concept in each section (title and abstract) should have, and it actually seems like the user annotation .json file is correct (offset should be 125), but I need to look into this more.

JTFouquier · 2016-06-15T17:30:09Z

http://mark2cure.org/admin/document/pubtator/9903/?_changelist_filters=q%3D17494702
Looking here at the title this doc in the Pubtator admin, if you use a character counter, the length of the title is 125 but the offset is captured as 126.

http://www.lettercount.com/

gtsueng · 2016-09-30T14:54:26Z

Issue is mostly solved, but it looks like there are certain docs which the offsets start out fine, but get worse towards the mid-end of the doc. Looking at what these docs have in common, it looks like the source of the problem is the '%' sign. Basically, every time there's a percent sign, user offsets increase by 2, while pubtator offsets do not. This seems to be the cause of the problem.

x0xMaximus · 2016-10-19T19:07:10Z

Just saving @gtsueng notes in the Issue itself: Still problematic.xlsx

Current status is waiting to hear back from Pubtator team

x0xMaximus · 2016-10-24T23:10:42Z

This is an example how to compare documents that may be problematic

doc = Document.objects.get(document_id=9140401)
writer = Document.objects.as_writer(documents=[doc])

doc.section_set.last().text
writer.collection.documents[0].passages[1].text

The as_writer function must be expanded to support documents that don't have an existing pubtator result, required for "new" never seen before documents, or documents whose pubtator files were purposefully deleted.

This is required to do a smart lookup for documents who don’t have pubtators in the DB avaiable for the bulk creation method. So it finds the ones that aren’t avilable and pulls those from the document table if necessary

- separation into 2 models, new one exclusively for tracking requests - new validation approach to prevent return type confusion

x0xMaximus · 2016-11-02T17:35:40Z

e9d490a and 42c4a32 moved this to the production server but we're currently on hold waiting to hear back from them. Authentication (or some mechanism to serve 403 Forbidden) was added to the server without notice.

- migrations for production to use enum of api status - new regex for getting session_ids - ditching respone text and using status code only b/c they keep changing

x0xMaximus · 2016-11-16T21:04:41Z

I'm waiting to close this until further testing with all docs having pubtator now and scheduled updates

x0xMaximus · 2016-12-07T08:52:46Z

Current analysis of broken Pubtator responses we've stored in the database

gtsueng added the bug label Jun 13, 2016

x0xMaximus added a commit that referenced this issue Jun 22, 2016

Addresses (Solves?) #183

08c07ad

x0xMaximus self-assigned this Oct 19, 2016

x0xMaximus added the in progress label Oct 19, 2016

x0xMaximus added a commit that referenced this issue Oct 26, 2016

[doc] more pub debugging softlimt (addresses #183)

6df4c11

x0xMaximus added a commit that referenced this issue Oct 27, 2016

[doc.pubtator] validity + request tracking: #183

073eb0f

- separation into 2 models, new one exclusively for tracking requests - new validation approach to prevent return type confusion

x0xMaximus added a commit that referenced this issue Nov 1, 2016

Addresses #183 - more model+task development

212445a

x0xMaximus added a commit that referenced this issue Nov 15, 2016

[#183] tasks, rabbitMQ fallback, signals and api

db584f7

- migrations for production to use enum of api status - new regex for getting session_ids - ditching respone text and using status code only b/c they keep changing

x0xMaximus added a commit that referenced this issue Nov 16, 2016

Addresses #183 and #177

a04023b

x0xMaximus added a commit that referenced this issue Nov 16, 2016

Addresses #183 and #177

77fd56b

x0xMaximus added a commit that referenced this issue Nov 16, 2016

Account orphaned+duplicate requests (#183 & #177)

08c211d

x0xMaximus added a commit that referenced this issue Dec 7, 2016

Text based Pubtator analysis. Addresses #183

2a03f67

x0xMaximus added a commit that referenced this issue Dec 7, 2016

More DB santizing of Pubtator Entries. #183

401330f

x0xMaximus added a commit that referenced this issue Dec 7, 2016

[edit] only empty pubtators. #183

e2acdce

x0xMaximus added a commit that referenced this issue Dec 7, 2016

[edit] account for excess pubtators. #183

7ecab9c

x0xMaximus added a commit that referenced this issue Dec 7, 2016

pubtator migrations + update schedule. #183

9445524

x0xMaximus added a commit that referenced this issue Dec 7, 2016

prevent backlogging of unnecessary updates. #183

d5ddc54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pubtator vs user annotation offset differences #183

Pubtator vs user annotation offset differences #183

gtsueng commented Jun 13, 2016

JTFouquier commented Jun 15, 2016

JTFouquier commented Jun 15, 2016

gtsueng commented Sep 30, 2016

x0xMaximus commented Oct 19, 2016

x0xMaximus commented Oct 24, 2016

x0xMaximus commented Nov 2, 2016

x0xMaximus commented Nov 16, 2016

x0xMaximus commented Dec 7, 2016

Pubtator vs user annotation offset differences #183

Pubtator vs user annotation offset differences #183

Comments

gtsueng commented Jun 13, 2016

JTFouquier commented Jun 15, 2016

JTFouquier commented Jun 15, 2016

gtsueng commented Sep 30, 2016

x0xMaximus commented Oct 19, 2016

x0xMaximus commented Oct 24, 2016

x0xMaximus commented Nov 2, 2016

x0xMaximus commented Nov 16, 2016

x0xMaximus commented Dec 7, 2016