-
Notifications
You must be signed in to change notification settings - Fork 39
/
Copy pathfries-data-representation-spec.txt
982 lines (808 loc) · 48.9 KB
/
fries-data-representation-spec.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
Version: 0.7
Authors: Hans, Ed, Mihai, Pradeep
General
-------
Goal:
- create a light-weight, flexible, explicit yet readable JSON-based representation for textual objects and annotations
- use uniform syntactic representations as much as possible for offsets, mentions, arguments, but allow variation
on type and argument names as well as any additional frame slots somebody needs to capture relevant semantic info
- the result should be human readable and understandable without tool support, that is, pulling a couple of files
into a text editor should be enough to understand what they represent and not require too much jumping around
- the representation can be redundant to increase readability, for example, we might add a text string or sentence
pointer even though those could be computed from offset information
Conventions:
- key names are hyphenated (not underscored)
- type names, similarly, are hyphenated (e.g., "complex-assembly"), unless they correspond to some ontology
- booleans start with "is-"
- objects are of two types:
- frames with unique IDs for things that need to be pointed to by other frames (basically, every
annotation object becomes a frame)
- simple embedded structured objects
- each object has an object-type for easier translation and to allow type heterogeneity
- each frame has a meta-info (which can be (partially) inherited from the file meta object)
- annotation files can contain any combination of frames, but:
- we'll try to modularize as good as possible, keeping only closely related things in the same files
- frames that need to be referenced by others (e.g., passages, sentences) will generally go into their own file
- files should take some logical naming convention describing that types of frames in them
- we will use pseudo-globally unique IDs for frames, so files can easily be combined without having to worry
about clashing IDs
- there is no mandatory convention for unique IDs, but the suggestion is to use some scheme like this:
<type-prefix>-<doc-id>-<org>-<run-id>-<frame-id>, for example: "pass-PMC3847091-uaz-r13-11"
how run and frame IDs are constructed is up to the data producer, the only suggestion is to keep things
short, so that the resulting files are still readable, yet unique
- one can of course use real GUIDs, however, those conflict with the readability goal
- there is no requirement to keep objects in a particular order in an annotation file, however, if one has control
over the order in which JSON objects and slots are generated, a logical order for easy readability is preferred
- index-es in the various objects indicate some relative ordering, e.g., the sentences in a passage, the passages in
a document, the mentions in a sentence, etc.; they should be monotonically increasing but they are not necessarily
contiguous; they are generally 0-based, they are not always mandatory (e.g., for mentions, events, etc.) and what
they are relative to depends on the frame type and possibly data producer (except for sentences and passages).
They are primarily useful for generating readable IDs
TO DO, Issues:
+ not sure if JSON strings can contain newlines, answer: no, they have to be encoded via \n
+ define a compound-mention frame to handle the special args variant Mihai uses to deal with complexes
Object Representation
---------------------
Annotation files:
{ "object-type": "frame-collection",
"object-meta": {"object-type": "meta-info", "component": "REACH", "organization": "UAZ", "doc-id": "PMC3847091", "processing-start": "...", "processing-end": "...", ....},
"frames": [ .... ] }
Passages:
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "nxml2fries"},
"frame-id": "pass-PMC3847091-uaz-r13-11",
"frame-type": "passage",
"index": 11,
"section-id": "s1",
"section-name": null,
"is-title": false,
"text": "Here we show that ASPP2 is a novel substrate of RAS/MAPK. Phosphorylation of ASPP2 by MAPK is required for the RAS-induced translocation of ASPP2, which results in the increased binding to p53. Consequently, the pro-apoptotic activity of ASPP2 is increased by the RAS/Raf/MAPK signalling cascade as ASPP2 phosphorylation mutant fails to do so. Thus phosphorylation of ASPP2 by RAS/MAPK pathway provides a novel link between RAS and p53 in regulating apoptosis. " }
Sentences:
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "CoreNLP"},
"frame-id": "sent-PMC3847091-uaz-r13-11-2",
"frame-type": "sentence",
// passage: mandatory, id of passage this sentence is part of
"passage:" "pass-PMC3847091-uaz-r13-11",
// index: optional, zero-based, passage-local sentence number for this sentence;
// useful for generating ID postfixes, e.g., ...<pass-index>-<sent-index>...
"index": 1,
// start/end-pos: mandatory, absolute or relative text positions to delinate the text string
// of this sentence in the passage or document text - more details on offset fields below
"start-pos": {"object-type": "relative-pos", "reference": "pass-PMC3847091-uaz-r13-11", "offset": 194},
"end-pos": {"object-type": "relative-pos", "reference": "pass-PMC3847091-uaz-r13-11", "offset": 343, "is-closed": false},
// text: mandatory, surface text string of this sentence, possibly normalized;
// the texts of entity and other mentions must be exact substrings of sentence texts
"text": "Consequently, the pro-apoptotic activity of ASPP2 is increased by the RAS/Raf/MAPK signalling cascade as ASPP2 phosphorylation mutant fails to do so." }
Entity Mentions:
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "REACH"},
"frame-id": "ment-PMC3847091-uaz-r13-11-2-4",
"frame-type": "entity-mention",
// index: optional, sentence-local number for this mention from this component,
// useful for generating ID postfixes, e.g., ...<sent-index>-<ment-index>
"index": 4,
// sentence: optional, sentence containing this mention, can also be determined from offsets
"sentence:" "sent-PMC3847091-uaz-r13-11-2",
// start/end-pos: mandatory, absolute or relative text positions to delinate the text string
// of this mention in the sentence or document text - more details on offsets below
"start-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r13-11-2", "offset": 105, "context-start": "ASPP2/2"},
"end-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r13-11-2", "offset": 110, "context-end": "ASPP2/2", "is-closed": false},
// text: mandatory, surface text string of this mention
"text": "ASPP2",
// type: mandatory - if at all possible, primary NER or ontology type extracted for this mention,
// types from an agreed upon type vocabulary are preferred, but not a requirement
"type": "protein",
// subtype: optional - secondary NER or ontology type extracted for this mention
"subtype": null,
// xrefs: optional - as applicable, cross-references to other information relevant to this mention,
// the meaning of the cross-reference depends on the type of the cross-reference object, here we
// point to an external database ID for this protein (similar to the BioPAX representation)
"xrefs": [{"object-type": "db-reference", "namespace": "UniProt", "id": "Q13625"}]
}
Event Mentions:
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "REACH"},
"frame-id": "evem-PMC3847091-uaz-r13-11-2-1",
"frame-type": "event-mention",
// index: optional, sentence-local number for this mention from this component,
// useful for generating ID postfixes, e.g., ...<sent-index>-<ment-index>
"index": 1,
// sentence: optional, sentence containing this mention, can also be determined from offsets
"sentence:" "sent-PMC3847091-uaz-r13-11-2",
// start/end-pos: mandatory, absolute or relative text positions to delinate the text string
// of this mention in the sentence or document text - more details on offsets below
"start-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r13-11-2", "offset": 105, "context-start": "ASPP2/2"},
"end-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r13-11-2", "offset": 126, "context-end": "phosphorylation/1", "is-closed": false},
// text: mandatory, narrowest surface text string of this mention that captures all aspects of this event -
// generally the text span including the textually first and last arguments
"text": "ASPP2 phosphorylation",
// type: mandatory - primary type describing the kind of this event;
// types from an agreed upon type vocabulary are preferred, but not a requirement
// NOTE: the type/subtype representation picked here is a possible choice,
// but the "type" could have simply been "phosphorylation"
"type": "protein-modification",
// subtype: optional - secondary type extracted for this event
"subtype": "phosphorylation",
// arguments: optional - list of textual arguments for this event
"arguments": [{"object-type": "argument",
// argument type: mandatory, describes the syntactic or semantic role of this argument,
// e.g., subj, obj, arg1, arg2, participant, controller, controlled, from-location, to-location, at-location, etc.
// multiple arguments with the same type are possible, e.g., all "participant"s
"type": "participant",
// index: optional, an argument number in case some argument ordering needs to be conveyed
"index": 0,
// text: optional, the text of the argument mention, for readability
"text": "ASPP2",
// arg: mandatory, pointer to the frame describing this argument, generally a text object
// such as an entity-mention, event mention or relation mention
"arg": "ment-PMC3847091-uaz-r13-11-2-4"}],
// is-negated: optional, can be used to represent negated information, if absent the default is false.
"is-negated": false
}
Relation Mentions:
{ "frame-type": "relation-mention",
otherwise very similar to event mentions }
Entities:
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "Jun-System", "organization": "CMU"},
"frame-id": "ent-PMC3847091-cmu-r4-1",
"frame-type": "entity",
"index": 1,
"members": ["ment-PMC3847091-uaz-r13-11-2-1", "ment-PMC3847091-uaz-r13-11-2-4", ...] }
Events:
{ "frame-type": "event",
otherwise very similar to entities }
Epistemics:
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "Nicolas-System", "organization": "CMU"},
"frame-id": "epi-PMC3847091-cmu-r4-11-2",
"frame-type": "epistemics",
// argument: mandatory, the sentence, relation or event mention this epistemic valuation is about
"argument" "sent-PMC3847091-uaz-r13-11-2",
// value: optional, numeric representation of this epistemic valuation
"value": 0.6,
// status: optional, symbolic representation of this epistemic valuation; at least one of
// value or status need to be represented, but preferably both
"status": "hypothesis" }
Absolute Text Positions (or Offsets):
{ "object-type": "absolute-pos",
// offset: mandatory, an absolute character offset from the beginning of the document
// relative to the document's character encoding, e.g., a UTF8 character offset
"offset": 1234}
Relative, Contextualized Text Positions:
{ "object-type": "relative-pos",
// reference: mandatory, reference frame to which this position is relative to, for example,
// if the reference is a sentence object, the position is relative to the start of that sentence
"reference": "sent-PMC3847091-uaz-r13-11-2",
// offset: mandatory, an absolute character offset from the beginning of the reference object
// relative to the document's character encoding, e.g., a UTF8 character offset
"offset": 105,
// context-start: optional, a contextualized offset specified as "<text>/<n>" which denotes
// the start position of the <n>-th occurrence of <text> in the text of the reference object.
// For example, "ASPP2/3" denotes the start offset of the third match for "ASPP2" in the sentence
// text of "sent-PMC3847091-uaz-r13-11-2" relative to the documents character encoding.
// <text> doesn't necessarily have to be a single token and can contain "/", but it always has
// a final "/<n>" suffix; matches are performed case-sensitively and do not have to begin or
// end at word or token boundaries.
"context-start": "ASPP2/3"},
// context-end: optional, similar to context-start but denotes the end position of the <n>-th match
// generally, only one of "context-start" or "context-end" should be specified, but if they are
// both provided, they should identify the same position
"context-end": "phosphorylation/1",
// is-closed: optional, default is false, indicates whether this is a closed pointer interval
// where the index points at the character as opposed to outside/behind it. Standard C, Java
// strings have half-open semantics where the start points at the first character and end points
// behind the last one. LDC annotations have closed semantics where end points at the last character.
// So, this is primarily here to allow the representation of LDC end offsets.
"is-closed": false}
Offset Mapping
--------------
Assumptions
- we have two or more independent sets of sentence frames for a particular document, for example,
the list of passages and associated sentences extracted via nxml2fries and the list
of sentences extracted by MedScan.
- we have annotations such as entity mentions whose start/end positions are relative
to sentence list 1, and we have another set whose positions are relative to list 2
- for each set of mentions, we have relative, contextualized start/end positions available
(as sketched out above)
Sentence Map
- the first step to support the mapping process is to construct a sentence map between
sentence list 1 and 2
- this is a somewhat tricky/fuzzy/heuristic alignment job (on the shoulders of Pradeep :-)
that uses the relative order of sentences and the tokens they contain to come up with
a reasonable mapping; this can use various constraints to help in the search, for example,
if we have a good match from s1_10 to s2_12, then s1_11 should only consider sentences
close to and following s2_12, etc.
- the map should look like this:
- for each sentence s1_i in list 1 and s2_k in list 2 we either have:
mapsTo(s1_i, null) // no corresponding sentence for s1_i
mapsTo(s1_i, s2_k) // s1_i is quasi-identical with s2_k
mapsTo(null, s2_k) // no corresponding sentence for s2_k
- these are more complex, general cases in case sentences fully or partially overlap:
mapsTo(s1_i, s2_k[s, e]) // s1_i is quasi-identical with the [s, e] region in s2_k (i.e., s1_i subsumes s2_k)
mapsTo(s1_i[s, e], s2_k) // s2_k is quasi-identical with the [s, e] region in s1_i (i.e., s2_k subsumes s1_i)
mapsTo(s1_i[s1, e1], s2_k[s2,e2]) // the s1_k[s1,e1] region is quasi-identical with region s2_k[s2,e2] (i.e., s1_i overlaps s2_k)
- each sentence maps to at most one sentence in the other list
- sentences might not have a reasonable match at all
- for a large number of cases we hope for a simple 1-1 or no match
- if there are multiple possible mappings for a sentence, a single best one has to be chosen
- since sentence segmentation might fail in different ways in the two systems, it is possible
that one combines sentences that are broken apart in the other, or both fail in different ways
which would lead to overlapping sentences. We might ignore these cases initially and simply
map them to null for now; but the region mappings sketched above could capture these situations
Map Representation
PROBLEM: it is possible for a sentence to overlap with multiple others, which in turn could overlap
with other sentences. To account for this case, each entry in the map would have to be multi-valued,
allowing multiple target sentence ranges. But let's wait on that for now...
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "Pradeep-Matcher", "organization": "CMU"},
"frame-id": "smap-PMC3847091-cmu-r4-1",
"frame-type": "sentence-map",
// list1-meta, list2-meta: mandatory, contain meta info for the systems that generated the sentence list;
// this is necessary, so we can identify a specific map if there are more than one, e.g., nxml-to-medscan
"list1-meta": {"object-type": "meta-info", "component": "nxml2fries", "organization": "UAZ"},
"list2-meta": {"object-type": "meta-info", "component": "MedScan", "organization": "CMU"}
// map: mandatory, maps each sentence s1_i in list 1 onto a corresponding sentence s2_k in list 2
// - if s1_i doesn't have a corresponding sentence, there won't be an entry in this map with key s1_i
// - if s2_k doesn't have a corresponding sentence, there won't be an entry in this map with map_range->to s2_k
// - if only subregions of the two sentences map, s1, e1, s2, and e2 can be used to delineate them
// - if any of s1, e1, s2, and e2 are absent, they default to the corresponding endpoint of the sentence
"map": [{"object-type": "mapping", "from": "sent-PMC3847091-uaz-r13-1-1", "to": "sent-PMC3847091-cmu-r4-1"},
{"object-type": "mapping", "from": "sent-PMC3847091-uaz-r13-1-2", "to": "sent-PMC3847091-cmu-r4-2"},
// more complex cases for illustration, we might want to ignore them for now
{"object-type: "mapping", "from": "sent-PMC3847091-uaz-r13-1-3", "to": "sent-PMC3847091-cmu-r4-3", "s1": 10, "e1": 100},
{"object-type: "mapping", "from": "sent-PMC3847091-uaz-r13-1-4", "to": "sent-PMC3847091-cmu-r4-4", "s2": 15, "e2": 95},
{"object-type: "mapping", "from": "sent-PMC3847091-uaz-r13-1-5", "to": "sent-PMC3847091-cmu-r4-6", "s1": 5, "e1": 50, "s2": 20, "e2": 68},
.....] }
Mapping Operations
All we need to represent in the data is the following
- contextualized offsets relative to their respective source sentences as sketched in the data representation above
- the sentence list-to-list map(s) described above
- no actual translation of offsets needs to be done on the data
A data consumer can then perform the following operations (assuming only full quasi-identity, no ranges for now):
- simple case: full identity between text spans span1 and span2
mapsTo(reference(start-pos(span1)), reference(start-pos(span2))) and // establishes that they are from the same, quasi-identical sentence
context-start(start-pos(span1)) = context-start(start-pos(span2)) and // establishes that the start context token and occurrence numbers are the same
context-end(end-pos(span1)) = context-end(end-pos(span2)) // establishes that the end context token and occurrence numbers are the same
- check overlap between text spans span1 and span2; findStart and findEnd are simple
string search functions that find the position of the nth occurrence of a context token
mapsTo(reference(start-pos(span1)), reference(start-pos(span2))) // establishes that they are from the same, quasi-identical sentence
mapped-s1 = findStart(context-start(start-pos(span1)), reference(start-pos(span2))) // translates span1 start into span2 land
mapped-e1 = findEnd(context-end(end-pos(span1)), reference(end-pos(span2))) // translates span1 end into span2 land
check overlap between span2 start/end offsets and [mapped-s1, mapped-e1]
- subrange to subrange mappings are similar, but now the n's in the context-start/end
tokens need to be recomputed first relative to the given subranges; after that,
everything else remains the same
- mapping failures occur if:
- there is no corresponding mapped sentence
- a token in a context-start/end does not occur in the mapped sentence (e.g., due to character set differences,
Greek character normalizations, etc.)
- nevertheless, we can still project a source start/end interval onto a target string purely based on length
transformations and then see if the projected interval overlaps with the target
UAZ Old and New Representations
-------------------------------
OLD: no passages and sentences
NEW: passage and sentence objects as described above combined in a separate, single annotation file;
OLD events:
{ "submitter":"UAZ",
"type":"positive_regulation",
"doc_id":"PMC3902907",
"reading_ended":"2015-05-13 04:58:48",
"controlled":"2",
"negative_information":false,
"event_id":"1",
"reader_type":"machine",
"reading_started":"2015-05-13 04:58:48",
"controller":{"namespace":"uniprotkb", "text":"JAK3", "type":"protein", "id":"P52333"},
"passage_id":"0",
"evidence":"phosphorylation of HuR by JAK3",
"offsets":[9, 39]},
{"submitter":"UAZ",
"participants":[{"namespace":"uniprotkb", "text":"HuR", "type":"protein", "id":"Q15717"}],
"type":"phosphorylation",
"reading_ended":"2015-05-13 04:58:48",
"doc_id":"PMC3902907",
"negative_information":false,
"event_id":"2",
"reader_type":"machine",
"reading_started":"2015-05-13 04:58:48",
"passage_id":"0",
"offsets":[9, 31],
"evidence":"phosphorylation of HuR"}
NEW events:
- meta-info goes into annotation-file meta info
- collect reading time on a per-document basis if possible and stick it into annotation-file meta info,
for example:
File PMC3847091.uaz.events.json:
{ "object-type": "frame-collection",
"object-meta": {"object-type": "meta-info",
"component": "REACH",
"component-type": "machine",
"organization": "UAZ",
"doc-id": "PMC3847091",
"processing-start": "2015-05-13 04:58:48",
"processing-end": "2015-05-13 04:59:30",
<anything else you deem important or interesting> },
"frames": [ <entity mention and event frames as shown below> ] }
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "REACH"}, // optional, inherits fields from frame collection
"frame-id": "ment-PMC3847091-uaz-r1-0-0-0",
"frame-type": "entity-mention",
"index": 0, // optional, mention number in this sentence
"sentence:" "sent-PMC3847091-uaz-r1-0-0", // optional, from passage/sentence file
"start-pos": {"object-type": "relative-pos",
"reference": "sent-PMC3847091-uaz-r1-0-0",
"offset": 28,
"context-start": "HuR/1"},
"end-pos": {"object-type": "relative-pos",
"reference": "sent-PMC3847091-uaz-r1-0-0",
"offset": 31,
"context-end": "HuR/1"},
"text": "HuR",
"type": "protein",
"xrefs": [{"object-type": "db-reference", "namespace": "uniprotkb", "id": "Q15717"}] }
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "REACH"}, // optional, inherits fields from frame collection
"frame-id": "ment-PMC3847091-uaz-r1-0-0-1",
"frame-type": "entity-mention",
"index": 1, // optional, mention number in this sentence
"sentence:" "sent-PMC3847091-uaz-r1-0-0", // optional, from passage/sentence file
"start-pos": {"object-type": "relative-pos",
"reference": "sent-PMC3847091-uaz-r1-0-0",
"offset": 35, "context-start": "JAK3/1"},
"end-pos": {"object-type": "relative-pos",
"reference": "sent-PMC3847091-uaz-r1-0-0",
"offset": 39,
"context-end": "JAK3/1"},
"text": "JAK3",
"type": "protein",
"xrefs": [{"object-type": "db-reference", "namespace": "uniprotkb", "id": "P52333"}] }
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "REACH"}, // optional
"frame-id": "evem-PMC3847091-uaz-r1-0-0-0",
"frame-type": "event-mention",
"index": 0, // optional
"sentence:" "sent-PMC3847091-uaz-r1-0-0", // optional
"start-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r1-0-0", "offset": 9, "context-start": "phosphorylation/1"},
"end-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r1-0-0", "offset": 31, "context-end": "HuR/1"},
"text": "phosphorylation of HuR",
"type": "phosphorylation",
"arguments": [{"object-type": "argument",
"type": "participant",
"index": 0, // optional
"text": "HuR", // optional
"arg": "ment-PMC3847091-uaz-r1-0-0-0"}],
"polarity": "positive" }
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "REACH"}, // optional
"frame-id": "evem-PMC3847091-uaz-r1-0-0-1",
"frame-type": "event-mention",
"index": 0, // optional
"sentence:" "sent-PMC3847091-uaz-r1-0-0", // optional
"start-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r1-0-0", "offset": 9, "context-start": "phosphorylation/1"},
"end-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r1-0-0", "offset": 39, "context-end": "JAK3/1"},
"text": "phosphorylation of HuR by JAK3",
"type": "positive-regulation",
"arguments": [{"object-type": "argument",
"type": "controller",
"index": 0, // optional
"text": "JAK3", // optional
"arg": "ment-PMC3847091-uaz-r1-0-0-1},
{"object-type": "argument",
"type": "controlled",
"index": 1, // optional
"text": "phosphorylation of HuR", // optional
"arg": "evem-PMC3847091-uaz-r1-0-0-0"}],
"polarity": "positive" }
This is what a similar binding event would look like:
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "REACH"}, // optional
"frame-id": "evem-PMC3847091-uaz-r1-0-10-1",
"frame-type": "event-mention",
"index": 0, // optional
"sentence:" "sent-PMC3847091-uaz-r1-0-10", // optional
"start-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r1-0-10", "offset": 9, "context-start": "binding/1"},
"end-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r1-0-10", "offset": 39, "context-end": "JAK3/1"},
"text": "binding of HuR to JAK3",
"type": "complex-assembly",
"arguments": [{"object-type": "argument",
"type": "participant",
"index": 0, // optional
"text": "HuR", // optional
"arg": "ment-PMC3847091-uaz-r1-0-0-1},
{"object-type": "argument",
"type": "participant",
"index": 1, // optional
"text": "JAK3", // optional
"arg": "ment-PMC3847091-uaz-r1-0-0-0"}],
"polarity": "positive" }
OLD Phosphorylation at site:
{ "submitter":"UAZ",
"participants":[{"namespace":"uniprotkb", "text":"HuR", "type":"protein", "id":"Q15717"}],
"subfields":{"site":"tyrosine residues"},
"type":"phosphorylation",
"doc_id":"PMC3902907",
"reading_ended":"2015-05-13 04:58:48",
"negative_information":false,
"event_id":"65",
"reader_type":"machine",
"reading_started":"2015-05-13 04:58:48",
"passage_id":"47",
"evidence":"phosphorylates HuR at tyrosine residues",
"offsets":[1554, 1593]}
NEW Phosphorylation at site (we promote the site to a mention, so it can become an event argument):
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "REACH"}, // optional, inherits fields from frame collection
"frame-id": "ment-PMC3902907-uaz-r1-47-1-2",
"frame-type": "entity-mention",
"index": 1, // optional, mention number in this sentence
"sentence:" "sent-PMC3902907-uaz-r1-47-1", // optional, from passage/sentence file
"start-pos": {"object-type": "relative-pos",
"reference": "sent-PMC3902907-uaz-r1-47-1",
"offset": 154, "context-start": "tyrosine/1"},
"end-pos": {"object-type": "relative-pos",
"reference": "sent-PMC3902907-uaz-r1-47-1",
"offset": 193,
"context-end": "residues/1"},
"text": "tyrosine residues",
"type": "site" }
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "REACH"}, // optional
"frame-id": "evem-PMC3902907-uaz-r1-47-1-1",
"frame-type": "event-mention",
"index": 0, // optional
"sentence:" "sent-PMC3902907-uaz-r1-47-1", // optional
"start-pos": {"object-type": "relative-pos", "reference": "sent-PMC3902907-uaz-r1-47-1", "offset": 132, "context-start": "phosphorylates/1"},
"end-pos": {"object-type": "relative-pos", "reference": "sent-PMC3902907-uaz-r1-47-1", "offset": 193, "context-end": "residues/1"},
"text": "phosphorylates HuR at tyrosine residues",
"type": "phosphorylation",
"arguments": [{"object-type": "argument",
"type": "participant",
"index": 0, // optional
"text": "HuR", // optional
"arg": "ment-PMC3902907-uaz-r1-47-1-1"}, // not shown
{"object-type": "argument",
"type": "at-location",
"index": 1, // optional
"text": "tyrosine residues", // optional
"arg": "ment-PMC3902907-uaz-r1-47-1-2"}],
"polarity": "positive" }
OLD translocation:
{ "submitter":"UAZ",
"participants":[{"namespace":"uniprotkb", "text":"ASPP2", "type":"protein", "id":"Q13625"}],
"subfields":{
"from":{"namespace":"go", "text":"plasma membrane", "type":"cellular_component", "id":"GO:0005886"},
"to":{"namespace":"go", "text":"nucleus", "type":"cellular_component", "id":"GO:0005634"}},
"type":"translocation",
"doc_id":"PMC3847091",
"reading_ended":"2015-05-13 04:57:54",
"negative_information":false,
"event_id":"80",
"reader_type":"machine",
"reading_started":"2015-05-13 04:57:54",
"passage_id":"37",
"evidence":"ASPP2 translocation from the plasma membrane to the cytosol and nucleus",
"offsets":[61, 132]}
NEW translocation (requires introduction of argument entity mentions - not shown here):
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "REACH"}, // optional
"frame-id": "evem-PMC3847091-uaz-r1-37-1-1",
"frame-type": "event-mention",
"index": 0, // optional
"sentence:" "sent-PMC3847091-uaz-r1-37-1", // optional
"start-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r1-37-1", "offset": 61, "context-start": "ASPP2/1"},
"end-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r1-37-1", "offset": 132, "context-end": "nucleus/1"},
"text": "ASPP2 translocation from the plasma membrane to the cytosol and nucleus",
"type": "translocation",
"arguments": [{"object-type": "argument",
"type": "participant",
"index": 0, // optional
"text": "ASPP2", // optional
"arg": "ment-PMC3847091-uaz-r1-37-1-1"}, // not shown
{"object-type": "argument",
"type": "from-location",
"index": 1, // optional
"text": "plasma membrane", // optional
"arg": "ment-PMC3847091-uaz-r1-37-1-2"} // not shown
{"object-type": "argument",
"type": "to-location",
"index": 2, // optional
"text": "cytosol", // optional
"arg": "ment-PMC3847091-uaz-r1-37-1-3"} // not shown
{"object-type": "argument",
"type": "to-location",
"index": 3, // optional
"text": "nucleus", // optional
"arg": "ment-PMC3847091-uaz-r1-37-1-4"}], // not shown
"polarity": "positive" }
MedScan Examples of Old and New Representations
-----------------------------------------------
OLD: no sentences
NEW: sentence objects as described above combined in a separate, single annotation file
OLD entity mentions and relations:
{ "bookkeeping": {"CMU-offsets": [4488, 4497], "object-type": "bookkeeping"},
"db-xrefs": [{"id": "8536", "namespace": "pubchem", "object-type": "xref"}],
"entity-type": "MedScan_DRUG",
"frame-id": 77,
"frame-type": "entity-mention",
"object-meta": {"object-type": "meta-info", "organization": "CMU", "source-system": "MedScan"},
"object-type": "frame",
"ref-sentence": 798,
"text": "menadione" },
{ "bookkeeping": {"CMU-offsets": [4525, 4562], "object-type": "bookkeeping"},
"db-xrefs": [{"id": "3718", "namespace": "MEDSCAN:urn:agi-llid", "object-type": "xref"}],
"entity-type": "MedScan_GENE_PROTEIN",
"frame-id": 79,
"frame-type": "entity-mention",
"object-meta": {"object-type": "meta-info", "organization": "CMU", "source-system": "MedScan"},
"object-type": "frame",
"ref-sentence": 798,
"text": "tyrosine kinase Janus kinase 3 (JAK3)" }
{ "context": "menadione, a drug that activated the tyrosine kinase Janus kinase 3 (JAK3)",
"frame-id": 623,
"frame-type": "relation-mention",
"obj": 79,
"object-meta": {"object-type": "meta-info", "organization": "CMU", "source-system": "MedScan"},
"object-type": "frame",
"relation-type": "MedScan_Activation",
"subj": 77 }
NEW entity mentions and relations:
- meta-info goes into annotation-file meta info
- collect reading time on a per-document basis if possible and stick it into annotation-file meta info,
for example:
File PMC3902907.cmu.medscan.json:
{ "object-type": "frame-collection",
"object-meta": {"object-type": "meta-info",
"component": "MedScan",
"component-type": "machine",
"organization": "CMU",
"doc-id": "PMC3902907",
"processing-start": "2015-05-13 04:58:48",
"processing-end": "2015-05-13 04:59:30",
<anything else you deem important or interesting> },
"frames": [ <entity mention and relation frames as shown below> ] }
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "MedScan"}, // optional, inherits fields from frame collection
"frame-id": "ment-PMC3902907-cmu-medscan-r1-98-1",
"frame-type": "entity-mention",
"index": 1, // optional, mention number in this sentence
"sentence:" "sent-PMC3902907-cmu-medscan-r1-98", // optional, from passage/sentence file
"start-pos": {"object-type": "relative-pos",
"reference": "sent-PMC3902907-cmu-medscan-r1-98",
"offset": 22,
"context-start": "menadione/1"},
"end-pos": {"object-type": "relative-pos",
"reference": "sent-PMC3902907-cmu-medscan-r1-98",
"offset": 31,
"context-end": "menadione/1"},
"text": "menadione",
"type": "DRUG",
"xrefs": [{"object-type": "db-reference", "namespace": "pubchem", "id": "8536"}] }
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "MedScan"}, // optional, inherits fields from frame collection
"frame-id": "ment-PMC3902907-cmu-medscan-r1-98-2",
"frame-type": "entity-mention",
"index": 2, // optional, mention number in this sentence
"sentence:" "sent-PMC3902907-cmu-medscan-r1-98", // optional, from passage/sentence file
"start-pos": {"object-type": "relative-pos",
"reference": "sent-PMC3902907-cmu-medscan-r1-98",
"offset": 59,
"context-start": "tyrosine/1"},
"end-pos": {"object-type": "relative-pos",
"reference": "sent-PMC3902907-cmu-medscan-r1-98",
"offset": 96,
"context-end": "(JAK3)/1"},
"text": "tyrosine kinase Janus kinase 3 (JAK3)",
"type": "GENE_PROTEIN",
"xrefs": [{"object-type": "db-reference", "namespace": "MEDSCAN:urn:agi-llid", "id": "3718"}] }
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "MedScan"}, // optional
"frame-id": "relm-PMC3902907-cmu-medscan-r1-98-0",
"frame-type": "relation-mention",
"index": 0, // optional
"sentence:" "sent-PMC3902907-cmu-medscan-r1-98", // optional
"start-pos": {"object-type": "relative-pos", "reference": "sent-PMC3902907-cmu-medscan-r1-98", "offset": 22, "context-start": "menadione/1"},
"end-pos": {"object-type": "relative-pos", "reference": "sent-PMC3902907-cmu-medscan-r1-98", "offset": 96, "context-end": "(JAK3)/1"},
"text": "menadione, a drug that activated the tyrosine kinase Janus kinase 3 (JAK3)",
"type": "Activation",
"arguments": [{"object-type": "argument",
"type": "subj",
"index": 0, // optional
"text": "menadione", // optional
"arg": "ment-PMC3902907-cmu-medscan-r1-98-1"},
{"object-type": "argument",
"type": "obj",
"index": 1, // optional
"text": "tyrosine kinase Janus kinase 3 (JAK3)", // optional
"arg": "ment-PMC3902907-cmu-medscan-r1-98-2"}],
"polarity": "positive" }
Jun's Events, Old and New
-------------------------
OLD:
{ "bookkeeping": {"CMU-offsets": [4506, 4510], "object-type": "bookkeeping"},
"db-xrefs": [],
"frame-id": 78,
"frame-type": "entity-mention",
"object-meta": {"object-type": "meta-info", "organization": "CMU", "source-system": "Jun-System"},
"object-type": "frame",
"ref-sentence": 798,
"text": "that" },
{ "bookkeeping": {"CMU-offsets": [4547, 4553], "object-type": "bookkeeping"},
"db-xrefs": [],
"frame-id": 80,
"frame-type": "entity-mention",
"object-meta": {"object-type": "meta-info", "organization": "CMU", "source-system": "Jun-System"},
"object-type": "frame",
"ref-sentence": 798,
"text": "kinase" },
{ "arg0": [78],
"arg1": [80],
"bookkeeping": {"CMU-offsets": [4511, 4520], "object-type": "bookkeeping"},
"context": "that activated the tyrosine kinase Janus kinase",
"event-type": "activate",
"frame-id": 567,
"frame-type": "event-mention",
"object-meta": {"object-type": "meta-info", "organization": "CMU", "source-system": "Jun-System"},
"object-type": "frame",
"ref-sentence": 798,
"text": "activated" }
NEW:
File PMC3902907.cmu.junsys.json:
{ "object-type": "frame-collection",
"object-meta": {"object-type": "meta-info",
"component": "Jun-System",
"component-type": "machine",
"organization": "CMU",
"doc-id": "PMC3902907",
"processing-start": "2015-05-13 04:58:48",
"processing-end": "2015-05-13 04:59:30",
<anything else you deem important or interesting> },
"frames": [ <entity and event frames as shown below> ] }
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "Jun-System"}, // optional, inherits fields from frame collection
"frame-id": "ment-PMC3902907-cmu-junsys-r1-98-1",
"frame-type": "entity-mention",
"index": 1, // optional, mention number in this sentence
"sentence:" "sent-PMC3902907-cmu-medscan-r1-98", // optional, from passage/sentence file
"start-pos": {"object-type": "relative-pos",
"reference": "sent-PMC3902907-cmu-medscan-r1-98",
"offset": 22,
"context-start": "that/1"},
"end-pos": {"object-type": "relative-pos",
"reference": "sent-PMC3902907-cmu-medscan-r1-98",
"offset": 26,
"context-end": "that/1"},
"text": "that",
"type": "OTHER" }
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "Jun-System"}, // optional, inherits fields from frame collection
"frame-id": "ment-PMC3902907-cmu-junsys-r1-98-2",
"frame-type": "entity-mention",
"index": 2, // optional, mention number in this sentence
"sentence:" "sent-PMC3902907-cmu-medscan-r1-98", // optional, from passage/sentence file
"start-pos": {"object-type": "relative-pos",
"reference": "sent-PMC3902907-cmu-medscan-r1-98",
"offset": 63,
"context-start": "kinase/2"},
"end-pos": {"object-type": "relative-pos",
"reference": "sent-PMC3902907-cmu-medscan-r1-98",
"offset": 69,
"context-end": "kinase/2"},
"text": "kinase",
"type": "OTHER",
// Jun doesn't seem to have these for his mentions, but that's how they would be represented:
"xrefs": [{"object-type": "db-reference", "namespace": "MEDSCAN:urn:agi-llid", "id": "3718"}] }
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "Jun-System"}, // optional
"frame-id": "evem-PMC3902907-cmu-junsys-r1-98-0",
"frame-type": "event-mention",
"index": 0, // optional
"sentence:" "sent-PMC3902907-cmu-medscan-r1-98", // optional
"start-pos": {"object-type": "relative-pos", "reference": "sent-PMC3902907-cmu-medscan-r1-98", "offset": 22, "context-start": "that/1"},
"end-pos": {"object-type": "relative-pos", "reference": "sent-PMC3902907-cmu-medscan-r1-98", "offset": 69, "context-end": "kinase/2"},
"text": "that activated the tyrosine kinase Janus kinase",
"type": "activate",
"arguments": [{"object-type": "argument",
"type": "arg0",
"index": 0, // optional
"text": "that", // optional
"arg": "ment-PMC3902907-cmu-junsys-r1-98-1"},
{"object-type": "argument",
"type": "arg1",
"index": 1, // optional
"text": "kinase", // optional
"arg": "ment-PMC3902907-cmu-junsys-r1-98-2"}],
"polarity": "positive" }
NEW entities and events:
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "Jun-System"}, // optional, inherits from frame collection
"frame-id": "ent-PMC3847091-cmu-junsys-r1-4",
"frame-type": "entity",
"index": 4,
"members": ["ment-PMC3902907-cmu-junsys-r1-98-1", ...] }
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "Jun-System"}, // optional, inherits from frame collection
"frame-id": "eve-PMC3847091-cmu-junsys-r1-3",
"frame-type": "event",
"index": 3,
"members": ["evem-PMC3902907-cmu-junsys-r1-98-0", ...] }
Pradeep's Measurements, Old and New
-----------------------------------
OLD:
{ "assay-molecule": ["mRNA"],
"binds-constituent1": ["GAPDH"],
"binds-constituent2": ["SIRT1 mRNA", "IP"],
"cell-type": ["HeLa cells"],
"frame-type": "implication",
"implication-type": "binds",
"object-meta": {"object-type": "meta-info", "organization": "CMU"},
"object-type": "frame",
"process": ["RIP ( IP"],
"ref-sentence": 871,
"transfection-molecule": ["HuR IP"] }
NEW - Suggested:
Pradeep: not sure if you want to introduce an entity mention for each of your arguments, so I didn't do this
for now and added a text-argument object that simply points to the text string. Eventually, we might want to
point to full mentions, because it would allow us to store additional offset/type/db/etc/ information.
File PMC3902907.cmu.pradsys.json:
{ "object-type": "frame-collection",
"object-meta": {"object-type": "meta-info",
"component": "Pradeep-System",
"component-type": "machine",
"organization": "CMU",
"doc-id": "PMC3902907",
"processing-start": "2015-05-13 04:58:48",
"processing-end": "2015-05-13 04:59:30",
<anything else you deem important or interesting> },
"frames": [ <implication frames as shown below> ] }
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": “Pradeep-System"}, // optional, inherits from frame-collection
"frame-id": “imp-PMC3902907-cmu-pradsys-r1-71-0",
"frame-type": “implication", // or measurement?
"index": 0, // optional
"sentence:" "sent-PMC3902907-cmu-medscan-r1-71", // optional
"start-pos": {"object-type": "relative-pos", "reference": "sent-PMC3902907-cmu-medscan-r1-71", "offset": 0, "context-start": “(/1"},
"end-pos": {"object-type": "relative-pos", "reference": "sent-PMC3902907-cmu-medscan-r1-71", "offset": 380, "context-end": “IP/4"},
"text": "( A , B ) After treatment of HeLa cells with arsenite and/or menadione as explained in Figure 1 , RIP ( IP followed by RT-qPCR ) analysis was used to measure the levels of enrichment of SIRT1 mRNA ( A ) and VHL mRNA ( B ) associated with HuR ; the samples were normalized using GAPDH mRNA , and the data represented as enrichment of each mRNA in HuR IP were compared with IgG IP",
"type": "binds",
"arguments": [{"object-type": "text-argument", "type": "assay-molecule", "arg": “mRNA"},
{"object-type": "text-argument", "type": "binds-constituent1", "arg": “GAPDH"},
{"object-type": "text-argument", "type": "binds-constituent2", "arg": “SIRT1 mRNA"},
{"object-type": "text-argument", "type": "binds-constituent2", "arg": “IP"},
{"object-type": "text-argument", "type": "cell-type", "arg": “HeLa cells"},
{"object-type": "text-argument", "type": "process", "arg": “RIP ( IP"},
{"object-type": "text-argument", "type": "transfection-molecule", "arg": “HuR IP"}]}
Nicolas' Epistemics, Old and New
--------------------------------
OLD:
{ "frame-id": 1012,
"frame-type": "epistemics",
"object-meta": {
"object-type": "meta-info",
"organization": "CMU"
},
"object-type": "frame",
"ref-sentence": 886,
"value": 0.6 }
NEW:
File PMC3902907.cmu.nicosys.json:
{ "object-type": "frame-collection",
"object-meta": {"object-type": "meta-info",
"component": "Nicolas-System",
"component-type": "machine",
"organization": "CMU",
"doc-id": "PMC3902907",
"processing-start": "2015-05-13 04:58:48",
"processing-end": "2015-05-13 04:59:30",
<anything else you deem important or interesting> },
"frames": [ <epistemics frames as shown below> ] }
{ "object-type": "frame",
"object-meta": {"object-type": "meta-info", "component": "Nicolas-System"}, // optional, inherits from frame-collection
"frame-id": "epi-PMC3847091-cmu-nicosys-r1-11-0",
"frame-type": "epistemics",
"index": 0, // optional
"argument": "sent-PMC3847091-cmu-medscan-r1-11",
"text": "In sum , these results indicate that HuR tyrosine phosphorylation at Y200 , which excludes HuR from SGs , also promotes the dissociation of HuR from target transcripts ( SIRT1 mRNA and VHL mRNA ) , or perhaps mobilizes HuR-SIRT1 mRNA and HuR-VHL mRNA complexes away from SGs , accelerating their degradation ( Figure 7 ) .", // optional
"value": 0.6,
"status": "hypothesis" }