doc/fries-data-representation-spec.txt

Version: 0.7
Authors: Hans, Ed, Mihai, Pradeep


General
-------

Goal:
- create a light-weight, flexible, explicit yet readable JSON-based representation for textual objects and annotations
- use uniform syntactic representations as much as possible for offsets, mentions, arguments, but allow variation
  on type and argument names as well as any additional frame slots somebody needs to capture relevant semantic info
- the result should be human readable and understandable without tool support, that is, pulling a couple of files
  into a text editor should be enough to understand what they represent and not require too much jumping around 
- the representation can be redundant to increase readability, for example, we might add a text string or sentence
  pointer even though those could be computed from offset information

Conventions:
- key names are hyphenated (not underscored)
- type names, similarly, are hyphenated (e.g., "complex-assembly"), unless they correspond to some ontology
- booleans start with "is-"
- objects are of two types:
  - frames with unique IDs for things that need to be pointed to by other frames (basically, every
    annotation object becomes a frame)
  - simple embedded structured objects
- each object has an object-type for easier translation and to allow type heterogeneity
- each frame has a meta-info (which can be (partially) inherited from the file meta object)
- annotation files can contain any combination of frames, but:
  - we'll try to modularize as good as possible, keeping only closely related things in the same files
  - frames that need to be referenced by others (e.g., passages, sentences) will generally go into their own file
  - files should take some logical naming convention describing that types of frames in them
- we will use pseudo-globally unique IDs for frames, so files can easily be combined without having to worry
  about clashing IDs
  - there is no mandatory convention for unique IDs, but the suggestion is to use some scheme like this:
       <type-prefix>-<doc-id>-<org>-<run-id>-<frame-id>, for example: "pass-PMC3847091-uaz-r13-11"
    how run and frame IDs are constructed is up to the data producer, the only suggestion is to keep things
    short, so that the resulting files are still readable, yet unique
  - one can of course use real GUIDs, however, those conflict with the readability goal
- there is no requirement to keep objects in a particular order in an annotation file, however, if one has control
  over the order in which JSON objects and slots are generated, a logical order for easy readability is preferred
- index-es in the various objects indicate some relative ordering, e.g., the sentences in a passage, the passages in
  a document, the mentions in a sentence, etc.; they should be monotonically increasing but they are not necessarily
  contiguous; they are generally 0-based, they are not always mandatory (e.g., for mentions, events, etc.) and what
  they are relative to depends on the frame type and possibly data producer (except for sentences and passages).
  They are primarily useful for generating readable IDs

TO DO, Issues: 
+ not sure if JSON strings can contain newlines, answer: no, they have to be encoded via \n 
+ define a compound-mention frame to handle the special args variant Mihai uses to deal with complexes


Object Representation
---------------------

Annotation files:

{ "object-type": "frame-collection",
  "object-meta": {"object-type": "meta-info", "component": "REACH", "organization": "UAZ", "doc-id": "PMC3847091", "processing-start": "...", "processing-end": "...", ....},
  "frames": [ .... ] }


Passages:

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "nxml2fries"},
  "frame-id":    "pass-PMC3847091-uaz-r13-11",
  "frame-type": "passage",
  "index": 11,
  "section-id": "s1",
  "section-name": null,
  "is-title": false,
  "text": "Here we show that ASPP2 is a novel substrate of RAS/MAPK. Phosphorylation of ASPP2 by MAPK is required for the RAS-induced translocation of ASPP2, which results in the increased binding to p53. Consequently, the pro-apoptotic activity of ASPP2 is increased by the RAS/Raf/MAPK signalling cascade as ASPP2 phosphorylation mutant fails to do so. Thus phosphorylation of ASPP2 by RAS/MAPK pathway provides a novel link between RAS and p53 in regulating apoptosis. " }


Sentences:

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "CoreNLP"},
  "frame-id":    "sent-PMC3847091-uaz-r13-11-2",
  "frame-type": "sentence",

  // passage: mandatory, id of passage this sentence is part of
  "passage:" "pass-PMC3847091-uaz-r13-11",

  // index: optional, zero-based, passage-local sentence number for this sentence;
  // useful for generating ID postfixes, e.g., ...<pass-index>-<sent-index>...
  "index": 1,

  // start/end-pos: mandatory, absolute or relative text positions to delinate the text string
  // of this sentence in the passage or document text - more details on offset fields below
  "start-pos": {"object-type": "relative-pos", "reference": "pass-PMC3847091-uaz-r13-11", "offset": 194},
  "end-pos": {"object-type": "relative-pos", "reference": "pass-PMC3847091-uaz-r13-11", "offset": 343, "is-closed": false},

  // text: mandatory, surface text string of this sentence, possibly normalized; 
  // the texts of entity and other mentions must be exact substrings of sentence texts
  "text": "Consequently, the pro-apoptotic activity of ASPP2 is increased by the RAS/Raf/MAPK signalling cascade as ASPP2 phosphorylation mutant fails to do so." }


Entity Mentions:

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "REACH"},
  "frame-id":    "ment-PMC3847091-uaz-r13-11-2-4",
  "frame-type": "entity-mention",

  // index: optional, sentence-local number for this mention from this component,
  // useful for generating ID postfixes, e.g., ...<sent-index>-<ment-index>
  "index": 4,

  // sentence: optional, sentence containing this mention, can also be determined from offsets
  "sentence:" "sent-PMC3847091-uaz-r13-11-2",

  // start/end-pos: mandatory, absolute or relative text positions to delinate the text string
  // of this mention in the sentence or document text - more details on offsets below
  "start-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r13-11-2", "offset": 105, "context-start": "ASPP2/2"},
  "end-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r13-11-2", "offset": 110, "context-end": "ASPP2/2", "is-closed": false},

  // text: mandatory, surface text string of this mention
  "text": "ASPP2",

  // type: mandatory - if at all possible, primary NER or ontology type extracted for this mention,
  // types from an agreed upon type vocabulary are preferred, but not a requirement
  "type": "protein",

  // subtype: optional - secondary NER or ontology type extracted for this mention
  "subtype": null,

  // xrefs: optional - as applicable, cross-references to other information relevant to this mention,
  // the meaning of the cross-reference depends on the type of the cross-reference object, here we
  // point to an external database ID for this protein (similar to the BioPAX representation)
  "xrefs": [{"object-type": "db-reference", "namespace": "UniProt", "id": "Q13625"}]
}


Event Mentions:

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "REACH"},
  "frame-id":    "evem-PMC3847091-uaz-r13-11-2-1",
  "frame-type": "event-mention",
  
  // index: optional, sentence-local number for this mention from this component,
  // useful for generating ID postfixes, e.g., ...<sent-index>-<ment-index>
  "index": 1,

  // sentence: optional, sentence containing this mention, can also be determined from offsets
  "sentence:" "sent-PMC3847091-uaz-r13-11-2",

  // start/end-pos: mandatory, absolute or relative text positions to delinate the text string
  // of this mention in the sentence or document text - more details on offsets below
  "start-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r13-11-2", "offset": 105, "context-start": "ASPP2/2"},
  "end-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r13-11-2", "offset": 126, "context-end": "phosphorylation/1", "is-closed": false},

  // text: mandatory, narrowest surface text string of this mention that captures all aspects of this event -
  // generally the text span including the textually first and last arguments
  "text": "ASPP2 phosphorylation",

  // type: mandatory - primary type describing the kind of this event;
  // types from an agreed upon type vocabulary are preferred, but not a requirement
  // NOTE: the type/subtype representation picked here is a possible choice,
  // but the "type" could have simply been "phosphorylation"
  "type": "protein-modification",

  // subtype: optional - secondary type extracted for this event
  "subtype": "phosphorylation",

  // arguments: optional - list of textual arguments for this event
  "arguments": [{"object-type": "argument", 

                 // argument type: mandatory, describes the syntactic or semantic role of this argument,
                 // e.g., subj, obj, arg1, arg2, participant, controller, controlled, from-location, to-location, at-location, etc.
                 // multiple arguments with the same type are possible, e.g., all "participant"s
                 "type": "participant", 

                 // index: optional, an argument number in case some argument ordering needs to be conveyed
                 "index": 0, 

                 // text: optional, the text of the argument mention, for readability
                 "text": "ASPP2",

                 // arg: mandatory, pointer to the frame describing this argument, generally a text object
                 // such as an entity-mention, event mention or relation mention
                 "arg": "ment-PMC3847091-uaz-r13-11-2-4"}],

  // is-negated: optional, can be used to represent negated information, if absent the default is false.
  "is-negated": false
}


Relation Mentions:

{ "frame-type": "relation-mention",
  otherwise very similar to event mentions }


Entities:

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "Jun-System", "organization": "CMU"},
  "frame-id":    "ent-PMC3847091-cmu-r4-1",
  "frame-type": "entity",
  "index": 1,
  "members": ["ment-PMC3847091-uaz-r13-11-2-1", "ment-PMC3847091-uaz-r13-11-2-4", ...] }


Events:

{  "frame-type": "event",
   otherwise very similar to entities }


Epistemics:

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "Nicolas-System", "organization": "CMU"},
  "frame-id":   "epi-PMC3847091-cmu-r4-11-2",
  "frame-type": "epistemics",

  // argument: mandatory, the sentence, relation or event mention this epistemic valuation is about
  "argument"    "sent-PMC3847091-uaz-r13-11-2",

  // value: optional, numeric representation of this epistemic valuation
  "value":      0.6,

  // status: optional, symbolic representation of this epistemic valuation; at least one of
  // value or status need to be represented, but preferably both
  "status":     "hypothesis" }


Absolute Text Positions (or Offsets):

{ "object-type": "absolute-pos", 

  // offset: mandatory, an absolute character offset from the beginning of the document
  // relative to the document's character encoding, e.g., a UTF8 character offset
  "offset": 1234}


Relative, Contextualized Text Positions:

{ "object-type": "relative-pos", 

  // reference: mandatory, reference frame to which this position is relative to, for example,
  // if the reference is a sentence object, the position is relative to the start of that sentence
  "reference": "sent-PMC3847091-uaz-r13-11-2", 

  // offset: mandatory, an absolute character offset from the beginning of the reference object
  // relative to the document's character encoding, e.g., a UTF8 character offset
  "offset": 105, 

  // context-start: optional, a contextualized offset specified as "<text>/<n>" which denotes
  // the start position of the <n>-th occurrence of <text> in the text of the reference object.
  // For example, "ASPP2/3" denotes the start offset of the third match for "ASPP2" in the sentence
  // text of "sent-PMC3847091-uaz-r13-11-2" relative to the documents character encoding.
  // <text> doesn't necessarily have to be a single token and can contain "/", but it always has
  // a final "/<n>" suffix; matches are performed case-sensitively and do not have to begin or
  // end at word or token boundaries.
  "context-start": "ASPP2/3"},

  // context-end: optional, similar to context-start but denotes the end position of the <n>-th match
  // generally, only one of "context-start" or "context-end" should be specified, but if they are
  // both provided, they should identify the same position
  "context-end": "phosphorylation/1", 

  // is-closed: optional, default is false, indicates whether this is a closed pointer interval
  // where the index points at the character as opposed to outside/behind it.  Standard C, Java
  // strings have half-open semantics where the start points at the first character and end points
  // behind the last one.  LDC annotations have closed semantics where end points at the last character.
  // So, this is primarily here to allow the representation of LDC end offsets.
  "is-closed": false}


Offset Mapping
--------------

Assumptions
- we have two or more independent sets of sentence frames for a particular document, for example,
  the list of passages and associated sentences extracted via nxml2fries and the list
  of sentences extracted by MedScan.
- we have annotations such as entity mentions whose start/end positions are relative
  to sentence list 1, and we have another set whose positions are relative to list 2
- for each set of mentions, we have relative, contextualized start/end positions available
  (as sketched out above)

Sentence Map
- the first step to support the mapping process is to construct a sentence map between
  sentence list 1 and 2
- this is a somewhat tricky/fuzzy/heuristic alignment job (on the shoulders of Pradeep :-)
  that uses the relative order of sentences and the tokens they contain to come up with
  a reasonable mapping; this can use various constraints to help in the search, for example,
  if we have a good match from s1_10 to s2_12, then s1_11 should only consider sentences
  close to and following s2_12, etc.
- the map should look like this:
  - for each sentence s1_i in list 1 and s2_k in list 2 we either have:
      mapsTo(s1_i, null)                // no corresponding sentence for s1_i
      mapsTo(s1_i, s2_k)                // s1_i is quasi-identical with s2_k
      mapsTo(null, s2_k)                // no corresponding sentence for s2_k
  - these are more complex, general cases in case sentences fully or partially overlap:
      mapsTo(s1_i, s2_k[s, e])          // s1_i is quasi-identical with the [s, e] region in s2_k (i.e., s1_i subsumes s2_k)
      mapsTo(s1_i[s, e], s2_k)          // s2_k is quasi-identical with the [s, e] region in s1_i (i.e., s2_k subsumes s1_i)
      mapsTo(s1_i[s1, e1], s2_k[s2,e2]) // the s1_k[s1,e1] region is quasi-identical with region s2_k[s2,e2] (i.e., s1_i overlaps s2_k)
  - each sentence maps to at most one sentence in the other list
  - sentences might not have a reasonable match at all
  - for a large number of cases we hope for a simple 1-1 or no match
  - if there are multiple possible mappings for a sentence, a single best one has to be chosen
  - since sentence segmentation might fail in different ways in the two systems, it is possible
    that one combines sentences that are broken apart in the other, or both fail in different ways
    which would lead to overlapping sentences.  We might ignore these cases initially and simply
    map them to null for now; but the region mappings sketched above could  capture these situations


Map Representation

PROBLEM: it is possible for a sentence to overlap with multiple others, which in turn could overlap
with other sentences.  To account for this case, each entry in the map would have to be multi-valued,
allowing multiple target sentence ranges.  But let's wait on that for now...

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "Pradeep-Matcher", "organization": "CMU"},
  "frame-id":    "smap-PMC3847091-cmu-r4-1",
  "frame-type": "sentence-map",

  // list1-meta, list2-meta: mandatory, contain meta info for the systems that generated the sentence list;
  // this is necessary, so we can identify a specific map if there are more than one, e.g., nxml-to-medscan
  "list1-meta": {"object-type": "meta-info", "component": "nxml2fries", "organization": "UAZ"},
  "list2-meta": {"object-type": "meta-info", "component": "MedScan", "organization": "CMU"}

  // map: mandatory, maps each sentence s1_i in list 1 onto a corresponding sentence s2_k in list 2
  // - if s1_i doesn't have a corresponding sentence, there won't be an entry in this map with key s1_i
  // - if s2_k doesn't have a corresponding sentence, there won't be an entry in this map with map_range->to s2_k
  // - if only subregions of the two sentences map, s1, e1, s2, and e2 can be used to delineate them
  // - if any of s1, e1, s2, and e2 are absent, they default to the corresponding endpoint of the sentence

  "map": [{"object-type": "mapping", "from": "sent-PMC3847091-uaz-r13-1-1", "to": "sent-PMC3847091-cmu-r4-1"},
          {"object-type": "mapping", "from": "sent-PMC3847091-uaz-r13-1-2", "to": "sent-PMC3847091-cmu-r4-2"},
          // more complex cases for illustration, we might want to ignore them for now
          {"object-type: "mapping", "from": "sent-PMC3847091-uaz-r13-1-3", "to": "sent-PMC3847091-cmu-r4-3", "s1": 10, "e1": 100},
          {"object-type: "mapping", "from": "sent-PMC3847091-uaz-r13-1-4", "to": "sent-PMC3847091-cmu-r4-4", "s2": 15, "e2": 95},
          {"object-type: "mapping", "from": "sent-PMC3847091-uaz-r13-1-5", "to": "sent-PMC3847091-cmu-r4-6", "s1": 5, "e1": 50, "s2": 20, "e2": 68},
          .....] }


Mapping Operations

All we need to represent in the data is the following

- contextualized offsets relative to their respective source sentences as sketched in the data representation above
- the sentence list-to-list map(s) described above
- no actual translation of offsets needs to be done on the data

A data consumer can then perform the following operations (assuming only full quasi-identity, no ranges for now):

- simple case: full identity between text spans span1 and span2
    mapsTo(reference(start-pos(span1)), reference(start-pos(span2))) and   // establishes that they are from the same, quasi-identical sentence
    context-start(start-pos(span1)) = context-start(start-pos(span2)) and  // establishes that the start context token and occurrence numbers are the same
    context-end(end-pos(span1)) = context-end(end-pos(span2))              // establishes that the end context token and occurrence numbers are the same

- check overlap between text spans span1 and span2; findStart and findEnd are simple
  string search functions that find the position of the nth occurrence of a context token
    mapsTo(reference(start-pos(span1)), reference(start-pos(span2)))                    // establishes that they are from the same, quasi-identical sentence
    mapped-s1 = findStart(context-start(start-pos(span1)), reference(start-pos(span2))) // translates span1 start into span2 land
    mapped-e1 = findEnd(context-end(end-pos(span1)), reference(end-pos(span2)))         // translates span1 end into span2 land
    check overlap between span2 start/end offsets and [mapped-s1, mapped-e1]

- subrange to subrange mappings are similar, but now the n's in the context-start/end
  tokens need to be recomputed first relative to the given subranges; after that,
  everything else remains the same

- mapping failures occur if:
  - there is no corresponding mapped sentence
  - a token in a context-start/end does not occur in the mapped sentence (e.g., due to character set differences,
    Greek character normalizations, etc.)
  - nevertheless, we can still project a source start/end interval onto a target string purely based on length
    transformations and then see if the projected interval overlaps with the target


UAZ Old and New Representations
-------------------------------

OLD: no passages and sentences

NEW: passage and sentence objects as described above combined in a separate, single annotation file;


OLD events:

    { "submitter":"UAZ",
      "type":"positive_regulation",
      "doc_id":"PMC3902907",
      "reading_ended":"2015-05-13 04:58:48",
      "controlled":"2",
      "negative_information":false,
      "event_id":"1",
      "reader_type":"machine",
      "reading_started":"2015-05-13 04:58:48",
      "controller":{"namespace":"uniprotkb", "text":"JAK3", "type":"protein", "id":"P52333"},
      "passage_id":"0",
      "evidence":"phosphorylation of HuR by JAK3",
      "offsets":[9, 39]},

    {"submitter":"UAZ",
      "participants":[{"namespace":"uniprotkb", "text":"HuR", "type":"protein", "id":"Q15717"}],
      "type":"phosphorylation",
      "reading_ended":"2015-05-13 04:58:48",
      "doc_id":"PMC3902907",
      "negative_information":false,
      "event_id":"2",
      "reader_type":"machine",
      "reading_started":"2015-05-13 04:58:48",
      "passage_id":"0",
      "offsets":[9, 31],
      "evidence":"phosphorylation of HuR"}

NEW events:
- meta-info goes into annotation-file meta info
- collect reading time on a per-document basis if possible and stick it into annotation-file meta info,
  for example:

File PMC3847091.uaz.events.json:

{ "object-type": "frame-collection",
  "object-meta": {"object-type": "meta-info", 
                  "component": "REACH", 
                  "component-type": "machine",
                  "organization": "UAZ", 
                  "doc-id": "PMC3847091", 
                  "processing-start": "2015-05-13 04:58:48", 
                  "processing-end": "2015-05-13 04:59:30", 
                  <anything else you deem important or interesting> },
  "frames": [ <entity mention and event frames as shown below> ] }


{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "REACH"}, // optional, inherits fields from frame collection
  "frame-id":    "ment-PMC3847091-uaz-r1-0-0-0",
  "frame-type": "entity-mention",
  "index": 0,                                                       // optional, mention number in this sentence
  "sentence:" "sent-PMC3847091-uaz-r1-0-0",                         // optional, from passage/sentence file
  "start-pos": {"object-type": "relative-pos", 
                "reference": "sent-PMC3847091-uaz-r1-0-0",
                "offset": 28, 
                "context-start": "HuR/1"},
  "end-pos": {"object-type": "relative-pos", 
              "reference": "sent-PMC3847091-uaz-r1-0-0",
              "offset": 31, 
              "context-end": "HuR/1"},
  "text": "HuR",
  "type": "protein",
  "xrefs": [{"object-type": "db-reference", "namespace": "uniprotkb", "id": "Q15717"}] }

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "REACH"}, // optional, inherits fields from frame collection
  "frame-id":    "ment-PMC3847091-uaz-r1-0-0-1",
  "frame-type": "entity-mention",
  "index": 1,                                                        // optional, mention number in this sentence
  "sentence:" "sent-PMC3847091-uaz-r1-0-0",                          // optional, from passage/sentence file
  "start-pos": {"object-type": "relative-pos", 
                "reference": "sent-PMC3847091-uaz-r1-0-0",
                "offset": 35, "context-start": "JAK3/1"},
  "end-pos": {"object-type": "relative-pos", 
              "reference": "sent-PMC3847091-uaz-r1-0-0",
              "offset": 39, 
              "context-end": "JAK3/1"},
  "text": "JAK3",
  "type": "protein",
  "xrefs": [{"object-type": "db-reference", "namespace": "uniprotkb", "id": "P52333"}] }


{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "REACH"}, // optional
  "frame-id":    "evem-PMC3847091-uaz-r1-0-0-0",
  "frame-type": "event-mention",
  "index": 0,                                                        // optional
  "sentence:" "sent-PMC3847091-uaz-r1-0-0",                          // optional
  "start-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r1-0-0", "offset": 9, "context-start": "phosphorylation/1"},
  "end-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r1-0-0", "offset": 31, "context-end": "HuR/1"},
  "text": "phosphorylation of HuR",
  "type": "phosphorylation",
  "arguments": [{"object-type": "argument", 
                 "type": "participant", 
                 "index": 0,                                         // optional
                 "text": "HuR",                                      // optional
                 "arg": "ment-PMC3847091-uaz-r1-0-0-0"}],
  "polarity": "positive" }

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "REACH"}, // optional
  "frame-id":    "evem-PMC3847091-uaz-r1-0-0-1",
  "frame-type": "event-mention",
  "index": 0,                                                        // optional
  "sentence:" "sent-PMC3847091-uaz-r1-0-0",                          // optional
  "start-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r1-0-0", "offset": 9, "context-start": "phosphorylation/1"},
  "end-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r1-0-0", "offset": 39, "context-end": "JAK3/1"},
  "text": "phosphorylation of HuR by JAK3",
  "type": "positive-regulation",
  "arguments": [{"object-type": "argument", 
                 "type": "controller", 
                 "index": 0,                                         // optional
                 "text": "JAK3",                                     // optional
                 "arg": "ment-PMC3847091-uaz-r1-0-0-1},
                {"object-type": "argument", 
                 "type": "controlled", 
                 "index": 1,                                         // optional
                 "text": "phosphorylation of HuR",                   // optional
                 "arg": "evem-PMC3847091-uaz-r1-0-0-0"}],
  "polarity": "positive" }

This is what a similar binding event would look like:

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "REACH"}, // optional
  "frame-id":    "evem-PMC3847091-uaz-r1-0-10-1",
  "frame-type": "event-mention",
  "index": 0,                                                        // optional
  "sentence:" "sent-PMC3847091-uaz-r1-0-10",                         // optional
  "start-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r1-0-10", "offset": 9, "context-start": "binding/1"},
  "end-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r1-0-10", "offset": 39, "context-end": "JAK3/1"},
  "text": "binding of HuR to JAK3",
  "type": "complex-assembly",
  "arguments": [{"object-type": "argument", 
                 "type": "participant", 
                 "index": 0,                                         // optional
                 "text": "HuR",                                      // optional
                 "arg": "ment-PMC3847091-uaz-r1-0-0-1},
                {"object-type": "argument", 
                 "type": "participant", 
                 "index": 1,                                         // optional
                 "text": "JAK3",                                     // optional
                 "arg": "ment-PMC3847091-uaz-r1-0-0-0"}],
  "polarity": "positive" }


OLD Phosphorylation at site:

    { "submitter":"UAZ",
      "participants":[{"namespace":"uniprotkb", "text":"HuR", "type":"protein", "id":"Q15717"}],
      "subfields":{"site":"tyrosine residues"},
      "type":"phosphorylation",
      "doc_id":"PMC3902907",
      "reading_ended":"2015-05-13 04:58:48",
      "negative_information":false,
      "event_id":"65",
      "reader_type":"machine",
      "reading_started":"2015-05-13 04:58:48",
      "passage_id":"47",
      "evidence":"phosphorylates HuR at tyrosine residues",
      "offsets":[1554, 1593]}

NEW Phosphorylation at site (we promote the site to a mention, so it can become an event argument):

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "REACH"}, // optional, inherits fields from frame collection
  "frame-id":    "ment-PMC3902907-uaz-r1-47-1-2",
  "frame-type": "entity-mention",
  "index": 1,                                                        // optional, mention number in this sentence
  "sentence:" "sent-PMC3902907-uaz-r1-47-1",                         // optional, from passage/sentence file
  "start-pos": {"object-type": "relative-pos", 
                "reference": "sent-PMC3902907-uaz-r1-47-1",
                "offset": 154, "context-start": "tyrosine/1"},
  "end-pos": {"object-type": "relative-pos", 
              "reference": "sent-PMC3902907-uaz-r1-47-1",
              "offset": 193, 
              "context-end": "residues/1"},
  "text": "tyrosine residues",
  "type": "site" }

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "REACH"}, // optional
  "frame-id":    "evem-PMC3902907-uaz-r1-47-1-1",
  "frame-type": "event-mention",
  "index": 0,                                                        // optional
  "sentence:" "sent-PMC3902907-uaz-r1-47-1",                         // optional
  "start-pos": {"object-type": "relative-pos", "reference": "sent-PMC3902907-uaz-r1-47-1", "offset": 132, "context-start": "phosphorylates/1"},
  "end-pos": {"object-type": "relative-pos", "reference": "sent-PMC3902907-uaz-r1-47-1", "offset": 193, "context-end": "residues/1"},
  "text": "phosphorylates HuR at tyrosine residues",
  "type": "phosphorylation",
  "arguments": [{"object-type": "argument", 
                 "type": "participant", 
                 "index": 0,                                         // optional
                 "text": "HuR",                                      // optional
                 "arg": "ment-PMC3902907-uaz-r1-47-1-1"},            // not shown
                {"object-type": "argument", 
                 "type": "at-location", 
                 "index": 1,                                         // optional
                 "text": "tyrosine residues",                        // optional
                 "arg": "ment-PMC3902907-uaz-r1-47-1-2"}],
  "polarity": "positive" }


OLD translocation:

    { "submitter":"UAZ",
      "participants":[{"namespace":"uniprotkb", "text":"ASPP2", "type":"protein", "id":"Q13625"}],
      "subfields":{
        "from":{"namespace":"go", "text":"plasma membrane", "type":"cellular_component", "id":"GO:0005886"},
        "to":{"namespace":"go", "text":"nucleus", "type":"cellular_component", "id":"GO:0005634"}},
      "type":"translocation",
      "doc_id":"PMC3847091",
      "reading_ended":"2015-05-13 04:57:54",
      "negative_information":false,
      "event_id":"80",
      "reader_type":"machine",
      "reading_started":"2015-05-13 04:57:54",
      "passage_id":"37",
      "evidence":"ASPP2 translocation from the plasma membrane to the cytosol and nucleus",
      "offsets":[61, 132]} 

NEW translocation (requires introduction of argument entity mentions - not shown here):

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "REACH"}, // optional
  "frame-id":    "evem-PMC3847091-uaz-r1-37-1-1",
  "frame-type": "event-mention",
  "index": 0,                                                        // optional
  "sentence:" "sent-PMC3847091-uaz-r1-37-1",                         // optional
  "start-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r1-37-1", "offset": 61, "context-start": "ASPP2/1"},
  "end-pos": {"object-type": "relative-pos", "reference": "sent-PMC3847091-uaz-r1-37-1", "offset": 132, "context-end": "nucleus/1"},
  "text": "ASPP2 translocation from the plasma membrane to the cytosol and nucleus",
  "type": "translocation",
  "arguments": [{"object-type": "argument", 
                 "type": "participant", 
                 "index": 0,                                         // optional
                 "text": "ASPP2",                                    // optional
                 "arg": "ment-PMC3847091-uaz-r1-37-1-1"},            // not shown
                {"object-type": "argument", 
                 "type": "from-location", 
                 "index": 1,                                         // optional
                 "text": "plasma membrane",                          // optional
                 "arg": "ment-PMC3847091-uaz-r1-37-1-2"}             // not shown
                {"object-type": "argument", 
                 "type": "to-location", 
                 "index": 2,                                         // optional
                 "text": "cytosol",                                  // optional
                 "arg": "ment-PMC3847091-uaz-r1-37-1-3"}             // not shown
                {"object-type": "argument", 
                 "type": "to-location", 
                 "index": 3,                                         // optional
                 "text": "nucleus",                                  // optional
                 "arg": "ment-PMC3847091-uaz-r1-37-1-4"}],           // not shown
  "polarity": "positive" }


MedScan Examples of Old and New Representations
-----------------------------------------------

OLD: no sentences

NEW: sentence objects as described above combined in a separate, single annotation file


OLD entity mentions and relations:

 { "bookkeeping": {"CMU-offsets": [4488, 4497], "object-type": "bookkeeping"}, 
   "db-xrefs": [{"id": "8536", "namespace": "pubchem", "object-type": "xref"}], 
   "entity-type": "MedScan_DRUG", 
   "frame-id": 77, 
   "frame-type": "entity-mention", 
   "object-meta": {"object-type": "meta-info", "organization": "CMU", "source-system": "MedScan"}, 
   "object-type": "frame", 
   "ref-sentence": 798, 
   "text": "menadione" }, 

 { "bookkeeping": {"CMU-offsets": [4525, 4562], "object-type": "bookkeeping"}, 
   "db-xrefs": [{"id": "3718", "namespace": "MEDSCAN:urn:agi-llid", "object-type": "xref"}], 
   "entity-type": "MedScan_GENE_PROTEIN", 
   "frame-id": 79, 
   "frame-type": "entity-mention", 
   "object-meta": {"object-type": "meta-info", "organization": "CMU", "source-system": "MedScan"}, 
   "object-type": "frame", 
   "ref-sentence": 798, 
   "text": "tyrosine kinase Janus kinase 3 (JAK3)" }

 { "context": "menadione, a drug that activated the tyrosine kinase Janus kinase 3 (JAK3)", 
   "frame-id": 623, 
   "frame-type": "relation-mention", 
   "obj": 79, 
   "object-meta": {"object-type": "meta-info", "organization": "CMU", "source-system": "MedScan"}, 
   "object-type": "frame", 
   "relation-type": "MedScan_Activation", 
   "subj": 77 }


NEW entity mentions and relations:

- meta-info goes into annotation-file meta info
- collect reading time on a per-document basis if possible and stick it into annotation-file meta info,
  for example:

File PMC3902907.cmu.medscan.json:

{ "object-type": "frame-collection",
  "object-meta": {"object-type": "meta-info", 
                  "component": "MedScan", 
                  "component-type": "machine",
                  "organization": "CMU", 
                  "doc-id": "PMC3902907", 
                  "processing-start": "2015-05-13 04:58:48", 
                  "processing-end": "2015-05-13 04:59:30", 
                  <anything else you deem important or interesting> },
  "frames": [ <entity mention and relation frames as shown below> ] }


{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "MedScan"}, // optional, inherits fields from frame collection
  "frame-id":    "ment-PMC3902907-cmu-medscan-r1-98-1",
  "frame-type": "entity-mention",
  "index": 1,                                                          // optional, mention number in this sentence
  "sentence:" "sent-PMC3902907-cmu-medscan-r1-98",                         // optional, from passage/sentence file
  "start-pos": {"object-type": "relative-pos", 
                "reference": "sent-PMC3902907-cmu-medscan-r1-98",
                "offset": 22, 
                "context-start": "menadione/1"},
  "end-pos": {"object-type": "relative-pos", 
              "reference": "sent-PMC3902907-cmu-medscan-r1-98",
              "offset": 31, 
              "context-end": "menadione/1"},
  "text": "menadione",
  "type": "DRUG",
  "xrefs": [{"object-type": "db-reference", "namespace": "pubchem", "id": "8536"}] }

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "MedScan"}, // optional, inherits fields from frame collection
  "frame-id":    "ment-PMC3902907-cmu-medscan-r1-98-2",
  "frame-type":  "entity-mention",
  "index": 2,                                                          // optional, mention number in this sentence
  "sentence:" "sent-PMC3902907-cmu-medscan-r1-98",                         // optional, from passage/sentence file
  "start-pos": {"object-type": "relative-pos", 
                "reference": "sent-PMC3902907-cmu-medscan-r1-98",
                "offset": 59, 
                "context-start": "tyrosine/1"},
  "end-pos": {"object-type": "relative-pos", 
              "reference": "sent-PMC3902907-cmu-medscan-r1-98",
              "offset": 96, 
              "context-end": "(JAK3)/1"},
  "text": "tyrosine kinase Janus kinase 3 (JAK3)",
  "type": "GENE_PROTEIN",
  "xrefs": [{"object-type": "db-reference", "namespace": "MEDSCAN:urn:agi-llid", "id": "3718"}] }

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "MedScan"}, // optional
  "frame-id":    "relm-PMC3902907-cmu-medscan-r1-98-0",
  "frame-type":  "relation-mention",
  "index": 0,                                                        // optional
  "sentence:" "sent-PMC3902907-cmu-medscan-r1-98",                       // optional
  "start-pos": {"object-type": "relative-pos", "reference": "sent-PMC3902907-cmu-medscan-r1-98", "offset": 22, "context-start": "menadione/1"},
  "end-pos":   {"object-type": "relative-pos", "reference": "sent-PMC3902907-cmu-medscan-r1-98", "offset": 96, "context-end":   "(JAK3)/1"},
  "text": "menadione, a drug that activated the tyrosine kinase Janus kinase 3 (JAK3)",
  "type": "Activation",
  "arguments": [{"object-type": "argument", 
                 "type": "subj", 
                 "index": 0,                                         // optional
                 "text": "menadione",                                // optional
                 "arg": "ment-PMC3902907-cmu-medscan-r1-98-1"},
                {"object-type": "argument", 
                 "type": "obj", 
                 "index": 1,                                         // optional
                 "text": "tyrosine kinase Janus kinase 3 (JAK3)",    // optional
                 "arg": "ment-PMC3902907-cmu-medscan-r1-98-2"}],
  "polarity": "positive" }


Jun's Events, Old and New
-------------------------

OLD:

 { "bookkeeping": {"CMU-offsets": [4506, 4510], "object-type": "bookkeeping"}, 
   "db-xrefs": [], 
   "frame-id": 78, 
   "frame-type": "entity-mention", 
   "object-meta": {"object-type": "meta-info", "organization": "CMU", "source-system": "Jun-System"}, 
   "object-type": "frame", 
   "ref-sentence": 798, 
   "text": "that" }, 

 { "bookkeeping": {"CMU-offsets": [4547, 4553], "object-type": "bookkeeping"}, 
   "db-xrefs": [], 
   "frame-id": 80, 
   "frame-type": "entity-mention", 
   "object-meta": {"object-type": "meta-info", "organization": "CMU", "source-system": "Jun-System"}, 
   "object-type": "frame", 
   "ref-sentence": 798, 
   "text": "kinase" },

 { "arg0": [78],
   "arg1": [80], 
   "bookkeeping": {"CMU-offsets": [4511, 4520], "object-type": "bookkeeping"}, 
   "context": "that activated the tyrosine kinase Janus kinase", 
   "event-type": "activate", 
   "frame-id": 567, 
   "frame-type": "event-mention", 
   "object-meta": {"object-type": "meta-info", "organization": "CMU", "source-system": "Jun-System"}, 
   "object-type": "frame", 
   "ref-sentence": 798, 
   "text": "activated" }

NEW:

File PMC3902907.cmu.junsys.json:

{ "object-type": "frame-collection",
  "object-meta": {"object-type": "meta-info", 
                  "component": "Jun-System", 
                  "component-type": "machine",
                  "organization": "CMU", 
                  "doc-id": "PMC3902907", 
                  "processing-start": "2015-05-13 04:58:48", 
                  "processing-end": "2015-05-13 04:59:30", 
                  <anything else you deem important or interesting> },
  "frames": [ <entity and event frames as shown below> ] }


{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "Jun-System"}, // optional, inherits fields from frame collection
  "frame-id":    "ment-PMC3902907-cmu-junsys-r1-98-1",
  "frame-type": "entity-mention",
  "index": 1,                                                             // optional, mention number in this sentence
  "sentence:" "sent-PMC3902907-cmu-medscan-r1-98",                        // optional, from passage/sentence file
  "start-pos": {"object-type": "relative-pos", 
                "reference": "sent-PMC3902907-cmu-medscan-r1-98",
                "offset": 22, 
                "context-start": "that/1"},
  "end-pos": {"object-type": "relative-pos", 
              "reference": "sent-PMC3902907-cmu-medscan-r1-98",
              "offset": 26, 
              "context-end": "that/1"},
  "text": "that",
  "type": "OTHER" }

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "Jun-System"}, // optional, inherits fields from frame collection
  "frame-id":    "ment-PMC3902907-cmu-junsys-r1-98-2",
  "frame-type":  "entity-mention",
  "index": 2,                                                             // optional, mention number in this sentence
  "sentence:" "sent-PMC3902907-cmu-medscan-r1-98",                        // optional, from passage/sentence file
  "start-pos": {"object-type": "relative-pos", 
                "reference": "sent-PMC3902907-cmu-medscan-r1-98",
                "offset": 63, 
                "context-start": "kinase/2"},
  "end-pos": {"object-type": "relative-pos", 
              "reference": "sent-PMC3902907-cmu-medscan-r1-98",
              "offset": 69, 
              "context-end": "kinase/2"},
  "text": "kinase",
  "type": "OTHER",
  // Jun doesn't seem to have these for his mentions, but that's how they would be represented:
  "xrefs": [{"object-type": "db-reference", "namespace": "MEDSCAN:urn:agi-llid", "id": "3718"}] }

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "Jun-System"}, // optional
  "frame-id":    "evem-PMC3902907-cmu-junsys-r1-98-0",
  "frame-type":  "event-mention",
  "index": 0,                                                             // optional
  "sentence:" "sent-PMC3902907-cmu-medscan-r1-98",                        // optional
  "start-pos": {"object-type": "relative-pos", "reference": "sent-PMC3902907-cmu-medscan-r1-98", "offset": 22, "context-start": "that/1"},
  "end-pos":   {"object-type": "relative-pos", "reference": "sent-PMC3902907-cmu-medscan-r1-98", "offset": 69, "context-end":   "kinase/2"},
  "text": "that activated the tyrosine kinase Janus kinase",
  "type": "activate",
  "arguments": [{"object-type": "argument", 
                 "type": "arg0", 
                 "index": 0,                                         // optional
                 "text": "that",                                     // optional
                 "arg": "ment-PMC3902907-cmu-junsys-r1-98-1"},
                {"object-type": "argument", 
                 "type": "arg1", 
                 "index": 1,                                         // optional
                 "text": "kinase",                                   // optional
                 "arg": "ment-PMC3902907-cmu-junsys-r1-98-2"}],
  "polarity": "positive" }


NEW entities and events:

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "Jun-System"}, // optional, inherits from frame collection
  "frame-id":    "ent-PMC3847091-cmu-junsys-r1-4",
  "frame-type": "entity",
  "index": 4,
  "members": ["ment-PMC3902907-cmu-junsys-r1-98-1", ...] }

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "Jun-System"}, // optional, inherits from frame collection
  "frame-id":    "eve-PMC3847091-cmu-junsys-r1-3",
  "frame-type": "event",
  "index": 3,
  "members": ["evem-PMC3902907-cmu-junsys-r1-98-0", ...] }


Pradeep's Measurements, Old and New
-----------------------------------

OLD:

 { "assay-molecule": ["mRNA"], 
   "binds-constituent1": ["GAPDH"], 
   "binds-constituent2": ["SIRT1 mRNA", "IP"], 
   "cell-type": ["HeLa cells"], 

   "frame-type": "implication", 
   "implication-type": "binds", 
   "object-meta": {"object-type": "meta-info", "organization": "CMU"}, 
   "object-type": "frame", 
   "process": ["RIP ( IP"], 
   "ref-sentence": 871, 
   "transfection-molecule": ["HuR IP"] }

NEW - Suggested:

Pradeep: not sure if you want to introduce an entity mention for each of your arguments, so I didn't do this
for now and added a text-argument object that simply points to the text string.  Eventually, we might want to
point to full mentions, because it would allow us to store additional offset/type/db/etc/ information.

File PMC3902907.cmu.pradsys.json:

{ "object-type": "frame-collection",
  "object-meta": {"object-type": "meta-info", 
                  "component": "Pradeep-System", 
                  "component-type": "machine",
                  "organization": "CMU", 
                  "doc-id": "PMC3902907", 
                  "processing-start": "2015-05-13 04:58:48", 
                  "processing-end": "2015-05-13 04:59:30", 
                  <anything else you deem important or interesting> },
  "frames": [ <implication frames as shown below> ] }

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": “Pradeep-System"},  // optional, inherits from frame-collection 
  "frame-id":    “imp-PMC3902907-cmu-pradsys-r1-71-0",
  "frame-type": “implication",                                                 // or measurement?
  "index": 0,                                                                  // optional
  "sentence:" "sent-PMC3902907-cmu-medscan-r1-71",                             // optional
  "start-pos": {"object-type": "relative-pos", "reference": "sent-PMC3902907-cmu-medscan-r1-71", "offset": 0,   "context-start": “(/1"},
  "end-pos":   {"object-type": "relative-pos", "reference": "sent-PMC3902907-cmu-medscan-r1-71", "offset": 380, "context-end": “IP/4"},
  "text": "( A , B ) After treatment of HeLa cells with arsenite and/or menadione as explained in Figure 1 , RIP ( IP followed by RT-qPCR ) analysis was used to measure the levels of enrichment of SIRT1 mRNA ( A ) and VHL mRNA ( B ) associated with HuR ; the samples were normalized using GAPDH mRNA , and the data represented as enrichment of each mRNA in HuR IP were compared with IgG IP",
  "type": "binds", 
  "arguments": [{"object-type": "text-argument", "type": "assay-molecule",        "arg": “mRNA"},
                {"object-type": "text-argument", "type": "binds-constituent1",    "arg": “GAPDH"},
                {"object-type": "text-argument", "type": "binds-constituent2",    "arg": “SIRT1 mRNA"},
                {"object-type": "text-argument", "type": "binds-constituent2",    "arg": “IP"},
                {"object-type": "text-argument", "type": "cell-type",             "arg": “HeLa cells"},
                {"object-type": "text-argument", "type": "process",               "arg": “RIP ( IP"},
                {"object-type": "text-argument", "type": "transfection-molecule", "arg": “HuR IP"}]}


Nicolas' Epistemics, Old and New
--------------------------------

OLD:

 { "frame-id": 1012, 
   "frame-type": "epistemics", 
   "object-meta": {
       "object-type": "meta-info", 
       "organization": "CMU"
   }, 
   "object-type": "frame", 
   "ref-sentence": 886, 
   "value": 0.6 }

NEW:

File PMC3902907.cmu.nicosys.json:

{ "object-type": "frame-collection",
  "object-meta": {"object-type": "meta-info", 
                  "component": "Nicolas-System", 
                  "component-type": "machine",
                  "organization": "CMU", 
                  "doc-id": "PMC3902907", 
                  "processing-start": "2015-05-13 04:58:48", 
                  "processing-end": "2015-05-13 04:59:30", 
                  <anything else you deem important or interesting> },
  "frames": [ <epistemics frames as shown below> ] }

{ "object-type": "frame",
  "object-meta": {"object-type": "meta-info", "component": "Nicolas-System"},  // optional, inherits from frame-collection
  "frame-id":   "epi-PMC3847091-cmu-nicosys-r1-11-0",
  "frame-type": "epistemics",
  "index": 0,                                                                  // optional
  "argument":   "sent-PMC3847091-cmu-medscan-r1-11",
  "text":       "In sum , these results indicate that HuR tyrosine phosphorylation at Y200 , which excludes HuR from SGs , also promotes the dissociation of HuR from target transcripts ( SIRT1 mRNA and VHL mRNA ) , or perhaps mobilizes HuR-SIRT1 mRNA and HuR-VHL mRNA complexes away from SGs , accelerating their degradation ( Figure 7 ) .",       // optional
  "value":      0.6,
  "status":     "hypothesis" }