Multiple sources per dictionary entry #61

eddieantonio · 2018-09-17T15:16:30Z

NDS should have the ability to list multiple data source per entry. In the bespoke XML dictionary format, an <e> entry can only have one source:

<e src="Cree : Words / nehiýawewin : itwēwina">
   <l pos="N">-ohkom</l>
</e>

Proposal

For each entry, be able specify multiple sources using the sources= attribute:

<e sources="cw md">
   <l pos="N">-ohkom</l>
</e>

The sources= attribute is a space-separated list of one or more source references, kind of like a bibliography key. The source is defined in separate <source> elements:

<source id="cw">
   <title>Cree : Words / nehiýawewin : itwēwina</title>
</source>
<source id="md">
   <title>Maskwacîs Dictionary</title>
</source>

The id= attribute is the key that the sources= attribute links to.

TODOs for this issue:

Create tiny prototype based on above XML (@eddieantonio)
Document the XML format (Documenting the new XML format #87) (@eddieantonio/@aarppe)
Implement basic support in neahtta (Display multiple dictionary sources #88) (@eddieantonio)
Regenerate dictionaries using new format (Regenerate Cree->English dictionary using revised dictionary XML format #89) (@aarppe)
Write integration tests for new dictionary (@eddieantonio)
Different presentation for multiple dictionary sources (Concise representations: translation groups, meaning groups, and sources #91) (@eddieantonio)
Fix dictionary generation bugs (Definitions missing for some entries #93, Entries that should have two translation groups have two meaning groups instead #102) (@aarppe)

The text was updated successfully, but these errors were encountered:

aarppe · 2018-09-20T07:02:15Z

Yes, this sounds like a good way to proceed, where we can make use of the existing XML structure without drastic changes - this is a better way of simply noting multiple sources in the field string.

But we should note that:

the Maskwacîs dictionary does not mark vowel length, uses the non-standard <ch> diagraph for <c>, and separates SRO preverbs from the stem with a space (sometimes resulting in a joiner -h, e.g. CW 'ê-' corresponds to 'eh ' in MD. In our discussions with Rose Makinaw, we have been told that they (MWE and subsequently MESC) would be OK with marking vowel length, as well as using <c>, and following the SRO-standard for hyphenated preverbs.
even if the MD and CW lexical entries are reconcilable to be one and the same word form/lemma, the English glosses in MD and CW can be divergent to the extent that we would want to present both, ideally under the same lexical entry.

As to (1), we have started two years back to use the descriptive and normative FSTs and a bit of scripting as a basis for normatizing the MD lexical entries as described above. However, in some 1889 cases, there are multiple possible analyses resulting in multiple alternative forms. Katie is currently reconciling those forms and selecting the single correct normative form.

As to (2), we have been scrutinizing the MD forms that we have been able to analyze with the descriptive FST (n=7363 out of 9k), in order to assess whether the English translations are effectively the same or not. In addition, there are some 1800 MD lemmas that are not morphologically traceable/analyzable to a CW-based lemma. These may represent genuine new lemmas, that would eventually need to be added to the core FST, or genuine typos, or morphological derivations that have not yet been implemented in the core FST.

In effect, the current plan is that we will at this point be including in itwêwina only Wolvengrey's Cree Words and the Maskwacîs Cree Dictionary (because of issues with the Elders' Dictionary noted below).

In this case, we we'd in effect have two types of lexical entries:

Lexical entry (after normatization) can be found in both CW and MD
Lexical entry can be found in either CW or MD.
MD lexical entry is an inflected form of a lexical entry in CW (or vice versa).

In the case of (1), we have three subtypes:

1a. The English translations/glosses are exactly/effectively the same, in which case we would present the gloss only once, and indicate both CW and MD as the joint source. E.g. acâhkos - MD: 'A star'; CW: 'star'. In such a case, the presentation could be the following:

acâhkos (Noun, NA)

star [MD, CW]

1b. The English glosses are not exactly the same, but nevertheless obviously semantically related or polysemous, e.g. acosis - MD: 'An arrow'; CW: 'arrow, little arrow'. A possible presentation would be:

acosis (Noun, NA)

an arrow [MD]
arrow, little arrow [CW]

1c. The English glosses are different and not clearly related, i.e. cases of homonymy. E.g. mimikopitam - MD: 'He shakes it'; CW: 's/he rubs s.t.' The paradigm generation will be the same. In such case, the presentation could be the following:

mimikopitam (Verb, VTI)

He shakes it [MD]
s/he rubs s.t. [CW]

OR

mimikopitam (Verb, VTI)

He shakes it [MD]

mimikopitam (Verb, VTI)

s/he rubs s.t. [CW]

Cases of type (3) could be treated similar to the lexicalized preverbs in CW, thus providing (a) a lexical entry and translation for the inflected from (from MD or CW), with the linguistic analysis indicating the underlying lemma), as well as (b) the lexical entry and translation for the underlying lemma, with the full, productive linguistic analysis to the right, e.g. âcimostawêwak (< âcimostawêw) - MD: 'They are telling stories to them.'; CW: 's/he tells a story to s.o., s/he tells news to s.o.; s/he tells s.o. about (it/him), s/he gives s.o. an account'. The presentation could then be:

âcimostawêwak (--> âcimostawêwak <-- √âcimostawêw V+TA+Ind+Prs+3Pl+4Sg/PlO)

They are telling stories to them. [MD];

âcimostawêw (--> âcimostawêwak <-- √âcimostawêw V+TA+Ind+Prs+3Pl+4Sg/PlO)

s/he tells a story to s.o., s/he tells news to s.o.; s/he tells s.o. about (it/him), s/he gives s.o. an account. [CW]

I'm ignoring for the moment that CW and MD use different ways of marking actors and goals (and preceding nouns with articles), which some would want to standardize, and others might want to retain as an indicator of two distinct dictionary works with different authors. This would be a matter for community feedback.

As historical background, when we first started with NDS/itwêwina, it was possible to have multiple sources, each in their own XML file, where the tradition/convention was to separate the content into separate files for each part-of-speech, e.g. for nouns in AECD, i.e. N_elders.xml

<e src="Elders">
   <lg>
      <l pos="n">acahkos</l>
      <lc>NA</lc>
   </lg>
   <mg>
   <tg xml:lang="eng">
       <t pos="n">a star</t>
   </tg>
   </mg>
</e>
...

When one then searched for 'acahkos', one got a match in both CW and AECD (and MD for that matter), and with a linguistic analysis for both, since the AECD and CW spelling of the word is such that it could be analyzed with the then crk FST.

But what we then decided was that this would result in a large amount of multiple matches in the three dictionaries, when often the English gloss was the same, but not always. This repetition of glosses was a feature of the on-line Cree dictionary that Arok wanted to avoid at all cost, understandably. We wanted to keep open the option of amalgamating MD and CW within itwêwina, without waiting for Arok to go over MD content and add those to CW (he's done that a bit, but is far from through).

So we decided to focus on including first Wolvengrey's CW, then do an evaluation that Katie has gotten close to finishing of where MD in effect entirely overlaps with CW (so could be presented as a single entry, but with just two sources), and when it doesn't, in which case one would be presented two entries, with the different glosses and with the sources indicated (as discussed above). For MD, this would require the standardization of the lexical entries as discussed above. AECD was left for later, since it is said to have orthographical inconsistencies, and its extent is much larger than MD, requiring much more time to access overlap with CW (and MD).

Not to forget, as Arok mentioned, CW has more detailed source information which could also be provided (I think there's a CSV field, but it probably isn't implemented entirely systematically).

Also, a thing to remember is that the English glosses of MD have not been processed in the same way as CW (lemmatized and POS-tagged), so that would need to be implemented at least for the MD lexical entries with non-matching glosses in relation to CW (there is a process for this).

Anyhow, the really old NDS was able to aggregate multiple sources represented by multiple XML files, so it is good to be aware of that NDS has some code for this (in order to avoid/block, or make use of, probably to former).

And finally, ideally we'd have a database which would incorporate all the dictionary sources in their entirety that we have access to, creating the XML or other dictionary-internal representation automatically. But that's a different project.

aarppe · 2018-11-09T04:04:08Z

For reference, the following is the basic structure for NDS-style XML (from: http://giellatekno.uit.no/doc/dicts/dictionarywork.html):

The Saami-to-English equivalent of the original Saami-to-Norwegian entry for sudja would be the following:

<e src="nj" usage="vd">
      <lg>
         <l pos="N">sudja</l>
         <lc>sujat</lc>
      </lg>
      <mg>
         <tg>
            <t pos="N">reason</t>
            <t pos="N">ground</t>
         </tg>
      </mg>
      <mg>
         <tg>
            <t pos="N">fault</t>
         </tg>
      </mg>
   </e>

So, for any source language lexical entry one can have one or more meaning groups <mg>...</mg>. Within each of the meaning groups, one can have one or more translation groups <tg>...</tg>. The translations <t>...</t> within each translation group would be near-synonyms, while the meaning groups represent clearly distinct senses, though still under a single lexical entry.

If we want represent within the original XML structure multiple sources under the same <e>...</e> field, we actually ought to be able to insert to source code within each translation (group) <t>...</t>, as the English translation is particular to the source that it comes from, and if we have multiple translations from multiple sources, then we'd want to indicate the source per each translation, rather than collective for the entire lexical entry.

This completes an analysis of the presumed thinking under the original XML structure (for the Saami NDS dictionaries).

aarppe · 2018-11-09T04:27:25Z

A possible solution for the Cree-to-English XML source code, incorporating two or more dictionary sources, and presenting the dictionary source in conjunction with the English translation:

Case 1a above:

acâhkos (Noun, NA)

star [MD, CW]

<e>
   <lg>
      <l pos="N">acâhkos</l>
      <lc>NA-1</lc>
      <stem>acâhkos-</stem>
   </lg>
   <mg>
       <tg xml:lang="eng">
           <t pos="N" sources="MD CW">star</t>
       </tg>
   </mg>
</e>

<e>
   <lg>
      <l pos="N">acâhkos</l>
      <lc>NA-1</lc>
      <stem>acâhkos-</stem>
   </lg>
   <mg>
       <tg xml:lang="eng">
           <t pos="N" sources="MD CW">star</t>
       </tg>
   </mg>
</e>

Case 1b above:

acosis (Noun, NA)

an arrow [MD]
arrow, little arrow [CW]

<e>
   <lg>
      <l pos="N">acosis</l>
      <lc>NA-1</lc>
      <stem>acosis-</stem>
   </lg>
   <mg>
       <tg xml:lang="eng">
           <t pos="N" sources="MD">an arrow</t>
       </tg>
       <tg>
           <t pos="N" sources="CW">arrow, little arrow</t>
       </tg>
   </mg>
</e>

Case 1c above:

mimikopitam (Verb, VTI)

He shakes it [MD]
s/he rubs s.t. [CW]

<e>
   <lg>
      <l pos="V">mimikopitam</l>
      <lc>VTI-1</lc>
      <stem>mimikopit-</stem>
   </lg>
   <mg>
       <tg xml:lang="eng">
           <t pos="V" sources="MD">He shakes it.</t>
       </tg>
   </mg>
   <mg>
       <tg xml:lang="eng">
           <t pos="V" sources="CW">s/he rubs s.t.</t>
       </tg>
   </mg>
</e>

The alternative to case 1c above would be to have two distinct lexical entry <e>...</e> fields for the two distinct meanings.

How all this renders itself on the paradigm presentation page is something I'm not sure will turn out to be as nice, or straight-forward, as we'd want it to be.

eddieantonio · 2018-11-09T16:25:29Z

I prefer the first option, where a single entry has multiple meaning groups, tagged with their sources.

So!

We're adding the <source> element, which will describe a particular source, tagged with an id=.
Each <t> element has an obligatory sources= attribute which lists the dictionary source IDs this translation came from, separated by spaces.

Here's what a "complete" dictionary file would look like:

<?xml version="1.0" encoding="UTF-8"?>
<r>
   <!-- The dictionary sources -->
   <source id="CW">
      <title>Cree : Words / nehiýawewin : itwēwina</title>
   </source>
   <source id="MD">
      <title>Maskwacîs Dictionary</title>
   </source>

   <!-- The dictionary entries -->
   <e>
      <lg>
         <l pos="N">acâhkos</l>
         <lc>NA-1</lc>
         <stem>acâhkos-</stem>
      </lg>
      <mg>
          <tg xml:lang="eng">
              <t pos="N" sources="MD CW">star</t>
          </tg>
      </mg>
   </e>

   <e>
      <lg>
         <l pos="N">acosis</l>
         <lc>NA-1</lc>
         <stem>acosis-</stem>
      </lg>
      <mg>
          <tg xml:lang="eng">
              <t pos="N" sources="MD">an arrow</t>
          </tg>
          <tg xml:lang="eng">
              <t pos="N" sources="CW">arrow, little arrow</t>
          </tg>
      </mg>
   </e>

   <e>
      <lg>
         <l pos="V">mimikopitam</l>
         <lc>VTI-1</lc>
         <stem>mimikopit-</stem>
      </lg>
      <mg>
          <tg xml:lang="eng">
              <t pos="V" sources="MD">He shakes it.</t>
          </tg>
      </mg>
      <mg>
          <tg xml:lang="eng">
              <t pos="V" sources="CW">s/he rubs s.t.</t>
          </tg>
      </mg>
   </e>
</r>

aarppe · 2018-12-07T21:27:43Z

Corrected copy-paste erroneous stem for 'acosis', which is 'acosis-'.

aarppe · 2018-12-20T03:23:49Z

Note that currently 'acosis' is probably not best example of case 1b above - âcihtin may be better, but this is based on currently dictionary comparison coding, which may always change.

eddieantonio · 2018-12-20T16:18:49Z

Huzzah! I think this one is finally done! Seems like there are still a few presentational tweaks to do, but that seems like a different issue.

eddieantonio added the enhancement New feature or request label Sep 17, 2018

eddieantonio assigned eddieantonio and aarppe Nov 9, 2018

This comment has been minimized.

Sign in to view

This was referenced Dec 4, 2018

Documenting the new XML format #87

Merged

Display multiple dictionary sources #88

Merged

Regenerate Cree->English dictionary using revised dictionary XML format #89

Closed

eddieantonio closed this as completed Dec 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple sources per dictionary entry #61

Multiple sources per dictionary entry #61

eddieantonio commented Sep 17, 2018 •

edited

Loading

aarppe commented Sep 20, 2018 •

edited

Loading

aarppe commented Nov 9, 2018 •

edited

Loading

aarppe commented Nov 9, 2018 •

edited

Loading

eddieantonio commented Nov 9, 2018 •

edited by aarppe

Loading

This comment has been minimized.

aarppe commented Dec 7, 2018

aarppe commented Dec 20, 2018

eddieantonio commented Dec 20, 2018

Multiple sources per dictionary entry #61

Multiple sources per dictionary entry #61

Comments

eddieantonio commented Sep 17, 2018 • edited Loading

Proposal

aarppe commented Sep 20, 2018 • edited Loading

aarppe commented Nov 9, 2018 • edited Loading

aarppe commented Nov 9, 2018 • edited Loading

eddieantonio commented Nov 9, 2018 • edited by aarppe Loading

This comment has been minimized.

aarppe commented Dec 7, 2018

aarppe commented Dec 20, 2018

eddieantonio commented Dec 20, 2018

eddieantonio commented Sep 17, 2018 •

edited

Loading

aarppe commented Sep 20, 2018 •

edited

Loading

aarppe commented Nov 9, 2018 •

edited

Loading

aarppe commented Nov 9, 2018 •

edited

Loading

eddieantonio commented Nov 9, 2018 •

edited by aarppe

Loading