-
Notifications
You must be signed in to change notification settings - Fork 0
Multiple sources per dictionary entry #61
Comments
Yes, this sounds like a good way to proceed, where we can make use of the existing XML structure without drastic changes - this is a better way of simply noting multiple sources in the field string. But we should note that:
As to (1), we have started two years back to use the descriptive and normative FSTs and a bit of scripting as a basis for normatizing the MD lexical entries as described above. However, in some 1889 cases, there are multiple possible analyses resulting in multiple alternative forms. Katie is currently reconciling those forms and selecting the single correct normative form. As to (2), we have been scrutinizing the MD forms that we have been able to analyze with the descriptive FST (n=7363 out of 9k), in order to assess whether the English translations are effectively the same or not. In addition, there are some 1800 MD lemmas that are not morphologically traceable/analyzable to a CW-based lemma. These may represent genuine new lemmas, that would eventually need to be added to the core FST, or genuine typos, or morphological derivations that have not yet been implemented in the core FST. In effect, the current plan is that we will at this point be including in itwêwina only Wolvengrey's Cree Words and the Maskwacîs Cree Dictionary (because of issues with the Elders' Dictionary noted below). In this case, we we'd in effect have two types of lexical entries:
In the case of (1), we have three subtypes: 1a. The English translations/glosses are exactly/effectively the same, in which case we would present the gloss only once, and indicate both CW and MD as the joint source. E.g. acâhkos - MD: 'A star'; CW: 'star'. In such a case, the presentation could be the following: acâhkos (Noun, NA)
1b. The English glosses are not exactly the same, but nevertheless obviously semantically related or polysemous, e.g. acosis - MD: 'An arrow'; CW: 'arrow, little arrow'. A possible presentation would be: acosis (Noun, NA)
1c. The English glosses are different and not clearly related, i.e. cases of homonymy. E.g. mimikopitam - MD: 'He shakes it'; CW: 's/he rubs s.t.' The paradigm generation will be the same. In such case, the presentation could be the following: mimikopitam (Verb, VTI)
OR mimikopitam (Verb, VTI)
mimikopitam (Verb, VTI)
Cases of type (3) could be treated similar to the lexicalized preverbs in CW, thus providing (a) a lexical entry and translation for the inflected from (from MD or CW), with the linguistic analysis indicating the underlying lemma), as well as (b) the lexical entry and translation for the underlying lemma, with the full, productive linguistic analysis to the right, e.g. âcimostawêwak (< âcimostawêw) - MD: 'They are telling stories to them.'; CW: 's/he tells a story to s.o., s/he tells news to s.o.; s/he tells s.o. about (it/him), s/he gives s.o. an account'. The presentation could then be: âcimostawêwak (--> âcimostawêwak <-- √âcimostawêw V+TA+Ind+Prs+3Pl+4Sg/PlO)
âcimostawêw (--> âcimostawêwak <-- √âcimostawêw V+TA+Ind+Prs+3Pl+4Sg/PlO)
I'm ignoring for the moment that CW and MD use different ways of marking actors and goals (and preceding nouns with articles), which some would want to standardize, and others might want to retain as an indicator of two distinct dictionary works with different authors. This would be a matter for community feedback. As historical background, when we first started with NDS/itwêwina, it was possible to have multiple sources, each in their own XML file, where the tradition/convention was to separate the content into separate files for each part-of-speech, e.g. for nouns in AECD, i.e. N_elders.xml
When one then searched for 'acahkos', one got a match in both CW and AECD (and MD for that matter), and with a linguistic analysis for both, since the AECD and CW spelling of the word is such that it could be analyzed with the then crk FST. But what we then decided was that this would result in a large amount of multiple matches in the three dictionaries, when often the English gloss was the same, but not always. This repetition of glosses was a feature of the on-line Cree dictionary that Arok wanted to avoid at all cost, understandably. We wanted to keep open the option of amalgamating MD and CW within itwêwina, without waiting for Arok to go over MD content and add those to CW (he's done that a bit, but is far from through). So we decided to focus on including first Wolvengrey's CW, then do an evaluation that Katie has gotten close to finishing of where MD in effect entirely overlaps with CW (so could be presented as a single entry, but with just two sources), and when it doesn't, in which case one would be presented two entries, with the different glosses and with the sources indicated (as discussed above). For MD, this would require the standardization of the lexical entries as discussed above. AECD was left for later, since it is said to have orthographical inconsistencies, and its extent is much larger than MD, requiring much more time to access overlap with CW (and MD). Not to forget, as Arok mentioned, CW has more detailed source information which could also be provided (I think there's a CSV field, but it probably isn't implemented entirely systematically). Also, a thing to remember is that the English glosses of MD have not been processed in the same way as CW (lemmatized and POS-tagged), so that would need to be implemented at least for the MD lexical entries with non-matching glosses in relation to CW (there is a process for this). Anyhow, the really old NDS was able to aggregate multiple sources represented by multiple XML files, so it is good to be aware of that NDS has some code for this (in order to avoid/block, or make use of, probably to former). And finally, ideally we'd have a database which would incorporate all the dictionary sources in their entirety that we have access to, creating the XML or other dictionary-internal representation automatically. But that's a different project. |
For reference, the following is the basic structure for NDS-style XML (from: http://giellatekno.uit.no/doc/dicts/dictionarywork.html): The Saami-to-English equivalent of the original Saami-to-Norwegian entry for
So, for any source language lexical entry one can have one or more meaning groups If we want represent within the original XML structure multiple sources under the same This completes an analysis of the presumed thinking under the original XML structure (for the Saami NDS dictionaries). |
A possible solution for the Cree-to-English XML source code, incorporating two or more dictionary sources, and presenting the dictionary source in conjunction with the English translation: Case 1a above: acâhkos (Noun, NA)
Case 1b above: acosis (Noun, NA)
Case 1c above: mimikopitam (Verb, VTI)
The alternative to case 1c above would be to have two distinct lexical entry How all this renders itself on the paradigm presentation page is something I'm not sure will turn out to be as nice, or straight-forward, as we'd want it to be. |
I prefer the first option, where a single entry has multiple meaning groups, tagged with their sources. So!
Here's what a "complete" dictionary file would look like: <?xml version="1.0" encoding="UTF-8"?>
<r>
<!-- The dictionary sources -->
<source id="CW">
<title>Cree : Words / nehiýawewin : itwēwina</title>
</source>
<source id="MD">
<title>Maskwacîs Dictionary</title>
</source>
<!-- The dictionary entries -->
<e>
<lg>
<l pos="N">acâhkos</l>
<lc>NA-1</lc>
<stem>acâhkos-</stem>
</lg>
<mg>
<tg xml:lang="eng">
<t pos="N" sources="MD CW">star</t>
</tg>
</mg>
</e>
<e>
<lg>
<l pos="N">acosis</l>
<lc>NA-1</lc>
<stem>acosis-</stem>
</lg>
<mg>
<tg xml:lang="eng">
<t pos="N" sources="MD">an arrow</t>
</tg>
<tg xml:lang="eng">
<t pos="N" sources="CW">arrow, little arrow</t>
</tg>
</mg>
</e>
<e>
<lg>
<l pos="V">mimikopitam</l>
<lc>VTI-1</lc>
<stem>mimikopit-</stem>
</lg>
<mg>
<tg xml:lang="eng">
<t pos="V" sources="MD">He shakes it.</t>
</tg>
</mg>
<mg>
<tg xml:lang="eng">
<t pos="V" sources="CW">s/he rubs s.t.</t>
</tg>
</mg>
</e>
</r> |
This comment has been minimized.
This comment has been minimized.
Corrected copy-paste erroneous stem for 'acosis', which is 'acosis-'. |
Note that currently 'acosis' is probably not best example of case 1b above - âcihtin may be better, but this is based on currently dictionary comparison coding, which may always change. |
Huzzah! I think this one is finally done! Seems like there are still a few presentational tweaks to do, but that seems like a different issue. |
NDS should have the ability to list multiple data source per entry. In the bespoke XML dictionary format, an
<e>
entry can only have one source:Proposal
For each entry, be able specify multiple sources using the
sources=
attribute:The
sources=
attribute is a space-separated list of one or more source references, kind of like a bibliography key. The source is defined in separate<source>
elements:The
id=
attribute is the key that thesources=
attribute links to.TODOs for this issue:
The text was updated successfully, but these errors were encountered: