-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems in Serbian #11
Comments
BTW: checking with German, we have the same problems for DEER. |
If one checks diacl, it becomes clear that they have mapped a huge number of partly related terms to one master concept. https://diacl.ht.lu.se/WordList/Index This problem is also but less problematically present in the Swadesh collection. The problem is that DIACL did in some sense some Concepticon mapping, however, one to their internal concept lists, which are often much broader than what we'd do in Concepticon. Since all words in the database have meaning strings, one could circumvent this by making a master list of all meaning glosses we find in the data. In the current form, however, it is unclear if the data is well aggregated into CLICS. |
Good catch and thanks for relaying the issue. Given the relatively specific relations I would hope that there isn't too much of an effect on CLICS-based analyses (i.e. most of the mappings will be very rare), but I fully agree: In this state it's not something that should be used in CLICS & Co. I think your suggestion (i.e. list of all meaning glosses, map) sounds good! |
So for CLICS4, we would either have fixed this issue by doing a re-mapping, or we'd not include it there, since this kind of mapping makes people who know the languages get upset, and we would like to avoid that. DIACL has the meaning glosses, so they use the concepts differently than we do in CLICS, so we do well in only aggregating from DIACL when we know that it corresponds to our models. |
In addition to the concept problems, we also have no segmentation because there are no orthography profiles. Since it is unlikely that we can do a full remapping until the LB 2.0 release and there are no capacities of student assistants at the moment, I'd vote to retire the dataset from Lexibank. @LinguList @chrzyki Would you agree with this? |
I didnt realize that |
We skipped it after we found too many problems in CLICS3. They just link any concept to any gloss. So they may end up having a term "butterfly" and link it to "insect" in their internal concepticon (!). |
There are some mismappings, as they have like 6 words for DEER in the data. We were informed by somebody who wrote to Joshua Jackson, who then wrote to me:
I suggest we manually correct these cases via Lexemes. I would also inform the DIACL editors about this.
Or, @chrzyki, @xrotwang, is it possible that the error (something swapped here) is on the side of the pylexibank script?
The text was updated successfully, but these errors were encountered: