Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searches result in duplicate dictionary entries #374

Open
aarppe opened this issue Mar 28, 2020 · 7 comments
Open

Searches result in duplicate dictionary entries #374

aarppe opened this issue Mar 28, 2020 · 7 comments
Assignees
Labels
bug Something isn't working, as it should and/or according to specs

Comments

@aarppe
Copy link
Contributor

aarppe commented Mar 28, 2020

I'm seeing at least a few cases where searches result in the same Cree dictionary entry being presented twice, e.g. searching with tan'si with tânisi:

image

Of course, the dictionary entry for tânisi should be shown only once.

However, searching with tânisi gives only one result, as expected:

@aarppe aarppe added the bug Something isn't working, as it should and/or according to specs label Mar 28, 2020
@aarppe
Copy link
Contributor Author

aarppe commented Mar 28, 2020

@Madoshakalaka This is one more for you.

@Madoshakalaka
Copy link
Collaborator

@aarppe sure thing 🤣

@eddieantonio
Copy link
Member

@Madoshakalaka, I think this is because results from the FST aren't being de-duplicated. See here:

$ echo "tan'si" | hfst-optimized-lookup crk-descriptive-analyzer.hfstol
tan'si	tânisi+Ipc+Err/Orth
tan'si	tânisi+Ipc

This one is tricky, because according to the descriptive analyzer, there are TWO valid analyses with different tags @aarppe, how should we handle this situation? According to the FST, having two results for tan'si is correct — the FST yields two results with different analyses!

@aarppe
Copy link
Contributor Author

aarppe commented Mar 30, 2020

The latest FST gives in fact three analyses:

echo "tan'si" | hfst-lookup -q src/analyser-gt-desc.hfst
tan'si	tânisi+Ipc+Err/Orth	0.000000
tan'si	tânisi+Ipc	0.000000
tan'si	tânisi+Ipc+Interj	0.000000

This is in particular tricky since the spell-relax corrections are tagged with +Err/Orth, and swapping an apostrophe for a short-i is one of the spell-relax rules.

I had previously revised the list of non-standard forms LEXC file to include only those spelling deviances that cannot be dealt with spell-relax rules, to avoid double analyses. I'll comment out the tan'si form in src/morphology/stems/non_standard.lexc.

What is even further tricky is that there are two legitimate lemmas for tânisi. One which is an interrogative/adverbial particle (the first one below), and the other which is an interjection (the second one below):

tânisi	IPC	how, in what way
tânisi	IPJ	hello, how are you

So, we'd have to use a feature pair to match the second one, and the lack of any additional features (exact match) to match the first one.

@dwhieb dwhieb removed the sorry-matt label Oct 21, 2021
@nienna73
Copy link
Contributor

This behaviour is no longer the same. In fact, the above definitions for tan'si are no longer in the dictionary.

@fbanados fbanados self-assigned this Jul 8, 2024
@fbanados
Copy link
Member

fbanados commented Jul 8, 2024

As part of recent fixes in crk-db, definitions are returning. The issue remains to be addressed. If I am correct in the discussion currently going on in UAlbertaALTLab/crk-db#119, and from my understanding of the last comment by @aarppe , there should not be repeated definitions as presented in the issue image, but still there should be two entries: one for IPC and one for IPJ. That would leave the sense how, in what way from CW in its own entry (IPC) and the how are you in a separate entry (IPJ). as the data in the MD dictionary stands, it would merge the definition into the IPC one, as follows:

Screenshot 2024-07-08 at 2 45 34 PM

My gut feeling tells me that this is not the expected match for the MD entry, but that the MD entry should go in the other one. To achieve that, it would be sufficient to change the FST Analysis for the entry in Maskwacis.tsv to add the +Interj tag, if the discussion from UAlbertaALTLab/crk-db#119 is resolved as IPJ == +Ipc+Interj.

@fbanados
Copy link
Member

fbanados commented Jul 9, 2024

Updates to merging using the actual analysis help fix the sense matching issue:
Screenshot 2024-07-09 at 12 31 20 PM

@fbanados fbanados moved this to To do in Third release Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working, as it should and/or according to specs
Projects
Status: To do
Development

No branches or pull requests

6 participants