You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Working on a couple of indigenous language dictionaries, we've encountered an intriguing problem. It's perfectly legitimate for a user/learner of the language to want to search for the other language word for a common English word that might be in the stopword list. If you're learning prepositions of location, you would obviously want to search for "at", "in", "on" etc.
However, if we just nuke these items from the stopword list, we'll end up with a massive index, and most of the hits will not be relevant to the search.
I think the solution here is to have a config file component which allows you to specify, through XPath, elements where the stopword list will be ignored when indexing; so for example a <gloss> element inside a dictionary entry can be assumed to contain the English gloss for a term, and could be indexed without the stopword list being invoked, generating and index entry for "in" if it contains that word; but instances of the stopwords would be ignored in all other contexts as normal.
This doesn't seem like it might be too difficult. The only bit I haven't figure out is how to carry over this functionality to the JavaScript; maybe all we need to do for a case like this is not use the stopword list at all, on the assumption that there's no penalty when a common word is searched for; if there's a stem file for it, then good -- it will have been constructed only from the specially-defined contexts, and shouldn't be too large -- and if there isn't, then the search just fails.
After discussion, we will wait until we actually have a project that doesn't solve this problem simply by using an empty stopword list. If we do implement it, we should do it through contexts.
Working on a couple of indigenous language dictionaries, we've encountered an intriguing problem. It's perfectly legitimate for a user/learner of the language to want to search for the other language word for a common English word that might be in the stopword list. If you're learning prepositions of location, you would obviously want to search for "at", "in", "on" etc.
However, if we just nuke these items from the stopword list, we'll end up with a massive index, and most of the hits will not be relevant to the search.
I think the solution here is to have a config file component which allows you to specify, through XPath, elements where the stopword list will be ignored when indexing; so for example a
<gloss>
element inside a dictionary entry can be assumed to contain the English gloss for a term, and could be indexed without the stopword list being invoked, generating and index entry for "in" if it contains that word; but instances of the stopwords would be ignored in all other contexts as normal.This doesn't seem like it might be too difficult. The only bit I haven't figure out is how to carry over this functionality to the JavaScript; maybe all we need to do for a case like this is not use the stopword list at all, on the assumption that there's no penalty when a common word is searched for; if there's a stem file for it, then good -- it will have been constructed only from the specially-defined contexts, and shouldn't be too large -- and if there isn't, then the search just fails.
@joeytakeda Any thoughts?
The text was updated successfully, but these errors were encountered: