Allow specification by XPath of elements where the stopword list should be ignored when indexing #273

martindholmes · 2023-08-22T23:28:27Z

Working on a couple of indigenous language dictionaries, we've encountered an intriguing problem. It's perfectly legitimate for a user/learner of the language to want to search for the other language word for a common English word that might be in the stopword list. If you're learning prepositions of location, you would obviously want to search for "at", "in", "on" etc.

However, if we just nuke these items from the stopword list, we'll end up with a massive index, and most of the hits will not be relevant to the search.

I think the solution here is to have a config file component which allows you to specify, through XPath, elements where the stopword list will be ignored when indexing; so for example a <gloss> element inside a dictionary entry can be assumed to contain the English gloss for a term, and could be indexed without the stopword list being invoked, generating and index entry for "in" if it contains that word; but instances of the stopwords would be ignored in all other contexts as normal.

This doesn't seem like it might be too difficult. The only bit I haven't figure out is how to carry over this functionality to the JavaScript; maybe all we need to do for a case like this is not use the stopword list at all, on the assumption that there's no penalty when a common word is searched for; if there's a stem file for it, then good -- it will have been constructed only from the specially-defined contexts, and shouldn't be too large -- and if there isn't, then the search just fails.

@joeytakeda Any thoughts?

The text was updated successfully, but these errors were encountered:

martindholmes · 2023-09-13T17:57:59Z

After discussion, we will wait until we actually have a project that doesn't solve this problem simply by using an empty stopword list. If we do implement it, we should do it through contexts.

martindholmes added the enhancement New feature or request label Aug 22, 2023

martindholmes mentioned this issue Sep 21, 2023

do we need default files for a dictionary and stop words list #271

Closed

martindholmes added this to the Blue sky milestone May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow specification by XPath of elements where the stopword list should be ignored when indexing #273

Allow specification by XPath of elements where the stopword list should be ignored when indexing #273

martindholmes commented Aug 22, 2023

martindholmes commented Sep 13, 2023

Allow specification by XPath of elements where the stopword list should be ignored when indexing #273

Allow specification by XPath of elements where the stopword list should be ignored when indexing #273

Comments

martindholmes commented Aug 22, 2023

martindholmes commented Sep 13, 2023