Skip to content

Commit

Permalink
CzechStemmerLight: Remove one char for -es/-ém/-ím
Browse files Browse the repository at this point in the history
This case was inconsistent with all the other cases where we call
palatalise as we remove the whole suffix here but leave the first
character in every over case.

Checking the vocabulary list, this means palatalise will almost never
match one of the suffixes, as the only words with this as an ending in
the list are these, which look like they're actually English words
(except "abies"):

abies
cookies
hippies
series
studies

This means palatalise will just remove the last character, which seems
odd.

This change changes a lot of stems but seems to be an improvement in
pretty much every instance I checked in google translate.
  • Loading branch information
ojwb committed Sep 5, 2024
1 parent 7b092ac commit 6cab87c
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion CzechStemmerLight.java
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@ private void removeCase(StringBuffer buffer) {
buffer.substring( len-2,len).equals("\u00e9m")|| //-ém
buffer.substring( len-2,len).equals("\u00edm")){ //-ím

buffer.delete( len- 2 , len);
buffer.delete( len- 1 , len);
palatalise(buffer);
return;
}
Expand Down

0 comments on commit 6cab87c

Please sign in to comment.