Skip to content

Latest commit

 

History

History
68 lines (42 loc) · 2.51 KB

index.md

File metadata and controls

68 lines (42 loc) · 2.51 KB

Hyphenators

This page first documents the two approaches to hyphenation (the two tools), thereafter documents their integration in word processor software. Warning: this is work in progress, therefore it is supplemented with a documentation of a cumbersome workaround while waiting for working solutions.

The hyphenation tools

For each language, there are (or should be) two hyphenators, the pattern hyphenator and the fst-based hyphenator.

Pattern hyphenation

For compilation: ./compile --enable-pattern-hyphenators

The pattern hyphenation is made of patterns generated by patgen, which takes a large list of pre-hyphenated words as input. The resulting pattern files are used in TeX and LibreOffice.

The hyphenated word list is generated from the lexical hyphenation fst. One can adjust the size of the generated word list in tools/hyphenators/Makefile.modification-pattern.am, by changing the variable PATTERN_WORD_LIST (default is 15 000 words). The larger the list, the better the quality of the hyphenation patterns, but the longer it takes to build.

More details here.

FST hyphenation

For compilation: ./compile --enable-fst-hyphenator

The fst-based hyphenator is in lang-xxx/tools/hyphenators/.

The compiled fst-based hyphenator itself is hyphenator-gt-desc.hfst. It contains both lexicon-based hyphenation (full morphology) and generic, syllable-based rules (the pattern hyphenation above, used for unknown words).

The file is composed by these files:

hyphenator-gt-desc-no_fallback.hfst
hyphenator-rules-desc-weighted.hfst

where the former is a full analyser and the latter contains syllable based rules, with added weights.

The linguistic source code for the syllabification rules is in lang-xxx/src/hyphenation. The script is hyphenation.xfscript, written in the xfstformalism.

Usage (where -b 0 gives only the best weight):

... |\
hfst-tokenise tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst |\
hfst-lookup -b 0 tools/hyphenators/hyphenator-gt-desc.hfstol

Integrating hyphenators in software

LibreOffice/TeX hyphenation

Solutions while waiting for hyphenation in word processors


Very old (2007) meetings