-
Notifications
You must be signed in to change notification settings - Fork 184
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix Unicode 15.0 line breaking (#4389)
The current implementation was attempting the LB25 tailoring recommended in Example 7 of [Section 8.2](https://www.unicode.org/reports/tr14/tr14-49.html#Examples) in UAX14 version 15.0; however, this requires more than one code point of lookahead* because of `(PR | PO) × ( OP | HY )? NU`, which the current implementation of the line segmenter cannot do. Instead this pull request goes back to the untailored LB25 from Unicode 15.0. The implementation was tested with two million test cases; I last encountered a failure somewhere in the nine thousands. I should probably do an overnight run. Only 200 test cases are included here; as usual, anyone working on the rules should try very long monkey test runs. This fixes #4146. — \* This will be needed for 15.1 line segmentation too. While we have that capability in the other segmenters, used in the sentence segmenter (the relevant rules are called intermediate match rules or interm(ediate) break states in this implementation), straightforwardly reusing that code would run into into issues as we have so many states in line breaking that we cannot dedicate a whole bit to that property of the state. This can probably be worked around (as far as I can tell we use the sign bit for a property of two special states, so we could probably be a bit more sparing), but will come later.
- Loading branch information
Showing
8 changed files
with
7,647 additions
and
1,600 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
208 changes: 208 additions & 0 deletions
208
components/segmenter/tests/testdata/LineBreakExtraTest.txt
Large diffs are not rendered by default.
Oops, something went wrong.
209 changes: 105 additions & 104 deletions
209
components/segmenter/tests/testdata/LineBreakTest.txt
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Oops, something went wrong.