-
Notifications
You must be signed in to change notification settings - Fork 24
Fix Emoji handling for wide.py #204
base: master
Are you sure you want to change the base?
Conversation
o good catch. This problem isn't limited to just emojis, and neither will only considering ZWJs solve it for all emojis. Looking at this Wikipedia page, there are also things like skin color modifiers and accents, and I'm not sure how your code would handle that. The actual problem is about segmenting unicode in terms of composed symbols vs bytes, and I might suggest playing around with the
The bad output is because GNOME terminal converts the ZWJ to text. |
When I had to solve a very similar problem for the old Terminal Emulator for Android ages ago, the key observation was this: at least when it comes to everything other than emojis, Unicode code points with a display width of zero (combining diacritics, non-spacing marks and the like) generally can be thought of as if they attach to the previous code point. In other words, you'd want something like (UNTESTED)
where
(Alternately, look into the Unfortunately, because emojis were a terrible mistake, you're still going to have to special-case emoji sequences. As far as I can tell, if you want to limit special-case handling to sequences that clients will actually display specially, the only way to do this is with a giant table (https://www.unicode.org/emoji/charts/emoji-zwj-sequences.html, https://www.unicode.org/emoji/charts/full-emoji-modifiers.html, and https://www.unicode.org/emoji/charts/emoji-list.html#country-flag at least; there may be others). If you're okay with your algorithm picking up potential sequences that clients don't recognize as special, you need to read and digest https://www.unicode.org/reports/tr51/ . |
Some emojis are comprised of multiple Unicode characters, joined by the Zero Width Joiner (U+200D). In its current form, create breaks those emojis apart (for example,
!w2 π¨ββοΈ
returnsπ¨γβγβγοΈγ
.My pull request aims to fix this behavior. Everything seems to work as intended during my limited testing.