Skip to content

Library for Enhanced Thai Language Support in Unity's TextMeshPro

License

MIT, Unknown licenses found

Licenses found

MIT
LICENSE
Unknown
LICENSE.meta
Notifications You must be signed in to change notification settings

phanphantz/ThaiTextCare-for-Unity

Repository files navigation

Thai Text Care for Unity

This library provides enhanced Thai language support for Unity's TextMeshPro such as Thai Word Segmentation and Thai Font Glyphs fixer for overlapped vowels/tone marks, significantly improving your experience when working with Thai language in Unity.

‎️‍🔥See it in Action: Try WebGL Demo‎️‍🔥 (Best Viewed on PC)
Buy Me a Coffee: If this project made your day a bit easier, consider Buying me a coffee

Tested On:

  • Unity 6 with com.unity.ugui 2.0.0
  • Unity 2022.3.22f with TextMeshPro 3.2.0 Preview Versions
  • Unity 2022.3.22f with TextMeshPro 3.0.9
  • Unity 2021.3.2f with TextMeshPro 3.0.6

Overview


Thai Word Segmentation


Thai Glyph Adjustment Automation


ThaiTextNurse

This component tokenizes and separates Thai words on TextMeshPro components using Zero Width Space. Just attach it to any TextMeshPro component and you're all set! It will beautifully wrap the Thai text for you!

Key Features

  • Real-time Tokenization - It updates in real-time when the text is modified via the Inspector or Script.
  • Word Breaks GUI - The word segmentation is visualized in the Scene View. You can configure the visualization using GUIMode and GUIColor
  • Dictionary Edit - If tokenization is incorrect for unknown words, you can easily Add or Remove words from the dictionary directly through the Dictionary section on the ThaiTextNurse Inspector.
  • Custom Separator - You can insert a custom separator before each word breaks.
  • Glyph Correction - You can set Correction option to YoorYingAndToorTaan or FullC90 to fix the issues where the lower vowels like 'ุ' and 'ู' get overlapped with characters like 'ญ' and 'ฐ'. But the font you use must support C90 encoding. If not, you can try to modify it using Fontforge. See Thai Font Modification using FontForge for more details.

Under The Hood

  • ThaiTextNurse makes no changes to the original text instead, it only modifies the displayed message using the TMP_Text component's ITextPreprocessor feature, so you can add or remove it without any side effects.
  • Behind the scenes, it uses a TrieNode structure with the Longest Match Searching technique. See PhTokenizer.cs for searching and tokenization logic.
  • The word segmentation relies on a simple Text Dictionary file located under the Resources folder.

Handling the Dictionary

  • The dictionary used by the ThaiTextCore package must always be under any of the project folders named 'Resources'. You can specify the resource sub-directory path and other global settings of ThaiTextCare at Project Settings > Thai Text Care Settings
  • The dictionary is loaded/reloaded automatically when :
    • UnityEditor starts up.
    • Scripts of ThaiTextCare assembly recompile.
    • The first ThaiTextNurse triggers the first tokenization during runtime.
    • You Force Rebuild the dictionary either by pressing a button on ThaiTextNurse Inspector or by calling ThaiTextNurse.RebuildDictionary()
  • If you modify the dictionary in any way other than Adding or Removing words via the ThaiTextNurse Inspector (e.g., by editing the dictionary file directly), you must manually Force Rebuild the dictionary by yourself.
  • In the dictionary file, each word is separated by new line characters (\r\n, \n, \r), and words beginning with the / character are ignored by the tokenizer.

Scripting

Runtime & Editor

  • To receive a notification when ThaiTextNurse completes text tokenization process, you can bind callbacks to ThaiTextNurse.OnTokenized event.
    public class ThaiTextCare_ExampleScript : MonoBehaviour
    {
        [SerializeField] TMP_InputField inputField;
        [SerializeField] TMP_InputField separatorInputField;
        [SerializeField] TMP_Text outputText;
        [SerializeField] TMP_Text wordCountText;
        [SerializeField] ThaiTextNurse nurse;

        void Start()
        {
            nurse.OnTokenized += result => RefreshWordCount(result.WordCount);
            inputField.onValueChanged.AddListener(OnOriginalMessageChanged);
            separatorInputField.onValueChanged.AddListener(OnSeparatorChanged);
        }

        void OnOriginalMessageChanged(string input)
        {
            //this outputText has a ThaiTextNurse attached, so it will be tokenized
            outputText.text = input;

            //Do not get the tokenized text here, rely on OnTokenized event binding in Start() instead
        }

        void OnSeparatorChanged(string value)
        {
            nurse.Separator = value;
        }

        void RefreshWordCount(int count)
        {
            wordCountText.text = count.ToString("N0") + " Words";
        }
    }
  • You can manually call ThaiTextNurse.RebuildDictionary() or RebuildDictionaryAsync() (as a coroutine) to load the dictionary in advance during a loading screen to hide the performance hiccup from dictionary initialization which can be noticeable on low-end devices.
  • To get a tokenized version of any string, use one of the ThaiTextNurse.SafeTokenize() or TryTokenize() methods
 // Example input string in Thai
string inputText = "สวัสดีครับ";

// Using SafeTokenize to tokenize the input
 string tokenizedResult = ThaiTextNurse.SafeTokenize(inputText);
 Debug.Log("Tokenized Result: " + tokenizedResult);

// Using TryTokenize for more control
if (ThaiTextNurse.TryTokenize(inputText, out TokenizeResult result))
 {
    Debug.Log("Successfully tokenized!");
    Debug.Log("Tokenized String: " + result.Result);
    Debug.Log("Word Count: " + result.WordCount);
}
else
{
    Debug.LogError("Tokenization failed.");
}

Editor-Only

  • In Editor, you can utilize ThaiTextCareEditorCore.AddWordsToDictionary(), RemoveWordsFromDictionary(), or SaveDictionaryAsset() to modify the dictionary via Editor scripting.

ThaiFontDoctor

ThaiFontDoctor is a ScriptableObject that processes TextMeshPro's TMP_FontAsset, automating adjustments to glyph pairs based on predefined Glyph Combinations.

When you set a Glyph Combination, you specify which Thai character glyphs should pair together and the appropriate offset for each pair. ThaiFontDoctor then updates the GlyphAdjustmentTable in your TMP_FontAsset in real-time, making it easy to fine-tune how vowels and tone marks appear in your TMP_Text components.

Take a look at ThaiFontDoctor.asset for an example. This instance of ThaiFontDoctor's ScriptableObject already has some common GlyphCombinations for solving Thai font issues on it:

Key Features

  • Use the Add button to create new Glyph Combinations or click on existing ones to edit them.
  • On each GlyphCombination, you can specify the characters and adjustment offset of the Leading Glyphs and Following Glyphs.
  • Use the Create In Scene button behind the Tester TMP Text field to quickly create a TMP Text component to see how the font is displayed in real-time.
  • Every action performed on the ThaiFontDoctor inspector is fully Undo-supported, so you can safely experiment with your adjustments.
  • When modifying Glyph Combination, You can use GlyphPreset to easily insert common Thai character sets, or choose the กำหนดเอง (Custom) preset to fine-tune specific glyphs as needed.

Glyph Presets

Here are the available GlyphPresets and its glyph members :

ThaiGlyphPreset Display Name Glyph Members
AllConsonants ก - ฮ ก, ข, ฃ, ค, ฅ, ฆ, ง, จ, ฉ, ช, ซ, ฌ, ญ, ฎ, ฏ, ฐ, ฑ, ฒ, ณ, ด, ต, ถ, ท, ธ, น, บ, ป, ผ, ฝ, พ, ฟ, ภ, ม, ย, ร, ล, ว, ศ, ษ, ส, ห, ฬ, อ, ฮ
AscenderConsonants พยัญชนะหางบน ป, ฝ, ฟ, ฬ
DescenderConsonants พยัญชนะหางล่าง ฎ, ฏ
AllUpperGlyphs อักขระด้านบน - ิ, - ี, - ึ, - ื, - ็, - ั, - ์, - ่, - ้, - ๊, - ๋
UpperVowels สระบน - ิ, - ี, - ึ, - ื, - ็, - ั
ToneMarks วรรณยุกต์ - ่, - ้, - ๊, - ๋
ThanThaKhaat ทัณฑฆาต - ์
LeadingVowels สระหน้า เ-, แ-, โ-, ไ-, ใ-
AllFollowingVowels สระหลัง -ะ, - ำ, -า, -ๅ
SaraAum สระอำ - ำ
LowerVowels สระล่าง - ุ, - ู

Limitations

  • If multiple Glyph Combinations target the same glyphs, only the last adjustment will be applied. You can reorder combinations by selecting one of them and using the Up Arrow or Down Arrow buttons to prioritize adjustments.
  • The multi-edit feature is not supported.

How to Install Thai Text Care

You have 2 options for installing the library. Either via the package manager (Recommended) or by downloading this repository and putting it in your Unity Projects.

Package Manager Installation

Installing the package via the Package Manager allows you to easily install or update ThaiTextCare as a third-party library.

  1. In UnityEditor, Go to Window > Package Manager
  2. Click + button and choose Add package by git URL
  3. Use this link to install the package: https://github.com/phanphantz/ThaiTextCare-for-Unity.git
  4. That's it! You're all set!

Other Considerations

  • Please be aware that installing the library via Package Manager is not an ideal method if you wish to modify the source code. Any changes made to files within the Packages folder will be lost if the package is re-downloaded, and those modifications aren't tracked by GIT if the Packages folder is included in the .gitignore file.
  • To easily copy the example assets into your project, You can drag the Examples folder from under Packages/Thai Text Care/ into the Assets folder.



Troubleshooting

  • Font letters are not showing up on the TextMeshPro components? Make sure that you have already pressed the TMP essential resources import button.
  • If your font asset modifications aren't showing up in TextMeshPro components, it might indicate an issue with TextMeshPro's EventManager logic. If this happens, try restarting the Unity Editor to resolve the problem.
  • You may encounter warnings like 'Unable to add the requested character to font asset ....'s atlas texture. Please make the texture .... readable' To fix this, follow the instructions on this link to mark the font texture as readable.
  • While ThaiTextNurse works with TextMeshPro components in TMP_InputField, the caret position in the input field may display incorrectly due to the injected Zero Width Space characters in the text. I have yet to find an effective solution to fully support TMP_InputField.

Future Plans

  • Support using Addressables to load and manage the dictionary instead of Resources
  • Runtime modifications of the dictionary

Thai Font Modification using FontForge

To modify your Thai font extended character glyphs to have the correct Unicode, follow these steps :

  1. Download FontForge
  2. Open the desired font with FontForge (On Windows, you can use Open With... from the Right-click menu and Choose FontForge)
  3. You will need to map the extended characters to the correct Unicode by yourself by right-clicking on the target glyph > Glyph Info... > Then edit the Unicode field to the correct value.

A Python script can also be used to automate the process. You can copy glyphs by their names to the target Unicode, but the challenge is that many fonts use different naming conventions for their characters.

According to C90 encoding, Extended glyphs consist of:

Code Description C90 Category
U+F700 uni0E10.descless base.descless
U+F701~04 uni0E34~37.left upper.left
U+F705~09 uni0E48~4C.lowleft top.lowleft
U+F70A~0E uni0E48~4C.low top.low
U+F70F uni0E0D.descless base.descless
U+F710~12 uni0E31,4D,47.left upper.left
U+F713~17 uni0E48~4C.left top.left
U+F718~1A uni0E38~3A.low lower.low

Here’s a summary of the Thai characters in each row:

  • U+F700: ฐ ไม่มีเชิง (Descender-less Toor Than)
  • U+F701~04: ิ, ี, ึ, ื (Sara I, Sara II, Sara UE, Sara UEE)
  • U+F705~09: ่, ้, ๊, ๋, ์ (Tone marks: Mai Ek, Mai Tho, Mai Tri, Mai Chattawa, Thanthakhat)
  • U+F70A~0E: ่, ้, ๊, ๋, ์ (Same tone marks as above, different positioning)
  • U+F70F: ญ ไม่มีเชิง (Descender-less Yo Ying)
  • U+F710~12: ั, ํ, ็ (Sara Am, Yamakkan, Mai Taikhu)
  • U+F713~17: ่, ้, ๊, ๋, ์ (Tone marks again, different positioning)
  • U+F718~1A: ุ, ู, ฺ (Sara U, Sara UU, Phinthu)
  1. After you modify the glyphs, You can now Export the font and use it to create TMP_FontAsset in Unity.

Acknowledgements

  • Huge thanks to SaladLab for the ThaiFontAdjuster, which is the approach for correcting characters using extended glyphs.
  • I would also like to express my gratitude to Chaiwat Matarak for ThaiStringTokenizer. The dictionary file used in this project is based on his work, and it also inspired the creation of my own PhTokenizer class.
  • A special thanks to Onchulee for the ThaiText repository which inspired me about how to handle the Dictionary effectively. His C# implementation of the LexTo library for word segmentation is also an impressive work.
  • And of course, Big thanks to all of you amazing supporters!

You are all true heroes🙏

With gratitude,
Phun