From 3540fb10400ffcc583ee334196c04dd9ff6fdde1 Mon Sep 17 00:00:00 2001 From: Gordon Lichtstein <72274426+generic-account@users.noreply.github.com> Date: Sat, 17 Feb 2024 13:20:05 -0500 Subject: [PATCH] Update README.md --- README.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/README.md b/README.md index 4488f1c..d5c312a 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,15 @@ # Esperanto Morphological Tokenization + + + + ## Introduction #### Esperanto Background Esperanto is an agglutinative constructed international auxiliary language, boasting a unique and regular set of grammatical features along with the largest speaker base of any constructed language. While it has its quirks which we will soon note, its word structure is incredibly regular, fitting only a handful of common patterns, making it uniquely suited to morphological segmentation, tokenization, and subword modeling, the process of splitting up words based on their structure for use in natural language processing models. We investigate the impact of morphological tokenization on the translation quality of English to Esperanto translations, using Fairseq, a simple sequence modeling toolkit built by Facebook. [^1]