trim.lyx

#LyX 2.0 created this file. For more info see http://www.lyx.org/
\lyxformat 413
\begin_document
\begin_header
\textclass article
\begin_preamble
% vim: tw=70

\usepackage{lrec2006}

% \usepackage[utf8x]{inputenc}
% \usepackage{times}
% \usepackage{url}
% \usepackage[small,bf]{caption}
% \usepackage{latexsym}


\newcommand{\ana}[1]{\texttt{#1}}
\newcommand{\f}[1]{`#1'}
\newcommand{\tool}[1]{\texttt{#1}}

\title{FST Intersection: Ending Dictionary Redundancy in Apertium} % TITLE TODO ugh
% good word: Decomposition

\name{Author1, Author2, Author3}

\address{ Affiliation1, Affiliation2, Affiliation3 \\
  Address1, Address2, Address3 \\
  author1@xxx.yy, author2@zzz.edu, author3@hhh.com\\}

\abstract{
  A Finite State Transducer (FST) used as an analyser, whose output is
  input to another FST, may have entries that don't pass through the
  second FST. We discuss certain problems that this creates in the
  Apertium machine translation platform, and describe the development
  of a tool to \emph{trim} such entries. The tool is made part of
  Apertium's \tool{lttoolbox} package.
}
\end_preamble
\use_default_options false
\maintain_unincluded_children false
\language english
\language_package none
\inputencoding latin9
\fontencoding default
\font_roman default
\font_sans default
\font_typewriter default
\font_default_family default
\use_non_tex_fonts false
\font_sc false
\font_osf false
\font_sf_scale 100
\font_tt_scale 100

\graphics default
\default_output_format default
\output_sync 0
\bibtex_command default
\index_command default
\paperfontsize 10
\spacing single
\use_hyperref true
\pdf_bookmarks true
\pdf_bookmarksnumbered false
\pdf_bookmarksopen false
\pdf_bookmarksopenlevel 1
\pdf_breaklinks false
\pdf_pdfborder false
\pdf_colorlinks false
\pdf_backref section
\pdf_pdfusetitle true
\papersize a4paper
\use_geometry false
\use_amsmath 1
\use_esint 1
\use_mhchem 0
\use_mathdots 0
\cite_engine basic
\use_bibtopic false
\use_indices false
\paperorientation portrait
\suppress_date false
\use_refstyle 0
\index Index
\shortcut idx
\color #008000
\end_index
\secnumdepth 3
\tocdepth 3
\paragraph_separation indent
\paragraph_indentation default
\quotes_language english
\papercolumns 1
\papersides 1
\paperpagestyle default
\tracking_changes false
\output_changes false
\html_math_output 0
\html_css_as_file 0
\html_be_strict false
\end_header

\begin_body

\begin_layout Standard
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
maketitleabstract
\end_layout

\end_inset


\end_layout

\begin_layout Section
Introduction and background
\end_layout

\begin_layout Standard
Apertium
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "forcada2011afp"

\end_inset

 is a rule-based machine translation platform, where the data and tools
 are released under a Free and Open Source license (primarily GNU GPL).
 Apertium translators use Finite State Transducers (FST's) for morphological
 analysis, bilingual dictionary lookup and generation of surface forms;
 most language pairs
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
A 
\emph on
language pair
\emph default
 is a set of resources to translate between a certain set of languages in
 Apertium, e.g.
 BasqueâSpanish.
\end_layout

\end_inset

 created with Apertium use the 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
tool{
\end_layout

\end_inset

lttoolbox
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 FST library for compiling XML dictionaries into binary FST's and for processing
 text with such FST's.
\end_layout

\begin_layout Standard
Below we give some background on how these FST's fit into Apertium as well
 as their capabilities; then we discuss the problems that have lead to redundant
 dictionary data, and introduce our solution.
\end_layout

\begin_layout Subsection
FST's in the Apertium pipeline
\end_layout

\begin_layout Standard
\begin_inset CommandInset label
LatexCommand label
name "sec:pipeline"

\end_inset


\end_layout

\begin_layout Standard
Translation with Apertium works as a pipeline, where each 
\emph on
module
\emph default
 processes some text and feeds its output as input to the next module.
 First, a surface form like 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

fishes
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 passes through the 
\series bold
analyser
\series default
 FST module, giving a set of analyses like 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

fish.n.pl/fish.vblex.pres
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

, or, if it is unknown, simply 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

*fishes
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

.
 Tokenisation is done during analysis, letting the FST decide in a left-right
 longest match fashion which words are tokens.
 The analyser is technically the union of several FST's, each marked for
 whether they contain entries which are tokenised in the regular way (like
 regular words), or entries that may separate other tokens, like punctuation.
 Anything that has an analysis is a token, and any other sequence consisting
 of letters of the 
\family typewriter
alphabet
\family default
 of the analyser is an unknown word token.
 Anything else can separate tokens.
\end_layout

\begin_layout Standard
After analysis, one or more 
\series bold
disambiguation
\series default
 modules select which of the analyses is the correct one.
 The 
\series bold
pretransfer
\series default
 module does some minor formal changes to do with multiwords.
\end_layout

\begin_layout Standard
Then a disambiguated analysis like 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

fish.n.pl
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 passes through the 
\series bold
bilingual
\series default
 FST.
 Using EnglishâNorwegian as an example, we would get 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

fisk.n.m.pl
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 if the bilingual FST had a matching entry, or simply 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

@fish.n.pl
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 if it was unknown in that dictionary.
 So a known entry may get changes to both lemma (
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

fish
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

â
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

fisk
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

) and tags (
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

n.pl
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

â
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

n.m.pl
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

) by the bilingual FST.
 When processing input to the bilingual FST, it is enough that the 
\emph on
prefix
\emph default
 of the tag sequence matches, so a bilingual dictionary writer can specify
 that 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

fish.n
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 goes to 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

fisk.n.m
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 and not bother with specifying all inflectional tags like number, definiteness,
 tense, and so on.
\end_layout

\begin_layout Standard
The output of the bilingual FST is then passed to the 
\series bold
structural transfer
\series default
 module (which may change word order, ensure determiner agreement, etc.),
 and finally a 
\series bold
generator
\series default
 FST which turns analyses like 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

fisk.n.m.pl
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 into forms like 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

fiskar
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

.
 Generation is the reverse of analysis; the dictionary which was compiled
 into a generator for Norwegian can also be used as an analyser for Norwegian,
 by switching the compilation direction.
\end_layout

\begin_layout Standard
A major feature of the 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
tool{
\end_layout

\end_inset

lttoolbox
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 FST package is the support for multiwords and compounds.
 A 
\series bold
lexical unit
\series default
 may be 
\end_layout

\begin_layout Itemize
a simple, non-multiword like the noun 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

fish
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

, 
\end_layout

\begin_layout Itemize
a space-separated word like the noun 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

hairy frogfish
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

, which will be analysed as one token, but otherwise have no formal differences
 from other words, 
\end_layout

\begin_layout Itemize
a multiword with inner inflection like 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

takes out
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

, analysed as 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

take.vblex.pri.p3.sg# out
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 and then turned into 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

take# out.vblex.pri.p3.sg
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 (the uninflected part is moved onto the lemma) before bilingual dictionary
 lookup, 
\end_layout

\begin_layout Itemize
a token which is actually two words which should be separated before bilingual
 dictionary lookup, like 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

they'll
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

, analysed as 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

prpers.prn.subj.p3.mf.pl+will.vaux.inf
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 and then split into 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

prpers.prn.subj.p3.mf.pl
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 and 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

will.vaux.inf
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 before bilingual dictionary lookup, 
\end_layout

\begin_layout Itemize
a combination of these three multiword types, like Catalan 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

creure-ho que
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

, analysed as 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

creure.vblex.inf+ho.prn.enc.p3.nt# que
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 and then moved and split into 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

creure# que.vblex.inf
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 and 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

ho.prn.enc.p3.nt
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 before bilingual dictionary lookup.
 
\end_layout

\begin_layout Standard
In addition to the above multiwords, where the whole string is explicitly
 defined as a path in the FST, we have dynamically analysed compounds which
 are not defined as single paths in the FST, but still get an analysis during
 lookup.
 To mark a word as being able to form a compound with words to the right,
 we give it the 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

hidden
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 tag 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

compound-only-L
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

, while a word that is able to be a right-side of a compound (or a word
 on its own) gets the tag 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

compound-R
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

.
 These hidden tags are not shown in the analysis output, but used by the
 FST processor during analysis.
 If the noun form 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

frog
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 is tagged 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

compound-only-L
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 and 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

fishes
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 is tagged 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

compound-R
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

, the 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
tool{
\end_layout

\end_inset

lttoolbox
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 FST processor will analyse 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

frogfishes
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 as a single compound token 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

frog.n.sg+fish.n.pl
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 (unless it was already in the dictionary as an explicit token) by trying
 all possible ways to split the word.
 After disambiguation, but before bilingual dictionary lookup, this compound
 analysis is split into two tokens, so the full word does not need to be
 specified in either dictionary.
 This feature is very useful for e.g.
 Norwegian, which has very productive compounding.
\end_layout

\begin_layout Subsection
The Problem: Redundant data
\end_layout

\begin_layout Standard
\begin_inset CommandInset label
LatexCommand label
name "sec:problem"

\end_inset


\end_layout

\begin_layout Standard
Ideally, when a monolingual dictionary for, say, English is created, that
 dictionary would be available for reuse unaltered (or with only bug fixes
 and additions) in all language pairs where one of the languages is English.
 Common data would be factored out of language pairs, avoiding redundancy,
 giving 
\emph on
data decomposition
\emph default
.
 Unfortunately, that has not been the case in Apertium until recently.
\end_layout

\begin_layout Standard
If a word is in the analyser, but not in the bilingual translation dictionary,
 certain difficulties arise.
 As the example above showed, if 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

fishes
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 were unknown to both dictionaries, the output would be 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

*fishes
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

, while if it were unknown to only the second, the output would be 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

@fish
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

.
 Given 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

*fishes
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

, a post-editor who knows both languages can immediately see what the original
 was, while the half-translated 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

@fish
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 hides the number information in the source text.
 Removing features like number, definiteness or tense can skew meaning.
 But it gets worse: Some languages inflect verbs for 
\emph on
negation
\emph default
, where the half-translated lemma would hide the fact that the meaning is
 negative.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
For simple cases like this, a workaround is to carry surface form information
 throughout the pipeline, but this fails with multiwords (described below)
 and compounds, which are heavily used in many Apertium language pairs.
\end_layout

\end_inset


\end_layout

\begin_layout Standard
And, as mentioned above, a word not known to the bilingual FST may not have
 its tags translated (or translated correctly) either; when the transfer
 module tries to use the half-translated tags to determine agreement, the
 
\emph on
context
\emph default
 of the half-translated word may have its meaning skewed as well.
\end_layout

\begin_layout Standard
Trying to write transfer rules to deal with half-translated tags also 
\emph on
increases the complexity of transfer rules
\emph default
.
 For example, if any noun can be missing its gender, that's one more exception
 to all rules that apply gender agreement (as well as any feature that interacts
 with gender).
\end_layout

\begin_layout Standard
Finally, there are issues with tokenisation and multiwords.
 Multiwords in Apertium are entries in the dictionaries that may consist
 of what would otherwise be several tokens.
 As an example, say you have 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

take
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 and 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

out
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 listed in your English dictionary, and they translate fine in isolation.
 Now, for Catalan we want to translate the phrasal verb 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

take out
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 into a single word 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

treure
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

, so we list it as a 
\emph on
multiword with inner inflection
\emph default
 in the English dictionary.
 This makes any occurrence of forms of 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

take out
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 get a single-token multiword analysis, e.g.
 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

takes out
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 gets the analysis 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

take.vblex.pri.p3.sg# out
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

.
 But then the whole multiword 
\emph on
has
\emph default
 to be in the bilingual dictionary if it is to be translated.
 If another language pair using the same English dictionary has both 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

take
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 and 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

out
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 in its bilingual dictionary, but not the multiword, the individual words
 in isolation may be translated, but the whole string together will not
 be translated.
\end_layout

\begin_layout Standard
Due to these issues, most language pairs in Apertium have a separate copy
 of each monolingual dictionary, manually 
\emph on
trimmed
\emph default
 to match the entries of the bilingual dictionary; so in the example above,
 if 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
f{
\end_layout

\end_inset

take out
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 did not make sense to have in the bilingual dictionary, it would be removed
 from the copy of the monolingual dictionary.
 This of course leads to a lot of redundancy and duplicated effort; as an
 example, there are currently (as of SVN revision 50180) twelve Spanish
 monolingual dictionaries in stable (SVN trunk) language pairs, with sizes
 varying from 36798 lines to 204447 lines.
\end_layout

\begin_layout Standard
The redundancy is not limited to Spanish; in SVN trunk we also find 10 English,
 7 Catalan, and 4 French dictionaries.
 If we include unreleased pairs, these numbers turn to 19, 28, 8 and 16,
 respectively -- in the worst case, if you add some words to an English
 dictionary, there are still 27 dictionaries which miss out on your work.
 The numbers get even worse if we look at potential new language pairs.
 Given 3 languages, you 
\begin_inset Quotes eld
\end_inset

only
\begin_inset Quotes erd
\end_inset

 need 
\begin_inset Formula $3*(3-1)=6$
\end_inset

 monolingual dictionaries for all possible pairs (remember that a dictionary
 provides both an analyser and a generator).
 But for 4 languages, you need 
\begin_inset Formula $4*(4-1)=12$
\end_inset

 dictionaries; if we were to create all possible translation pairs of the
 34 languages appearing in currently released language pairs, we would need
 
\begin_inset Formula $34*(34-1)=1122$
\end_inset

 monolingual dictionaries, where 34 ought to be enough.
\end_layout

\begin_layout Standard
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

% 
\backslash
fbox{
\backslash
parbox{6cm}{
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset


\begin_inset ERT
status collapsed

\begin_layout Plain Layout

% This is a figure with a caption.}}
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset


\end_layout

\begin_layout Standard
\begin_inset Float figure
placement h
wide false
sideways false
status open

\begin_layout Plain Layout
\align center
\begin_inset Graphics
	filename pairs-before.eps
	scale 50
	draft
	BoundingBox 0 0 200 100
	special type=eps

\end_inset

 
\begin_inset Caption

\begin_layout Plain Layout
Current monodixes with pairs of four languages
\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
centering
\end_layout

\end_inset


\begin_inset ERT
status collapsed

\begin_layout Plain Layout

{}
\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "fig.1"

\end_inset

 
\end_layout

\end_inset


\end_layout

\begin_layout Standard
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

% 
\backslash
fbox{
\backslash
parbox{6cm}{
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset


\begin_inset ERT
status collapsed

\begin_layout Plain Layout

% This is a figure with a caption.}}
\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset


\end_layout

\begin_layout Standard
\begin_inset Float figure
placement h
wide false
sideways false
status open

\begin_layout Plain Layout
\align center
\begin_inset Graphics
	filename pairs-after.eps
	scale 50
	draft
	BoundingBox 0 0 200 100
	special type=eps

\end_inset

 
\begin_inset Caption

\begin_layout Plain Layout
Ideal number of monodixes with four languages
\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
centering
\end_layout

\end_inset


\begin_inset ERT
status collapsed

\begin_layout Plain Layout

{}
\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "fig.1"

\end_inset

 
\end_layout

\end_inset


\end_layout

\begin_layout Standard
The lack of shared monolingual dictionaries also means that other monolingual
 resources, like disambiguator data, is not shared, since the effort of
 copying files is less than the effort of letting one module depend on another
 for little gain.
 And it complicates the reuse of Apertium's extensive
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "tyers2010fosresources"

\end_inset

 set of language resources for other systems: If you want to create a speller
 for some language supported by Apertium, you either have to manually merge
 dictionaries in order to gain from all the work, or (more likely) pick
 the largest one and hope it's good enough.
\end_layout

\begin_layout Subsection
A Solution: Intersection
\end_layout

\begin_layout Standard
\begin_inset CommandInset label
LatexCommand label
name "sec:solution"

\end_inset


\end_layout

\begin_layout Standard
However, there is a way around these troubles.
 Finite state machines can be intersected with one another to produce a
 new finite state machine.
 In the case of the Apertium transducers, what we want is to intersect the
 output (or 
\series bold
right
\series default
) side of the full analyser with the input (or 
\series bold
left
\series default
) side of the bilingual FST, producing a 
\emph on
trimmed
\emph default
 FST.
 We call this 
\emph on
trimming
\emph default
.
\end_layout

\begin_layout Standard
Some recent language pairs in Apertium use the alternative FST framework
 HFST
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "linden2011hfst"

\end_inset


\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Partly due to available data in that formalism, partly due to features missing
 from 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
tool{
\end_layout

\end_inset

lttoolbox
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 like 
\emph on
flag diacritics
\emph default
.
\end_layout

\end_inset

.
 Using HFST, one can create a "prefixed" version of the bilingual FST, this
 is is the concatenation of the bilingual FST and the regular expression
 
\family typewriter
.*
\family default
, i.e.
 match any symbol zero or more times.
 Then the command 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
tool{
\end_layout

\end_inset

hfst-compose-intersect
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 on the analyser and the prefixed FST creates the FST where only those paths
 of the analyser remain where the right side of the analyser match the left
 side of the bilingual FST.
 The prefixing is necessary since, as mentioned above, the bilingual dictionary
 is underspecified for inflectional tags such as definiteness, and so on.
\end_layout

\begin_layout Standard
The HFST solution works, but is missing many of the Apertium-specific features
 such as different types of tokenisation FST's, and it does not handle the
 fact that multiwords may split or change format before bilingual dictionary
 lookup.
 Also, HFST represents compounds with an optional transition from the end
 of the noun to the beginning of the noun dictionary -- so if 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

frog.n
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 and 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

fish.n
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 were in the analyser, but 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

fish.n
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 were missing from the bilingual FST, 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
ana{
\end_layout

\end_inset

frog.n+fish.n
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 would remain in the trimmed FST since the prefix matches.
 In addition, using HFST in language pairs whose data are all in 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
tool{
\end_layout

\end_inset

lttoolbox
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 format would introduce a new dependency.
\end_layout

\begin_layout Standard
Thus we decided to create a new tool within 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
tool{
\end_layout

\end_inset

lttoolbox
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

, called 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
tool{
\end_layout

\end_inset

lt-trim
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

.
 This tool should trim an analyser using a bilingual FST, creating a trimmed
 analyser, and handle all the 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
tool{
\end_layout

\end_inset

lttoolbox
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 multiwords and compounds, as well as letting us retain the special tokenisation
 features of 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
tool{
\end_layout

\end_inset

lttoolbox
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

.
 The end result should be the same as perfect manual trimming.
 The next section details its implementation.
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
Available from 
\begin_inset CommandInset href
LatexCommand href
target "http://example.com/anonymized-until-peer-review"

\end_inset

.
\end_layout

\end_inset


\end_layout

\begin_layout Section
Implementation of 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
tool{
\end_layout

\end_inset

lt-trim
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset


\end_layout

\begin_layout Subsection
Prefixing the bilingual dictionary
\end_layout

\begin_layout Standard
Apertium alphabets consist of symbol pairs, each with a left (serves as
 the input in a transducer) and right (output) symbol.
 Both the pairs and the symbols themselves, which can be either letters
 or tags, are identified by an integer.
 First, the identifiers of identical left-right pairs of the desired symbols,
 are determined.
 In the case of intersecting transducers using depth-first traversal, the
 method implemented in Apertium, only the tags are desired, though the option
 to include letter pairs as well still exists due to the deprecated multiplicati
ve method.
 (See Section 2.3) The side from which the symbols are obtained is also able
 to be specified, though in the case of prefixing a bilingual dictionary,
 only the right (output) symbols are used.
 All of the symbol-pairs of the given alphabet are looped-through, and depending
 on which side was specified, the respective symbols are analysed.
 The method differs between letters and tags.
 As the identifiers of letters, which are actually the letters themselves
 cast to integers, are consistent throughout all alphabets, the identifiers
 of letter pairs can be directly determined.
 The identifiers of tags, however, can differ, and so their individual identifie
rs must first be determined before that of their respective pairs.
 After the identifiers of the desired symbol pairs are determined, they
 are used to create loopbacks on the bilingual dictionary using the function
 appendDotStar.
 Transitions are created from the final states of the bilingual transducer
 that loop directly back with each of the identifiers.
\end_layout

\begin_layout Subsection
Moving lemq's
\end_layout

\begin_layout Subsection
Intersection
\end_layout

\begin_layout Standard
The first method of intersecting FST's in Apertium consisted of multiplying
 them.
 However, it was extremely inefficient and crippled every system it was
 tested on when trimming real Apertium dictionaries.
 First, the states of each transducer were multiplied.
 This meant that every possible state pair, consisting of a state from the
 monolingual dictionary and the bilingual dictionary, was assigned a state
 in the trimmed transducer.
 Next, each of the transitions were multiplied.
 As the intersection is only concerned with the output of the monolingual
 dictionary and the input of the bilingual dictionary, the respective symbols
 had to match; though very many of them did not, a significant number of
 matching symbols resulted in transitions to reduntant and unreachable states.
 These were removed with minimization.
\end_layout

\begin_layout Standard
Apertium now implements the much more efficient method of intersection,
 depth-first traversal.
 The monolingual and bilingual dictionaries are traversed in lockstep.
 Only transitions of the monolingual dictionary whose output match the input
 of their cooresponding transition in the bilingual dictionary are included
 in the trimmed transducer.
 To handle multiwords, a few other things are necessary.
 If a 
\begin_inset Quotes eld
\end_inset

+
\begin_inset Quotes erd
\end_inset

 is encoutered in the monolingual dictionary (indicating a 
\begin_inset Quotes eld
\end_inset

+
\begin_inset Quotes erd
\end_inset

-type multiword) the traversal of the bilingual dictionary resumes from
 its beginning.
 In addition, in the event that a 
\begin_inset Quotes eld
\end_inset

#
\begin_inset Quotes erd
\end_inset

 is later encountered, the current position is recorded.
 
\begin_inset Quotes eld
\end_inset

#
\begin_inset Quotes erd
\end_inset

-type multiwords alone can be easily handled if the bilingual dictionary
 if preprocessed to be in the same format as the monolingual dictionary;
 the tags must be moved before the 
\begin_inset Quotes eld
\end_inset

#.
\begin_inset Quotes erd
\end_inset

 However, a combination of both 
\begin_inset Quotes eld
\end_inset

+
\begin_inset Quotes erd
\end_inset

 and 
\begin_inset Quotes eld
\end_inset

#
\begin_inset Quotes erd
\end_inset

 requires the traversal of the bilingual dicitonary return to the state
 at which the 
\begin_inset Quotes eld
\end_inset

+
\begin_inset Quotes erd
\end_inset

 was first encountered.
\end_layout

\begin_layout Section
Ending Dictionary Redundancy
\end_layout

\begin_layout Standard
As mentioned in section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sec:solution"

\end_inset

, there are already language pairs in Apertium that have moved to a decomposed
 data model, using the HFST trimming method.
 At first, the HFST language pairs would also copy dictionaries, even if
 they were automatically trimmed, just to make them available for the language
 pair.
 But over the last year, we have created GNU Autotools scripts that let
 a language pair have a formal dependency on one more monolingual data packages
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
So if a user asks their package manager, e.g.
 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
tool{
\end_layout

\end_inset

apt-get
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

, to install the language pair 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
tool{
\end_layout

\end_inset

apertium-foo-bar
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

, it would automatically install dependencies 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
tool{
\end_layout

\end_inset

apertium-foo
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 and 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
tool{
\end_layout

\end_inset

apertium-bar
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 first.
\end_layout

\end_inset

.
 There is now an SVN module 
\family typewriter
languages
\family default

\begin_inset Foot
status collapsed

\begin_layout Plain Layout
\begin_inset CommandInset href
LatexCommand href
target "http://wiki.apertium.org/wiki/Languages"

\end_inset


\end_layout

\end_inset

 where such monolingual data packages reside, and all of the new HFST-based
 languages pairs now use such dependencies, which are trimmed automatically,
 instead of making redundant dictionary copies.
 Disambiguation data is also fetched from the dependency instead of being
 redundantly copied.
\end_layout

\begin_layout Standard
Most of the released and 
\begin_inset Quotes eld
\end_inset

stable
\begin_inset Quotes erd
\end_inset

 Apertium language pairs use 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
tool{
\end_layout

\end_inset

lttoolbox
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 and still have dictionary redundancy.
 With the new 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
tool{
\end_layout

\end_inset

lt-trim
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

 tool, it is finally possible to end the redundancy
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
One could argue that there is still 
\emph on
cross-lingual
\emph default
 redundancy in the bilingual dictionaries -- Apertium by design does not
 use an interlingua.
 Instead, the Apertium dictionary crossing tool 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
tool{
\end_layout

\end_inset

crossdics
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset


\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "toral2011crossdics-it-ca"

\end_inset

 provides ways to extract new translations during development: Given bilingual
 dictionaries between languages A-B and B-C, it creates a new bilingual
 dictionary between languages A-C.
 One argument for not using an interlingua during the translation process
 is that the dictionary resulting from automatic crossing needs a lot of
 manual cleaning to root out false friends, unidiomatic translations and
 other errors -- thus an interlingua would have to contain a lot more informatio
n than our current bilingual dictionaries in order to automatically disambiguate
 such issues.
 It would also require more linguistics knowledge of developers and heighten
 the entry barrier for new contributors.
\end_layout

\end_inset

 for the pairs which use 
\begin_inset ERT
status collapsed

\begin_layout Plain Layout


\backslash
tool{
\end_layout

\end_inset

lttoolbox
\begin_inset ERT
status collapsed

\begin_layout Plain Layout

}
\end_layout

\end_inset

, with its tokenisation, multiword and compounding features, and without
 having to make those pairs dependent on HFST.
\end_layout

\begin_layout Section*
Acknowledgements
\end_layout

\begin_layout Standard
Part of the development was funded by the Google Code-In
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
\begin_inset CommandInset href
LatexCommand href
target "https://code.google.com/gci/"

\end_inset


\end_layout

\end_inset

 programme.
\end_layout

\begin_layout Standard
\begin_inset CommandInset bibtex
LatexCommand bibtex
bibfiles "apertium"
options "lrec2006"

\end_inset


\end_layout

\end_body
\end_document