-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCSVidler 7LN004 - Russian Cyrillic to Latin Transliteration Program.py
396 lines (362 loc) · 19.3 KB
/
CSVidler 7LN004 - Russian Cyrillic to Latin Transliteration Program.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
# -*- coding: utf-8 -*-
"""
Created on Mon Feb 1 23:14:47 2021
@author: chris
"""
# Computational Linguistics Essay Program:
# A Basic Program for Transliterating (Russian) Cyrillic Characters
# to the Latin Alphabet
# A function is defined here at the start of the program that ensures
# the user can only enter "Y" (for yes) or "N" (for no) when prompted.
# The function's if condition ensures that only these letters
# can be entered to break the infinite while loop within the function
# and continue the program. This is used to control program flow
# towards the end of the program and to allow the user to loop back
# to various steps, enabling them to test multiple inputs
# and/or transliteration systems in single run of the program.
def y_or_n_input(prompt):
"""Restrict user input to return either "Y" or "N"."""
while True:
y_or_n_answer = input(prompt).upper()
if (y_or_n_answer != "Y" and y_or_n_answer != "N"):
print("Error: Please enter either Y or N to proceed.")
continue
else:
break
return y_or_n_answer
# To begin, instructions are printed here that introduce the program
# and describe its functionality to the user.
print("""Welcome to the Russian text transliteration program!
This program allows the user to input a string of text containing
Russian letters, and have this string returned with the Cyrillic
characters romanised to Latin equivalents according to
a transliteration system selected by the user.
Please note that this program is only designed to convert
the 33 letters that appear in the Russian alphabet. Certain Cyrillic
characters that feature in other languages, such as Ukrainian "ї"
or Serbian "ђ", will not be converted by this program,
and the systems used to convert compatible letters will handle these
according to conventions used in transliterating Russian (i.e. "г"
will become "g", rather than "h" as it is commonly romanised
from Ukrainian and Belarusian.)
Although there is no commonly accepted standard for romanising Russian,
a few official systems exist for use in various applications such as
linguistics research, producing machine readable documents or
transliterating place names on road signs. The user can have
their string of text transliterated according to one of three systems:
- The International Scholarly System (also known as the scientific
transliteration system), which has been used in linguistics
publications since the 19th century
- The ISO 9 standard for transliterating Cyrillic characters into Latin
equivalents, in its most recent version from 1995, which is notable
for its use of one-to-one character mappings (which make use
of letters with diacritics) that also allow for reliable reverse
transliteration.
The ISO 9 system has also been adopted into the GOST (ГОСТ) standards
used by the countries of the CIS as GOST 7.79-2000,
with an "A" subsystem that is identical to the ISO standard
and retains the one Cyrillic character to one Latin character
mapping system. A subsystem "B" also exists that features mappings
from single Cyrillic characters to multiple Latin characters,
although this version is not covered by this program.
- The system stipulated by the International Civil Aviation
Organization (ICAO) for machine readable travel documents,
which has been used in Russian passports since 2013
according to Order No. 320 of Russia's Federal Migratory Service.
These transliteration systems are summarised in a table of character
mappings on Wikipedia, which can be found here:
https://en.wikipedia.org/wiki/Romanization_of_Russian
More information on the International Scholarly System can be found at:
https://en.wikipedia.org/wiki/Scientific_transliteration_of_Cyrillic
More information on ISO 9:1995 can be found at:
https://www.iso.org/standard/3589.html
More information on the ICAO system can be found at:
https://www.icao.int/publications/Documents/9303_p3_cons_en.pdf
This program comprises the following steps:
- Input the string for transliteration
- Input the code for the transliteration system to use
After the final step, you will have the option of entering a code
for a different transliteration system to use on the same string,
or to start from scratch by entering a new string for processing.
Alternatively, you will also be able to exit the program.
""")
# The program's main loop is initiated here using a while loop
# in conjunction with a sentry variable called loop_string_selection.
# If the user does not enter "N" as a value for this variable
# at the end when asked if they wish to end the program,
# the program flow reverts back here and repeats the steps,
# allowing them to enter a new string for conversion.
loop_string_selection = ""
while loop_string_selection != "N":
# The user's first prompt is printed here, which asks them to enter
# a string of text for transliteration. This could be,
# for example, the Russian pangram "Разъяренный чтец эгоистично
# бьёт пятью жердями шустрого фехтовальщика."
input_string = input("Please enter a string for processing: ")
# Since strings are immutable objects in Python, the string
# is converted into a mutable list format for ease of modification
# and processing. The list() function breaks the string down
# into a list with its characters (including spaces and punctuation
# marks) stored as individual items. This list is then assigned
# to the string_as_list variable.
string_as_list = list(input_string)
# A secondary loop is initiated here, this time using
# a sentry variable called loop_dict_selection.
# If the user does not enter "N" as a value for this variable
# at the end when asked if they wish to enter a new string
# for conversion or end the program, the program flow
# reverts back here to repeat the last step,
# allowing them to select a new transliteration system
# for converting the string that was previously entered.
loop_dict_selection = ""
while loop_dict_selection != "N":
# Mapping dictionaries for the various transliteration systems
# are established here, along with the codes that are used
# to refer to them and to select them. The relevant dictionary
# is accessed later in the program to return Latin character
# sets (in the value entry) for each identified Cyrillic
# character (in the key entry).
#
# This dictionary has been created for the International
# Scholarly System. Note that since double quotation marks
# are used for the strings here, a backslash is required to
# escape the double quotation mark used to represent
# the hard sign "ъ" in this system. Also note that
# this system uses one-to-two character mappings
# in a few cases.
scholarly_code = "SCHOLARLY"
scholarly_dict = {"А": "A", "а": "a",
"Б": "B", "б": "b",
"В": "V", "в": "v",
"Г": "G", "г": "g",
"Д": "D", "д": "d",
"Е": "E", "е": "e",
"Ё": "Ë", "ё": "ë",
"Ж": "Ž", "ж": "ž",
"З": "Z", "з": "z",
"И": "I", "и": "i",
"Й": "J", "й": "j",
"К": "K", "к": "k",
"Л": "L", "л": "l",
"М": "M", "м": "m",
"Н": "N", "н": "n",
"О": "O", "о": "o",
"П": "P", "п": "p",
"Р": "R", "р": "r",
"С": "S", "с": "s",
"Т": "T", "т": "t",
"У": "U", "у": "u",
"Ф": "F", "ф": "f",
"Х": "X", "х": "x",
"Ц": "C", "ц": "c",
"Ч": "Č", "ч": "č",
"Ш": "Š", "ш": "š",
"Щ": "Šč", "щ": "šč",
"Ъ": "\"", "ъ": "\"",
"Ы": "Y", "ы": "y",
"Ь": "'", "ь": "'",
"Э": "È", "э": "è",
"Ю": "Ju", "ю": "ju",
"Я": "Ja", "я": "ja"}
# This dictionary has been created for the IS0 9:1995/
# GOST 7.79-2000(A) system. Note that since double quotation
# marks are used for the strings here, a backslash is required
# to escape the double quotation mark used to represent
# the hard sign "ъ" in this system. This system is notable
# for consisting solely of one-to-one character mappings,
# which would mean that text romanised with this system
# would be quite easy to back-transliterate as well
# using a similar method.
ISO_9_code = "ISO9"
ISO_9_dict = {"А": "A", "а": "a",
"Б": "B", "б": "b",
"В": "V", "в": "v",
"Г": "G", "г": "g",
"Д": "D", "д": "d",
"Е": "E", "е": "e",
"Ё": "Ë", "ё": "ë",
"Ж": "Ž", "ж": "ž",
"З": "Z", "з": "z",
"И": "I", "и": "i",
"Й": "J", "й": "j",
"К": "K", "к": "k",
"Л": "L", "л": "l",
"М": "M", "м": "m",
"Н": "N", "н": "n",
"О": "O", "о": "o",
"П": "P", "п": "p",
"Р": "R", "р": "r",
"С": "S", "с": "s",
"Т": "T", "т": "t",
"У": "U", "у": "u",
"Ф": "F", "ф": "f",
"Х": "H", "х": "h",
"Ц": "C", "ц": "c",
"Ч": "Č", "ч": "č",
"Ш": "Š", "ш": "š",
"Щ": "Ŝ", "щ": "ŝ",
"Ъ": "\"", "ъ": "\"",
"Ы": "Y", "ы": "y",
"Ь": "'", "ь": "'",
"Э": "È", "э": "è",
"Ю": "Û", "ю": "û",
"Я": "Â", "я": "â"}
# This dictionary has been created for the ICAO system
# for producing machine readable travel documents.
# This system is notable for not using any diacritics,
# so there is greater reliance on one-to-multiple character
# mappings instead. The hard sign "ъ" is transliterated
# more unusually as "ie", while the soft sign "ь" is not
# reflected in the transliteration at all in this case,
# and so it returns an empty string (i.e. no character) here.
ICAO_code = "ICAO"
ICAO_dict = {"А": "A", "а": "a",
"Б": "B", "б": "b",
"В": "V", "в": "v",
"Г": "G", "г": "g",
"Д": "D", "д": "d",
"Е": "E", "е": "e",
"Ё": "E", "ё": "e",
"Ж": "Zh", "ж": "zh",
"З": "Z", "з": "z",
"И": "I", "и": "i",
"Й": "I", "й": "i",
"К": "K", "к": "k",
"Л": "L", "л": "l",
"М": "M", "м": "m",
"Н": "N", "н": "n",
"О": "O", "о": "o",
"П": "P", "п": "p",
"Р": "R", "р": "r",
"С": "S", "с": "s",
"Т": "T", "т": "t",
"У": "U", "у": "u",
"Ф": "F", "ф": "f",
"Х": "Kh", "х": "kh",
"Ц": "Ts", "ц": "ts",
"Ч": "Ch", "ч": "ch",
"Ш": "Sh", "ш": "sh",
"Щ": "Shch", "щ": "shch",
"Ъ": "Ie", "ъ": "ie",
"Ы": "Y", "ы": "y",
"Ь": "", "ь": "",
"Э": "E", "э": "e",
"Ю": "Iu", "ю": "iu",
"Я": "Ia", "я": "ia"}
# The user's second prompt is printed here, with instructions
# printed before the input prompt itself that ask the user
# to enter the corresponding code for the transliteration
# system to be used. The variables that contain the strings
# that will be matched against the user's input
# are concatenated to the instruction string in this case.
#
# The input itself is framed within an infinite while loop
# that restricts input in a similar way to the y_or_n_input()
# function defined above: the user can only enter one
# of the specified codes (with the upper() method applied,
# to remove case sensitivity) to trigger the if condition
# that will break out of the while loop and continue
# the program. If a non-matching string is entered,
# the else condition will continue the loop until a valid
# input is entered.
while True:
print("\nPlease enter the relevant code following the colon "
"for one of the transliteration systems below "
"for converting the text."
"\nInternational Scholarly System: "
+ scholarly_code +
"\nISO 9:1995/GOST 7.79-2000(A): "
+ ISO_9_code +
"\nICAO MRTD specification: "
+ ICAO_code)
input_code = input("Enter code: ")
if (input_code.upper() == scholarly_code
or input_code.upper() == ISO_9_code
or input_code.upper() == ICAO_code):
break
else:
print("The code entered was invalid. Please try again.")
continue
# With a valid input entered, a series of (el)if conditions
# are defined that assign the relevant mapping dictionary
# to the chosen_dict variable depending on the code
# that was entered.
if input_code.upper() == scholarly_code:
chosen_dict = scholarly_dict
elif input_code.upper() == ISO_9_code:
chosen_dict = ISO_9_dict
elif input_code.upper() == ICAO_code:
chosen_dict = ICAO_dict
# The crux of the program occurs here in the form of a list
# comprehension expression that assigns the modified characters
# of the string (along with any unchanged characters)
# to a new variable named modified_list. The expression itself
# makes use of Python's get() method to access the chosen
# dictionary and return the Latin equivalents
# in this dictionary for any Russian Cyrillic characters
# that appear. The program does this by iterating
# over each item in string_as_list, which in this
# case is each character extracted from the original string.
# Each item is assigned to a temporary variable, char,
# for each iteration of the for loop. By passing this char
# variable as the first parameter for the get() method,
# the method will return the value for any dictionary key
# that matches the string contained in this variable
# should this string be a Russian Cyrillic character, and add
# this value to the new list. The second parameter indicates
# the value that should be returned if no dictionary key
# is found, and in this case, the char variable is passed
# again, meaning that the character will be simply be re-added
# to the new list unmodified if no matching dictionary key
# is found. This creates a new list in the modified_list
# variable with the same characters in the same order as the
# string_as_list variable, but with only the Russian Cyrillic
# characters changed.
modified_list = [chosen_dict.get(char, char)
for char in string_as_list]
# The items in modified_list are then combined back
# into a single string using the join() method, with an empty
# string as a separator to ensure that no spaces are inserted
# between the characters.
modified_string = "".join(modified_list)
# Finally, the results of the program are printed:
# The original string is printed for reference,
# along with the modified string and the reference code
# for the system that was used to transliterate it.
print("\nOrginal string:\n" + input_string)
print("Transliteration of string using the "
+ input_code.upper() + " system:\n"
+ modified_string)
# With the final step of the program concluded, the user
# is asked to enter "Y" or "N" according to the earlier
# defined input function to determine whether they would
# like to process the same string again with a different
# system (i.e. revert to the beginning of the secondary
# loop), or choose to enter a new string or exit the program
# (i.e. exit the secondary loop and proceed to the end
# of the main loop).
loop_dict_selection = y_or_n_input("To process the same string again "
"using a different "
"transliteration system, "
"please enter the letter Y."
"\nTo process a different string "
"or exit the program, "
"please enter the letter N."
"\nEnter Y/N: ")
# Having entered "N" to exit the secondary loop above,
# the user is asked to enter "Y" or "N" according to
# the earlier defined input function to determine whether
# they would like to enter a new string for processing
# (i.e. revert to the beginning of the main loop) or exit
# the program (i.e exit the main loop and proceed to the end
# of the program).
loop_string_selection = y_or_n_input("To enter a new string "
"for processing, "
"please enter the letter Y."
"\nTo exit the program, "
"please enter the letter N."
"\nEnter Y/N: ")
# With the program's main loop ended, the user is simply prompted
# to press enter to end the program.
input("Thank you for trying out this program. "
"Press enter to exit.")