Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translation: Give more context for auto-translate #447

Open
benbucksch opened this issue Feb 19, 2025 · 15 comments
Open

Translation: Give more context for auto-translate #447

benbucksch opened this issue Feb 19, 2025 · 15 comments
Assignees
Labels

Comments

@benbucksch
Copy link
Collaborator

Problem

When sending a translation to auto-translators, we need to give more context of the string. Otherwise, the translator cannot correctly translate. The translations are really good. But translating 1-word strings often goes wrong.

For example:

  • "End time" -> translated as "Apocalypse" in German
  • "Decline" -> translated as "Going down" in German. The meaning was in context of calendar invitation "Refuse".
  • "Quote" -> translated in many languages as "to give a price". The meaning was "to cite".

In most other cases, when there was context, because there were more words in the string, it was auto-translated correctly. So, that makes a big difference.

I understand that it's difficult to give the context of the strings in the same dialog, because the "extract strings" and hash-IDs of strings already removed the entire context.

Approach

So, is there a way to tell json-autotranslate to add some context to the string? E.g. the other strings around it?

Solutions - Ideas

1. IDs with code file name

It might be sufficient to generate IDs that include the code file name and file path in the ID, and to include that ID in the translation. And then remove it from the translated result. Maybe the file name and path already gives the translation enough context. E.g. instead of sending Decline and getting Niedergang, rather send "Calendar/Invitation/Display#hzlsh0" : "Decline", translated as "Kalender/Einladung/Anzeige#something" : "Ablehnen" -> (remove context) "Ablehnen" -> add that as translation.

Disadvantage: Changing the string IDs will significantly increase the size of the translation files in the distribution, and the amount of characters to be translated -> price. I worry more about the size of the app distribution.

2. Find source code file and context from ID

It would be much better, if the auto-translation engine could take the ID as we have them right now, go back to English string, find that string in the source code, and then find the other strings in that same source file, and give these strings as context. For example:

  1. German: "hztle": ""
  2. -> English: "hztle": "Decline"
  3. Find source file: Search for "tDecline" and "gtDecline" in source code. Find the source code file name.
  4. Find context: Find other t and gt strings in the dialog.
  5. Build translation source string: "Decline" ### "Accept" "Maybe" "Time" "Topic" "Description"
  6. Send that to auto-translation
  7. Get back "Ablehnen" ### "Annehmen" "Vielleicht", "Zeit", "Thema", "Beschreibung"
  8. Extract the first string "Ablehnen"
  9. Put that into translation: German: "hztle": "Ablehnen"

If you want to optimize it, you can combine multiple strings that are untranslated.

3. Send multiple strings at once

Given that we likely add all the strings in the same dialog at the same time, a simple solution might be to not send individual strings one-by-one to the auto-translator, but to send multiple strings at the same time.

  1. We have new untranslated strings "Decline" "Accept" "Maybe" "Time" "Topic" "Description"
  2. Concatenate them and send Decline | Accept| Maybe | Time | Topic | Description (with these exact string separators)
  3. Get back Ablehnen | Annehmen | Vielleicht | Zeit | Thema | Beschreibung
  4. Parse the separators, and verify that we get back the same amount of strings that we sent. If not, fall back to individual strings.
  5. Put them into the translation.
@benbucksch
Copy link
Collaborator Author

Please try Solution 3 first, then 2, and 1 only as fallback when everything else fails.

@jermy-c
Copy link
Collaborator

jermy-c commented Feb 20, 2025

Problems

  1. Even if you manage to translate the strings with different variations the hash IDs would be the same for all variations
  2. You would need to a custom function that records the fileName while extracting and if the string is a single word
  3. The format of the context would need to be supported by the script that would submit the strings to be translated
  4. The script would need to have the feature to group single word strings that appear in the same dialog together
  5. The extra context would make the file much larger in the repo but it can be removed while compiling
  6. If all strings were to have it's own hash and have the fileName context then that would make the file really large since there are instances where the same string is used in multiple places even though the translation is exactly the same

Solutions

Tolgee
Tolgee has the feature to group strings in the same dialog and submit together for translation

Custom Lib
We would need the t function to know when the string is single worded so it generates a hash based on string+fileName.
Then the extractor would also need to know the hash and output a format compatible for the auto-translate script. And then the auto-translate script would need to have the feature to do that.

Tweak Settings
Use the Lingui style output which already has the fileName but the hashID is still an issue. Or add the fileName as a comment in the t function if it's a single word then the hash would be different. Then we'd still need to modify the auto-translate script to support grouping the strings together.

Merge the Extractor and Auto Translate scripts together
json-autotranslate would need to have the extractor within also and then it could go into the source code and search for it and also they're would be the need for the extra context in the message files.

@jermy-c
Copy link
Collaborator

jermy-c commented Feb 26, 2025

Solution 3 seems to be working well.

@benbucksch
Copy link
Collaborator Author

benbucksch commented Feb 26, 2025

Solution 3 = "Send multiple strings at once" or "Tweak Settings" ?

@jermy-c
Copy link
Collaborator

jermy-c commented Feb 26, 2025

Solution 3 = "Send multiple strings at once" or "Tweak Settings" ?

Send multiple strings at once

@jermy-c
Copy link
Collaborator

jermy-c commented Feb 26, 2025

The process would be:

  1. Extract strings from source code with the format as messages.template.json
{
  "Options": {
    "abcd": "A user-readable string",
  },
}
  1. Auto-translate uses the above format for translating with context and output messages.json
{
  "abcd": "A user-readable string",
}
  1. Since messages.template.json is not imported then it would not be bundled making it smaller.

So the files that are in the repo would be.

  1. en/messages.template.json - the template file with format grouping by dialog
  2. en/messages.json - the source language file that would be imported and bundled
  3. {locale}/messages.json - the translations for target languages, it would be imported and bundled

@benbucksch
Copy link
Collaborator Author

That's great! Do it.

@benbucksch
Copy link
Collaborator Author

@jermy-c What's the status here? Is this done? Or still TODO?

@benbucksch
Copy link
Collaborator Author

BTW: Your fix to not send a huge files to auto-translate worked. I was now able to translate only a few strings, with minimal amount of char in the API. That's great, thank you! This means that we can run the Auto-Translate script more often.

@jermy-c
Copy link
Collaborator

jermy-c commented Mar 14, 2025

@jermy-c What's the status here? Is this done? Or still TODO?

No, it is not done. It might require a big change to json-autotranslate because it already has contexts but it is only one for the entire translation file/app. And also, it requires changes at the json-autotranslate script level and then at the DeepL API service level. I've started it but it's still not finished.

@benbucksch
Copy link
Collaborator Author

benbucksch commented Mar 14, 2025

OK, thanks for the update. Lets put this on hold for the moment. This is important long-term, but not short-term.

Is there something valuable in what you already did? If so, can you push to a branch what you did, and post it here? And shortly describe (mostly for yourself, not for me) what your implementation plan is and what you already finished, so it's easier for you to pick up later? If you didn't do much yet (that's OK), then nevermind.

@jermy-c
Copy link
Collaborator

jermy-c commented Mar 14, 2025

You're welcome.

Sorry, I implemented group words together plan. Not, the comment/context one.

Implementation Plan

  1. Batch strings without context together for translation (that's how it is keep it the some way), to make it faster and avoid hitting the rate limit
  2. Send individual strings for strings with context, because you can only one context.
  3. Watch out, not to break the script, since all services use the same request structure

@benbucksch
Copy link
Collaborator Author

benbucksch commented Mar 26, 2025

Hey Jeremy, you have implemented the commentSeparator = *=>. This is useful to disambiguate terms, e.g. email has been "read" vs. "Read"ing. I've commited some of such comments.

Unfortunately, the comment doesn't seem to be considered by the auto-translation.

I see that messages.template.json has these comments as description property. Is that submitted to the machine translation? If no, could you make that happen? If yes, why does it have no effect on the translation?

Another way is to simply submit our string with the *=> to the machine translation, and strip it afterward the translation. The comment is stripped before the string is written into the translated file messages.json. During runtime, when reading messages.json, we then no longer need to strip the comment, saving a tiny bit of runtime.

@benbucksch
Copy link
Collaborator Author

benbucksch commented Mar 26, 2025

@jermy-c This is important. I have huge problems in the translations, due to lack of context.

  • "Read" (message is read) was translated in German as 'You shall read it'
  • "Decline" (the meeting) was translated in many languages as "Going down"
  • Merely to work around the bad translations, I changed "Decline" to "Refuse" (the meeting), and that was translated in a few languages as "Trash" 🤦 👊🤡 🤦

Without context or manual human translators, there's no way out of this. This is urgent.

Bad translations are highly embarrassing and give the impression of a very low quality product.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants