28 September, 2008

Hebrew-English online translation

Filed under: Hebrew,Technology by Joel @ 12:22 pm, 28 September 2008.

It seems Google Translate has finally added Hebrew to its canon of transled languages (along with another 35). It seems they don’t have translation from web search enabled yet, but you can play with it (translate Dutch to Hebrew for instance) at Google Translate.

I borrow the example text used in one reporting blog:

משטרת גרמניה עצרה שני צעירים בחשד שהתכוונו לבצע פיגוע במטוס של חברת התעופה ההולנדית קיי-אל-אם. כוחות משטרת גרמניה פשטו על המטוס שחנה בשדה התעופה בקלן, זמן קצר לפני שהמריא בחזרה להולנד והוציאו ממנו את שני הצעירים, אזרח גרמני יליד סומליה בן 24 ואזרח סומליה בן 23.

Google Translate says:

German police arrested two youths suspected Shaatcwano an attack on the plane of Dutch airline Kay – to – if. German police forces raided the plane parked at the airport Cologne, shortly before Smria Leclnde back and took him to the two young men, a German citizen born in Somalia 24 Uezarh Somalia age 23.

There are a number of interesting things here:

Assuming something is a proper name if it can’t otherwise be understood is quite a normal approach. But it’s unusual that Google has particular trouble with “שהתכוונו”, “שהמריא” and “ואזרח”, which I don’t consider particularly uncommon words. These, and the messed up “להולנד” all have the common feature of attached prefixes (proclitics), and Google gets it right for all but “המריא” when these are removed. Obviously their word segmentation systems could be improved, or could be adjusted so that if the end system resorts to considering it a proper noun, it might go back and check whether there were some proclitics it failed to lop off. In practice, implementing such a feedback loop may not be worthwhile if the system wants to be fast.

Go take a look at the proper names it forms. It puts some funny letters in there, transliterating:

  • ה ([h]) as nothing (which a lot of Israelis do, but I’m guessing that the system is being hugely biased by the silent הs at the ends of many female names);
  • ו ([v]) as “w”, maybe because “w” always translates to Hebrew in names as ו, but it makes Google look very academic (or Iraqi/Yemenite) to transliterate the vavs in words as waws.
  • כ ([k]) becomes “c”, but so does some non-existant letter in להולנד! What’s going on there?
  • ח (usu. [x]) becomes “h” (rather than “ch” or “kh”), but I guess it is only ever found when transliterating Arabic names, and Ahmed is more common than Achmed.
  • The vowels are also interesting. Especially the spurious “e” on the end of להולנד, but it’s already clear that it’s done a strange job on that one.

Kay – to – if (KLM) is obviously entertaining, but there’s not really much to say about it (except that apparently they split tokens on hyphens).

The most interesting phrase translation is “and took him to the two young men” from “והוציאו ממנו את שני הצעירים”. It would appear as if they took the ו on the end of והוציאו as referring to the object (והוציאוֹ) rather than the subject (והוציאוּ), but seeing as the former is quite rare in contemporary written Hebrew, this may mean they have a wide variety of texts from various ages. And then ממנו seems to disappear altogether. So maybe I’ve just misinterpreted how the system makes a mistake. At the end of the day, the system is all numbers, so no one can really be certain how it made the mistake…

One of the few other online Hebrew-English translation services is Reverso:

A police of Germany stopped two young on suspicion that meant to execute an attack in the airplane of the Dutch airline KAY but them. Forces a police of Germany spreaded on the airplane that parked in the airfield Bkln, a short time before took Off back/in return to Holland and withdrew from him you two the young, German born citizen Somalia ben 24 and citizen Somalia ben23.

Comparing to this translation, we see that Reverso generally does a better job of splitting off proclitics and so makes less apparent mistakes. But its grammar is certainly much poorer, both in English and in Hebrew, thinking for instance that “צעירים” should be understood as an adjective rather than a noun; and that one makes an attack in a plane rather than on it; or that the singular משטרת should be translated “a police”; or that “את” is better translated “you” than as a direct-object marker. Compare also Google’s handling of the compound noun phrase “כוחות משטרת גרמניה” as “German police forces” rather than “Forces a police of Germany”. Also interesting is Reverso’s offering of a choice for בחזרה as “back/in return”.

Overall, while reverso handles word segmentation somewhat better, Google has a much more fluid grammar and chooses more appropriate words in translation.

I haven’t tried translating the other direction (English to Hebrew) yet, or any other combination of languages where I would be under-qualified. I leave that as an exercise to the reader.

And no, they don’t do Yiddish yet. Real Soon Now.

Yes, it’s been a long time. Yes, I won’t be talking much here till November. Shana tova anyway! Enjoy translating your New Year cards from strange Israeli rellies…

Powered by WordPress