No, that’s not “moron machine translation”, but if you read Hugo Kornelis’ candid and very helpful comments on Tuesday’s Windows Live Translator post, you’re certainly aware that machine translation (MT) has its pitfalls.
The pun in the paragraph above is a good example of the challenges faced by MT. English-speaking readers might stifle a small chuckle (they perhaps are more likely to fail in their efforts to suppress a groan and accompanying roll of the eyes) at my play on the homonyms “more on” and “moron,” but unless these terms translate to homonyms in other languages (an unlikely state of affairs), the joke likely loses something — perhaps everything — in the translation.
MT works best on professionally authored content — short, crisp sentences with good grammar — and even then the results aren’t always perfect. Blogs tend not to be edited to such high standards; many of them, including this one, are authored with a strongly colloquial voice. This aspect is usually lauded as a virtue of the blogging model, but it does tend to limit the utility of MT tools run against blogs.
The good news in this scenario should come as no surprise to those of us in the database business. An MT application is only as good as its underlying databases, and one of the Windows Live Translator team’s motivations in releasing these products for public consumption at this stage is to get more data for their databases — just the sort of feedback Hugo offers. They’ve set up a feedback website which reports directly to the Live Translator team; I’ve added a link to the site under my Windows Live Translator link to the left (for you web-based viewers). I’ve also added a Machine Translation tag to the blog to group this conversation.
The Windows Live Translator team is actively pursuing sources for data for their databases. They already get monthly refreshes of localized content from Microsoft product groups, and there are plans afoot to set up a sort of “translation Wiki” where customers can suggest translations. After a sufficient number of solid translations are collected, the statistical favorite will then be incorporated into the translator. Two of the words Hugo mentions — aardigheidje and giechel — will be included in the next Dutch refresh.
Windows Live Translator is going to perform better and better as its underlying databases are populated. It’s just going to take time, and forthright feedback such as Hugo’s only hastens the process. I’ve given feedback of my own, which was graciously received by the team.
My last name is also a noun, and a component of place names, in my native language. My full name as rendered here is also a place name in at least two US states (there appear to be two Ward Ponds in Massachusetts alone). This configuration challenges current MT technology.
My brief testing of the technology before I posted on Tuesday showed that it had trouble with the words “ruminating,” “appendectomy,” and “telecommute.”
As the databases grow, I’m sure that these issues will subside, although I’m quite curious to see how they’ll deal with the whole names-as-proper-nouns situation.
Thanks to Hugo for his interest and comments. I urge all of you fluent in any of the supported languages (German, French, Italian, Spanish, Portuguese, Chinese Simplified, Chinese Traditional, Japanese, Korean, Arabic, and Netherlands Dutch) to liberally exercise both the technology and the feedback page.
Remember, based on the state of the underlying database for your selected language, your mileage may vary. Think of yourself as a beta tester for your language, with commensurate privilege to raise issues and concerns with the team. Based on my experience with them, they’ll jump on the chance to make their product better.
When this is working up to par, think of the extended reach this can give the web! It will never replace the loving touch of a person fluent in both the source and destination languages, but it has strong potential to further democratize the web, and that can’t be a bad thing.
Moron.. oops.. I mean.. More on this topic as events unfold.