Why Evaluating Machine Translation Quality is Hard

Why Evaluating Machine Translation Quality is Hard

Imagine it’s your first time in Beijing, and you sit down at a restaurant to get something to eat. The waiters do not speak English, and the menu is all in Chinese, with no pictures. Luckily, you have Waygo installed on your phone! You point it at some Chinese text:

白菜香菇炒年糕 bái cài xiāng gū chǎo nián gāo

Sauteed Glutinous Rice Cake with Black Mushrooms and Chinese Cabbage

Sounds amazing! What else do they have?

如意汉堡 rú yì hàn bǎo

As One Wants Burger

Hmm. Sounds good, but something is a bit off in that translation. Let’s try another one:

老家担担面(辣) lǎo jiā dàn dàn miàn (là)

Native Place Noodles in Chili Sauce, Sichuan Style (Spicy)

Also sounds good, but where is this “Native Place”?

You don’t need to have used Waygo to know this as a familiar experience. If you have ever used an online translation service, like Google Translate or Bing Translator, you have probably come across some wonky translations that had you scratching your head to figure out the meaning.

At Waygo, we are proud to say the vast majority of the time when you point our app at a menu, the translations are both sensible and fluent. But we do spend a lot of time thinking about how to improve the cases where translations are not-so-fluent. Since this involves tweaking complex machine translation algorithms, we need a way to ensure the changes we make nudge us ever-closer to human-level translations, not further away. But how does one measure, objectively, how well a sentence is translated?

One way to measure this would be to have a human judge the accuracy and fluency of the translations, one by one. Considering there are hundreds of thousands possible food items on Chinese menus, this is a time-consuming task. We could regularly have humans evaluate translations of a small sample set of dishes, but even if we chose a small subset of dishes—say, 1,000—it would still take someone a couple of hours to finish this mind-numbingly boring task. Even then, how do we compare one set of evaluations with a previous one? We need a way to evaluate translations that is fast, objective and automatic. Luckily, computer scientists working on machine translation have long been wondering about this exact problem. The most popular solution, which we also use at Waygo, is called BLEU.

Measuring Translation Quality

BLEU (an acronym for Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one language to another. The idea behind BLEU is that “the closer a machine translation is to a professional human translation, the better it is”. This sounds about the same as our first idea above – to let a translator evaluate the quality every time. The key difference is that before a machine gets involved, BLEU first has professional human translators translate the source text, the results of which we then call the “reference translations”. Now, every time we want evaluate the quality of a machine translation algorithm, we give it the same list of source sentences, and compare its translations to the reference translations. The closer it is to the reference translations, the better the score. Simple, right?

Hang on. Just how does one compare one (machine) translation with a number of reference translations? This is where BLEU helps us out, and running through an example will help. Let’s go back to our burger example, 如意汉堡, which Waygo translated as “As One Wants Burger”. Give this to two human translators, and they might instead come up with these translations “Burger made as you like it”, and “Hamburger with anything you want”.

Now, to compare Waygo’s translation with the reference translations, the BLEU algorithm starts by counting the number of times each word in the Waygo translation appears in at least one reference translation.

“As” appears in one of the references, so that’s 1. “One” does not appear in either translation, so that’s 0. “Wants” does not appear, so another 0. And finally, “Burger” appears in one of the translations, so that’s 1. Add them up, and we get 1 + 0 + 0 + 1 = 2. We now divide that by the number of words in the candidate sentence, and in BLEU parlance, obtain a unigram precision of 2 / 4 = 0.5.

(“unigram”, in this context, simply means a single word. Similarly a “bigram” is a pair of two words, “trigram” is a pair of three words, and from there on it becomes “4-gram”, “5-gram”, etc.)

Continuing in this fashion, we can repeat this process for pairs of words, i.e. “As One”, “One Wants” and “Wants Burger”, and compare these pairs to the pairs in the reference translations. In this case, the bigram precision is 0. The trigram precision and 4-gram precisions are also 0.

The BLEU algorithm now allows us to decide how much weight each of these n-grams should have, to calculate the final score between 0 and 1. It is common practice to view them all as equally important, so the final score is calculated as (0.5 + 0 + 0 + 0) / 4 = 0.125. Is that good or bad? It only makes sense when you are comparing it against the score of another machine translation. Google Translate, for example, translates the same text as “Ruyi Hamburg”. With our two reference sentences, this will achieve a score of 0, telling us that the Waygo translation is, according to BLEU, slightly better.

The BLEU score is used widely in the machine translation world to compare machine translation results. We applied it to a single sentence here, but since it is just a matter of calculation after the initial human translation, it is commonly run on thousands of sentences, for which we can then calculate the average BLEU score.

There are lots of possible variations on the BLEU score, and at Waygo we use a slight modification called Smoothed BLEU (Lin and Och 2004). You may also object by saying that the way BLEU works, comparing sets of words with reference translations, does not necessarily mean that the final translation will be fluent. Worse still, a professional human translator might not even get a full score! These are valid concerns, but numerous studies have found the BLEU score to correlate very well with human judgement of translation quality, and it is one of the best known methods.

Using BLEU to Compare Machine Translation Systems

Now that we know how the BLEU score works, let’s take it for a test drive! We assembled a collection of about 1,000 sentences in two languages, Chinese and Japanese, and across three different categories:

  • menus and foods,
  • chat conversations, and
  • idioms and expressions.

These categories represent sentences that we thought would range from easy to difficult for most machine translation systems, though this does depend on what the designers optimize for. We then assembled reference sentences for each sentence. Below is one example sentence from each category in Chinese:

盐烤鸡肉大葱串 Salt Baked Chicken and Leek Skewers (Food)

我给你讲讲这段经历 I will tell you… (Chat)

全体人员凝成一股劲 Everyone is of one mind (Idiom)

Now we can put three popular machine translation systems to the test on the three categories: our very own Waygo Translator, Google Translate, and Microsoft’s Bing Translator. Here is how each fared on the sentences above:

盐烤鸡肉大葱串 Salt Baked Chicken and Leek Skewers

Waygo: Salt baked chicken with leek kebabs (BLEU: 0.8409)

Google: Grilled chicken green onions salt string (BLEU: 0.7598)

Bing: Salt roasted chicken with onion strings (BLEU: 0.7598)

我给你讲讲这段经历 I will tell you…

Waygo: I to you speak this experience (BLEU: 0.6389)

Google: I tell you about this experience … (BLEU: 0.7311)

Bing: I tell you about the experience … (BLEU: 0.7311)

全体人员凝成一股劲 Everyone is of one mind

Waygo: Crew become one (BLEU: 0.3901)

Google: Crew cemented vigor (BLEU: 0.0000)

Bing: All cemented himself wholeheartedly (BLEU: 0.000)

The BLEU score does quite well to estimate the fluency of each machine translation output. In these examples, the more fluent translation invariably ended up with the highest score. Applied to our full list of 1,213 sentences, the results of the average BLEU scores are captured in the tables below:

System Food Chat Idioms
Waygo 0.6301 0.3641 0.3592
Google 0.6809 0.4543 0.3554
Bing 0.6267 0.4438 0.3526
Table 1: Average BLEU scores for different machine translation systems translating from Chinese to English. The highest score for each category is bolded.
System Food Chat Idioms
Waygo 0.8300 0.2584 0.3464
Google 0.7352 0.2753 0.4486
Bing 0.6893 0.3283 0.4334
Table 2: Average BLEU scores for different machine translation systems translating from Japanese to English. The highest score for each category is bolded.

These tables illustrate that machine translation systems generally fare about the same in each category: food translations are generally good, but the systems struggle to make any sense of idioms, and don’t fare particularly well in Japanese chat conversations either. Waygo emerges the winner when it comes to Japanese food translations, and places a close second in the Chinese food category. Not bad for an app that doesn’t require an Internet connection!


This was a quick intro to how we measure translation quality at Waygo. If you would like to learn more, see the Wikipedia links below. We also open sourced our library for calculating the BLEU score; feel free to check it out on Github and perform your own translation quality tests.


Herman Schaaf