On Wednesday, I attended a Joshua Tutorial hosted by Matt Post. As a key developer of the Joshua Decoder, he was very helpful in building my understanding of the decoder and the overall pipeline. He helped me set up my account with the Human Language Technology Center of Excellence so I could have access to the large amount of translation data available. He then gave me a sample of translation data from Spanish to English. This data set was much larger than the data set I used to run my small experiment last week and, therefore, took a much longer time to process and decode the sentences. Because the data set was much larger, my resulting BLEU score of .6151 was also much higher.
After looking at the sentence-by-sentence.html analysis, I noticed some interesting things about the translation, specifically in regards to the BLEU score not accurately validating some of the sentences. In every sentence with an apostrophe, the spacing in the translated sentence would differ from the spacing in the reference sentences. As an example:
reference sentence: “ah , well , look at that . i ‘ m from”
output sentence: “ah , well , look at that . i ‘m from”
The bolded words above are the same, yet when calculating BLEU score, the output sentence did not receive a score of 1 but instead .7612 because ‘i ‘ m’ is different from ‘i ‘m’ when computing the score. For this example, the BLEU score was still quite high, but for some other examples with the same ‘i’m’ problem, the BLEU score resulted in a 0 because the word was in the middle of the sentence and, therefore, could not produce a 4-gram. Fixing the resulting output sentence so there is a space after an apostrophe would increase overall BLEU score.
Another miscalculation in the BLEU score occurs if the sentence is not long enough to have a 4-gram. This results in an overall BLEU score of 0 because the 4-gram score is calculated as a 0. Because of these miscalculations in the score, the translations are actually better than the .6151 BLEU score indicates.
There were also some words that were consistently mistranslated. Such as the Spanish word ‘a ver’:
reference sentence: “let ‘ s see if you go for the second marriage.”
output sentence: “to see if you go for the second marriage.”
google translate: “to see if it is on the second marriage.”
Google translate was also unable to successfully translate the Spanish word. In this example, Joshua’s output sentence had a higher meaning retention than the Google translation. The above example had a BLEU score of .8801, which is very high even though the meaning of ‘let’s see’ is much different from ‘to see’.
The inclusion of four reference sentences, as opposed to the one reference sentence in the small experiment, should result in a higher BLEU score because the output sentence can be compared to any of the paraphrased reference sentences. This counteracts a misleading BLEU score when the output sentences are correct paraphrases of the reference sentences.