Getting into the Data

I am now beginning to get into paraphrasing tasks; moving away from the two translation experiments that I ran and analyzed (the very small English to Spanish translation experiment and the larger English to Spanish translation during the Joshua tutorial).

Juri gave me a large amount of paraphrasing input data that he asked me to sort based on their compression ratios. I proceeded to write a Python code ( that would take in the input (formatted as SOURCE ||| TARGET ||| ALIGNMENTS), calculate the compression ratio based on the number of words and the number of characters, and store the output into a file (formatted as SOURCE ||| TARGET ||| ALIGNMENTS ||| word-compression-ratio ||| character-compression ratio). The compression ratio by words was calculated as the word-length-of-target / word-length-of-source, comparatively the compression ratio by characters was calculated as the character-length-of-target / character-length-of-source. If at any time the target sentence was longer than the source sentence (based on number of words), the code would switch the sentences and also switch all of the alignments. For paraphrasing this is important because you want the machine translation to learn and output shorter, more compact sentences. This way all of the compression ratio by words are between 0 and 1. I then sorted the sentence pairs by the word compression ratio and stored the sorted data into another text file.

I wrote a Python script ( that would read input from the user based on their desired range of word compression ratios and output the segment of the sentences to another text file. The reasoning for this is that a lot of the paraphrase sentence pairs in the data should not be used for the machine translation learning. Specifically smaller compression ratios often result in output paraphrases that are either totally grammatically incorrect or there is no meaning retention.

For very small compression ratios in the range of .05 to .1, most, if not all, of the sentences pairs should be disregarded. One such example with a word compression ratio of .0588:

source sentence – “she bore two daughters to a syrian husband , who died some time after their birth.”

target sentence – “.”

This range and below will not be helpful for machine translation tasks.

By sampling the sentences from .3 to .4, source sentence such as this one:

source sentence – “the southern part of america will be drier than usual . it means there could be a lasting drought in texas and new mexico , and the possibility of big forest fire in florida will be higher than normal.”

target sentence – “however , the opportunities for forest fires in florida may increase.”

Although the paraphrased target sentence is grammatically correct, it gets rid of the first two statements altogether. This example had a word compression ratio of .3. Overall, the paraphrasing in this segment of the data was much better then I expected with some of the target sentences descriptive enough to be good paraphrases of the source sentence, but most of the sentence pairs resulted in target sentences that did not hold the same meaning as the source sentences.

I am going to focus on segmenting the data around the .6 to .8 range. Below the .6 range, the target sentences stop retaining the full meaning of the source sentence. In the .6 to .7 range, there are overall good paraphrases with a small collection of sentence pairs that do not hold as good paraphrases. It will be better to start around the .6 ratio, rather than determining a higher lower bound because there are many good paraphrases with a compression ratio of .6. For example:

source sentence – “11 palestinians hit by israeli army fire in khan yunis.”

target sentence – “11 palestinians wounded in khan yunis.”

Word compression ratios around .7, appear to be ideal for translation tasks, yet this can hopefully be verified by performing the translation tasks with subsets of this data for training and tuning. An example with a compression ratio of .68:

source sentence – “gary had hacked the american computer system in the years 2001 and 2002 , causing the american government to lose eight hundred thousand dollars.”

target sentence – “in 2001-2 mckinnon hacked the us military computer system , causing damage that cost $ 800,000.”

In the example above, the sentence was reordered to provide a more concise paraphrase, yet still retained the full meaning from the source sentence. In the .7 to .8 range, target sentences retain adequate meaning from the source sentence, yet could be more compressed. For a comparison between target sentences:

source sentence – “he said that this ceremony is a ritual that has a deep relationship between the peaceful transfer of power from one leader to another leader and the american grandeur.”

target sentence (.7) – “he said the ceremony signifies the great us tradition of a peaceful transfer of power from one leader to another.”

target sentence (.83) – “he said that this ceremony has a deep connection with the great american custom of transferring authority peacefully from one leader to the next.”

Both target sentences retain the meaning of the source sentence, but the target sentence with a compression ratio of .7 is more concise than the wordier target sentence with a compression ratio of .83.

Above the .8 range, there is not much paraphrasing at all, which is only good if the sentence is very short and no paraphrasing should occur anyways. In the case of a word compression ratio of 1, only one or two words differ from the source to target sentences. Here is an example where only word reordering occurs:

source sentence – “arafat welcomes american president ‘s statements.”

target sentence – “arafat welcomes statements of american president.”

Sometimes with the sentences in this range of .8 to 1, the only difference between the source sentence and the target sentence is a ‘.’ and other times the target sentence simply removes articles (‘the’ or ‘a’). For example:

source sentence – “arab nations unanimously opposed to the us attack on iraq.”

target sentence – “arab nations universally oppose us attack on iraq.”

The only changes are the deletion of ‘to the’ and the different vocabulary words for ‘opposed’ and ‘unanimously,’ resulting in a compression ratio of .8.

An interesting point to make is through sorting the original input data, some of the sentences are grammatically incorrect. As an example, this sentence pair has a compression ratio of .65:

source sentence – “moreover , for these , the man who bring these to them plays a role of good sample , to improve local human right , to improve directly the labor right , have , will have very direct affect.”

target sentence, “and what they brought can be used as examples, it has very direct influence on improving the local human rights and workers ‘ rights.”

The target sentence is grammatically correct and contains fewer words than the muddled source sentence. These ungrammatical sentences appear throughout the entire data set. It would lead to better machine translation if we could remove that sentence pair altogether. The only way I could think to do this is to manually go through the data and remove ungrammatical sentences. This data came from a group of translators, so for future data collection tasks, it would be better to not include translations from weaker translators.

Another interesting observation is that there are a lot of “garbage” sentences in the data. With a word compression ratio of .8:

source sentence – “*****.”

target sentence of “****.”

There are a pretty substantial amount of sentences similar to this in the input data; specifically if the word compression ratios are closer to 1. It would be better for training and tuning if these types of sentences were removed from the data sets.

Character compression ratio is another parameter that should be considered when finding training and tuning data. The character compression ratio is usually within a small difference from the word compression ratio, but there are sentence pairs where the character compression ratio is above 1 (which means that there are longer words in the target sentence than the source sentence). Another idea would be to disregard sentence pairs for the training and tuning data sets if the character compression ratio is above 1; so as to only include sentence pairs with target sentences that contain fewer words and smaller words than the source sentences.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: