Specific Text-to-Text Generation Tasks

For the past couple of days, I have focused on specific text-to-text generation tasks, including legalese to plain English translations and prose to poetry translations.

Legalese is the type of writing used in legal documents that can be very difficult to understand for the general public. Signed in 2010, the Plain Writing Act requires that Federal agencies promote clear communication so that the public can understand new laws and regulations that may affect them. There have also been many other movements promoting plain language, for example for lawyers to use plain English when they address the jury. An obvious market exists for translation tools that can effectively compress and simplify legalese.

As stated in my previous post, I have been collecting translations from legalese to plain English. I now have more than 200 different passage translations, along with approximately 700 different key phrase translations. These key phrases include English-to-English rewording, such as replacing “concerning the matter of” with the more colloquial word “about,” and common Latin phrase translations that are often used in legalese, for example translating the phrase “corpus delicti” as “material evidence.” A major challenge of this translation task is the breakdown of sentences; many times when translating from legalese to plain English, sentences are combined and sometimes deleted. So through translation, paraphrasing will need to take place across different sentences to result in more accurate translations. Another challenge will be dealing with the visual display of the translation; if the legalese contains a list of items or notes, in the translation, this list is structured so that each item gets a new line. For example, the legalese statement, “Federal employees are required to participate in this program if they are involved in the direct care of animals or their living quarters or have direct contact with animals (live or dead), their viable tissues, body fluids or waste” will be formatted as

“Federal employees must participate in this program if they:

  • Take care of animals;
  • Take care of animal living quarters;or
  • Have other direct contact with:
    • Live or dead animals;
    • Viable animal tissues; or
    • Animal body fluids or waste.”

I think a good place to start with this translation task is by focusing on paraphrasing singular sentences, specifically utilizing the key phrase translations I have collected.

Another interesting topic is the translation from prose to poetry. Writing poetry is viewed as a human intelligence tasks; a task that artificial intelligence could not possibly perform successfully. In fact, I have not been able to find any research papers discussing this translation task. The closest topics that I have found include translating Shakespeare to plain English (as discussed in the following research paper) and foreign language translation of poetry (discussed in this research paper). The latter topic addresses some of the difficulties foreseen when dealing with poetic rhythm and rhyming schemes.

Due to the lack of translated poetry from prose on the internet, I have been unable to successfully collect any passage translations. In order to go forward with this translation task, it may be necessary to post HITs on Amazon Mechanical Turk. The problem with this is it would require “Turkers” to write poetry, in which case every Turker will produce a different output. My suggestion is to collect input in the form of paraphrases and synonyms, instead of taking in sentence pairs. One reason being that collecting sentence pairs or prose-to-poetry as input seems challenging, if not impossible. Another reason because this type of translation will depend more on rhyming and syllable recognition, which will require a collection of synonyms and paraphrases, along with a collection of words or phrases that rhyme and convey the same meaning.

In addition to researching legalese-to-plain English and prose-to-poetry translations, I have also spent more time looking at the java code and the pipeline of the Joshua Decoder to learn more about how the decoder works. I plan on attending a tutorial on the Joshua Decoder tomorrow to further my understanding.



  1. Hi Hilary,

    Hi HIlary,

    Thanks for the nice summary of what you’ve been working on. I have two suggestions, one for legalese and one for poetry to prose:

    For poetry to prose, I recommend that you check out Erica Green and Kevin Knight’s paper
    “Automatic Analysis of Rhythmic Poetry with Applications to Generation and Translation”. It was published in the EMNLP conference in 2010. You can download it here: http://www.aclweb.org/anthology/D/D10/D10-1051.pdf

    For leaglese to plain English translation, I would suggest that you start off by targeting a simple subproblem like mapping Latin legal terms onto more colloquial English (like your “corpus delicti” = “material evidence” example). I did a pilot study of this by running an earlier version of our paraphrase system over the 256 Latin legal terms listed here:
    It found paraphrases for 75 of them in the Europarl corpus.
    There are some good paraphrases for most of the 75 phrases that were found. For instance:

    ad infinitum ad infinitum 0.49749339
    ad infinitum infinitely 0.133333334
    ad infinitum on hold forever 0.125
    ad infinitum indefinitely 0.12310185
    ad infinitum a disadvantage 0.016666666
    ad infinitum solutions 0.016666666
    ad infinitum chance 0.01025641

    affidavit declaration 0.6328629025
    affidavit statement 0.22358871
    affidavit affidavit 0.125
    affidavit elections 0.0125

    caveat caveat 0.343533613846154
    caveat reservation 0.14859456
    caveat reserve 0.116732395384615
    caveat warning 0.0859813661538461
    caveat exception 0.0636752130769231
    caveat caveats 0.0341880338461538
    caveat more observation 0.0307692307692308
    caveat condition 0.0256410253846154
    caveat reservations 0.0154700853846154
    caveat limitation 0.0139194138461538
    caveat limit 0.0139194138461538
    caveat subject 0.0103586023076923

    conditio sine qua non conditio sine qua non 0.44166666
    conditio sine qua non sine qua non 0.268494917142857
    conditio sine qua non linchpin 0.117032968571429
    conditio sine qua non condition 0.0765735457142857
    conditio sine qua non prerequisite 0.0256036285714286
    conditio sine qua non condition sine qua non 0.0140931371428571
    conditio sine qua non essential prerequisite 0.0113493042857143

    in extremis in extremis 0.22222222
    in extremis in the end 0.22222222
    in extremis at the last minute 0.17948718
    in extremis at the very last moment 0.11538462
    in extremis in the nick of time 0.11111111
    in extremis at the eleventh hour 0.11111111
    in extremis at this late stage 0.02564103
    in extremis at the last moment 0.01282051

    in toto in its entirety 0.48858599
    in toto in toto 0.4418746725
    in toto by 0.03125
    in toto in full 0.020249595
    in toto in their entirety 0.01171875

    inter alia inter alia 0.356385564545455
    inter alia things 0.136363636363636
    inter alia example 0.132290579090909
    inter alia among other things 0.102272727272727
    inter alia the 0.0681818181818182
    inter alia one 0.05511386
    inter alia amongst other things 0.0454545454545455
    inter alia among others 0.0227272727272727
    inter alia even the 0.0227272727272727

    Why don’t you try the same experiment using the full PPDB? It probably has better coverage than the Europarl corpus that I used in my earlier experiment. In fact, I had Juri include the Acquis Communautaire in the set of documents that he used to construct the PPDB. The Acquis Communautaire is the body of law for the EU, so it has tons of legalese in it.


  2. […] I introduced in a previous post, Specific Text-to-Text Generation Tasks, Mechanical Turk is a website that allows researchers to post HITs or Human Intelligence Tasks. We […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: