I have complied 411 different legalese to plain English sentence pairs. I found 285 sentence pairs from the internet; sources include plainlanguage.gov and Michigan Bar Journal. Out of the 285, I am only taking the 131 sentence pairs where the plain English structurally resembles the legalese. I have disregarded the other 154 pairs that I found on the internet because these pairs will most likely lead to a weaker translation system. I believe the pairs with a very low word compression ratio will force the system to compress too much while translating, which will result in weaker grammaticality and less meaning preservation. I translated 126 pairs myself from a housing agreement that Dr. Callison-Burch provided me with. So for now I have a total of 257 sentence pairs to work with. With these sentences pairs and the 677 phrase pairs I found, both Latin to plain English phrases (381 phrases) and convoluted English to plain English phrases (296 phrases), I will now be able to focus on running a legalese to plain English translation task.
In order to set up the decoder with the sentences pairs, I have to format the data to match the format the decoder is looking for. To format the sentence pairs, I had to normalize the punctuation, tokenize the sentences, and convert all letters to lowercase. To do this, I ran 3 scripts in the Joshua Decoder (note that the variable JOSHUA=/home/hltcoe/hpaisley/code/joshua-v5.0rc2):
at english_sentences_1.txt | $JOSHUA/scripts/training/normalize-punctuation.pl | $JOSHUA/scripts/training/penn-treebank-tokenizer.perl | $JOSHUA/scripts/lowercase.perl > sentences_1.en
The sentence pairs I extracted from the internet are in english_sentences_1.txt and the sentence pairs I translated are in english_sentences_2.txt. The subsequent legalese are located in legalese_sentences_1.txt and legalese_sentences_2.txt, respectively.
After formatting the data, the sentence pairs must be parsed using the Joshua Decoder’s parser. So I set up 4 different runs that will only parse the data: each run is for each set of data, sentences_1.en, sentences_1.legal, sentences_2.en, sentences_2.legal. The following script was used to parse the data:
#$ -l num_proc=16,h_vmem=120g,mem_free=120g,h_rt=168:00:00
#$ -S /bin/bash
#$ -M email@example.com
#$ -m eas
#$ -j y -o logs/1.log
–rundir 1 \
–readme “Parsing sentences_1.en” \
–type hiero \
–first-step PARSE \
–last-step PARSE \
–type samt \
–corpus /home/hltcoe/hpaisley/data/legalese/sentences_1 \
–grammar /home/hltcoe/jganitkevic/experiments/ppdb/release/v1.0/eng/xxxl/packed-2013-06-08/ppdb-1.0.packed \
–tune-grammar /home/hltcoe/jganitkevic/experiments/ppdb/release/v1.0/eng/xxxl/packed-2013-06-08/ppdb-1.0.packed \
–test-grammar /home/hltcoe/jganitkevic/experiments/ppdb/release/v1.0/eng/xxxl/packed-2013-06-08/ppdb-1.0.packed \
–tune /home/hltcoe/hpaisley/data/legalese/sentences_1 \
–test /home/hltcoe/hpaisley/data/legalese/sentences_1 \
–tuner pro \
–lmfile /home/hltcoe/mpost/expts/wmt13/runs/hiero/de-en/8.6/lm-merged.kenlm \
–joshua-mem 40g \
–optimizer-runs 1 \
–threads 32 \
–jobs 20 \
–source legal \
I am now focusing on writing a Python script that will analyze the parsed sentence pairs that I have collected. My goal is to discover key differences in legalese and plain English so that I can have the translation system implement these differences while translating legalese. In the script, I am going to focus on the differences in sentence length and word length to get data on word compression ratio and character compression ratio, respectively. I will also try to break down the sentences structures to analyze the histogram of the parts of speech to note any differences in legalese and english.
I am also in the process of reading Lifting the Fog of Legalese by Joseph Kimble to get a better understanding of the difficulties of legalese translation, along with the reasoning to get rid of legalese all together.