I had difficulty running the legalese from the document I translated through the parser. The other 5 text files I have sent through the parser, took approximately 100 seconds to finish. This text file (sentences_2.legal) took more than one day to finish running and the line count of the corpus.legal.Parsed file was 20 instead of 126 like it should be. When looking at the file, it appears that the parser was abruptly stopped, even though the standard output implied it was finished. After consulting with Juri, I instead called the parser through the command line instead of running it by using a script to call the Joshua Decoder.
cat /home/hltcoe/hpaisley/data/legalese/sentences_2.legal | java -d64 -Xmx2g -jar /home/hltcoe/hpaisley/joshua-v5.0rc2/lib/BerkeleyParser.jar -gr /home/hltcoe/hpaisley/joshua-v5.0rc2/lib/eng_sm6.gr -nThreads 1 | sed ‘s/^(())$//; s/^(/(TOP/’ | perl /home/hltcoe/hpaisley/joshua-v5.0rc2/scripts/training/add-OOVs.pl /home/hltcoe/hpaisley/expts/legalese_parsing/20/data/train/vocab.legal | tee /home/hltcoe/hpaisley/expts/legalese_parsing/20/data/train/corpus.legal.Parsed | /home/hltcoe/hpaisley/joshua-v5.0rc2/scripts/training/lowercase-leaves.pl > /home/hltcoe/hpaisley/expts/legalese_parsing/20/data/train/corpus.parsed.legal
Juri told me that this may work better because sometimes when the sentences are too long, calling the decoder will take awhile and will run out of memory. This ran for about a minute and correctly parsed the file. I then ran it through my scripts to analyze the parsing and got the following results:
Word Ratio – 0.79265716594
Character Ratio – 0.784720658004
Nesting Ratio – 0.790867870058
Comma Ratio – 0.4164808516
Fragment Ratio – 0.261904761905
OOV Ratio – 0.670171957672
Verb Ratio – 0.760199533707
SBAR Ratio – 0.524754346183
VBN Ratio – 0.416931216931
Shall Ratio – 0.0
And Ratio – 0.337641723356
SV Ratio – 0.904761904762
NP-VP Ratio – 0.665497341634
With my translated sentences, the Verb Ratio and OOV Ratio were both lower than the internet sentences, which means when I translated the legalese to plain English, I used fewer verbs and more common words in general. Every other ratio stayed within range of the internet sentence pair results.
I have collected more sentence pairs from the book Lifting the Fog of Legalese. I put these 33 new sentence pairs into a file to be analyzed separately so that I can consider the differences of these sentences pairs from the internet sentences and my translated sentences. I got the following results after running the pairs through the parser and then my analysis script:
Word Ratio – 0.76917978335
Character Ratio – 0.738876753047
Nesting Ratio – 0.739335216038
Comma Ratio – 0.504748029748
Fragment Ratio – 0.242424242424
OOV Ratio – 0.904637410519
Verb Ratio – 0.980316027716
SBAR Ratio – 0.603896103896
VBN Ratio – 0.259595959596
Shall Ratio – 0.0
And Ratio – 0.420202020202
SV Ratio – 0.969696969697
NP-VP Ratio – 0.552687539492
These results also closely matched the results I got from the internet sentences, except for the significantly lower VBN Ratio and NP-VP Ratio. When Kimble was writing plain English translations, he must have been more cautious with his use of past participles.
I have also included a new ratio into the analysis script. The NP to VP distance ratio measures the distance between the noun phrase and the verb phrase in the sentences. Distance is measured by the number of words between the parts of speech tokens “NP” and “VP” in each sentence. I added this ratio because in the book Lifting the Fog of Legalese, Kimble mentions that plain English sentences should have a short distance between the subject and the verb, minimizing clauses and wordiness. This also promotes the use of action verbs. Because the sentence pairs sometimes have multiple sentences per line, I had to take the average distance between the NP and VP in each sentence in each line to get an accurate ratio. I have also added the NP-VP Ratio to the first sentence pairs in my previous post (Parsed Data Analysis):
Internet Sentence Pairs from Previous Post: NP-VP Ratio – 0.866280086741
Comparing all of the NP-VP Ratios: the internet sentence pairs had a significantly higher NP-VP Ratio than my translated sentence pairs or Kimble’s sentence pairs. Because Kimble was the one who spoke of using a short distance between the subject and the action verb, it makes sense the ratio is the lowest for his sentence pairs.
Based on these results, I think the SBAR and VBN Ratios could be good ratios to separate legalese and plain English sentences. These ratios are both around 0.5, which means legalese sentences have twice the likelihood to have past participles and multiple clauses. Fragment Ratio is interesting to note because it is consistently the lowest ratio, but hard to use because most sentences do not contain fragments. The Shall Ratio could also be used as a separator; if the sentence contains “shall” immediately categorize it as a legalese sentence. There are a whole list of words that are used in legalese, but omitted in the plain English because they serve no structural or meaningful purpose, including:
albeit, ambit, hence, henceforth, herein, heretofore, hitherto, inasmuch as, notwithstanding, oftentimes, thereafter, therein, thereof, theretofore, thereupon, thitherto
Then there are also a slew of words that are exchanged with a different more common phrase or word. With the list of legalese to plain English word/phrases pairs that I already collected, if a sentence contains one of the words from the legalese column, it can be classified as legalese.