Parsed Data Analysis II

I had difficulty running the legalese from the document I translated through the parser. The other 5 text files I have sent through the parser, took approximately 100 seconds to finish. This text file ( took more than one day to finish running and the line count of the file was 20 instead of 126 like it should be. When looking at the file, it appears that the parser was abruptly stopped, even though the standard output implied it was finished. After consulting with Juri, I instead called the parser through the command line instead of running it by using a script to call the Joshua Decoder.

cat /home/hltcoe/hpaisley/data/legalese/ | java -d64 -Xmx2g -jar /home/hltcoe/hpaisley/joshua-v5.0rc2/lib/BerkeleyParser.jar -gr /home/hltcoe/hpaisley/joshua-v5.0rc2/lib/ -nThreads 1 | sed ‘s/^(())$//; s/^(/(TOP/’ | perl /home/hltcoe/hpaisley/joshua-v5.0rc2/scripts/training/ /home/hltcoe/hpaisley/expts/legalese_parsing/20/data/train/ | tee /home/hltcoe/hpaisley/expts/legalese_parsing/20/data/train/ | /home/hltcoe/hpaisley/joshua-v5.0rc2/scripts/training/ > /home/hltcoe/hpaisley/expts/legalese_parsing/20/data/train/

Juri told me that this may work better because sometimes when the sentences are too long, calling the decoder will take awhile and will run out of memory. This ran for about a minute and correctly parsed the file. I then ran it through my scripts to analyze the parsing and got the following results:

Word Ratio – 0.79265716594

Character Ratio – 0.784720658004

Nesting Ratio – 0.790867870058

Comma Ratio – 0.4164808516

Fragment Ratio – 0.261904761905

OOV Ratio – 0.670171957672

Verb Ratio – 0.760199533707

SBAR Ratio – 0.524754346183

VBN Ratio – 0.416931216931

Shall Ratio – 0.0

And Ratio – 0.337641723356

SV Ratio – 0.904761904762

NP-VP Ratio – 0.665497341634

With my translated sentences, the Verb Ratio and OOV Ratio were both lower than the internet sentences, which means when I translated the legalese to plain English, I used fewer verbs and more common words in general. Every other ratio stayed within range of the internet sentence pair results.

I have collected more sentence pairs from the book Lifting the Fog of Legalese. I put these 33 new sentence pairs into a file to be analyzed separately so that I can consider the differences of these sentences pairs from the internet sentences and my translated sentences. I got the following results after running the pairs through the parser and then my analysis script:

Word Ratio – 0.76917978335

Character Ratio – 0.738876753047

Nesting Ratio – 0.739335216038

Comma Ratio – 0.504748029748

Fragment Ratio – 0.242424242424

OOV Ratio – 0.904637410519

Verb Ratio – 0.980316027716

SBAR Ratio – 0.603896103896

VBN Ratio – 0.259595959596

Shall Ratio – 0.0

And Ratio – 0.420202020202

SV Ratio – 0.969696969697

NP-VP Ratio – 0.552687539492

These results also closely matched the results I got from the internet sentences, except for the significantly lower VBN Ratio and NP-VP Ratio. When Kimble was writing plain English translations, he must have been more cautious with his use of past participles.

I have also included a new ratio into the analysis script. The NP to VP distance ratio measures the distance between the noun phrase and the verb phrase in the sentences. Distance is measured by the number of words between the parts of speech tokens “NP” and “VP” in each sentence. I added this ratio because in the book Lifting the Fog of Legalese, Kimble mentions that plain English sentences should have a short distance between the subject and the verb, minimizing clauses and wordiness. This also promotes the use of action verbs. Because the sentence pairs sometimes have multiple sentences per line, I had to take the average distance between the NP and VP in each sentence in each line to get an accurate ratio. I have also added the NP-VP Ratio to the first sentence pairs in my previous post (Parsed Data Analysis):

Internet Sentence Pairs from Previous Post: NP-VP Ratio – 0.866280086741

Comparing all of the NP-VP Ratios: the internet sentence pairs had a significantly higher NP-VP Ratio than my translated sentence pairs or Kimble’s sentence pairs. Because Kimble was the one who spoke of using a short distance between the subject and the action verb, it makes sense the ratio is the lowest for his sentence pairs.

Based on these results, I think the SBAR and VBN Ratios could be good ratios to separate legalese and plain English sentences. These ratios are both around 0.5, which means legalese sentences have twice the likelihood to have past participles and multiple clauses. Fragment Ratio is interesting to note because it is consistently the lowest ratio, but hard to use because most sentences do not contain fragments. The Shall Ratio could also be used as a separator; if the sentence contains “shall” immediately categorize it as a legalese sentence. There are a whole list of words that are used in legalese, but omitted in the plain English because they serve no structural or meaningful purpose, including:

albeit, ambit, hence, henceforth, herein, heretofore, hitherto, inasmuch as, notwithstanding, oftentimes, thereafter, therein, thereof, theretofore, thereupon, thitherto

Then there are also a slew of words that are exchanged with a different more common phrase or word. With the list of legalese to plain English word/phrases pairs that I already collected, if a sentence contains one of the words from the legalese column, it can be classified as legalese.



  1. […] NP-VP Ratio: the word distance between the noun phrase and verb phrase; discussed further in Parsed Data Analysis II […]

  2. […] I translated through it as the testing data with the features I had specified in the previous post, Parsed Data Analysis II. I used the sentences I found from the internet as the training […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: