Parsed Data Analysis

I have been trying to determine good ways to distinguish legalese from plain English through the use of coding. I need to come up with rules to teach the decoder, so when presented with unseen phrases or challenging sentences, the decoder will have a better success rate with translation into plain English. The easier ones to find are word compression ratio and character compression ratio. I calculated these for each of the sentences in my file (I am focusing on the file which contains 131 sentence pairs from the internet). Then I took the averages of each line’s word and character compression ratio. The results were 0.8647212037 and 0.857087615914, respectively. But, these numbers proved to me what I already knew: that plain English contains fewer and simpler words than legalese, in general.

My Python script (/home/hltoce/hpaisley/data/legalese/ calculates much more than word and character compression ratios. The closer the number is to 0, the more of the assessed variable in the legalese sentence. I then run the line-by-line text file through another python script (/home/hltcoe/hpaisley/data/legalese/ that averages each of the columns separately. It is possible for the compression ratios to be larger than 1, but when sampling the legalese to plain English translation data, all of the average ratios remain below 1.

In addition to word and character compression ratio, the script analyzes many other aspects of the parsed data.

  • Nesting ratio: the amount of clauses and convolution in the sentence; the higher the number of clauses, the larger the amount of nesting in the sentence
  • Comma ratio: to try to determine if there was a significant difference between the number of commas used in plain English vs. legalese
  • Fragment ratio: I figured that the parser would be unable to correctly categorize the legal jargon phrases, such as “In witness whereof” or “hereby,” and would therefore, classify these phrases as fragments
  • OOV ratio: which stands for words that are “out of vocabulary”; my thought was that the legalese would contain a higher number of OOV words than the plain English
  • Verb ratio: legalese usually uses more helper verbs instead of action verbs, so legalese should contain more verbs than plain English does
  • SBAR ratio: definition from “The Penn Treebank”  – clause introduced by subordinating conjunction or (); legalese has more clauses and is more convoluted than plain English
  • VBN ratio: this stands for past participle verbs; usually past participles are used more frequently in legalese
  • Shall ratio: the word “shall” is common in legalese, but replaced by “must” in plain English
  • And ratio: legalese uses a lot of doublets and triplets, such as “terms and conditions” or “signed, sealed and delivered,” so this ratio might show this distinction; a lot of other times plain English uses lists, which will diminish the use of “and”
  • SV Ratio: the subject before verb ratio was calculated a little differently than the other ratios from above; it was calculated that if the legalese sentence had SV order then the ratio was a 1, and if the legalese sentence had VS order then the ratio was a 0
  • NP-VP Ratio: the word distance between the noun phrase and verb phrase; discussed further in Parsed Data Analysis II

After running the data through the scripts, I was surprised by some of the ratios. To run the data:

./ /home/hltcoe/hpaisley/expts/legalese_parsing/1/data/train/corpus.en.Parsed /home/hltcoe/hpaisley/expts/legalese_parsing/10/data/train/ 1_analysis.txt

cat 1_analysis.txt | ./

The final averages will be output to standard output. For my first plain English and legalese files (the 131 sentences from the internet), I got the following results:

Word Ratio ||| Character Ratio ||| Nesting Ratio ||| Comma Ratio ||| Fragment Ratio ||| OOV Ratio ||| Verb Ratio ||| SBAR Ratio ||| VBN Ratio ||| Shall Ratio ||| And Ratio ||| SV Ratio ||| NP-VP Ratio
0.8647212037 ||| 0.857087615914 ||| 0.86374038229 ||| 0.544165757906 ||| 0.058524173028 ||| 0.953806329379 ||| 0.992319428276 ||| 0.662031988368 ||| 0.48905246577 ||| 0.0 ||| 0.403307888041 ||| 0.977099236641 ||| 0.866280086741

Because it is difficult to read in this format, below is the reformatted output:

Word Ratio – 0.8647212037

Character Ratio – 0.857087615914

Nesting Ratio – 0.86374038229

Comma Ratio – 0.544165757906

Fragment Ratio – 0.058524173028

OOV Ratio – 0.953806329379

Verb Ratio – 0.992319428276

SBAR Ratio – 0.662031988368

VBN Ratio – 0.48905246577

Shall Ratio – 0.0

And Ratio – 0.403307888041

SV Ratio – 0.977099236641

NP-VP Ratio – 0.866280086741

I was not surprised by the Word, Character, Nesting, or Comma Ratios. The Fragment Ratio was interesting that it was so low, whereas the OOV Ratio was quite high. What really surprised me was the Verb Ratio, as it was the closest number to 1.0. As shown by the data, the number of verbs is not a distinguishing factor, but the VBN ratio (those past participle verbs) is a key determiner between plain English and legalese, as there are more than twice the number of past participles in legalese as there are in plain English. The SBAR Ratio was as I expected, legalese contains more unnecessary clauses. And the Shall Ratio is also excepted because plain English should replace every “shall” with “must” or another similar word. The And Ratio was lower than I thought it would be, but I did expect that “and” would appear more in the legalese due to the doublets and lists, as stated above. Finally, the SV Ratio was much higher than I expected. A possible explanation is that legalese does have a subject at the front of the sentence, but it is usually not the main subject, and this results in the common use of past participles.

I plan to look into other distinguishing qualities that I can find between legalese and plain English. I would really like to figure out if there are key words that are classified differently by the parser in the plain English vs. the legalese. Though, I am not sure how to come up with a proper algorithm. I also want to look into the differences in the word length distance of the subject, verb and object. Lifting the Fog of Legalese is also helping me to think about key distinctions and has provided me with some more examples that I will add to my sentence pair data. I also plan to run these scripts with the data I translated myself to see if the results are similar.

By manually looking through the parsed data, I noticed some discrepancies with the parser that could lead to slightly skewed ratios. For example, a list in a plain English sentence was not tokenized as a list because it used ‘a’, ‘b’, ‘c’, as opposed to ‘1’, ‘2’, ‘3’:

(VBZ includes)))) (JJ -lrb-) (NP (NP (DT a) (NN -rrb-)) (JJ -lrb-) (X (SYM b)) (JJ -rrb-))

The ‘a’ was classified as a DT (determiner) which makes sense. Whereas, the ‘b’ was classified as a SYM (symbol). This would be fine, except there is no place where the tokenizer establishes this as a list and this would alter my results if I were to look at List Ratio because the listing did not appear in the paired legalese sentence. I will continue looking for discrepancies in the parsed data, but most of the sentences I have looked at are parsed correctly, so there is a minimal chance of large variance in the results.



  1. Dennis Paisley · · Reply

    Very interesting with the OOV ratio. Legalese would have some words that appear to use several plain English words to accomplish the same meaning.

  2. […] I have also included a new ratio into the analysis script. The NP to VP distance ratio measures the distance between the noun phrase and the verb phrase in the sentences. Distance is measured by the number of words between the parts of speech tokens “NP” and “VP” in each sentence. I added this ratio because in the book Lifting the Fog of Legalese, Kimble mentions that plain English sentences should have a short distance between the subject and the verb, minimizing clauses and wordiness. This also promotes the use of action verbs. Because the sentence pairs sometimes have multiple sentences per line, I had to take the average distance between the NP and VP in each sentence in each line to get an accurate ratio. I have also added the NP-VP Ratio to the first sentence pairs in my previous post (Parsed Data Analysis): […]

  3. Hi Hilary, How do you compute OOV? Are you looking up the words in a pre-existing dictionary of English words? Where does that dictionary come from?

    1. I compute OOV by adding up how many of the parsed words have a part of speech classified as OOV in each sentence. I am using the default Berkeley parser from the Joshua Decoder as the source of the English words.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: