I have been trying to determine good ways to distinguish legalese from plain English through the use of coding. I need to come up with rules to teach the decoder, so when presented with unseen phrases or challenging sentences, the decoder will have a better success rate with translation into plain English. The easier ones to find are word compression ratio and character compression ratio. I calculated these for each of the sentences in my file (I am focusing on the file which contains 131 sentence pairs from the internet). Then I took the averages of each line’s word and character compression ratio. The results were 0.8647212037 and 0.857087615914, respectively. But, these numbers proved to me what I already knew: that plain English contains fewer and simpler words than legalese, in general.
My Python script (/home/hltoce/hpaisley/data/legalese/analyze_3.py) calculates much more than word and character compression ratios. The closer the number is to 0, the more of the assessed variable in the legalese sentence. I then run the line-by-line text file through another python script (/home/hltcoe/hpaisley/data/legalese/analyze_4.py) that averages each of the columns separately. It is possible for the compression ratios to be larger than 1, but when sampling the legalese to plain English translation data, all of the average ratios remain below 1.
In addition to word and character compression ratio, the script analyzes many other aspects of the parsed data.
- Nesting ratio: the amount of clauses and convolution in the sentence; the higher the number of clauses, the larger the amount of nesting in the sentence
- Comma ratio: to try to determine if there was a significant difference between the number of commas used in plain English vs. legalese
- Fragment ratio: I figured that the parser would be unable to correctly categorize the legal jargon phrases, such as “In witness whereof” or “hereby,” and would therefore, classify these phrases as fragments
- OOV ratio: which stands for words that are “out of vocabulary”; my thought was that the legalese would contain a higher number of OOV words than the plain English
- Verb ratio: legalese usually uses more helper verbs instead of action verbs, so legalese should contain more verbs than plain English does
- SBAR ratio: definition from “The Penn Treebank” – clause introduced by subordinating conjunction or (); legalese has more clauses and is more convoluted than plain English
- VBN ratio: this stands for past participle verbs; usually past participles are used more frequently in legalese
- Shall ratio: the word “shall” is common in legalese, but replaced by “must” in plain English
- And ratio: legalese uses a lot of doublets and triplets, such as “terms and conditions” or “signed, sealed and delivered,” so this ratio might show this distinction; a lot of other times plain English uses lists, which will diminish the use of “and”
- SV Ratio: the subject before verb ratio was calculated a little differently than the other ratios from above; it was calculated that if the legalese sentence had SV order then the ratio was a 1, and if the legalese sentence had VS order then the ratio was a 0
- NP-VP Ratio: the word distance between the noun phrase and verb phrase; discussed further in Parsed Data Analysis II
After running the data through the scripts, I was surprised by some of the ratios. To run the data:
./analyze_3.py /home/hltcoe/hpaisley/expts/legalese_parsing/1/data/train/corpus.en.Parsed /home/hltcoe/hpaisley/expts/legalese_parsing/10/data/train/corpus.legal.Parsed 1_analysis.txt
cat 1_analysis.txt | ./analyze_4.py
The final averages will be output to standard output. For my first plain English and legalese files (the 131 sentences from the internet), I got the following results:
Word Ratio ||| Character Ratio ||| Nesting Ratio ||| Comma Ratio ||| Fragment Ratio ||| OOV Ratio ||| Verb Ratio ||| SBAR Ratio ||| VBN Ratio ||| Shall Ratio ||| And Ratio ||| SV Ratio ||| NP-VP Ratio
0.8647212037 ||| 0.857087615914 ||| 0.86374038229 ||| 0.544165757906 ||| 0.058524173028 ||| 0.953806329379 ||| 0.992319428276 ||| 0.662031988368 ||| 0.48905246577 ||| 0.0 ||| 0.403307888041 ||| 0.977099236641 ||| 0.866280086741
Because it is difficult to read in this format, below is the reformatted output:
Word Ratio – 0.8647212037
Character Ratio – 0.857087615914
Nesting Ratio – 0.86374038229
Comma Ratio – 0.544165757906
Fragment Ratio – 0.058524173028
OOV Ratio – 0.953806329379
Verb Ratio – 0.992319428276
SBAR Ratio – 0.662031988368
VBN Ratio – 0.48905246577
Shall Ratio – 0.0
And Ratio – 0.403307888041
SV Ratio – 0.977099236641
NP-VP Ratio – 0.866280086741
I was not surprised by the Word, Character, Nesting, or Comma Ratios. The Fragment Ratio was interesting that it was so low, whereas the OOV Ratio was quite high. What really surprised me was the Verb Ratio, as it was the closest number to 1.0. As shown by the data, the number of verbs is not a distinguishing factor, but the VBN ratio (those past participle verbs) is a key determiner between plain English and legalese, as there are more than twice the number of past participles in legalese as there are in plain English. The SBAR Ratio was as I expected, legalese contains more unnecessary clauses. And the Shall Ratio is also excepted because plain English should replace every “shall” with “must” or another similar word. The And Ratio was lower than I thought it would be, but I did expect that “and” would appear more in the legalese due to the doublets and lists, as stated above. Finally, the SV Ratio was much higher than I expected. A possible explanation is that legalese does have a subject at the front of the sentence, but it is usually not the main subject, and this results in the common use of past participles.
I plan to look into other distinguishing qualities that I can find between legalese and plain English. I would really like to figure out if there are key words that are classified differently by the parser in the plain English vs. the legalese. Though, I am not sure how to come up with a proper algorithm. I also want to look into the differences in the word length distance of the subject, verb and object. Lifting the Fog of Legalese is also helping me to think about key distinctions and has provided me with some more examples that I will add to my sentence pair data. I also plan to run these scripts with the data I translated myself to see if the results are similar.
By manually looking through the parsed data, I noticed some discrepancies with the parser that could lead to slightly skewed ratios. For example, a list in a plain English sentence was not tokenized as a list because it used ‘a’, ‘b’, ‘c’, as opposed to ‘1’, ‘2’, ‘3’:
(VBZ includes)))) (JJ -lrb-) (NP (NP (DT a) (NN -rrb-)) (JJ -lrb-) (X (SYM b)) (JJ -rrb-))
The ‘a’ was classified as a DT (determiner) which makes sense. Whereas, the ‘b’ was classified as a SYM (symbol). This would be fine, except there is no place where the tokenizer establishes this as a list and this would alter my results if I were to look at List Ratio because the listing did not appear in the paired legalese sentence. I will continue looking for discrepancies in the parsed data, but most of the sentences I have looked at are parsed correctly, so there is a minimal chance of large variance in the results.