Support Vector Machines

Support Vector Machines (or SVM) are supervised learning models used for classification analysis. Classification is the problem of determining which set an unseen observation belongs to through the use of a trained classifier. There is a good article on the theory behind SVMs that can be accessed here: SVM. I am going to use a classifier to help separate legalese sentences and plain English sentences.

I downloaded a SVM-Light classifier from Thorsten Joachims at Cornell University. After installing and classifier, I had to create a Python script that would accurately format the input (/home/hltcoe/hpaisley/data/legalese/format.py). This script formatted the input data to look like this (/home/hltcoe/hpaisley/data/legalese/svm_light/sentences_2/train.dat):

1 1:34 2:102 3:58 4:3 6:1 7:7 9:1 11:2 12:30

Where the initial 1 is for the categorization (1 means plain English and -1 means legalese) and the following numbers are formatted as feature:value. There were 12 different features when I first ran this classifier. The test.dat file is formatted identically to the train.dat file. The model file looks like this (/home/hltcoe/hpaisley/data/legalese/svm_light/sentences_2/model):

SVM-light Version V6.02
0 # kernel type
3 # kernel parameter -d
1 # kernel parameter -g
1 # kernel parameter -s
1 # kernel parameter -r
empty# kernel parameter -u
12 # highest feature index
262 # number of training documents
246 # number of support vectors plus 1
-1.3146557 # threshold b, each following line is a SV (starting with alpha*y)
-5.367323281782505629269527935854e-05 1:38 2:114 3:78 4:1 6:6 7:24 8:2 9:3 12:26 #

There are 261 lines following the final line shown above;  for each of the 262 sentences.

I then ran the sentence pairs I translated through it as the testing data with the features I had specified in the previous post, Parsed Data Analysis II. I used the sentences I found from the internet as the training data.

svm_learn sentences_2/train.dat sentences_2/model

svm_classify sentences_2/test.dat sentences_2/model sentences_2/predictions

The svm_classify script sends the results to standard output:

Reading model…OK. (245 support vectors read)
Classifying test examples..100..200..done
Runtime (without IO) in cpu-seconds: 0.00
Accuracy on test set: 58.78% (154 correct, 108 incorrect, 262 total)
Precision/recall on test set: 55.96%/82.44%

The results I got were extremely awful, probably because of the small training set and the very few number of features.

Just to make sure the classifier was accurate, I decided run svm_learn and svm_classify on the same train.dat consisting of the sentences I translated. There should be a very high accuracy when using the same data as the training data and testing data. These were my results, similar to what I expected:

Accuracy on test set: 100.00% (252 correct, 0 incorrect, 252 total)
Precision/recall on test set: -nan%/-nan%

I then ran this type of test again (using train.dat for both the learning and classifying) with the example data provided by the classifier. I got the following results:

Accuracy on test set: 99.75% (1995 correct, 5 incorrect, 2000 total)
Precision/recall on test set: 99.70%/99.80%

I am unsure why the classifier could 100% accurately classify my data and only 99.75% accurately classify the example data. But, it seems that the classifier is working properly so I focused on collecting more data and features. I changed my analyze_2.py script to include more features. Now the script will also compare every word/phrase in each sentence to the legalese/latin words and phrases I collected from the internet. The legalese_phrases.txt file has 680 different legalese and latin phrases. Every time there is a legalese phrase in the sentence, I will increment that feature by 1. This way the classifier should be able to recognize that legalese sentences have higher numbers for legalese phrase features than the plain English sentences will.

I ran the classifier again; this time using the 252 sentences that I translated as the training data (which is a very small training data set) and then the 66 sentences from the Lifting the Fog of Legalese book as the testing data. Even with the large number of features, the data sets, specifically the training data set, was too small to accurately classify. Here were my results:

Accuracy on test set: 50.00% (33 correct, 33 incorrect, 66 total)
Precision/recall on test set: -nan%/0.00%

50% accuracy is the worst accuracy the classifier could possibly give because it is the same as guessing which category a sentence belongs in. I am surprised that this was worse than when I ran the classifier with only 12 features because when running with 12 features, the learning misclassified 101 of the 250 sentences:

Reading examples into memory…100..200..OK. (252 examples read)
Setting default regularization parameter C=0.0001
Optimizing………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………..done. (345 iterations)
Optimization finished (101 misclassified, maxdiff=0.00007).

While the run with the many features did not misclassify any of the sentences:

Reading examples into memory…100..200..OK. (252 examples read)
Setting default regularization parameter C=0.0000
Optimizing.done. (2 iterations)
Optimization finished (0 misclassified, maxdiff=0.00000).

Just to reiterate, the two outputs above are from the same sentence pairs. In the top output, the classifier only had 12 features to work with, whereas in the bottom output, the classifier had closer to 700 different features.

I needed to focus on finding more training data. Juri gave me the path to a large legalese corpus (with 1.3 million sentences) and an even larger plain English corpus (with over 8 million sentences). These can be accessed at:

/home/hltcoe/jganitkevic/experiments/ppdb/data/jrc/fr/data/train and

/home/hltcoe/jganitkevic/experiments/ppdb/data/gale/ar/data/train, respectively.

He asked me to cut down the plain English corpus so that there was approximately 50% of each corpus used for the classifier training data. I ran both corpuses through my analyze_2.py script to get the features for each sentence and then through the format.py script to creating proper formatting for the classifier. I am currently running the corpus (approximately 2.3 million sentences) through the classifier as the training data. There were the results from svm_learn:

WARNING: Relaxing KT-Conditions due to slow progress! Terminating!
done. (1934453 iterations)
Optimization finished (741721 misclassified, maxdiff=0.65934).
Runtime in cpu-seconds: -961.06
Number of SV: 1561192 (including 1561170 at upper bound)
L1 loss: loss=1498925.08650
Norm of weight vector: |w|=3.14609
Norm of longest example vector: |x|=554.65575
Estimated VCdim of classifier: VCdim<=2511868.97226
Computing XiAlpha-estimates…done
Runtime for XiAlpha-estimates in cpu-seconds: 0.48
XiAlpha-estimate of the error: error<=67.48% (rho=1.00,depth=0)
XiAlpha-estimate of the recall: recall=>39.95% (rho=1.00,depth=0)
XiAlpha-estimate of the precision: precision=>39.95% (rho=1.00,depth=0)
Number of kernel evaluations: 134197941
Writing model file…done

I then used the sentences I translated as the testing data. The results:

Reading model…OK. (1561192 support vectors read)
Classifying test examples..100..200..done
Runtime (without IO) in cpu-seconds: 0.00
Accuracy on test set: 55.56% (140 correct, 112 incorrect, 252 total)
Precision/recall on test set: 53.76%/79.37%

The results were lower than I expected. Perhaps the problem was that the classifier was overloaded performing the learning process and it appears that it terminated before going through all of the sentences. With these poor results, I need to focus on the features. For my subsequent classifying runs, I plan to decrease the number of training sentences so that the learning step does not take 3 days to run.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: