Sunday, March 20, 2011

simple tutorial on using LIBSVM

This article deals with on how to use LIBSVM and test the accuracy of the classifier.  Libsvm is a tool to
incorporate the concept of SVM in your project.

I'll be posting other tutorials/progarms using LibSVM here.

SVM's are used for classifying data in 1 or multiple dimensions into 2 or more classes. All they do is try to clearly separate two classes from each other in clustering.

labelled training data-------->SVM------->trained SVM

unlabelled/labeled testing data--------->trained SVM--------->predicted labels

They work on labelled data (for unlabelled data, you will need a ground truth to establish the accuracy of the SVM). I would actually recommend you to read LIBSVM documentation completely (as it is less than 16 pages). After reading that you can get some insight into what libsvm is about and how you can use it in your project.

Basically a SVM takes in a set of feature vectors (value in multiple dimensions) while training and outputs the labels in testing phase (and given labelled test data, we can measure the accuracy on how to better differentiate the classes).

If we were to differentiate a square from a rectangle. We have 2 dimension feature vector i.e. length, width.

The following logic defines our feature vector and thus our SVM.

If length = width its a square
else its a rectangle.

If this is the logic we want to cluster using the SVM, we would have to give the have data in following format:

1 1:2 2:2
1 1:3 2:3
1 1:4 2:4
1 1:5 2:5

-1 1:2 2:3
-1 1:1 2:2
-1 1:2 2:4
-1 1:3 2:1
.....
.....
....

I guess you must have figured out what 1 and -1 meant. In above data 1st column deals with the label (i.e square or rectangle. Square=1 and Rectangle=-1). The 2nd and 3rd column deals with length and width. 1: and 2: tell libsvm that they are 1st and 2nd dimension of data respectively.

You can have the above data in different ways. If you were have series of images as input. Depending on size and colorspace, you will have different input feature vector for each image.

If you have 640 x 480 length black and white image as input (grayscale colorspace). You will have data of 307200 dimensions (each dimension with a range 0 to 255 in grayscale). It would look something as follows:

1 1:255 2: 233 3:0 4:44 ........307719: 233
-1 1:55 2: 3 3:20 4:240 ........307719: 233
1 1:155 2: 123 3:50 4:42 ........307719: 233

Here 1 and -1 represents the class label (defined according to you). Each row represents an image. 

Now lets download libsvm and get started. I would also suggest you to install python.



Now as we have downloaded and installed LIBSVM, lets try to do a simple classification in LIBSVM.

Download the testing data and training data and put them in a folder.

In the downloaded data, newtraining.txt represents the training data and newtesting.txt represents the testing data. 

Copy the svm-predict, svm-train, easy.py, grid.py from the folder where you installed libsvm to a folder where you have testing and training data saved.

Now you can do this in 2 ways:
1) Manually:

Go to command prompt / Shell Terminal and give the following command:
$svm-train <training_data>

where <training_data> is the text file which contains the training data

Executing above command will output on how much accurate the training was.



Then issue the following command:

$svm-predict <testing_data> <training_data.model> outputlabels.txt

where <testing_data> is the text file which contains the testing data,
<training_data.model> is a text file that was generated from previous step (svm-train <training_data>),
outputlabels.txt is a text file that stores the respective output labels for the input feature vectors in <testing_data> file.

Executing above command will show the accuracy of the generated model (by libsvm).




If you observe the above procedure yeilds a accuracy of  51.85%.
This is considered to be low for a SVM. This is so because we need to scale paramaters, select best SVM kernel type for the given input data. In order to do so, we use the automatic way of using LIBSVM. Jump to the next part.

2) Automatically:
I assume that you have installed python and GNUPLOT. Edit the easy.py program using IDLE python editor or any other text editor by going to the line where it says the following:


    svmscale_exe = .....
svmtrain_exe = .....
svmpredict_exe = .......
grid_py = ......
gnuplot_exe = .....

Basically its the code which is pointing to svmtrain, svmpredict, etc executables. Edit the above to point to the svmtrain, svmpredict, etc executables.

If the above part is done,you just issue the command:

$python easy.py <training_data> <testing_data>

Here <training_data> and <testing_data> refers to the training and testing files. The above command automatically scales the data (you should know what scaling is if you've read libsvm documentation).

The above command automatically scales data, does cross-validation, selects optimal kernel for the SVM automatically without you having worry about the training and testing parameters to generate a reliable SVM.




The output on the terminal/command prompt shows the accuracy of the SVM. Also in the present folder are the files which have output labels of predicted data.

Using the automatic way, we get an accuracy of 74%.

The automatic way is the best way to use LIBSVM as it does cross validation, selects best svm kernel type, scales the test and training data.

16 comments:

  1. Is there an easy.py equivalent tool or utility developed in Java to work with Java version of LibSVM?

    ReplyDelete
  2. I dont think so. But easy.py uses the executables of svm-scale, svm-train, svm-test which are written in C.

    All I can suggest is, go through the easy.py code and change the above executables to java classes of the same i.e. svm-test.class, svm-train.class, svm-scale.class and they should work (theoretically).

    The above suggested solution might just work. Otherwise, you can go through the source-code, see what it does (scales the training, testing data, gives them to grid.py to estimate best training parameters, train and then test) and do the same in java.

    ReplyDelete
  3. For unlabelled data, how to classify ?

    ReplyDelete
  4. @above: Please be more clear in your question. If you are dealing solving a machine learning problem with unlabelled data, then it would be unsupervised learning problem. If you are dealing with unlabelled data in testing phase, read this article carefully it has the answer.

    ReplyDelete
  5. hai,
    i have data format like ,time, iperf(internet performance),thruput min,thruputmax,..
    how to identify the class label and conversion of the format..

    ReplyDelete
  6. you need to know class label before hand in order to use an SVM. Please read basic pattern recognition and machine learning.

    ReplyDelete
  7. and how can i manually decide these parameters...???

    ReplyDelete
  8. Easy.py takes care of most of the stuff. Run that on the data and it will generate the appropriate parameters (available in the model files created). Go with a trial and error strategy to guess the parameters. I will suggest you to use scikit-learn toolkit in python. With it, you could manually try out kernel type, c, gamma values and look at the results.

    ReplyDelete
  9. i want to apply svm to annotate images how to do that?

    ReplyDelete
  10. @Ayesh: use pixel values. An 8 x 8 image will transfer to 64 length feature vector. As in
    1 1:255 2:0 ...... 64: 223
    1 1:255 2:0 ...... 64: 223
    1 1:255 2:0 ...... 64: 223
    1 1:255 2:0 ...... 64: 223
    -1 1:25 2:10 ...... 64: 1
    -1 1:22 2:11 ...... 64: 22
    -1 1:23 2:84 ...... 64: 10
    -1 1:24 2:58 ...... 64: 5

    ReplyDelete
  11. How can i run python? I have python installed at C:\Python27 , how can i run

    $python easy.py

    thankss.

    ReplyDelete
  12. How can i run python? I have python installed at C:\Python27 , how can i run

    $python easy.py

    thankss.

    ReplyDelete
  13. put your data files where there is easy.py installed. The data should be in given order.

    ReplyDelete
  14. i want to use libsvm to identify 4 signs how can i set labels
    each image in binary 30X30 pix
    please help me to setup this i have no idea how to do it

    ReplyDelete
  15. @Ellian Laura: You should set the System Variables. Go to My Computer -> System Properties -> Advanced -> System Variables -> System Variables. In system variables, edit "Path" variable and add "C:\Python27\;" in the end of the text and save it. Close everything. Then open "CMD" from Windows->Run. If you can "python" command successfully, you have correctly set the path.

    ReplyDelete
  16. @Rishan: Read the article and use the technique mentioned above. Instead of the feature vectors mentioned above, use your own feature vectors. They will be of size 30 x 30 = 900 length. Separate training and testing feature vectors.

    ReplyDelete