Tuesday, January 17, 2012

Kmeans clustering in OpenCV with C++

Kmeans clustering is one of the most widely used UnSupervised Learning Algorithms. If you are not sure what Kmeans is, refer this article. Also if you have heard about the term Vector Quantization, Kmeans is closely related to that (refer this article to know more about it). Autonlab has a great ppt on Kmeans Clustering.

First, I'll talk about the kmeans usage in OpenCV with C++ and then I'll explain it with a program. If you are not yet comfortable in OpenCV with  C++, please refer to this article and the pretty much everything else is the same as in C API (where you use IplImage*,etc).

Btw, My other programs in OpenCV will be posted here.

Function call in C++ API of OpenCV accepts the input in following format:
double kmeans(const Mat& samples, int clusterCount, Mat& labels, TermCriteria termcrit, int attempts, int flags, Mat* centers);

Parameters explained as follows:
  1. samples: It contains the data. Each row represents a Feature Vector. Each co lumn in a row represent a dimension. So, we can have multiple dimensions of data in the feature vector. Example if we have 50, 5 dimensional feature vector, we will have 50 rows, 5 colums of this matrix. One thing interesting which I've noticed is kmeans doesn't work with CV_64F type.
  2. clusterCount: It should be specified beforehand. We need to know how many clusters do we divide the data into. It is an integer.
  3. labels: It is an output Matrix. If we had a Matrix of above specified size (i.e 50 x 5 ), we will have 50 x 1 output Matrix. It determines which cluster the feature vector belongs. It starts with 0, 1, .... (number of clusters-1).
  4. TermCriteria: It determines the criteria in applying the algorithm. Max iterations, accuracy,etc. 
  5. attempts: number of attempts made with different initial labelling. Also refer documentation for elaborate information on this parameter.
  6. flags: It can be
    KMEANS_RANDOM_CENTERS   (for random initialization of cluster centers).
    KMEANS_PP_CENTERS   (for kmeans++ version of initializing cluster centers)
    KMEANS_USE_INITIAL_LABELS   (for user defined initialization).
  7. centers: Matrix holding center of each cluster. If we divide the 50 x 5 feature vector into 2 clusters, we will have 2 centers of each in 5 dimensions.
Sample program is explained as follows:


  1. how would i use the kmeans if i had a array of 2d points (x,y).
    i have a hard time figuring out how to convert my convehull point so that i can enter them in to the kmeans function.

  2. I'm not sure that I got you right. If you want to use Kmeans on a set of points, you need all the points in a single Matrix and you should know the number of clusters. You can have all the convex hull points(2d points x,y) in a single matrix and specify number of clusters and perform kmeans.

  3. ok i figured it out. thanks for you sample code.
    i had to turn the array of convex hull points in to an array of single float values.
    i am useing openframeworks.

    int sampleCount = contourFinder.getConvexHull(i).size();
    int dimensions = 2;
    float pointsdata[sampleCount*2]; //[] = {1,1, 2,2, 6,6, 5,5, 10,10};

    int cnt = 0;
    for(int a=0; a<sampleCount; a++){
    pointsdata[cnt] = contourFinder.getConvexHull(i)[a].x;
    pointsdata[cnt] = contourFinder.getConvexHull(i)[a].y;
    Mat points(sampleCount,dimensions, CV_32F,pointsdata);

    int clusterCount = 3; //i want 3 averaged points back

    cv::Mat labels;
    Mat centers(clusterCount, 1, points.type());

    kmeans(points, 3, labels, cv::TermCriteria(), 2,cv::KMEANS_PP_CENTERS, &centers);