pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: K-Means Clustering


From: Mehmet Hakan Satman
Subject: Re: K-Means Clustering
Date: Thu, 10 Mar 2011 03:35:33 -0800 (PST)


Hi John

it is good to receive such an email from you. 

Actually it is an important case to be engaged such a project like PSPP for me. 


I think i can add these functionalties to PSPP if the members of the project help me. Another way, i can try implementing other clustering algorithms to PSPP, such as single linkage and complete linkage with several distance functions. 



Note: If i am not wrong, "implementing such a command in PSPP" means making the k-means clustering codes be callable using the 
"QUICK CLUSTER x y
  /MISSING=LISTWISE
  /CRITERIA=CLUSTER(2) MXITER(10) CONVERGE(0)
  /METHOD=KMEANS(NOUPDATE)
  /PRINT INITIAL ANOVA CLUSTER DISTAN."
syntax. is that right?

also: I have no experience about coding GTK+

best regards


From: John Darrington <address@hidden>
To: Mehmet Hakan Satman <address@hidden>
Cc: address@hidden
Sent: Thu, March 10, 2011 1:07:59 PM
Subject: Re: K-Means Clustering

Hello Mehmet,

I think the code is well written and is something that we could use in PSPP.

Like you say, perhaps it doesn't lend itself to a generalized clustering algorithm,
but I think it would nicely satisfy the needs of the QUICK CLUSTER command.

Would you like to try implementing such a command in PSPP ?  Ask on the mailing list
if you need help getting started.

Regards,


John

On Wed, Mar 09, 2011 at 05:51:44AM -0800, Mehmet Hakan Satman wrote:
    Hi John,
   
    I am sending the .h, the .c and the test.c file as attachments.
   
   
    My implementation clusters a random data set with number of cases 10000, the
    number of parameters 10 and the number of clusters 10 in average 0.45 seconds.
   
   
    Algorithm starts with the randomly selected cluster centers and in each single
    steps these centers are changed. Distances between these centers and the
    observations must be calculated in every step. I am not sure if it can be
    implemented a "create-one-and-use-anywhere" distances in this algorithm. In
    hierarchial cluster algorithms such as "single linkage clustering", it is
    possible to build a distance matrix and use it during the clustering.
   
    Best Regards.
   
    Hakan.
   
   
   
   
    ________________________________
    From: John Darrington <address@hidden>
    To: Mehmet Hakan Satman <address@hidden>
    Cc: address@hidden
    Sent: Wed, March 9, 2011 3:21:59 PM
    Subject: Re: K-Means Clustering
   
    Hi Mehmet!
   
    As I mentioned, the CLUSTER command is soemthing which I think it would be
    great to support.
   
    One issue with clustering is its memory complexity. It requires O(n^2) where
    n is the number of cases being clustered.
    Have you tested your algorithm with large numbers of cases?
   
    Maybe Ben has some ideas how an efficient distance matrix can be  implemented
    in PSPP (maybe sparse-array.c can help?) .
   
    In any case, I'd be interested to see your code, and the results of your
    comparisons.  Can you post them somewhere?
   
   
    J'
   
    On Tue, Mar 08, 2011 at 05:56:37AM -0800, Mehmet Hakan Satman wrote:
          hi everybody,
       
          I am interested in PSPP and i read about something about the needs for
          developing some functionality.
          I implemented a k-means clustering  library using the GNU scientific
    library and
   
          sent an informative e-mail to John. He suggested me to join this group and
    share
   
          my ideas with the stuff.
       
       
          I compared the results with SPSS outputs. The analysis of variance table is
    not
   
          completed but we may add this feature.
       
          I would be glad to integrate something to PSPP and work with you.
       
          What do you think about this?
       
       
             
    --
    PGP Public key ID: 1024D/2DE827B3
    fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
    See http://pgp.mit.edu or any PGP keyserver for public key.
   
   
         
    /*
        kmeans - K-Means Clustering Library for the GNU PSPP (People Should Prefer PSPP) project.
        Copyright (C) 2011  Dr.Mehmet Hakan Satman <address@hidden>
   
        This program is free software: you can redistribute it and/or modify
        it under the terms of the GNU General Public License as published by
        the Free Software Foundation, either version 3 of the License, or
        (at your option) any later version.
   
        This program is distributed in the hope that it will be useful,
        but WITHOUT ANY WARRANTY; without even the implied warranty of
        MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
        GNU General Public License for more details.
   
        You should have received a copy of the GNU General Public License
        along with this program.  If not, see <http://www.gnu.org/licenses/>.
    */
    #include <stdio.h>
    #include <math.h>
    #include <gsl/gsl_matrix.h>
    #include <gsl/gsl_statistics.h>
   
   
    /*
    Struct KMeans:
    Holds all of the information for the functions.
    */
    struct Kmeans {
        gsl_matrix *data;          //User Data (Given by the user)
        gsl_matrix *centers;        //Centers for groups
        gsl_vector_int *index;      //Integer values from zero to ngroups. Shows group of an observation.
        gsl_vector *v1,*v2,*v3;    //Temporary vector for program. Do not use.
        int ngroups;                //Number of group. (Given by the user)
        int n;                      //Number of observations. (Given by the user)
        int m;                      //Number of observations. (Given by the user)
        int maxiter;                //Maximum number of iterations (Given by the user)
        int lastiter;              //Show at which iteration it found the solution.
        double *weights;            //Double values for handling weights for program use.
    };
   
    /*
    Creates a Kmeans structure for a given data 'data', number of observations 'n',
    number of columns 'm', number of groups 'ngroups' (it is usually called 'k') and
    number of maximum iterations 'maxiter'.
    */
    struct Kmeans* kmeans_create(double* data, int n, int m, int ngroups, int maxiter);
   
   
    /*
    Randomly chooses centers of 'ngroup' groups.
    These centers are initial and will be changed iteratively.
    */
    void kmeans_randomize_centers(struct Kmeans *kmeans);
   
    /*
    Prints the given Kmeans structure
    */
    void kmeans_print(struct Kmeans* kmeans);
   
   
    /*
    Calculates the squared euclidean distance between vector v1 and v2
    */
    double kmeans_euclidean_distance(gsl_vector *v1, gsl_vector *v2);
   
   
    /*
    Calculates and returns the number of elements contained in the
    specific group.
    */
    int kmeans_num_elements_group(struct Kmeans *kmeans, int group);
   
   
    /*
    Calculates group centers. Those centers are calculated using iteration
    and they are usually different from the initial centers. Recalculation
    process remains while the last two solutions are not equal.
    */
    void kmeans_recalculate_centers(struct Kmeans *kmeans);
   
   
    /*
    Constructs the index variable. This variable shows the current
    group of the each single observation.
    */
    void kmeans_calculate_indexes(struct Kmeans *kmeans);
   
   
    /*
    Checks if the last two index variables are equal. If they are equal,
    algorithm can not find a better classification anymore and stops.
    */
    int kmeans_check_converge(gsl_vector_int *current, gsl_vector_int *old);
   
   
    /*
    This is the main method of the algorithm.
    */
    void kmeans_cluster(struct Kmeans *kmeans);
   
   
   

    #include "kmeans.h"
   
   
    struct Kmeans*
    kmeans_create(double* data,
                  int n,
                  int m,
                  int ngroups,
                  int maxiter){
        int i,j;
        struct Kmeans *k = (struct Kmeans*) malloc(sizeof(struct Kmeans));
        k->data="">        k->centers=gsl_matrix_alloc(ngroups, m);
        k->ngroups=ngroups;
        k->index=gsl_vector_int_alloc(n);
        k->n=n;
        k->m=m;
        k->maxiter=maxiter;
        k->lastiter=0;
        for (i=0;i<n;i++){
            for(j=0;j<m;j++){
                gsl_matrix_set(k->data, i, j, data[i * m +j]);
            }
        }
        k->weights = (double*)malloc(sizeof(double) * k->index->size);
        k->v1 = gsl_vector_alloc(k->centers->size2);
        k->v2 = gsl_vector_alloc(k->centers->size2);
        k->v3 = gsl_vector_alloc(k->n);
        return(k);
    }
   
    void
    kmeans_randomize_centers(struct Kmeans *kmeans){
        int i,j;
        double min=0,max=0;
        min=gsl_matrix_min(kmeans->data);
        max=gsl_matrix_max(kmeans->data);
        gsl_matrix_minmax(kmeans->data, &min, &max);
        for (i=0;i<kmeans->centers->size1;i++){
            for (j=0;j<kmeans->centers->size2; j++){
                gsl_matrix_set(kmeans->centers, i, j, min + (((double)rand())/RAND_MAX)*(max-min));
            }
        }
    }
   
   
   
    void
    kmeans_print(struct Kmeans* kmeans){
        int i,j;
        printf("Number of groups: %d\n",kmeans->ngroups);
        printf("Centers:\n");
        for (i=0;i<kmeans->centers->size1;i++) {
            for (j=0;j<kmeans->centers->size2;j++){
                printf("%f ",gsl_matrix_get(kmeans->centers, i,j));
            }
            printf("\n");
        }
   
        printf("Index:\n");
        for (i=0;i<kmeans->n;i++){
            printf("%d ",gsl_vector_int_get(kmeans->index, i));
        }
        printf("\nLast iter: %d\n",kmeans->lastiter);
    }
   
   
    void print_matrix(gsl_matrix *m){
        int i,j;
        for (i=0;i<m->size1;i++){
            for (j=0;j<m->size2;j++){
                printf("%f ",m->data[i * m->size2 + j]);
            }
            printf("\n");
        }
    }
   
   
    double
    kmeans_euclidean_distance(gsl_vector *v1,
                              gsl_vector *v2){
        double result=0.0;
        double val;
        int i;
        for (i=0;i<v1->size;i++){
            val=v1->data[i] - v2->data[i];
            result+=val*val;
        }
        return(result);
    }
   
   
   
    int
    kmeans_num_elements_group(struct Kmeans *kmeans, int group){
        int total=0;
        int i;
        for (i=0;i<kmeans->n;i++){
            if(gsl_vector_int_get(kmeans->index,i)==group) total++;
        }
        return(total);
    }
   
   
    void
    kmeans_recalculate_centers(struct Kmeans *kmeans){
        int i,j,h;
        int elm;
        double mean;
        gsl_vector *v1=kmeans->v3;
   
        for (i=0;i<kmeans->ngroups;i++){
            elm=kmeans_num_elements_group(kmeans,i);
            for (j=0;j<kmeans->index->size;j++){
                if(gsl_vector_int_get(kmeans->index,j)==i){
                    kmeans->weights[j]=1.0;
                }else{
                    kmeans->weights[j]=0.0;
                }
            }
   
            for (h=0;h<kmeans->m;h++){
                gsl_matrix_get_col(v1,kmeans->data, h);
                mean=gsl_stats_wmean(kmeans->weights, 1, v1->data ,1, v1->size);
                gsl_matrix_set(kmeans->centers, i,h, mean);
            }
        }
    }
   
   
    void
    kmeans_calculate_indexes(struct Kmeans *kmeans){
        double dist;
        double mindist;
        int bestindex=0;
        int i,j;
        gsl_vector *v1 = kmeans->v1;
        gsl_vector *v2 = kmeans->v2;
   
        for (i=0;i<kmeans->n; i++){
            mindist=INFINITY;
            gsl_matrix_get_row(v1, kmeans->data, i);
            for (j=0;j<kmeans->ngroups; j++){
                gsl_matrix_get_row(v2, kmeans->centers,j);
                dist=kmeans_euclidean_distance(v1,v2);
                if(dist<mindist){
                    mindist=dist;
                    bestindex=j;
                }
            }
            gsl_vector_int_set(kmeans->index,i,bestindex);
        }
    }
   
   
   
    int
    kmeans_check_converge(gsl_vector_int *current,
                          gsl_vector_int *old){
        int i;
        int total=0;
        for (i=0;i<current->size;i++) {
            if(current->data[i] == old->data[i]) total++;
            old->data[i]=current->data[i];
        }
        return(current->size-total);
    }
   
   
   
    gsl_matrix*
    kmeans_getGroup(struct Kmeans *kmeans, int index){
            int i;
            int j=0;
            int elm=kmeans_num_elements_group(kmeans,index);
            gsl_matrix *agroup=gsl_matrix_alloc(elm, kmeans->data->size2);
            gsl_vector *v1=gsl_vector_alloc(kmeans->data->size2);
            for(i=0;i<kmeans->data->size1;i++){
                if(kmeans->index->data[i]==index){
                    gsl_matrix_get_row(v1, kmeans->data, i);
                    gsl_matrix_set_row(agroup, j, v1);
                    j++;
                }
            }
            return(agroup);
    }
   
   
   
    void
    kmeans_cluster(struct Kmeans *kmeans){
        int i;
        gsl_vector_int *oldindex = gsl_vector_int_alloc(kmeans->index->size);
        kmeans_randomize_centers(kmeans);
   
        for (i=0;i<kmeans->maxiter;i++){
            kmeans->lastiter=i;
            kmeans_calculate_indexes(kmeans);
            kmeans_recalculate_centers(kmeans);
            if(kmeans_check_converge(kmeans->index, oldindex)==0) break;
        }
   
    }
   
   

    #include "kmeans.h"
   
    int
    main()
    {
        srand(12345);
        int i,j;
        double val;
        int n=10000; //Number of cases
        int m=10;    //Number of variables
        int k=10;    //Number of groups
        int maxiter=100;
        double * data = "" malloc(n* m * sizeof(double));
        for (i=0;i<n;i++){
            for (j=0;j<m;j++){
                val= ((float)rand())/(float)RAND_MAX;
                data[i * m + j]=val*(i+j);
            }
        }
   
        struct Kmeans *km = kmeans_create(data, n, m ,k, maxiter);
        kmeans_cluster(km);
        //kmeans_print(km);
   
        return 0;
    }


--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]