Re: K-Means Clustering

Hi John

it is good to receive such an email from you. 

Actually it is an important case to be engaged such a project like PSPP for me. 

I think i can add these functionalties to PSPP if the members of the project help me. Another way, i can try implementing other clustering algorithms to PSPP, such as single linkage and complete linkage with several distance functions. 

Note: If i am not wrong, "implementing such a command in PSPP" means making the k-means clustering codes be callable using the 

"QUICK CLUSTER x y

  /MISSING=LISTWISE
  /CRITERIA=CLUSTER(2) MXITER(10) CONVERGE(0)
  /METHOD=KMEANS(NOUPDATE)
  /PRINT INITIAL ANOVA CLUSTER DISTAN."

syntax. is that right?

also: I have no experience about coding GTK+

best regards

From: John Darrington <address@hidden>
To: Mehmet Hakan Satman <address@hidden>
Cc: address@hidden
Sent: Thu, March 10, 2011 1:07:59 PM
Subject: Re: K-Means Clustering

Hello Mehmet,

I think the code is well written and is something that we could use in PSPP.

Like you say, perhaps it doesn't lend itself to a generalized clustering algorithm,
but I think it would nicely satisfy the needs of the QUICK CLUSTER command.

Would you like to try implementing such a command in PSPP ?   Ask on the mailing list
if you need help getting started.

Regards, 

John

On Wed, Mar 09, 2011 at 05:51:44AM -0800, Mehmet Hakan Satman wrote:
     Hi John,

     I am sending the .h, the .c and the test.c file as attachments. 

     My implementation clusters a random data set with number of cases 10000, the 
     number of parameters 10 and the number of clusters 10 in average 0.45 seconds. 

     Algorithm starts with the randomly
 selected cluster centers and in each single 
     steps these centers are changed. Distances between these centers and the 
     observations must be calculated in every step. I am not sure if it can be 
     implemented a "create-one-and-use-anywhere" distances in this algorithm. In 
     hierarchial cluster algorithms such as "single linkage clustering", it is 
     possible to build a distance matrix and use it during the clustering.

     Best Regards.

     Hakan.

     ________________________________
     From: John Darrington <address@hidden>
     To: Mehmet Hakan Satman <address@hidden>
     Cc: address@hidden
     Sent: Wed, March 9, 2011 3:21:59 PM
     Subject: Re: K-Means Clustering

     Hi Mehmet!

     As I mentioned, the CLUSTER command is soemthing which I think it would be
     great to support.

     One issue with clustering is its memory complexity. It requires O(n^2) where
     n is the number of cases being clustered.
     Have you tested your algorithm with large numbers of cases?

     Maybe Ben has some ideas how an efficient distance matrix can be  implemented
     in PSPP (maybe sparse-array.c can help?) .

     In any case, I'd be interested to see your code, and the results of your 
     comparisons.  Can you post them somewhere?

     J'

     On Tue, Mar 08, 2011 at 05:56:37AM -0800, Mehmet Hakan Satman wrote:
          hi everybody,

          I am interested in PSPP and i read about something about the needs for 
          developing some functionality.
          I implemented a k-means clustering  library using the GNU scientific 
     library and 

          sent an informative e-mail to John. He suggested me to join this group and 
     share 

          my ideas with the
 stuff. 

          I compared the results with SPSS outputs. The analysis of variance table is 
     not 

          completed but we may add this feature.

          I would be glad to integrate something to PSPP and work with you.

          What do you think about this?

     -- 
     PGP Public key ID: 1024D/2DE827B3 
     fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
     See http://pgp.mit.edu or any PGP keyserver for public
 key.

     /*
         kmeans - K-Means Clustering Library for the GNU PSPP (People Should Prefer PSPP) project.
         Copyright (C) 2011  Dr.Mehmet Hakan Satman <address@hidden>

         This program is free software: you can redistribute it and/or modify
         it under the terms of the GNU General Public License as published by
         the Free Software Foundation, either version 3 of the License, or
         (at your option) any later version.

         This program is distributed in the hope that it will be useful,
         but WITHOUT ANY
 WARRANTY; without even the implied warranty of
         MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
         GNU General Public License for more details.

         You should have received a copy of the GNU General Public License
         along with this program.  If not, see <http://www.gnu.org/licenses/>.
     */
     #include <stdio.h>
     #include <math.h>
     #include <gsl/gsl_matrix.h>
     #include <gsl/gsl_statistics.h>

     /*
     Struct KMeans:
     Holds all of the information for the functions.
     */
     struct Kmeans {

        gsl_matrix *data;           //User Data (Given by the user)
         gsl_matrix *centers;        //Centers for groups
         gsl_vector_int *index;      //Integer values from zero to ngroups. Shows group of an observation.
         gsl_vector *v1,*v2,*v3;     //Temporary vector for program. Do not use.
         int ngroups;                //Number of group. (Given by the user)
         int n;                      //Number of observations. (Given by the user)
         int m;                      //Number of observations. (Given by the user)
         int
 maxiter;                //Maximum number of iterations (Given by the user)
         int lastiter;               //Show at which iteration it found the solution.
         double *weights;            //Double values for handling weights for program use.
     };

     /*
     Creates a Kmeans structure for a given data 'data', number of observations 'n',
     number of columns 'm', number of groups 'ngroups' (it is usually called 'k') and
     number of maximum iterations 'maxiter'.
     */
     struct Kmeans* kmeans_create(double* data, int n, int m, int ngroups, int maxiter);

     /*
     Randomly chooses centers of 'ngroup'
 groups.
     These centers are initial and will be changed iteratively.
     */
     void kmeans_randomize_centers(struct Kmeans *kmeans);

     /*
     Prints the given Kmeans structure
     */
     void kmeans_print(struct Kmeans* kmeans);

     /*
     Calculates the squared euclidean distance between vector v1 and v2
     */
     double kmeans_euclidean_distance(gsl_vector *v1, gsl_vector *v2);

     /*
     Calculates and returns the number of elements contained in the
     specific group.
     */
     int kmeans_num_elements_group(struct Kmeans *kmeans, int group);

     /*
     Calculates
 group centers. Those centers are calculated using iteration
     and they are usually different from the initial centers. Recalculation
     process remains while the last two solutions are not equal.
     */
     void kmeans_recalculate_centers(struct Kmeans *kmeans);

     /*
     Constructs the index variable. This variable shows the current
     group of the each single observation.
     */
     void kmeans_calculate_indexes(struct Kmeans *kmeans);

     /*
     Checks if the last two index variables are equal. If they are equal,
     algorithm can not find a better classification anymore and stops.
     */
     int kmeans_check_converge(gsl_vector_int *current, gsl_vector_int *old);

     /*
     This is the main method of the algorithm.
     */
     void kmeans_cluster(struct Kmeans *kmeans);

     #include "kmeans.h"

     struct Kmeans*
     kmeans_create(double* data,
                   int n,
                   int m,
                   int ngroups,
                   int maxiter){
         int i,j;
         struct Kmeans *k = (struct Kmeans*) malloc(sizeof(struct Kmeans));
         k->data="">        
 k->centers=gsl_matrix_alloc(ngroups, m);
         k->ngroups=ngroups;
         k->index=gsl_vector_int_alloc(n);
         k->n=n;
         k->m=m;
         k->maxiter=maxiter;
         k->lastiter=0;
         for (i=0;i<n;i++){
             for(j=0;j<m;j++){
                 gsl_matrix_set(k->data, i, j, data[i * m +j]);
             }
         }
         k->weights = (double*)malloc(sizeof(double) * k->index->size);
         k->v1 = gsl_vector_alloc(k->centers->size2);
         k->v2 = gsl_vector_alloc(k->centers->size2);

      k->v3 = gsl_vector_alloc(k->n);
         return(k);
     }

     void
     kmeans_randomize_centers(struct Kmeans *kmeans){
         int i,j;
         double min=0,max=0;
         min=gsl_matrix_min(kmeans->data);
         max=gsl_matrix_max(kmeans->data);
         gsl_matrix_minmax(kmeans->data, &min, &max);
         for (i=0;i<kmeans->centers->size1;i++){
             for (j=0;j<kmeans->centers->size2; j++){
                 gsl_matrix_set(kmeans->centers, i, j, min + (((double)rand())/RAND_MAX)*(max-min));
             }
         }

 }

     void
     kmeans_print(struct Kmeans* kmeans){
         int i,j;
         printf("Number of groups: %d\n",kmeans->ngroups);
         printf("Centers:\n");
         for (i=0;i<kmeans->centers->size1;i++) {
             for (j=0;j<kmeans->centers->size2;j++){
                 printf("%f ",gsl_matrix_get(kmeans->centers, i,j));
             }
             printf("\n");
         }

         printf("Index:\n");
         for (i=0;i<kmeans->n;i++){
             printf("%d
 ",gsl_vector_int_get(kmeans->index, i));
         }
         printf("\nLast iter: %d\n",kmeans->lastiter);
     }

     void print_matrix(gsl_matrix *m){
         int i,j;
         for (i=0;i<m->size1;i++){
             for (j=0;j<m->size2;j++){
                 printf("%f ",m->data[i * m->size2 + j]);
             }
             printf("\n");
         }
     }

     double
     kmeans_euclidean_distance(gsl_vector *v1,
                               gsl_vector
 *v2){
         double result=0.0;
         double val;
         int i;
         for (i=0;i<v1->size;i++){
             val=v1->data[i] - v2->data[i];
             result+=val*val;
         }
         return(result);
     }

     int
     kmeans_num_elements_group(struct Kmeans *kmeans, int group){
         int total=0;
         int i;
         for (i=0;i<kmeans->n;i++){
             if(gsl_vector_int_get(kmeans->index,i)==group) total++;
         }
         return(total);

 }

     void
     kmeans_recalculate_centers(struct Kmeans *kmeans){
         int i,j,h;
         int elm;
         double mean;
         gsl_vector *v1=kmeans->v3;

         for (i=0;i<kmeans->ngroups;i++){
             elm=kmeans_num_elements_group(kmeans,i);
             for (j=0;j<kmeans->index->size;j++){
                 if(gsl_vector_int_get(kmeans->index,j)==i){
                     kmeans->weights[j]=1.0;
                 }else{

 kmeans->weights[j]=0.0;
                 }
             }

             for (h=0;h<kmeans->m;h++){
                 gsl_matrix_get_col(v1,kmeans->data, h);
                 mean=gsl_stats_wmean(kmeans->weights, 1, v1->data ,1, v1->size);
                 gsl_matrix_set(kmeans->centers, i,h, mean);
             }
         }
     }

     void
     kmeans_calculate_indexes(struct Kmeans *kmeans){
         double dist;
         double mindist;
         int bestindex=0;

      int i,j;
         gsl_vector *v1 = kmeans->v1;
         gsl_vector *v2 = kmeans->v2;

         for (i=0;i<kmeans->n; i++){
             mindist=INFINITY;
             gsl_matrix_get_row(v1, kmeans->data, i);
             for (j=0;j<kmeans->ngroups; j++){
                 gsl_matrix_get_row(v2, kmeans->centers,j);
                 dist=kmeans_euclidean_distance(v1,v2);
                 if(dist<mindist){
                     mindist=dist;
                     bestindex=j;

            }
             }
             gsl_vector_int_set(kmeans->index,i,bestindex);
         }
     }

     int
     kmeans_check_converge(gsl_vector_int *current,
                           gsl_vector_int *old){
         int i;
         int total=0;
         for (i=0;i<current->size;i++) {
             if(current->data[i] == old->data[i]) total++;
             old->data[i]=current->data[i];
         }
         return(current->size-total);
     }

     gsl_matrix*
     kmeans_getGroup(struct Kmeans *kmeans, int index){
             int i;
             int j=0;
             int elm=kmeans_num_elements_group(kmeans,index);
             gsl_matrix *agroup=gsl_matrix_alloc(elm, kmeans->data->size2);
             gsl_vector *v1=gsl_vector_alloc(kmeans->data->size2);
             for(i=0;i<kmeans->data->size1;i++){
                 if(kmeans->index->data[i]==index){
                     gsl_matrix_get_row(v1, kmeans->data, i);

 gsl_matrix_set_row(agroup, j, v1);
                     j++;
                 }
             }
             return(agroup);
     }

     void
     kmeans_cluster(struct Kmeans *kmeans){
         int i;
         gsl_vector_int *oldindex = gsl_vector_int_alloc(kmeans->index->size);
         kmeans_randomize_centers(kmeans);

         for (i=0;i<kmeans->maxiter;i++){
             kmeans->lastiter=i;
             kmeans_calculate_indexes(kmeans);

 kmeans_recalculate_centers(kmeans);
             if(kmeans_check_converge(kmeans->index, oldindex)==0) break;
         }

     }

     #include "kmeans.h"

     int
     main()
     {
         srand(12345);
         int i,j;
         double val;
         int n=10000; //Number of cases
         int m=10;    //Number of variables
         int k=10;    //Number of groups
         int maxiter=100;
         double * data = "" malloc(n* m * sizeof(double));
         for (i=0;i<n;i++){

 for (j=0;j<m;j++){
                 val= ((float)rand())/(float)RAND_MAX;
                 data[i * m + j]=val*(i+j);
             }
         }

         struct Kmeans *km = kmeans_create(data, n, m ,k, maxiter);
         kmeans_cluster(km);
         //kmeans_print(km);

         return 0;
     }

-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.

From:	Mehmet Hakan Satman
Subject:	Re: K-Means Clustering
Date:	Thu, 10 Mar 2011 03:35:33 -0800 (PST)