Articles

K-Means Clustering using R

The Algorithm

K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. It is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This algorithm has nothing to do with and should not be confused with k-nearest neighbor, another popular machine learning technique.

The procedure provides a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed beforehand. The main idea is to define k centroids, one for each cluster. These centroids should be placed in a clever way, because different location causes different result. The best choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is unresolved, the first step is completed and an early grouping is done. At this point we need to re-calculate k new centroids as centers of mass of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. This generates a loop and as a result of this loop, we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more.

Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective function

k-means objective function

The algorithm is composed of the following steps:

1.      Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.2.      Assign each object to the group that has the closest centroid.

3.      When all objects have been assigned, recalculate the positions of the K centroids.

4.      Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

Although we can be prove that the procedure will always terminate, the k-means algorithm does not necessarily find the most optimal configuration, corresponding to the global objective function minimum. The algorithm is also significantly sensitive to the initial randomly selected cluster centers. The k-means algorithm can be run multiple times to reduce this effect.

K-means Algorithm using R

We will use R to implement the k-means algorithm for cluster analysis or the DavisThin data set. The DavisThin data frame has 191 rows and 7 columns and is included with the car-package. This is part of a larger dataset for a study of eating disorders. The seven variables in the data frame comprise a “drive for thinness” scale, to be formed by summing the items.

Data Preparation

Data that is analyzed by the k-means algorithm in R must be prepared as follows:

  1. The data must not have any missing values—remove the records or estimate them
  2. The data must be scaled for comparison

# Prepare Data
> require(car)
> mydata
> mydata # listwise deletion of missing

> mydata # standardize variables

Determining the Number of Clusters

A plot of the within groups sum of squares by number of clusters extracted can help determine the appropriate number of clusters. We can look for a bend in the plot similar to a screen test in factor analysis.

> # Determine number of clusters
> wss
> for (i in 2:15) wss[i]
> plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")

Number of clusters

Based on the plot, we should try our cluster analysis with 5 or 6 clusters (centroids).

Fitting a Model

The stats package contains the kmeans function that we use here. It perform k-means clustering on a data matrix. We will fit two models, one with 5 clusters and the other 6.

> # K-Means Clustering with 5 clusters
> fit1
> # get cluster means
> aggregate(mydata,by=list(fit1$cluster),FUN=mean)

Group.1    DT1   DT2    DT3   DT4   DT5   DT6    DT7  fit1   fit2
1     1 -0.374 -0.72 -0.659 -0.39 -0.70 -0.69 -0.501     1    6.0
2     2  0.398  0.89  1.408 -0.32  1.28  1.35  1.665     2    2.1
3     3  0.058  0.61 -0.035 -0.32  0.31  0.12 -0.320     3    1.8
4     4  1.646  1.30  1.631  2.73  1.37  1.48  1.800     4    4.0
5     5  0.454  1.18  1.078  1.87  0.97  1.17 -0.099     5    3.3
> # append cluster assignment
> mydata1

> # K-Means Clustering with 6 clusters
> fit2
> # get cluster means
> aggregate(mydata,by=list(fit2$cluster),FUN=mean)

Group.1   DT1   DT2   DT3   DT4   DT5   DT6   DT7 fit1  fit2
1     1 -0.47  0.71 -0.08 -0.29  0.24  0.26 -0.30  3.0     1
2     2  0.29  0.96  1.40 -0.32  1.37  1.33  1.64  2.0     2
3     3  0.10  1.19  1.12  2.06  1.02  1.16 -0.02  5.0     3
4     4  1.65  1.30  1.63  2.73  1.37  1.48  1.80  4.0     4
5     5  2.33  0.11  0.23 -0.22  0.23 -0.01 -0.22  2.9     5
6     6 -0.43 -0.72 -0.66 -0.39 -0.69 -0.69 -0.50  1.0     6
> # append cluster assignment
> mydata2

Alternative

A robust version of K-means based on mediods can be invoked by using pam( ) instead of kmeans( ). The function pamk( ) in the fpc package is a wrapper for pam that also prints the suggested number of clusters based on optimum average silhouette width. We will not implement this method here.

Plotting the Results

It is always a good idea to look at the cluster results. These can be used to assess the choice of number of clusters as well as comparing two different cluster analyses. Here we use the clustplot function from the cluster package, and plot both models.

> require(cluster)
> clusplot(mydata1, fit1$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
> clusplot(mydata2, fit2$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

Clustplt-mydata1

Clustplt-mydata2

The plotcluster function from the fpc package provides plots to distinguish given classes by ten available projection methods. The one-dimensional data from mydata1 and mydata2 is plotted against the cluster number.

> # Centroid Plot against 1st 2 discriminant functions
> require(fpc)

Attaching package: ‘fpc’
The following object is masked from ‘package:DAAG’:
      confusion
> plotcluster(mydata1, fit1$cluster)
> plotcluster(mydata2, fit2$cluster)

plotclust-mydata1

plotclust-mydata2

Conclusion

Based on our plots, it appears that model fit1 best describes the clustering of the DavisThin data.

References

  • J. B. MacQueen (1967): “Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability”, Berkeley, University of California Press, 1:281-297.

Jeffrey Strickland

Authored by:
Jeffrey Strickland, Ph.D.

Jeffrey Strickland, Ph.D., is the Author of Predictive Analytics Using R and a Senior Analytics Scientist with Clarity Solution Group. He has performed predictive modeling, simulation and analysis for the Department of Defense, NASA, the Missile Defense Agency, and the Financial and Insurance Industries for over 20 years. Jeff is a Certified Modeling and Simulation professional (CMSP) and an Associate Systems Engineering Professional (ASEP). He has published nearly 200 blogs on LinkedIn, is also a frequently invited guest speaker and the author of 20 books including:

  • Operations Research using Open-Source Tools
  • Discrete Event simulation using ExtendSim
  • Crime Analysis and Mapping
  • Missile Flight Simulation
  • Mathematical Modeling of Warfare and Combat Phenomenon
  • Predictive Modeling and Analytics
  • Using Math to Defeat the Enemy
  • Verification and Validation for Modeling and Simulation
  • Simulation Conceptual Modeling
  • System Engineering Process and Practices

Connect with Jeffrey Strickland
Contact Jeffrey Strickland


Leave a comment