The Algorithm
K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. It is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This algorithm has nothing to do with and should not be confused with k-nearest neighbor, another popular machine learning technique.
The procedure provides a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed beforehand. The main idea is to define k centroids, one for each cluster. These centroids should be placed in a clever way, because different location causes different result. The best choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is unresolved, the first step is completed and an early grouping is done. At this point we need to re-calculate k new centroids as centers of mass of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. This generates a loop and as a result of this loop, we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more.
Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective function
The algorithm is composed of the following steps:
1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the K centroids. 4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. |
Although we can be prove that the procedure will always terminate, the k-means algorithm does not necessarily find the most optimal configuration, corresponding to the global objective function minimum. The algorithm is also significantly sensitive to the initial randomly selected cluster centers. The k-means algorithm can be run multiple times to reduce this effect.
K-means Algorithm using R
We will use R to implement the k-means algorithm for cluster analysis or the DavisThin data set. The DavisThin data frame has 191 rows and 7 columns and is included with the car
-package. This is part of a larger dataset for a study of eating disorders. The seven variables in the data frame comprise a “drive for thinness” scale, to be formed by summing the items.
Data Preparation
Data that is analyzed by the k-means algorithm in R must be prepared as follows:
- The data must not have any missing values—remove the records or estimate them
- The data must be scaled for comparison
# Prepare Data
> require(car)
> mydata
> mydata # listwise deletion of missing
> mydata # standardize variables
Determining the Number of Clusters
A plot of the within groups sum of squares by number of clusters extracted can help determine the appropriate number of clusters. We can look for a bend in the plot similar to a screen test in factor analysis.
> # Determine number of clusters
> wss
> for (i in 2:15) wss[i]
> plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
Based on the plot, we should try our cluster analysis with 5 or 6 clusters (centroids).
Fitting a Model
The stats
package contains the kmeans
function that we use here. It perform k-means clustering on a data matrix. We will fit two models, one with 5 clusters and the other 6.
> # K-Means Clustering with 5 clusters
> fit1
> # get cluster means
> aggregate(mydata,by=list(fit1$cluster),FUN=mean)
Group.1 DT1 DT2 DT3 DT4 DT5 DT6 DT7 fit1 fit2
1 1 -0.374 -0.72 -0.659 -0.39 -0.70 -0.69 -0.501 1 6.0
2 2 0.398 0.89 1.408 -0.32 1.28 1.35 1.665 2 2.1
3 3 0.058 0.61 -0.035 -0.32 0.31 0.12 -0.320 3 1.8
4 4 1.646 1.30 1.631 2.73 1.37 1.48 1.800 4 4.0
5 5 0.454 1.18 1.078 1.87 0.97 1.17 -0.099 5 3.3
> # append cluster assignment
> mydata1
> # K-Means Clustering with 6 clusters
> fit2
> # get cluster means
> aggregate(mydata,by=list(fit2$cluster),FUN=mean)
Group.1 DT1 DT2 DT3 DT4 DT5 DT6 DT7 fit1 fit2
1 1 -0.47 0.71 -0.08 -0.29 0.24 0.26 -0.30 3.0 1
2 2 0.29 0.96 1.40 -0.32 1.37 1.33 1.64 2.0 2
3 3 0.10 1.19 1.12 2.06 1.02 1.16 -0.02 5.0 3
4 4 1.65 1.30 1.63 2.73 1.37 1.48 1.80 4.0 4
5 5 2.33 0.11 0.23 -0.22 0.23 -0.01 -0.22 2.9 5
6 6 -0.43 -0.72 -0.66 -0.39 -0.69 -0.69 -0.50 1.0 6
> # append cluster assignment
> mydata2
Alternative
A robust version of K-means based on mediods can be invoked by using pam( )
instead of kmeans( )
. The function pamk( )
in the fpc
package is a wrapper for pam that also prints the suggested number of clusters based on optimum average silhouette width. We will not implement this method here.
Plotting the Results
It is always a good idea to look at the cluster results. These can be used to assess the choice of number of clusters as well as comparing two different cluster analyses. Here we use the clustplot
function from the cluster
package, and plot both models.
> require(cluster)
> clusplot(mydata1, fit1$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
> clusplot(mydata2, fit2$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
The plotcluster
function from the fpc
package provides plots to distinguish given classes by ten available projection methods. The one-dimensional data from mydata1
and mydata2
is plotted against the cluster number.
> # Centroid Plot against 1st 2 discriminant functions
> require(fpc)
Attaching package: ‘fpc’
The following object is masked from ‘package:DAAG’:
confusion
> plotcluster(mydata1, fit1$cluster)
> plotcluster(mydata2, fit2$cluster)
Conclusion
Based on our plots, it appears that model fit1
best describes the clustering of the DavisThin data.
References
- J. B. MacQueen (1967): “Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability”, Berkeley, University of California Press, 1:281-297.
Authored by:
Jeffrey Strickland, Ph.D.
Jeffrey Strickland, Ph.D., is the Author of Predictive Analytics Using R and a Senior Analytics Scientist with Clarity Solution Group. He has performed predictive modeling, simulation and analysis for the Department of Defense, NASA, the Missile Defense Agency, and the Financial and Insurance Industries for over 20 years. Jeff is a Certified Modeling and Simulation professional (CMSP) and an Associate Systems Engineering Professional (ASEP). He has published nearly 200 blogs on LinkedIn, is also a frequently invited guest speaker and the author of 20 books including:
- Operations Research using Open-Source Tools
- Discrete Event simulation using ExtendSim
- Crime Analysis and Mapping
- Missile Flight Simulation
- Mathematical Modeling of Warfare and Combat Phenomenon
- Predictive Modeling and Analytics
- Using Math to Defeat the Enemy
- Verification and Validation for Modeling and Simulation
- Simulation Conceptual Modeling
- System Engineering Process and Practices
Connect with Jeffrey Strickland
Contact Jeffrey Strickland
Categories: Articles, Education & Training, Featured, Jeffrey Strickland