The simplest method of cluster analysis is partitioning which organizes the objects of a set into several exclusive groups. The partitioning algorithm organizes the objects into k partitions where each partition represents a cluster.
The well-known and commonly used partitioning methods are k-means and k-medoids. An objective function is used to assess the partitioning quality so that objects within a cluster are similar to one another but dissimilar to objects in other clusters. A centroid-based partitioning technique uses the centroid of a cluster, Ci , to represent that cluster. Conceptually, the centroid of a cluster is its center point.
The quality of cluster Ci can be measured by the within cluster variation which is the sum of squared error between all objects in Ci and the centroid ci, defined as
where
E is the sum of the squared error for all objects in the data set
p is the point in space representing a given object
ci is the centroid of cluster Ci
Optimizing the within-cluster variation is computationally challenging. The problem is NP-hard (non-deterministic polynomial-time hardness) for a general number of clusters k even in the 2-D Euclidean space.
If the number of clusters k and the dimensionality of the space d are fixed, the problem can be solved in time O(ndk+1 log n), where n is the number of objects.
The k-means algorithm defines the centroid of a cluster as the mean value of the points within the cluster. The process of iteratively reassigning objects to clusters to improve the partitioning is referred to as iterative relocation.
The time complexity of the k-means algorithm is O(nkt)
where n is the total number of objects
k is the number of clusters.
t is the number of iterations.
The k-means method can be applied only when the mean of a set of objects is defined.
The k-modes method is a variant of k-means, which extends the k-means paradigm to cluster nominal data by replacing the means of clusters with modes.