Skip to main content

k-means Partition Clustering

The simplest method of cluster analysis is partitioning which organizes the objects of a set into several exclusive groups. The partitioning algorithm organizes the objects into k partitions where each partition represents a cluster.

The well-known and commonly used partitioning methods are k-means and k-medoids. An objective function is used to assess the partitioning quality so that objects within a cluster are similar to one another but dissimilar to objects in other clusters. A centroid-based partitioning technique uses the centroid of a cluster, Ci , to represent that cluster. Conceptually, the centroid of a cluster is its center point.

The quality of cluster Ci can be measured by the within cluster variation which is the sum of squared error between all objects in Ci and the centroid ci, defined as





where 

E is the sum of the squared error for all objects in the data set 

p is the point in space representing a given object

ci is the centroid of cluster Ci

Optimizing the within-cluster variation is computationally challenging. The problem is NP-hard (non-deterministic polynomial-time hardness) for a general number of clusters k even in the 2-D Euclidean space.

If the number of clusters k and the dimensionality of the space d are fixed, the problem can be solved in time O(ndk+1 log n), where n is the number of objects.

The k-means algorithm defines the centroid of a cluster as the mean value of the points within the cluster. The process of iteratively reassigning objects to clusters to improve the partitioning is referred to as iterative relocation.

The time complexity of the k-means algorithm is O(nkt)

where n is the total number of objects

k is the number of clusters.

t is the number of iterations.

The k-means method can be applied only when the mean of a set of objects is defined.

The k-modes method is a variant of k-means, which extends the k-means paradigm to cluster nominal data by replacing the means of clusters with modes.

Popular posts from this blog

Exercise 2 - Amdahl's Law

A programmer has parallelized 99% of a program, but there is no value in increasing the problem size, i.e., the program will always be run with the same problem size regardless of the number of processors or cores used. What is the expected speedup on 20 processors? Solution As per Amdahl's law, the speedup,  N - No of processors = 20 f - % of parallel operation = 99% = 1 / (1 - 0.99) + (0.99 / 20) = 1 / 0.01 + (0.99 / 20) = 16.807 The expected speedup on 20 processors is 16.807

Decision Tree Classification

 A decision tree is a flowchart-like tree structure. The topmost node in a tree is the root node. The each internal node (non-leaf node) denotes a test on an attribute and each branch represents an outcome of the test. The each leaf node (or terminal node) holds a class label. Decision trees can handle multidimensional data.  Some of the decision tree algorithms are Iterative Dichotomiser (ID3), C4.5 (a successor of ID3), Classification and Regression Trees (CART). Most algorithms for decision tree induction  follow a top-down approach.  The tree starts with a training set of tuples and their associated class labels. The algorithm is called with data partition, attribute list, and attribute selection method, where the data partition is the complete set of training tuples and their associated class labels. The splitting criterion is determined by attribute selection method which indicates the splitting attribute that may be splitting point or splitting subset. Attribu...

Exercise 1 - Amdahl's Law

A programmer is given the job to write a program on a computer with processor having speedup factor 3.8 on 4 processors. He makes it 95% parallel and goes home dreaming of a big pay raise. Using Amdahl’s law, and assuming the problem size is the same as the serial version, and ignoring communication costs, what is the speedup factor that the programmer will get? Solution Speedup formula as per Amdahl's Law, N - no of processor = 4 f - % of parallel operation = 95% Speedup = 1 / (1 - 0.95) + (0.95/4) = 1 / 0.5 + (0.95/4) Speedup = 3.478 The programmer gets  3.478 as t he speedup factor.