k-means Partition Clustering

The simplest method of cluster analysis is partitioning which organizes the objects of a set into several exclusive groups. The partitioning algorithm organizes the objects into k partitions where each partition represents a cluster.

The well-known and commonly used partitioning methods are k-means and k-medoids. An objective function is used to assess the partitioning quality so that objects within a cluster are similar to one another but dissimilar to objects in other clusters. A centroid-based partitioning technique uses the centroid of a cluster, C_i , to represent that cluster. Conceptually, the centroid of a cluster is its center point.

The quality of cluster C_i can be measured by the within cluster variation which is the sum of squared error between all objects in C_i and the centroid c_i, defined as

where

E is the sum of the squared error for all objects in the data set

p is the point in space representing a given object

c_i is the centroid of cluster C_i

Optimizing the within-cluster variation is computationally challenging. The problem is NP-hard (non-deterministic polynomial-time hardness) for a general number of clusters k even in the 2-D Euclidean space.

If the number of clusters k and the dimensionality of the space d are fixed, the problem can be solved in time O(n^dk+1 log n), where n is the number of objects.

The k-means algorithm defines the centroid of a cluster as the mean value of the points within the cluster. The process of iteratively reassigning objects to clusters to improve the partitioning is referred to as iterative relocation.

The time complexity of the k-means algorithm is O(nkt)

where n is the total number of objects

k is the number of clusters.

t is the number of iterations.

The k-means method can be applied only when the mean of a set of objects is defined.

The k-modes method is a variant of k-means, which extends the k-means paradigm to cluster nominal data by replacing the means of clusters with modes.

Exercise 2 - Amdahl's Law

A programmer has parallelized 99% of a program, but there is no value in increasing the problem size, i.e., the program will always be run with the same problem size regardless of the number of processors or cores used. What is the expected speedup on 20 processors? Solution As per Amdahl's law, the speedup, N - No of processors = 20 f - % of parallel operation = 99% = 1 / (1 - 0.99) + (0.99 / 20) = 1 / 0.01 + (0.99 / 20) = 16.807 The expected speedup on 20 processors is 16.807

Exercise 1 - Amdahl's Law

A programmer is given the job to write a program on a computer with processor having speedup factor 3.8 on 4 processors. He makes it 95% parallel and goes home dreaming of a big pay raise. Using Amdahl’s law, and assuming the problem size is the same as the serial version, and ignoring communication costs, what is the speedup factor that the programmer will get? Solution Speedup formula as per Amdahl's Law, N - no of processor = 4 f - % of parallel operation = 95% Speedup = 1 / (1 - 0.95) + (0.95/4) = 1 / 0.5 + (0.95/4) Speedup = 3.478 The programmer gets 3.478 as t he speedup factor.

BITS Work Integrated Learning Program - M.Tech Data Science & Engineering

BITS Pilani offers work integrated learning program (WILP) on M.Tech Data Science and Engineering which is UGC approved. The course is a four semester programme designed to help work professionals to build their skills required for data science engineering which enable them to become a Data Scientist. It is a 100% online course and lectures would be delivered by BITS Pilani faculty on weekends. Those who are working in software industry as Software Engineer, Programmer, Data Analyst, Business Analyst can apply for the course. Minimum eligibility criteria to apply for the course. Those who are employed holding B.E/B.Tech/MCA/M.Sc or Equivalent with 60% marks and minimum one year relevant work experience. The candidates should have basic programming and engineering mathematics knowledge. The following subjects shall be covered in the course. Semester Subjects Data Mining Mathematical Fundamentals for Data Science Data Structure and Algorithms Design Co...

The Data Ilm

Search This Blog