The Data Ilm

Posts

Clustering Evaluation

The major tasks of clustering evaluation are as follows Clustering tendency assessment The assessment to be carried out to check whether nonrandom structure exists in the given data set. The Hopkins Statistic is a spatial statistic that tests the spatial randomness of a variable as distributed in a space. The Hopkins statistic value would be about 0.5 if data set are uniformly distributed. If the data set are highly skewed, then H would be close to 0. The uniformly distributed data set contains no meaningful clusters, that is if Hopkin statistic value is greater than 0.5 then it is unlikely that the data set has significant clusters. No of Clusters determination The number of clusters can be regarded as an important summary statistic of a data set. It is desirable to estimate the number of clusters even before a clustering algorithm is used to derive detailed clusters. The appropriate number of clusters controls the proper granularity of cluster analysis. The right num...

Grid based clustering

A grid-based clustering is a space-driven approach method. The grid data structure is formed by quantizing the object space into a finite number of cells. The clustering are performed on the formed grid data structure. As this approach is independent of the number of data objects but only dependent on the number of cells in each dimension in the space, its processing time is faster. There are two methods in grid based approach. They are STING (Statistical Information Grid) and CLIQUE (Clustering in Quest). STING is a grid-based multiresolution clustering technique in which the embedding spatial area of the input objects is divided into rectangular cells. Each cell at a high level is partitioned to form a number of cells at the next lower level. The statistical parameters such as mean, maximum, and minimum values are computed and stored for query processing and for data analysis tasks. STING approaches the clustering result of DBSCAN if the granularity approaches 0. STING offe...

Density based Clustering

Partitioning and hierarchical methods are designed to find spherical-shaped clusters. They often fail to find clusters of arbitrary shape. The clusters can be found in arbitrary shape as dense region separated by sparse regions in the data space. The methods used in the density based are DBSCAN (Density Based Spatial Clustering of Applications with Noise) OPTICS (Ordering Points to Identify the Clustering Structure) DENCLUE (Clustering Based on Density Distribution Functions) DBSCAN method finds core objects that have dense neighborhoods. It connects core objects and their neighborhoods to form dense regions as clusters. A user-specified parameter ε > 0 is used to specify the radius of a neighborhood we consider for every object. The method uses a parameter, MinPts, which specifies the density threshold of dense regions. An object is a core object if the ε -neighborhood of the object contains at least MinPts objects. All core objects can be identified with respect t...

Hierarchy Clustering

The method of grouping data objects into a hierarchy clusters is called hierarchy clustering. It is useful for data summarization and data visualization. The hierarchy clustering can be done using agglomerative and divisive method. The agglomerative method starts with individual objects as clusters, which are iteratively merged to form larger clusters. The divisive method initially lets all the given objects form one cluster, which they iteratively split into smaller clusters. Hierarchical clustering methods can encounter difficulties as the methods will neither undo what was done previously, nor perform object swapping between clusters. It may lead to low-quality clusters if merge or split decisions are not well chosen. Hierarchical clustering methods can be categorized into algorithmic methods, probabilistic methods, and Bayesian methods. The agglomerative and divisive are algorithmic methods. An agglomerative method requires at most n iterations. A tree structur...

k-means Partition Clustering

The simplest method of cluster analysis is partitioning which organizes the objects of a set into several exclusive groups. The partitioning algorithm organizes the objects into k partitions where each partition represents a cluster. The well-known and commonly used partitioning methods are k-means and k-medoids. An objective function is used to assess the partitioning quality so that objects within a cluster are similar to one another but dissimilar to objects in other clusters. A centroid-based partitioning technique uses the centroid of a cluster, C i , to represent that cluster. Conceptually, the centroid of a cluster is its center point. The quality of cluster C i can be measured by the within cluster variation which is the sum of squared error between all objects in C i and the centroid c i , defined as where E is the sum of the squared error for all objects in the data set p is the point in space representing a given object c i is the centroid of...