The Data Ilm

Posts

Showing posts from August, 2021

Clustering Evaluation

The major tasks of clustering evaluation are as follows Clustering tendency assessment The assessment to be carried out to check whether nonrandom structure exists in the given data set. The Hopkins Statistic is a spatial statistic that tests the spatial randomness of a variable as distributed in a space. The Hopkins statistic value would be about 0.5 if data set are uniformly distributed. If the data set are highly skewed, then H would be close to 0. The uniformly distributed data set contains no meaningful clusters, that is if Hopkin statistic value is greater than 0.5 then it is unlikely that the data set has significant clusters. No of Clusters determination The number of clusters can be regarded as an important summary statistic of a data set. It is desirable to estimate the number of clusters even before a clustering algorithm is used to derive detailed clusters. The appropriate number of clusters controls the proper granularity of cluster analysis. The right num...

Grid based clustering

A grid-based clustering is a space-driven approach method. The grid data structure is formed by quantizing the object space into a finite number of cells. The clustering are performed on the formed grid data structure. As this approach is independent of the number of data objects but only dependent on the number of cells in each dimension in the space, its processing time is faster. There are two methods in grid based approach. They are STING (Statistical Information Grid) and CLIQUE (Clustering in Quest). STING is a grid-based multiresolution clustering technique in which the embedding spatial area of the input objects is divided into rectangular cells. Each cell at a high level is partitioned to form a number of cells at the next lower level. The statistical parameters such as mean, maximum, and minimum values are computed and stored for query processing and for data analysis tasks. STING approaches the clustering result of DBSCAN if the granularity approaches 0. STING offe...

Density based Clustering

Partitioning and hierarchical methods are designed to find spherical-shaped clusters. They often fail to find clusters of arbitrary shape. The clusters can be found in arbitrary shape as dense region separated by sparse regions in the data space. The methods used in the density based are DBSCAN (Density Based Spatial Clustering of Applications with Noise) OPTICS (Ordering Points to Identify the Clustering Structure) DENCLUE (Clustering Based on Density Distribution Functions) DBSCAN method finds core objects that have dense neighborhoods. It connects core objects and their neighborhoods to form dense regions as clusters. A user-specified parameter ε > 0 is used to specify the radius of a neighborhood we consider for every object. The method uses a parameter, MinPts, which specifies the density threshold of dense regions. An object is a core object if the ε -neighborhood of the object contains at least MinPts objects. All core objects can be identified with respect t...

Hierarchy Clustering

The method of grouping data objects into a hierarchy clusters is called hierarchy clustering. It is useful for data summarization and data visualization. The hierarchy clustering can be done using agglomerative and divisive method. The agglomerative method starts with individual objects as clusters, which are iteratively merged to form larger clusters. The divisive method initially lets all the given objects form one cluster, which they iteratively split into smaller clusters. Hierarchical clustering methods can encounter difficulties as the methods will neither undo what was done previously, nor perform object swapping between clusters. It may lead to low-quality clusters if merge or split decisions are not well chosen. Hierarchical clustering methods can be categorized into algorithmic methods, probabilistic methods, and Bayesian methods. The agglomerative and divisive are algorithmic methods. An agglomerative method requires at most n iterations. A tree structur...

k-means Partition Clustering

The simplest method of cluster analysis is partitioning which organizes the objects of a set into several exclusive groups. The partitioning algorithm organizes the objects into k partitions where each partition represents a cluster. The well-known and commonly used partitioning methods are k-means and k-medoids. An objective function is used to assess the partitioning quality so that objects within a cluster are similar to one another but dissimilar to objects in other clusters. A centroid-based partitioning technique uses the centroid of a cluster, C i , to represent that cluster. Conceptually, the centroid of a cluster is its center point. The quality of cluster C i can be measured by the within cluster variation which is the sum of squared error between all objects in C i and the centroid c i , defined as where E is the sum of the squared error for all objects in the data set p is the point in space representing a given object c i is the centroid of...

Clustering Analysis

Clustering is the process of grouping a set of data objects into multiple groups so that objects within a group have high similarity, but are very dissimilar to objects in other groups. The clustering would help to discover previously unknown groups within the data. The cluster analysis can be used to gain insight into the distribution of data, to observe the characteristics of each cluster, and to focus on a particular set of clusters for further analysis. The main focus has been given to distance-based cluster analysis . The Cluster analysis tools based on k-means, k-medoids, and several other methods also have been built into many statistical analysis software packages. The clustering is a form of learning by observation, rather than learning by examples. Requirement of Clustering Scalability Ability to deal various types of attributes Discovery of clusters with arbitrary shape Requirement for domain knowledge Ability to deal with noisy data Incremental clustering and insensitivity ...

Scheduling of Computer Processes

The process scheduler selects an available process from Ready state for execution. Each CPU core can run one process at a time. To maximize the CPU utilization, the multicore system is used to run multiple process at all times. If there are more processes than cores, excess processes will have to wait until a core is free and can be rescheduled. The number of processes currently in memory is known as the degree of multiprogramming . Most processes can be described as either I/O bound or CPU bound. An I/O-bound process is one that spends more of its time doing I/O than it spends doing computations whereas a CPU bound process spends more of its time doing computations. The processes are put into a Ready queue once they enter the system. The queue is stored as linked list which contains first PCB pointers. When a process is allocated a CPU core, it executes for a while and eventually terminates, is interrupted, or waits for the occurrence of a particular event. The Processes that are wai...

Association rule analysis - FP Growth Algorithm

The Apriori algorithm generates candidates and tests if they are frequent. The generation of candidate itemsets and support counting are expensive in both space and time. The FP-Growth algorithm allows frequent itemset discovery without candidate itemsets generation. The FP growth tree is constructed by Scanning data and find support for each item. Discarding the infrequent items. Sorting the frequent items in decreasing order based on their support. Building a compact data structure called the FP-tree to map all transactions to a path in FP tree. The height of the tree is bounded by the maximum number of items in a transaction. FP growth is an efficient mining method of frequent patterns in large Database : using a highly compact FP‐tree, divide‐and‐conquer method in nature.

The Computer Process

A process is the unit of work in a modern computing system. A system consists of a collection of processes, some executing user code, others executing operating system code. The status of the current activity of a process is represented by the value of the program counter and the contents of the processor’s registers. The process memory is typically divided into Text Section, Data Section, Heap Section and Stack Section. The Text section carries the executable code, the Data section holds global variables, the Heap section is dynamically allocated during program run time and the Stack section keeps temporary data storage for function return, local variables etc. The size of the text and data sections are fixed whereas the stack and heap sections can shrink and grow dynamically during program execution. A program is a passive entity containing a list of instructions stored on disk whereas a process is an active entity with a program counter specifying the next instruction to execut...

Sine & Cosine Function Graph

The value of Sine and Cosine function for the given x is tabulated below. Plotting the points from the table and continuing along the x-axis gives the shape of the Sine and Cosine function. The shape of the Sine and Cosine function graph repeats after 2π, which means the functions are periodic with a period of 2π. The domain of each Sine and Cosine function is (−∞,∞) and the range is [−1,1]. As the Sine function is an odd one, the graph of y = sin x is symmetric about the origin whereas the graph of y = cos x is symmetric about the y-axis as it is even one.

Maxima and Minima

A function has an global maximum at a if f(a) ≥ f(x) for all x in D, domain of f. f(a) is called the maximum value of f on D. A function has an global minimum at a if f(a) ≤ f(x) for all x in D, domain of f. f(a) is called the minimum value of f on D. If a function is a continuous on a closed interval [a, b], then the function attains both an absolute maximum value M and an absolute minimum value m in [a, b]. It means, there are numbers c and d in [a, b] with f(c) = M and f(d) = m and m ≤ f(x) ≤ M for every other x in [a, b]. This is called Extreme Value Theorem . The theorem doesn't apply to functions which are not continuous on [a, b]. A function has a local maximum at a point a if f(a) ≥ f(x) for all x in some open interval containing a. A function f has a local minimum at a point a if f(a) ≤ f(x) for all x in some open interval containing a. If f has a local maximum or minimum at a, and if f'(a) exists, then f'(c) = 0.

Taylor Series

Taylor Series is based on a theorem that states, any smooth function, 𝑓(𝑥), can be expressed as a polynomial in the neighborhood of a point 𝑎. Assuming all derivatives of the function exist at 𝑎, the series takes the form f(a) = C 0 The derivative of f(x) by differentiating the individual terms, the derivative is also a power series, then computing all of its higher derivatives when the function is evaluated at a, the constant term is obtained for each power series. f'(a) = 1 · C 1 f''(a) = 2 · 1 · C 2 f'''(a) = 3 · 2 · 1 · C 3 .. . . . f (k) (a) = k! · C k Solving the equation for the k-th coefficient C k , If function has a power series expansion at a with radius of convergence R > 0, that is, then C n , Substituting C n value in the formula, The above series is called Taylor series of the function f at a. When a = 0, the series becomes which is called Maclaurin series .

Derivative of a function (Differentiation)

Let f(x) be a function defined in an open interval containing a . The derivative of the function f(x) at a , denoted by f'(a) , is defined by The process of finding a derivative is called differentiation. Some common functions and its derivatives.

Continuity and Intermediate Value Theorem

For a function to be continuous at a point, it must be defined at that point, its limit must exist at the point, and the value of the function at that point must equal the value of the limit at that point. A function is continuous over an open interval if it is continuous at every point in the interval. It is continuous over a closed interval if it is continuous at every point in its interior and is continuous at its endpoints. A function f(x) is continuous at a point a if and only if the following three conditions are satisfied: f(a) is defined lim x→a f(x) exists lim x→a f(x) = f(a) A function is discontinuous at a point a if it fails to be continuous at a. Discontinuities may be classified as removable, jump, or infinite. Let f and g be continuous functions. Then f + g is a continuous function. fg is a continuous function. f/g is a continuous function, when g(x) ≠ 0 The Intermediate Value Theorem Suppose f(x) is a continuous function on the interval [a,b] with ...

Greedy Algorithm

A greedy algorithm is a simple and efficient algorithmic approach for solving any given problem by selecting the best available option at that moment of time, without bothering about the future results. The most basic purpose of greedy algorithm is optimization which can be either minimization or maximization based on our problem statement requirements. Greedy algorithms have several advantages over other algorithmic approaches: Greedy algorithms are often easier to describe and code up than other algorithms. Greedy algorithms can often be implemented more efficiently than other algorithms. Though greedy algorithm doesn’t provide correct solution in some cases, it is known that this algorithm works for the majority of problems. As greedy algorithm is fast and efficient, it is used in many other most commonly used algorithms such as: Huffman Coding: It is the most common algorithm in the field of data compression. It analyzes a given message and based on the frequenc...

Function - One to One & Onto

One to One A function f from A to B is called one-to-one (or 1-1) if whenever f (a) = f (b) then a = b. No element of B is the image of more than one element in A. In a one-to-one function, given any y there is only one x that can be paired with the given y. f is one-to-one ( injective ) if f maps every element of A to a unique element in B. In other words no element of B are mapped to by two or more elements of A. Onto A function f from A to B is called onto if for all b in B there is an a in A such that f (a) = b. All elements in B are used. f is onto ( surjective )if every element of B is mapped to by some element of A. In other words, nothing is left out. f is one-to-one onto ( bijective ) if it is both one-to-one and onto. In this case the map f is also called a one-to-one correspondence. Notice that “f is one-to-one” is asserting uniqueness, while “f is onto” is asserting existence. Let A and B be two finite sets such that there is a function f: A -> B. We claim the foll...

Limit of a function

Let f(x) be a function defined at all values in an open interval containing a, with the possible exception of a itself, and let L be a real number. If all values of the function f(x) approach the real number L as the values of x( ≠ a) approach the number a, then we say that the limit of f(x) as x approaches a is L. It is represented as The limit of a function to exist at a point, the functional values must approach a single real-number value at that point. If the functional values do not approach a single value, then the limit does not exist.

Exercise 14 - Matrix Consistency

If the system A nxn x = b 1 and A nxn y = b 2 are consistent. Does that mean Ax = b 1 + b 2 is consistent. Justify? Answer: Now there exist x 1 such that Ax 1 =b 1 and there exist x 2 such that Ax 2 =b 2 . So A(x 1 +x 2 )= Ax 1 +Ax 2 =b 1 +b 2 Therefore, Ax =b 1 +b 2 is consistent.

Exercise 13 - Rank of a Matrix

Let A be a m x m matrix with elements ajk = j+k-Θ for a fixed Θ such that π ≤ Θ ≤ 2π. Compute the Rank of A. Answer: Rank of the matrix is 2.

Exercise 12 - Operation Count

A student X is interested in solving the system A nxn x = b using Crout's method. However, a friend of her, Y has already worked out the LU decomposition for the matrix PA where P is typically a non singular matrix used in biological application. Can X still use this information to solve the original system? If so, how many division, multiplication and additions would be required? Answer: Ax = b PAx = Pb LUx = Pb Let Ux = y, then Ly = Pb y can be found using forward substitution and x can be found using backward substitution. So Total multiplication = 2n 2 - n Total addition = 2 (n 2 - n) Total division = 2n Total = 4n 2 - n

Exercise 11 - Operation Count

If A is a 823 X 823 matrix and one wishes to do a decomposition of the form A = LU with Doolittle's method. Compute separately the no of divisions, multiplications and additions required in the computation of L and U. Answer: n = 823 Addition = (2n 3 - 3n 2 + n) / 6 = 185475395 Multiplication = (2n 3 - 3n 2 + n) / 6 = 185475395 Division = (n 2 - n) / 2 = 338253

Function - Domain, Codomain, Range

A function is a relation from a set of inputs to a set of possible outputs where each input is related to exactly one output. e.g, f(x) = x 2 where f - function; x - input; x 2 - output One input with many output can't be a function but many input with one output can be a function. e.g, f(x) = x 2 and f(x) = y 2 can't be a function f(x) = x 2 and f(y) = x 2 can be a function The domain is all the values that go into a function (input). The range is all the values that come out from a function (output) The codomain is all the possible outcome of a function. Domain: 1,3,5,7,9 Codomain: 2,4,6,8,10,12,14,16,18,20 Range: 2,6,10,14,18 The range is a subset of the codomain.

Exercise 10 - Apriori Algorithm

Find the frequent item sets and generate association rules on the above transactions. Assume that minimum support threshold (s = 40%) and minimum confident threshold (c = 60%) Answer: Minimum support count = (40/100) x 8 = 3 There is only one itemset with minimum support 3. So only one itemset is frequent. Frequent Itemset = { Grapes, Banana } Association Rules Grapes => Banana [ Sup (Grapes, Banana) / Sup (Grapes) ] = 3/3 = 100% > Confidence VALID Banana => Grapes [ Sup (Grapes, Banana) / Sup (Banana) ] = 3/4 = 75% > Confidence VALID

Association rules analysis - Apriori algorithm

A set of items together is called an itemset. An itemset consists of two or more items. An itemset that occurs frequently is called a frequent itemset. E.g., Bread and butter, Milk and Diaper A set of items is called frequent if it satisfies a minimum threshold value for support and confidence. Support shows transactions with items purchased together in a single transaction. Confidence shows transactions where the items are purchased one after the other. Only those transactions which meet minimum threshold support and confidence requirements shall be considered for frequent itemset mining method. The frequent mining algorithm is an efficient algorithm to mine the hidden patterns of item sets within a short time and less memory consumption. Association rules analysis is a technique to uncover how items are associated to each other. There are three common ways to measure association. Support, Confidence and Lift. Market Basket Analysis is a popular application of Association R...