The Data Ilm

Posts

Showing posts from May, 2021

Decision Tree - Information Gain & Gain Ratio

ID3 uses information gain as its attribute selection measure. The attribute with the highest information gain is chosen as the splitting attribute for node N. The average amount of information needed to identify the class label, where p i - non zero probability p i is estimated by |C i,D | / |D| Info(D) is also known as the entropy of D. The required more information is derived from where |D j | / |D| - weight of jth partition Information gain is defined as the difference between the original information requirement and the new requirement. Gain(A) = Info(D) - Info A (D) The information gain measure is biased toward tests with many outcomes. C4.5, a successor of ID3, uses an extension to information gain known as gain ratio , which attempts to overcome this bias. Gain ratio differs from information gain, which measures the information with respect to classification that is acquired based on the same partitioning. Gain ration can be calculated by The attribute with the maxim...

Decision Tree Classification

A decision tree is a flowchart-like tree structure. The topmost node in a tree is the root node. The each internal node (non-leaf node) denotes a test on an attribute and each branch represents an outcome of the test. The each leaf node (or terminal node) holds a class label. Decision trees can handle multidimensional data. Some of the decision tree algorithms are Iterative Dichotomiser (ID3), C4.5 (a successor of ID3), Classification and Regression Trees (CART). Most algorithms for decision tree induction follow a top-down approach. The tree starts with a training set of tuples and their associated class labels. The algorithm is called with data partition, attribute list, and attribute selection method, where the data partition is the complete set of training tuples and their associated class labels. The splitting criterion is determined by attribute selection method which indicates the splitting attribute that may be splitting point or splitting subset. Attribu...

Classification Concept

Classification is a form of data analysis that extracts models describing important data classes. The models predict categorical class labels for the given data. Fraud detection, Target marketing are few examples for classification. The classification models are constructed through learning steps which is called training phase. The learning step builds the classifier by analyzing the training set made up of data sets and their associated class labels. The class label attribute is discrete-valued and unordered. As the class labels of training data are provided, the learning is called supervised learning. The accuracy of a classifier on a given test set is the percentage of test set that are correctly classified.

Matrix Eigenvalues & Eigenvectors

The process of finding an unknown scalar, λ and a nonzero vector, x for a given non zero square matrix, A of dimension n x n is called matrix eigenvalue or eigenvalue. Ax = λ x The λ and x which satisfies the above equation is called eigen value and eigenvector. Ax should be proportional to x. The multiplication will produce a new vector that will have the same or opposite direction as the original vector. The set of all the eigenvalues of A is called the spectrum of A. The largest of the absolute values of the eigenvalues of A is called the spectral radius of A. To determine eigenvalue and eigenvector, the equation can be written in matrix notation, (A - λI)x = 0 By Cramer's theorem, the homogeneous linear system of equations has a nontrivial solution if and only if the corresponding determinant of the coefficients is zero. A - λI is called characteristic matrix and D( λ) is characteristic determinant of A. The above equation is called characteristic ...

Logarithmic Properties

The logarithm is the inverse function to exponentiation. It means the logarithm of a given number x is the exponent to which another fixed number, the base b, must be raised, to produce that number x. eg., log b x = y which is same to b y = x log 2 8 = 3 same as 2 3 = 8 The following are the properties of logarithm. log b xy = log b x + log b y log b (x/y) = log b x - log b y log b x n = n log b x log b a = log c a / log c b log b a = 1 / log a b Some basic values of logarithm log b 1 = 0 log b b = 1 log b b 2 = 2

Abstract Data Type

An Abstract Data Type(ADT) defines data types only in terms of their operations, and the axioms to which those operations must adhere. List, Stack, Queue, Set, Map, Stream are examples of ADT. In a stack, the element deleted is the one which was recently inserted: Is is Last-in, First-out (LIFO) policy. In a queue, the element deleted is always the one that has been for the longest time in the set. It is a First-in, First-out (FIFO) policy. Stack The insert operation on a stack is called Push and delete operation is called Pop. The attempt to pop an empty stack is called underflows and if top exceeds n of the stack, it is called overflows. Queue The insert operation on a queue is called Enqueue and delete operation is called Dequeue. The queue has a head and tail. When the Queue head is equal to tail, the queue is empty. When the queue head is tail+1, the queue is full. Linked List A linked list is a data structure in which the objects are arranged in a linear order. The ord...

Cache Memory

The supplementary memory system that temporarily stores frequently used instructions and data for quicker processing by CPU of a computer. Main memory consists of up to 2n addressable words, with each word having a unique n-bit address. there are M = 2n/K fixed length of blocks in main memory. where K - words The cache consists of m blocks called lines. The length of a line, not including tag and control bits, is the line size. As there are more blocks in main memory than lines in cache, an individual line cannot be uniquely and permanently dedicated to a particular block. Thus, each line includes a tag that identifies which particular block is currently being stored. The cache connects to the processor via data, control, and address lines. When a cache hit occurs, the data and address buffers are disabled and communication is only between processor and cache, with no system bus traffic. When a cache miss occurs, the desired address is loaded onto the system bus and the data are r...

Characteristics of Computer Memory

The key characteristics of computer memory systems are listed below. Location Capacity Unit of Transfer Performance Access method Physical type & characteristics The internal memory capacity is expressed in terms of bytes or words whereas the external memory capacity is expressed in bytes. The size of a word is equal to the number of bits used to represent an integer and the instruction length. The relationship between the length in bits A of an address and the number N of addressable units is 2 A = N. The Unit of transfer is the number of bits read out of or written into memory at a time for main memory. The important performance parameters are access time, memory cycle time and transfer rate. The access time is the time it takes to perform a read or write operation for random access memory and time it takes to position the read–write mechanism at the desired location for non random access memory. The access time plus any additional time required before a second access can c...

Dot and Cross product

Vectors can be multiplied in two ways: Scalar product or Dot product Vector Product or Cross product The dot product is the product of magnitude of the vectors and the cos of the angle between them. Scalar product = a . b = |a||b| cos α The cross product is the product of the magnitude of the vectors and the sine of the angle between them. Vector product = a × b = |a||b| sin α The dot product is a scalar and the cross product is a vector.

Real and Complex Numbers

A real number is any number that can be placed on a number line that extends to infinity in both the positive and negative directions. e.g 12, 45, 45.98, -0.985, -537 Let's consider an imaginary number i which is equal to square root of -1. The square value of i , i 2 = -1. A complex number is any number that includes i such as 3i, 4 + 5 i , -8i. The real number is a subset of complex number which can be written as r + 0 i .

Exercise 8 - Data Central Tendency

Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. (a) What is the mean of the data? (b) What is the median? (c) What is the mode of the data? (d) What is the midrange of the data? (e) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data? (f) Give the five-number summary of the data. Solution: a) Total No of age values given = 27 Mean of the data = 809/27 = 29.96 The mean age value is 29.96 b) Median, center of the data is 14th value. The median age value is 25 c) mode, the value that occurs most frequently in the set. The data set is multimodal. The mode are 25 and 35. d) Midrange, average of the smallest and largest value in the data set. smallest value = 13 largest value = 70 midrange = 83/2 = 41.5 The midrange age value is 41.5 e) First quartile (Q1) ...

Basic Data Statistics Measurement

Measuring of central tendency of a data set includes the mean, median, mode, and midrange. The mean of set of value is, The process of removing certain percentage of top and bottom values from the sorted set of value to compute the mean is called trimmed mean. The middle value in a sorted set of values is known as median which is used to measure the center of data in a skewed or asymmetric data set. The approximate median of the grouped data sets is, where L 1 - lower boundary of median interval N - no of values in the data set (Σfreq) 1 - sum of frequencies on all intervals lower than median interval freq median - frequency of the median interval width - width of the median interval The value which occurs most frequently in a data set is known as mode. Data sets with one, two, or three modes are respectively called unimodal, bimodal, and trimodal. The data set with two or more modes is multimodal. The average of the largest and smallest values in the data set is known as midrange. ...

Data objects & its attribute

Data sets are made up of data objects. A data object represents an entity such as info on customers, products etc. Data objects are typically described by attributes. The data objects stored in database are called data tuples. The row in database refers data objects and column refers its attributes. A feature of a data object is called an attribute. The type of attributes are Nominal, Binary, Ordinal and Numeric. A nominal attribute represents some kind of category, code, or state etc., which are also referred to as categorical. Though a nominal attribute may have integers as values, it is not considered a numeric attribute because the integers are not meant to be used quantitatively. A binary attribute is a nominal one with only two states: 0 or 1. It is also called as Boolean when the states refer true or false. A binary attribute is symmetric if both of its states are equally valuable. An ordinal attribute is an attribute with possible values that have a meaningful order ...

Exercise 7 - Matrix Operation Count

Let A nxn be a given square matrix. Compute the number of multiplications and additions required to evaluate A 28 using a) the naive method, A 28 = A x A x A ......... A (28 times) b) A 2 , A 4 = A 2 x A 2 etc. Solution: a) For the given square matrix, A x A has no of multiplication = n x n x n = n 3 no of addition = (n-1) x n x n = n 3 - n 2 For total of 28 times, no of multiplication = 27 n 3 no of addition = 27(n 3 - n 2 ) b) Step 1 A 2 = A x A Step 2 A 4 = A 2 x A 2 Step 3 A 8 = A 4 x A 4 Step 4 A 16 = A 8 x A 8 Step 5 A 12 = A 8 x A 4 Step 6 A 28 = A 16 x A 12 Total no of steps required = 6 So, total no of multiplication and addition 6n 3 and 6(n 3 - n 2 ) respectively.

Exercise 6 - Disk Operation

The average I/O size of an application is 64 KB. The following specifications are available from the disk manufacturer: average seek time = 5 ms, 7,200 rpm, transfer rate = 40 MB/s. Determine Rotational latency and Seek time and number of Read/Write operations (IO Operations). Solution Given specification Application size 64 KB = 65,536 bytes T avg seek = 5 ms Max rotation = 7200 rpm Transfer rate = 40 MB/sec = 41,943,040 bytes/sec Rotational Latency = 1/2 x Max rotation = 1/2 x (60 sec /7200 rpm) = 0.00416 sec = 4.16 ms Avg Transfer time = 65536/41943040 = 0.00156 sec = 1.56 ms Avg seek time = 5 ms T access time = T avg seek + T avg rotation + T avg transfer = 5 + 4.16 + 1.56 = 10.72 ms The average IO operation (IOPS) = 1/(Disk Average Latency in seconds + Disk Average Seek Time in seconds) Average IOPS = 1/(0.00416+0.005) = 109

Exercise 5 - Disk Capacity

Compute the capacity of the disk with the following parameters 256 bytes per sector, 600 sectors per track, 40,000 tracks per surface, 2 surfaces per platter and 6 platters per disk. Solution Specified parameters of disk 256 bytes per sector 600 sectors per track 40,000 tracks per surface 2 surfaces per platter 6 platters per disk Disk Capacity = (256 x 600 x 40000 x 2 x 6) / (1024 X 1024 X 1024) = 68.66 GB The disk capacity is 68.66 GB

Computer Basic Measurement Units

Hertz = Clock cycles per second (Frequency) 1 MHz = 1,000,000 Hz Processor Speed is measured in MHz or GHz

Exercise 4 - Disk Operation

Estimate the time to read a 512 bytes sector sized block from SRAM and DRAM. The access time for a doubleword stored in SRAM and DRAM is roughly 4 ns and 60 ns respectively. Solution A doubleword size is 8 bytes. SRAM access time = (512/8) x 4 ns = 256 ns DRAM access time = (512/8) x 60 ns = 3840 ns The time required to access 512 bytes sector sized block from SRAM shall be 256 ns and from DRAM shall be 3840 ns.

Exercise 3 - Disk Operation

Estimate the average time to access a sector in a disk which parameters are given below. Rotational speed 12000 RPM Avg No of sector/track 500 Avg seek time 8 ms Solution Avg access time = Avg seek time + Avg rotation time + Avg transfer time Avg rotation time = 1/2 x (Max. rotation time) = 1/2 x (60 sec/12000 rpm) x 1000 ms = 2.5 ms Avg transfer time = (60 sec/12000 rpm) x (1/500) x 1000 ms = 0.01 ms Avg access time = 8 ms + 2.5 ms + 0.01 ms = 10.51 ms The average time to access a sector is 10.51 ms

Exercise 2 - Amdahl's Law

A programmer has parallelized 99% of a program, but there is no value in increasing the problem size, i.e., the program will always be run with the same problem size regardless of the number of processors or cores used. What is the expected speedup on 20 processors? Solution As per Amdahl's law, the speedup, N - No of processors = 20 f - % of parallel operation = 99% = 1 / (1 - 0.99) + (0.99 / 20) = 1 / 0.01 + (0.99 / 20) = 16.807 The expected speedup on 20 processors is 16.807

Exercise 1 - Amdahl's Law

A programmer is given the job to write a program on a computer with processor having speedup factor 3.8 on 4 processors. He makes it 95% parallel and goes home dreaming of a big pay raise. Using Amdahl’s law, and assuming the problem size is the same as the serial version, and ignoring communication costs, what is the speedup factor that the programmer will get? Solution Speedup formula as per Amdahl's Law, N - no of processor = 4 f - % of parallel operation = 95% Speedup = 1 / (1 - 0.95) + (0.95/4) = 1 / 0.5 + (0.95/4) Speedup = 3.478 The programmer gets 3.478 as t he speedup factor.

Normalization or Standardization Process

The process of transforming the data to fall within a smaller or common range such as 0 to 1 is called Normalization or Standardization . Normalizing the data attempts to give all attributes an equal weight. Data normalization methods Min-max normalization Min-max normalization performs a linear transformation on the original data. Min-max normalization maps a value of an attribute to new value range by computing, where min A - minimum value of an attribute max A - maximum value of an attribute Min-max normalization preserves the relationships among the original data values. z-score (zero-mean) normalization The values for an attribute are normalized based on the mean and standard deviation of the attribute. A variation of z-score normalization replaces the standard deviation of above equation by the mean absolute deviation of the attribute. The mean absolute deviation s A , is The z-score normalization using the mean absolute deviation is Decimal scaling This method normalizes ...

Data Transformation Process

The data are consolidated to make the mining process more efficient and to understand the patterns found easier. Data transformation strategies Smoothing - Removing noise from the data using techniques such as binning, regression, clustering. Attribute construction - Constructing new attributes using given set of attributes to improve the process Aggregation - summary or aggregation operations are applied to the data. Normalization - Scaling down the attribute data to fall within a smaller range such as 0 to 1. Discretization - Replacing values of numeric attributes to interval or conceptual labels. Concept hierarchy generation - Generalizing the attributes to higher level concept labels.

Regression & Log linear models

Regression models can be used to approximate the given data. The data are modeled to fit a straight line. y = wx + b where y - response variable x - predictor variable w , b - regression coefficients The coefficient specify the slope of the line and y intercept. The method of least squares shall be used to solve the coefficients. Log linear models approximate discrete multidimensional probability distributions. Log-linear models can be used to estimate the probability of each point in a multidimensional space for a set of discretized attributes, based on a smaller subset of dimensional combinations. This allows a higher-dimensional data space to be constructed from lower-dimensional spaces. Regression and log-linear models can both be used on sparse data and skewed data too. Regression can be computationally intensive when applied to high-dimensional data.

Attribute subset selection

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes. It is used to find a minimum set of attributes which give probability distribution of the data classes result is as close as possible to the original distribution obtained using all attributes. It also helps to make the pattern easier to understand as it has minimum set of attributes. Attribute subset selection techniques Forward selection - starts with empty set and adding best set of attributes in subsequent steps Backward elimination - starts with full set and eliminating worst set of attributes in further steps. Decision tree - attributes that do not appear in the tree are assumed irrelevant.

Gauss Seidel & Jacobi Iteration

The methods used to solve linear system of equations using Gauss elimination and its sub methods are called direct methods. The methods in which approximation is applied to find the true solutions is called indirect or iterative method. Gauss Seidel Iteration Method is a successive corrections as each component is replaced by a corresponding new approximation as soon as it has been computed. The linear equation Ax = b can be substituted with A = I + L + U where I is the n x n unit matrix and L and U are, respectively, lower and upper triangular matrices with zero main diagonals. (I + L + U)x = b x = b - Lx - Ux where is the mth approximation is the (m+1)th approximation An iteration method, Gauss Jacob iteration also known as simultaneous corrections in which no component of an new approximation is used until all the components of have been computed.

Principal components analysis (PCA)

Principal components analysis (PCA) also known as K-L method searches for 'k' n-dimensional orthogonal vectors that can best be used to represent the data, where k is less than or equal to n. The original data are thus projected onto a much smaller space, resulting in dimensionality reduction. Basic procedure of PCA The input data are normalized so that each attribute falls within the same range to ensure that the attributes with large domains will not dominate attributes with smaller domains. PCA computes k orthonormal vectors that provide a basis for the normalized input data. These are unit vectors that each point in a direction perpendicular to the others. These vectors are referred to as the principal components. The input data are a linear combination of the principal components. The principal components are sorted in order of decreasing significance or strength. As t he components are sorted in decreasing order of significance, the data s...

Data Reduction Process

Data reduction techniques can be applied to obtain a reduced representation of the data set that ismuch smaller in volume, yet closely maintains the integrity of the original data. Data reduction strategies include dimensionality reduction, numerosity reduction, and data compression. Dimensionality reduction is the process of reducing the number of random variables or attributes under consideration. Dimensionality Reduction Methods Wavelet Transforms Principal Components Analysis Attribute Subset Selection Numerosity reduction techniques replace the original data volume by alternative, smaller forms of data representation. Numerosity Reduction Methods Parametric Regression Log Linear Non Parametric Histogram Clustering Sampling Data cube aggregation In data compression techniques , transformations are applied so as to obtain a reduced or compressed representation of the original data.

Discrete wavelet transform (DWT)

The discrete wavelet transform (DWT) is a linear signal processing technique that, when applied to a data vector, transforms it to a numerically different vector, of wavelet coefficients. The procedure for applying a discrete wavelet transform uses a hierarchical pyramid algorithm that halves the data at each iteration, resulting in fast computational speed. The length of the input data vector must be an integer power of 2. The data vector with zeros as necessary shall be added to achieve this condition. Two functions are applied to transform the data sets. The first is data smoothing, such as a sum or weighted average and the second function of weighted difference. The two functions are recursively applied to the data sets obtained in the previous loop, until the resulting data sets obtained are of length 2. Selected values from the data sets obtained in the previous iterations are designated the wavelet coefficients of the transformed data. The popular wavelet ...

Storage Disk

Disks are storage devices that hold enormous amounts of data, on the order of hundreds to thousands of gigabytes. Disks are constructed from platters. Each platter consists of two sides, or surfaces. A rotating spindle in the center of the platter spins the platter at a fixed rotational rate, range between 5400 to 15000 rpm. Each surface consists of a collection of concentric rings called tracks. Each track is partitioned into a collection of sectors. Sectors are separated by gaps where no data bits are stored. The geometry of multiple-platter termed as cylinders, where a cylinder is the collection of tracks on all the surfaces that are equidistant from the center of the spindle. The maximum number of bits that can be recorded by a disk is known as Disk capacity. The number of bits that can be recorded into a 1-inch segment of a track is known as Recording density which is measured by bits/in. The number of tracks that can be squeezed into a 1-inch segment of the...

Data Integration Process

Data mining often requires data integration, the merging of data from multiple data source. Entity Identification Problem When matching attributes from one database to another during integration, special attention must be paid to the structure of the data. This is to ensure that any attribute functional dependencies and referential constraints in the source system match those in the target system. Redundancy and Correlation Analysis Some redundancies can be detected by correlation analysis. For nominal data, we use the X 2 (chi-square) test. For numeric attributes, the correlation coefficient and covariance can be used. Chi-Square Correlation Test (Pearson Statistic Test) where o ij is the observed frequency (actual count) e ij is the expected frequency where n is the number of data tuples If there is no correlation between A & B, then they are independent. T he cells that contribute the most to the chi-square value are those for which the actual count is ...