Skip to main content

Basic Data Statistics Measurement

Measuring of central tendency of a data set includes the mean, median, mode, and midrange.

The mean of set of value is,





The process of removing certain percentage of top and bottom values from the sorted set of value to compute the mean is called trimmed mean.

The middle value in a sorted set of values is known as median which is used to measure the center of data in a skewed or asymmetric data set.

The approximate median of the grouped data sets is,




where

L1 - lower boundary of median interval

N - no of values in the data set

(Σfreq)1 - sum of frequencies on all intervals lower than median interval

freqmedian - frequency of the median interval

width - width of the median interval

The value which occurs most frequently in a data set is known as mode. Data sets with one, two, or three modes are respectively called unimodal, bimodal, and trimodal. The data set with two or more modes is multimodal.

The average of the largest and smallest values in the data set is known as midrange.

Measuring the dispersion of a data set includes range, quantiles, quartiles, percentiles, interquartile range, variance and standard deviation.

The range is the difference between the largest and smallest values of a data set.

The quantiles are points taken at regular intervals of a data distribution, dividing it into essentially equal size consecutive sets.

The 4-quantiles are the three data points that split the data distribution into four equal parts; each part represents one-fourth of the data distribution. They are more commonly referred to as quartiles.

The percentiles divide the data distribution into 100 equal-sized consecutive sets.

1st quartile - 25th percentile

2nd quartile - 50th percentile

3rd quartile - 75th percentile

The distance between the first and third quartiles that gives the range covered by the middle half of the data is called interquartile range.

A common rule of thumb for identifying suspected outliers is to single out values falling at least 1.5x IQR above the third quartile or below the first quartile.

The five-number summary of a distribution consists of the median (Q2), the quartiles Q1 and Q3, and the smallest and largest individual observations, written in the order of Minimum, Q1, Median, Q3, Maximum.

Variance and standard deviation are measures of data dispersion. A low standard deviation means that the data observations tend to be very close to the mean, while a high standard deviation indicates that the data are spread out over a large range of values.

The variance of N observations for a numeric attribute X is




The standard deviation is the square root of the variance.

Popular posts from this blog

Exercise 2 - Amdahl's Law

A programmer has parallelized 99% of a program, but there is no value in increasing the problem size, i.e., the program will always be run with the same problem size regardless of the number of processors or cores used. What is the expected speedup on 20 processors? Solution As per Amdahl's law, the speedup,  N - No of processors = 20 f - % of parallel operation = 99% = 1 / (1 - 0.99) + (0.99 / 20) = 1 / 0.01 + (0.99 / 20) = 16.807 The expected speedup on 20 processors is 16.807

Exercise 1 - Amdahl's Law

A programmer is given the job to write a program on a computer with processor having speedup factor 3.8 on 4 processors. He makes it 95% parallel and goes home dreaming of a big pay raise. Using Amdahl’s law, and assuming the problem size is the same as the serial version, and ignoring communication costs, what is the speedup factor that the programmer will get? Solution Speedup formula as per Amdahl's Law, N - no of processor = 4 f - % of parallel operation = 95% Speedup = 1 / (1 - 0.95) + (0.95/4) = 1 / 0.5 + (0.95/4) Speedup = 3.478 The programmer gets  3.478 as t he speedup factor.

Minor, Cofactor, Determinant, Adjoint & Inverse of a Matrix

Consider a matrix Minor of a Matrix I n the above matrix A, the minor of first element a 11  shall be Cofactor The Cofactor C ij  of an element a ij shall be When the sum of row number and column number is even, then Cofactor shall be positive, and for odd, Cofactor shall be negative. The determinant of an n x n matrix can be defined as the sum of multiplication of the first row element and their respective cofactors. Example, For a 2 x 2 matrix Cofactor C 11 = m 11 = | a 22 | = a 22  = 2 Determinant The determinant of A is  |A| = (3 x 2) - (1 x 1) = 5 Adjoint or Adjucate The Adjoint matrix of A , adjA is the transpose of its cofactor matrix. Inverse Matrix A matrix should be square matrix to have an inverse matrix and also its determinant should not be zero. The multiplication of matrix and its inverse shall be Identity matrix. The square matrix has no inverse is called Singular. Inv A = adjA / |A|           [ adjoint A / determ...