Basic Data Statistics Measurement

Measuring of central tendency of a data set includes the mean, median, mode, and midrange.

The mean of set of value is,

The process of removing certain percentage of top and bottom values from the sorted set of value to compute the mean is called trimmed mean.

The middle value in a sorted set of values is known as median which is used to measure the center of data in a skewed or asymmetric data set.

The approximate median of the grouped data sets is,

where

L₁ - lower boundary of median interval

N - no of values in the data set

(Σfreq)₁ - sum of frequencies on all intervals lower than median interval

freq_median - frequency of the median interval

width - width of the median interval

The value which occurs most frequently in a data set is known as mode. Data sets with one, two, or three modes are respectively called unimodal, bimodal, and trimodal. The data set with two or more modes is multimodal.

The average of the largest and smallest values in the data set is known as midrange.

Measuring the dispersion of a data set includes range, quantiles, quartiles, percentiles, interquartile range, variance and standard deviation.

The range is the difference between the largest and smallest values of a data set.

The quantiles are points taken at regular intervals of a data distribution, dividing it into essentially equal size consecutive sets.

The 4-quantiles are the three data points that split the data distribution into four equal parts; each part represents one-fourth of the data distribution. They are more commonly referred to as quartiles.

The percentiles divide the data distribution into 100 equal-sized consecutive sets.

1st quartile - 25th percentile

2nd quartile - 50th percentile

3rd quartile - 75th percentile

The distance between the first and third quartiles that gives the range covered by the middle half of the data is called interquartile range.

A common rule of thumb for identifying suspected outliers is to single out values falling at least 1.5x IQR above the third quartile or below the first quartile.

The five-number summary of a distribution consists of the median (Q2), the quartiles Q1 and Q3, and the smallest and largest individual observations, written in the order of Minimum, Q1, Median, Q3, Maximum.

Variance and standard deviation are measures of data dispersion. A low standard deviation means that the data observations tend to be very close to the mean, while a high standard deviation indicates that the data are spread out over a large range of values.

The variance of N observations for a numeric attribute X is

The standard deviation is the square root of the variance.

The Data Ilm

Search This Blog

Basic Data Statistics Measurement

Labels

Popular posts from this blog

Exercise 2 - Amdahl's Law

Exercise 1 - Amdahl's Law

Minor, Cofactor, Determinant, Adjoint & Inverse of a Matrix