Skip to main content

Data Sets & Its types

Data Set - collection of data objects

Data objects can be of different types—quantitative or qualitative

Other names for a data object are record, point, vector, pattern, event, case, sample, observation, or entity.

Data objects are described by a number of attributes that capture the basic characteristics of an object.

Other names for an attribute are variable, characteristic, field, feature, or dimension.

An attribute is a property or characteristic of an object that may vary, either from one object to another or from one time to another.

A measurement scale is a rule (function) that associates a numerical or symbolic value with an attribute of an object.

The properties of an attribute need not be the same as the properties of the values used to measure it.

Four types of attributes:

Categorical (Qualitative)

Nominal

just different names

Eg. Zip Codes, Employee ID 

Ordinal

ordering objects

Eg. Grades, Rating

Numeric (Quantitative)

Interval

difference between values

Eg. Dates, Temperatures

Ratio

both differences & ratio

Eg. Age, Length

Value of Attribute

Discrete

Discrete attributes are often represented using integer variables.

Binary attributes are a special case of discrete attributes and assume only two values (eg. yes/no, 0/1)

Continuous

Continuous attributes are typically represented as floating-point variables

Type of Datasets

  • Record data
  • Graphical data
  • Ordered data

Characteristic of Datasets

Dimensionality

No of attributes that the objects in the dataset possess.

Sparsity

Fewer than 1% of the entries are non zero.

Resolution

obtain data at different levels of resolution, and often the properties of the data are different at different resolutions.

The difficulties associated with analyzing high-dimensional data are sometimes referred to as the curse of dimensionality.

Record Data

collection of records (data objects), each of which consists of a fixed set of data fields (attributes).

Types of Record data

  • Transaction or Market Basket data
  • Data Matrix
  • Sparse Data Matrix

Graphical data

the graph captures relationships among data objects

the objects contain sub objects that have relationships, then such objects are frequently represented as graphs.

Ordered data

the attributes have relationships that involve order in time or space

Types of Ordered data

Sequential data (Temporal data)

record data associated with time

Sequence data

data set that is a sequence of individual entities, such as a sequence of words or letters.

there are no time stamps; instead, there are positions in an ordered sequence.

Time Series data

sequential data in which each record is a time series

Spatial data

Some objects have spatial attributes, such as positions or areas, as well as other types of attributes.

Popular posts from this blog

Gaussian Elimination - Row reduction Algorithm

 Gaussian elimination is a method for solving matrix equations of the form, Ax=b.  This method is also known as the row reduction algorithm. Back  Substitution Solving the last equation for the variable and then work backward into the first equation to solve it.  The fundamental idea is to add multiples of one equation to the others in order to eliminate a variable and to continue this process until only one variable is left. Pivot row The row that is used to perform elimination of a variable from other rows is called the pivot row. Example: Solving a linear equation The augmented matrix for the above equation shall be The equation shall be solved using back substitution. The eliminating the first variable (x1) in the first row (Pivot row) by carrying out the row operation. As the second row become zero, the row will be shifted to bottom by carrying out partial pivoting. Now, the second variable (x2)  shall be eliminated by carrying out the row operation again. ...

Decision Tree - Gini Index

The Gini index is used in CART. The Gini index measures the impurity of the data set, where p i - probability that data in the data set, D belong to class, C i  and pi = |C i,D |/|D| There are 2 v - 2 possible ways to form two partitions of the data set, D based on a binary split on a attribute. Each of the possible binary splits of the attribute is considered. The subset that gives the minimum Gini index is selected as the splitting subset for discrete valued attribute. The degree of Gini index varies between 0 and 1. The value 0 denotes that all elements belong to a certain class or if there exists only one class, and the value 1 denotes that the elements are randomly distributed across various classes. A Gini Index of 0.5 denotes equally distributed elements into some classes. The Gini index is biased toward multivalued attributes and has difficulty when the number of classes is large.

Data Cleaning Process

Data cleaning attempt to fill in missing values, smooth out noise while identifying  outliers, and correct inconsistencies in the data. Data cleaning is performed as an iterative two-step process consisting of discrepancy detection and data transformation. The missing values of the attribute can be addressed by Ignoring the value filling the value manually Using global constant to fill the value Using a measure of central tendency (mean or median) of value Using attribute mean or median belonging to same class Using the most probable value Noise is a random error or variance in a measured variable. The noisy data can be smoothened using following techniques. Binning methods smooth a sorted data value by consulting the nearby values around it. smoothing by bin means - each  value in a bin is replaced by the mean value of the bin. smoothing by bin medians - each bin value  is replaced by the bin median smoothing by bin boundaries - the minimum and maximum values in a ...