Skip to main content

Data Sets & Its types

Data Set - collection of data objects

Data objects can be of different types—quantitative or qualitative

Other names for a data object are record, point, vector, pattern, event, case, sample, observation, or entity.

Data objects are described by a number of attributes that capture the basic characteristics of an object.

Other names for an attribute are variable, characteristic, field, feature, or dimension.

An attribute is a property or characteristic of an object that may vary, either from one object to another or from one time to another.

A measurement scale is a rule (function) that associates a numerical or symbolic value with an attribute of an object.

The properties of an attribute need not be the same as the properties of the values used to measure it.

Four types of attributes:

Categorical (Qualitative)

Nominal

just different names

Eg. Zip Codes, Employee ID 

Ordinal

ordering objects

Eg. Grades, Rating

Numeric (Quantitative)

Interval

difference between values

Eg. Dates, Temperatures

Ratio

both differences & ratio

Eg. Age, Length

Value of Attribute

Discrete

Discrete attributes are often represented using integer variables.

Binary attributes are a special case of discrete attributes and assume only two values (eg. yes/no, 0/1)

Continuous

Continuous attributes are typically represented as floating-point variables

Type of Datasets

  • Record data
  • Graphical data
  • Ordered data

Characteristic of Datasets

Dimensionality

No of attributes that the objects in the dataset possess.

Sparsity

Fewer than 1% of the entries are non zero.

Resolution

obtain data at different levels of resolution, and often the properties of the data are different at different resolutions.

The difficulties associated with analyzing high-dimensional data are sometimes referred to as the curse of dimensionality.

Record Data

collection of records (data objects), each of which consists of a fixed set of data fields (attributes).

Types of Record data

  • Transaction or Market Basket data
  • Data Matrix
  • Sparse Data Matrix

Graphical data

the graph captures relationships among data objects

the objects contain sub objects that have relationships, then such objects are frequently represented as graphs.

Ordered data

the attributes have relationships that involve order in time or space

Types of Ordered data

Sequential data (Temporal data)

record data associated with time

Sequence data

data set that is a sequence of individual entities, such as a sequence of words or letters.

there are no time stamps; instead, there are positions in an ordered sequence.

Time Series data

sequential data in which each record is a time series

Spatial data

Some objects have spatial attributes, such as positions or areas, as well as other types of attributes.

Popular posts from this blog

Exercise 1 - Amdahl's Law

A programmer is given the job to write a program on a computer with processor having speedup factor 3.8 on 4 processors. He makes it 95% parallel and goes home dreaming of a big pay raise. Using Amdahl’s law, and assuming the problem size is the same as the serial version, and ignoring communication costs, what is the speedup factor that the programmer will get? Solution Speedup formula as per Amdahl's Law, N - no of processor = 4 f - % of parallel operation = 95% Speedup = 1 / (1 - 0.95) + (0.95/4) = 1 / 0.5 + (0.95/4) Speedup = 3.478 The programmer gets  3.478 as t he speedup factor.

Exercise 2 - Amdahl's Law

A programmer has parallelized 99% of a program, but there is no value in increasing the problem size, i.e., the program will always be run with the same problem size regardless of the number of processors or cores used. What is the expected speedup on 20 processors? Solution As per Amdahl's law, the speedup,  N - No of processors = 20 f - % of parallel operation = 99% = 1 / (1 - 0.99) + (0.99 / 20) = 1 / 0.01 + (0.99 / 20) = 16.807 The expected speedup on 20 processors is 16.807

Gaussian Elimination - Row reduction Algorithm

 Gaussian elimination is a method for solving matrix equations of the form, Ax=b.  This method is also known as the row reduction algorithm. Back  Substitution Solving the last equation for the variable and then work backward into the first equation to solve it.  The fundamental idea is to add multiples of one equation to the others in order to eliminate a variable and to continue this process until only one variable is left. Pivot row The row that is used to perform elimination of a variable from other rows is called the pivot row. Example: Solving a linear equation The augmented matrix for the above equation shall be The equation shall be solved using back substitution. The eliminating the first variable (x1) in the first row (Pivot row) by carrying out the row operation. As the second row become zero, the row will be shifted to bottom by carrying out partial pivoting. Now, the second variable (x2)  shall be eliminated by carrying out the row operation again. ...