The Data Ilm

Posts

Showing posts from March, 2021

Data Preprocessing techniques

Data preprocessing is a data mining technique that involves transforming raw data into a readable and an understandable format. The real-world data is often incomplete, inconsistent, and sometimes lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing includes cleaning, Instance selection, normalization, transformation, feature extraction and selection, etc. After the preprocessing of data, the data may be more valuable, it may be more informative. It may also reduce the computational load. Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted. The other data preprocessing techniques are Aggregation Sampling Dimensional Reduction Feature subset selection Feature extraction Discretization Attribute transformation

Data Quality & its problems

Data quality is a measure of the condition of data based on the following factors Accuracy Uniqueness Consistency Reliability Availability Usability Sufficiency The data is generally considered high quality if it is fit for its intended uses for decision making and strategy planning. There can be several type of quality problems in data collection. The most common problems are Noise and Outliers The noise is a kind of distortion on the actual data. The o utliers are considerably different than most of the other data available in the data set. Missing Values The missing values are incomplete data in the data set. Duplicate The duplicate data are repetitive data in the data set.

Data and its types

The data have been stored in different types in various applications based on the needs. The popular form of the data is in tabular form such as given below. The tabular data shown above has few rows and columns. The each row represents each record. The each column represents the different attributes of each record. The record is also referred as Sample, Instance, Case etc. The attribute of a record is also referred as Variable, Field, Feature etc. The attribute of the record shall be captured in different types. They are Nominal - Provides distinguishable information such as Student ID, Gender etc Ordinal - Provides comparable information such as Score, Grades etc Interval - Provides interval information such as Dates etc Ratio - Provides ratio information such as Age, Height etc The properties of attribute values shall be Distinctness [ Equal to or Not Equal to ] Ordering [ Greater than or Less than ] Addition/Subtraction Multiplication/Division The attributes of the re...

What is Data Mining?

Data Mining is e xtraction of potentially useful information or patterns from the available huge data. It is also k nown as Knowledge discovery in database (KDD). Searching in Google, searching products in ecommerce portals or searching flight tickets in booking websites are not data mining. These are all simple query from the currently available existing data in database. Data mining is a process of extracting knowledge from the historical data. Different types of data Sensor data Time Series data Graphical data Heterogeneous data Spatial data Multimedia data Text data Web data

Value assignment to name in R

Like other programming languages, the value can be assigned to name variables in R. The assignment of a value to variable can be done as below. > y <- 12 > y [1] 12 '=' can be also be used in place of '<-'. If '<-' is used, there should not be any space between two characters. The name can be chosen from letter, digits and period symbol. But the name should not be started with a digit or period followed by a digit. > 1. = 5 Error in 1 = 5 : invalid (do_set) left-hand side to assignment. > .1 = 1 Error in 0.1 = 1 : invalid (do_set) left-hand side to assignment Some characters are used already for defining functions. However, it would cause confusion if it is used for assigning a value. Few of the defined function characters are c, q, t , F, T etc. c - Combine values into a vector or list q - Terminate the current R session t - Return a transpose of the given matrix F & T - Logical argument TRUE or FALSE Except F & T, other charac...

How to download & install RStudio?

RStudio is an integrated development environment (IDE) for R, a programming language for statistical computing and graphics. RStudio can be downloaded in two formats. RStudio Desktop: a regular desktop application. RStudio Server: runs on a remote server to get access through a web browser. The latest version of RStudio can be downloaded from https://rstudio.com/products/rstudio/download/ Once the appropriate 'Download' is clicked, the page will be redirected to Download page. Once it is downloaded and installed, it can be opened through its shortcut icon. RStudio will have four section. Script Section Console Section Environment Section Plot Section Script Section: To write the code of R programming. Console Section: To execute the code. Environment Section: To display Data and values of parameters. Plot Section: To display the different type of plots.

What is Data in Data Science?

Data is nothing but collection of information. The singular form of Data is Datum. Data are set of values in qualitative and quantitative variables. Qualitative data is descriptive and quantitative data is numerical in nature. Examples: Qualitative: Gender (Male, Female) Quantitative : Age (10, 45,..) Data : Census data (2011 Census data) Data Science is the domain of study that deals with vast volumes of data using modern tools and techniques to derive meaningful information, to identify similar patterns and to provide visibility and aid decision making in business strategy and goals.