Skip to main content

Association rules analysis - Apriori algorithm

A set of items together is called an itemset. An itemset consists of two or more items. An itemset that occurs frequently is called a frequent itemset.

E.g., Bread and butter, Milk and Diaper

A set of items is called frequent if it satisfies a minimum threshold value for support and confidence. Support shows transactions with items purchased together in a single transaction. Confidence shows transactions where the items are purchased one after the other.

Only those transactions which meet minimum threshold support and confidence requirements shall be considered for frequent itemset mining method.

The frequent mining algorithm is an efficient algorithm to mine the hidden patterns of item sets within a short time and less memory consumption.

Association rules analysis is a technique to uncover how items are associated to each other. There are three common ways to measure association. Support, Confidence and Lift.

Market Basket Analysis is a popular application of Association Rules.

For example, Bread => butter [Support 5% , Confidence 75%]

The above statement refers that there is a 5% transaction in which bread and butter were bought together and there are 75% of customers who bought bread as well as butter.

Support and Confidence for Itemset A and B are represented by formulas

Support (A) = No of transaction in which A appears / Total no of transactions

Confidence (A -> B) = Support (A U B) / Support (A)

Apriori algorithm was the first algorithm that was proposed for frequent itemset mining. This algorithm uses two steps “join” and “prune” to reduce the search space. It is an iterative approach to discover the most frequent item sets.

If an itemset set has value less than minimum support then all of its supersets will also fall below min support, and thus can be ignored. This property is called the Antimonotone property.

The 'Join' step generates itemset from k item sets by joining each item with itself. The 'Prune' step scans the count of each item in the database. If the candidate item does not meet minimum support, then it is regarded as infrequent and removed.

To discover a frequent pattern of size 100, one need to generate 2100 candidates.

i.e, No of candidates to be generated for 'k' size shall be 2k.

Popular posts from this blog

Exercise 2 - Amdahl's Law

A programmer has parallelized 99% of a program, but there is no value in increasing the problem size, i.e., the program will always be run with the same problem size regardless of the number of processors or cores used. What is the expected speedup on 20 processors? Solution As per Amdahl's law, the speedup,  N - No of processors = 20 f - % of parallel operation = 99% = 1 / (1 - 0.99) + (0.99 / 20) = 1 / 0.01 + (0.99 / 20) = 16.807 The expected speedup on 20 processors is 16.807

Decision Tree Classification

 A decision tree is a flowchart-like tree structure. The topmost node in a tree is the root node. The each internal node (non-leaf node) denotes a test on an attribute and each branch represents an outcome of the test. The each leaf node (or terminal node) holds a class label. Decision trees can handle multidimensional data.  Some of the decision tree algorithms are Iterative Dichotomiser (ID3), C4.5 (a successor of ID3), Classification and Regression Trees (CART). Most algorithms for decision tree induction  follow a top-down approach.  The tree starts with a training set of tuples and their associated class labels. The algorithm is called with data partition, attribute list, and attribute selection method, where the data partition is the complete set of training tuples and their associated class labels. The splitting criterion is determined by attribute selection method which indicates the splitting attribute that may be splitting point or splitting subset. Attribu...

Exercise 1 - Amdahl's Law

A programmer is given the job to write a program on a computer with processor having speedup factor 3.8 on 4 processors. He makes it 95% parallel and goes home dreaming of a big pay raise. Using Amdahl’s law, and assuming the problem size is the same as the serial version, and ignoring communication costs, what is the speedup factor that the programmer will get? Solution Speedup formula as per Amdahl's Law, N - no of processor = 4 f - % of parallel operation = 95% Speedup = 1 / (1 - 0.95) + (0.95/4) = 1 / 0.5 + (0.95/4) Speedup = 3.478 The programmer gets  3.478 as t he speedup factor.