Apriori Algorithm

The Apriori algorithm principle says that if an itemset is frequent, then all of its subsets are frequent.this means that if {0,1} is frequent, then {0} and {1} have to be frequent.

The rule turned around says that if an itemset is infrequent, then its supersets are also infrequent.

We first need to find the frequent itemsets, and then we can find association rules.

Pros: Easy to code up

Cons: May be slow on large datasets

Works with: Numeric values, nominal values

Association analysis

Looking for hidden relationships in large datasets is known as association analysis or association rule learning. The problem is, finding different combinations of items can be a time-consuming task and prohibitively expensive in terms of computing power.

These interesting relationships can take two forms: frequent item sets or association rules. Frequent item sets are a collection of items that frequently occur together. The second way to view interesting relationships is association rules. Association rules suggest that a strong relationship exists between two items.

With the frequent item sets and association rules, retailers have a much better understanding of their customers. Another example is search terms from a search engine.

The support and confidence are ways we can quantify the success of our association analysis.

The support of an itemset is defined as the percentage of the dataset that contains this itemset.

The confidence for a rule P ➞ H is defined as support(P | H)/ support(P). Remember, in Python, the | symbol is the set union; the mathematical symbol is U. P | H means all the items in set P or in set H.

General approach to the Apriori algorithm

Collect: Any method.
Prepare: Any data type will work as we’re storing sets.
Analyze: Any method.
Train: Use the Apriori algorithm to find frequent itemsets.
Test: Doesn’t apply.
Use: This will be used to find frequent itemsets and association rules between items.

Finding frequent itemsets

The way to find frequent itemsets is the Apriori algorithm. The Apriori algorithm needs a minimum support level as an input and a data set. The algorithm will generate a list of all candidate itemsets with one item. The transaction data set will then be scanned to see which sets meet the minimum support level. Sets that don’t meet the minimum support level will get tossed out. The remaining sets will then be combined to make itemsets with two elements. Again, the transaction dataset will be scanned and itemsets not meeting the minimum support level will get tossed. This procedure will be repeated until all sets are tossed out.

Scanning the dataset

For each transaction in the dataset:

For each candidate itemset, can:

Check to see if can is a subset of tran

If so increment the count of can

For each candidate itemset:

If the support meets the minimum, keep this item

Return list of frequent itemsets

In [1]:

from numpy import *

Create a simple dataset for testing

In [2]:

def loadDataSet():
    return [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]

It creates C1 .C1 is a candidate itemset of size one. In the Apriori algorithm, we create C1, and then we’ll scan the dataset to see if these one itemsets meet our minimum support requirements. The itemsets that do meet our minimum requirements become L1. L1 then gets combined to become C2 and C2 will get filtered to become L2.

Frozensets are sets that are frozen, which means they’re immutable; you can’t change them. You need to use the type frozenset instead of set because you’ll later use these sets as the key in a dictionary.

You can’t create a set of just one integer in Python. It needs to be a list (try it out). That’s why you create a list of single-item lists. Finally, you sort the list and then map every item in the list to frozenset() and return this list of frozensets

In [11]:

def createC1(dataSet):
    C1 = []
    for transaction in dataSet:
        for item in transaction:
            if not [item] in C1:
                C1.append([item])

    C1.sort()
    return list(map(frozenset, C1))#use frozen set so we
                            #can use it as a key in a dict

This function takes three arguments: a dataset, Ck, a list of candidate sets, and minSupport, which is the minimum support you’re interested in. This is the function you’ll use to generate L1 from C1. Additionally, this function returns a dictionary with support values.

In [28]:

def scanD(D, Ck, minSupport):
    ssCnt = {}
    for tid in D:
        for can in Ck:
            if can.issubset(tid):
                if not can in ssCnt: ssCnt[can]=1
                else: ssCnt[can] += 1
    numItems = float(len(D))
    retList = []
    supportData = {}
    for key in ssCnt:
        support = ssCnt[key]/numItems
        if support >= minSupport:
            retList.insert(0,key)
        supportData[key] = support
    return retList, supportData

In [29]:

dataSet = loadDataSet()
dataSet

Out[29]:

[[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]

In [30]:

C1 = createC1(dataSet)

In [31]:

C1

Out[31]:

[frozenset({1}),
 frozenset({2}),
 frozenset({3}),
 frozenset({4}),
 frozenset({5})]

C1 contains a list of all the items in frozenset

In [32]:

#D is a dataset in the setform.

D = list(map(set,dataSet))

In [33]:

Out[33]:

[{1, 3, 4}, {2, 3, 5}, {1, 2, 3, 5}, {2, 5}]

Now that you have everything in set form, you can remove items that don’t meet our minimum support.

In [34]:

L1,suppDat0 = scanD(D,C1,0.5)
L1

Out[34]:

[frozenset({1}), frozenset({3}), frozenset({2}), frozenset({5})]

These four items make up our L1 list, that is, the list of one-item sets that occur in at least 50% of all transactions. Item 4 didn’t make the minimum support level, so it’s not a part of L1. That’s OK. By removing it, you’ve removed more work from when you find the list of two-item sets.

Pseudo-code for the whole Apriori algorithm

While the number of items in the set is greater than 0:

Create a list of candidate itemsets of length k

Scan the dataset to see if each itemset is frequent

Keep frequent itemsets to create itemsets of length k+1

The main function is apriori(); it calls aprioriGen() to create candidate itemsets: Ck.

The function aprioriGen() will take a list of frequent itemsets, Lk, and the size of the itemsets, k, to produce Ck. For example, it will take the itemsets {0}, {1}, {2} and so on and produce {0,1} {0,2}, and {1,2}.

The sets are combined using the set union, which is the | symbol in Python.

In [35]:

def aprioriGen(Lk, k): #creates Ck
    retList = []
    lenLk = len(Lk)
    for i in range(lenLk):
        for j in range(i+1, lenLk):
            L1 = list(Lk[i])[:k-2]; L2 = list(Lk[j])[:k-2]
            L1.sort(); L2.sort()
            if L1==L2: #if first k-2 elements are equal
                retList.append(Lk[i] | Lk[j]) #set union
    return retList

In [38]:

def apriori(dataSet, minSupport = 0.5):
    C1 = createC1(dataSet)
    D = list(map(set, dataSet))
    L1, supportData = scanD(D, C1, minSupport)
    L = [L1]
    k = 2
    while (len(L[k-2]) > 0):
        Ck = aprioriGen(L[k-2], k)
        Lk, supK = scanD(D, Ck, minSupport)#scan DB to get Lk
        supportData.update(supK)
        L.append(Lk)
        k += 1
    return L, supportData

In [39]:

L,suppData = apriori(dataSet)

In [40]:

Out[40]:

[[frozenset({1}), frozenset({3}), frozenset({2}), frozenset({5})],
 [frozenset({3, 5}), frozenset({1, 3}), frozenset({2, 5}), frozenset({2, 3})],
 [frozenset({2, 3, 5})],
 []]

L contains some lists of frequent itemsets that met a minimum support of 0.5. The variable suppData is a dictionary with the support values of our itemsets.

In [46]:

L[0]

Out[46]:

[frozenset({1}), frozenset({3}), frozenset({2}), frozenset({5})]

In [47]:

L[1]

Out[47]:

[frozenset({3, 5}), frozenset({1, 3}), frozenset({2, 5}), frozenset({2, 3})]

In [48]:

L[2]

Out[48]:

[frozenset({2, 3, 5})]

In [49]:

L[3]

Out[49]:

[]

In [50]:

aprioriGen(L[0],2)

Out[50]:

[frozenset({1, 3}),
 frozenset({1, 2}),
 frozenset({1, 5}),
 frozenset({2, 3}),
 frozenset({3, 5}),
 frozenset({2, 5})]

Mining association rules from frequent item sets

To find association rules, we first start with a frequent itemset. We know this set of items is unique, but we want to see if there is anything else we can get out of these items. One item or one set of items can imply another item.

generateRules(), is the main command, which calls the other two.

The generateRules() function takes three inputs: a list of frequent itemsets, a dictionary of support data for those itemsets, and a minimum confidence threshold. It’s going to generate a list of rules with confidence values that we can sort through later.

In [51]:

def generateRules(L, supportData, minConf=0.7):  #supportData is a dict coming from scanD
    bigRuleList = []
    for i in range(1, len(L)):#only get the sets with two or more items
        for freqSet in L[i]:
            H1 = [frozenset([item]) for item in freqSet]
            if (i > 1):
                rulesFromConseq(freqSet, H1, supportData, bigRuleList, minConf)
            else:
                calcConf(freqSet, H1, supportData, bigRuleList, minConf)
    return bigRuleList

calcConf() calculates the confidence of the rule and then find out the which rules meet the minimum confidence.

In [53]:

def calcConf(freqSet, H, supportData, brl, minConf=0.7):
    prunedH = [] #create new list to return
    for conseq in H:
        conf = supportData[freqSet]/supportData[freqSet-conseq] #calc confidence
        if conf >= minConf:
            print (freqSet-conseq,'-->',conseq,'conf:',conf)
            brl.append((freqSet-conseq, conseq, conf))
            prunedH.append(conseq)
    return prunedH

rulesFromConseq() generates more association rules from our initial dataset. This takes a frequent itemset and H, which is a list of items that could be on the right-hand side of a rule.

In [54]:

def rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7):
    m = len(H[0])
    if (len(freqSet) > (m + 1)): #try further merging
        Hmp1 = aprioriGen(H, m+1)#create Hm+1 new candidates
        Hmp1 = calcConf(freqSet, Hmp1, supportData, brl, minConf)
        if (len(Hmp1) > 1):    #need at least two sets to merge
            rulesFromConseq(freqSet, Hmp1, supportData, brl, minConf)

In [55]:

L,suppData= apriori(dataSet,minSupport=0.5)

In [56]:

rules= generateRules(L,suppData, minConf=0.7)

frozenset({1}) --> frozenset({3}) conf: 1.0
frozenset({5}) --> frozenset({2}) conf: 1.0
frozenset({2}) --> frozenset({5}) conf: 1.0

This gives you three rules: {1} ➞ {3},{5} ➞ {2},and {2} ➞ {5}. It’s interesting to see that the rule with 2 and 5 can be flipped around but not the rule with 1 and 3.

Finding similar features in poisonous mushrooms

In [70]:

mushDatSet = [line.split() for line in open('mushroom.dat').readlines()]

In [71]:

L,suppData= apriori(mushDatSet, minSupport=0.3)

Search the frequent itemsets for the poisonous feature 2

In [73]:

for item in L[1]:
    if item.intersection('2'):
        print (item)

for item in L[1]:
    if item.intersection('2'):
        print (item)

frozenset({'93', '2'})
frozenset({'36', '2'})
frozenset({'53', '2'})
frozenset({'23', '2'})
frozenset({'59', '2'})
frozenset({'67', '2'})
frozenset({'86', '2'})
frozenset({'39', '2'})
frozenset({'85', '2'})
frozenset({'76', '2'})
frozenset({'63', '2'})
frozenset({'34', '2'})
frozenset({'28', '2'})
frozenset({'90', '2'})

You can also repeat this for the larger itemsets:

In [79]:

for item in L[6]:
    if item.intersection('2'):
        print (item)

frozenset({'86', '39', '34', '23', '36', '2', '59'})
frozenset({'86', '53', '85', '28', '90', '2', '39'})
frozenset({'93', '86', '34', '23', '90', '59', '2'})
frozenset({'86', '85', '34', '90', '2', '39', '63'})
frozenset({'93', '85', '34', '23', '90', '59', '2'})
frozenset({'93', '86', '39', '23', '59', '2', '36'})
frozenset({'86', '85', '34', '23', '36', '2', '39'})
frozenset({'93', '86', '85', '34', '23', '59', '2'})
frozenset({'93', '86', '34', '23', '59', '2', '39'})
....
....
....

frozenset({'93', '86', '85', '34', '23', '90', '2'})
frozenset({'86', '34', '23', '36', '2', '59', '63'})

In [83]:

rules= generateRules(L,suppData, minConf=0.7)

frozenset({'76'}) --> frozenset({'36'}) conf: 0.7135036496350365
frozenset({'56'}) --> frozenset({'86'}) conf: 1.0
frozenset({'2'}) --> frozenset({'93'}) conf: 0.7490494296577946
.....
.....
.....

frozenset({'23', '85'}) --> frozenset({'86', '39', '34', '59', '2', '36', '63'}) conf: 0.7298578199052134
frozenset({'86', '23'}) --> frozenset({'85', '34', '59', '2', '39', '36', '63'}) conf: 0.7298578199052134
frozenset({'23'}) --> frozenset({'86', '85', '34', '59', '2', '39', '36', '63'}) conf: 0.7298578199052134

excerpts from
photo

Tagged Algorithms, Data Science, Machine Learning, Python

7 Responses

Nay says:

September 19, 2017 at 3:53 am

HI,
I have a question and I hope you can help me.
I am working on an apriory algorithm for a large list of item.
My question is if I can save all the rules generated in the same file?
Roopa T R says:

February 22, 2018 at 6:28 am

Hi
In[28] y ssCnt is used, and y ssCnt[can] is assigned for 1
starman says:

May 21, 2018 at 1:04 pm

hey can you tell me what L1=list([i])[:k-2] Is Doing
GIORGOS TZIOLAS says:

May 26, 2020 at 3:29 pm

There is an error at In 34: else: ssCnt[can] +=1 KeyError: frozenset({2})
1. Piush Vaish says:
  
  June 6, 2020 at 5:42 am
  
  I was not able to recreate your error.
pooja says:

March 29, 2021 at 3:46 am

I need to do this for very large csv file and i am facing problem. Can somebody please help me?

poojakhatri8336@gmail.com
1. Piush Vaish says:
  
  April 4, 2021 at 3:23 am
  
  Hi Pooja,
  What problem are you facing?

Apriori Algorithm (Python 3.0)