# FP-growth algorithm

Have you ever gone to a search engine, typed in a word or part of a word, and the search engine automatically completed the search term for you? Perhaps it recommended something you didn’t even know existed, and you searched for that instead. This requires a way to find frequent itemsets efficiently. FP-growth algorithm find frequent itemsets or pairs, sets of things that commonly occur together, by storing the dataset in a special structure called an FP-tree.

The FP-Growth Algorithm, proposed by Han in , is an efficient and scalable method for mining the complete set of frequent patterns by pattern fragment growth, using an extended prefix-tree structure for storing compressed and crucial information about frequent patterns named frequent-pattern tree (FP-tree). In his study, Han proved that his method outperforms other popular methods for mining frequent patterns, e.g. the Apriori Algorithm and the TreeProjection .

The FP-growth algorithm scans the dataset only twice. The basic approach to finding frequent itemsets using the FP-growth algorithm is as follows:

1 Build the FP-tree.

2 Mine frequent itemsets from the FP-tree.

The FP stands for “frequent pattern.” An FP-tree looks like other trees in computer science, but it has links connecting similar items. The linked items can be thought of as a linked list.

The FPtree is used to store the frequency of occurrence for sets of items. Sets are stored as paths

in the tree. Sets with similar items will share part of the tree. Only when they differ will the tree split. A node identifies a single item from the set and the number of times it occurred in this sequence. A path will tell you how many times a sequence occurred. The links between similar items, known as node links, will be used to rapidly find the location of similar items.

# FP-growth algorithm

Pros: Usually faster than Apriori.

Cons: Difficult to implement; certain datasets degrade the performance.

Works with: Nominal values.

## General approach to FP-growth algorithm

1. Collect: Any method.
2. Prepare: Discrete data is needed because we’re storing sets. If you have continuous data, it will need to be quantized into discrete values.
3. Analyze: Any method.
4. Train: Build an FP-tree and mine the tree.
5. Test: Doesn’t apply.
6. Use: This can be used to identify commonly occurring items that can be used to make decisions, suggest items, make forecasts, and so on.
In [3]:
```#variables:
#name of the node, a count
#parent vaiable used to refer to the parent of the node in the tree
#node contains an empty dictionary for the children in the node
class treeNode:
def __init__(self, nameValue, numOccur, parentNode):
self.name = nameValue
self.count = numOccur
self.parent = parentNode      #needs to be updated
self.children = {}
#increments the count variable with a given amount
def inc(self, numOccur):
self.count += numOccur
#display tree in text. Useful for debugging
def disp(self, ind=1):
print ('  '*ind, self.name, ' ', self.count)
for child in self.children.values():
child.disp(ind+1)
```
In [4]:
```rootNode = treeNode('pyramid',9,None)
```
In [5]:
```rootNode.children['eye'] = treeNode('eye',13,None)
```
In [6]:
```rootNode.disp()
```
```   pyramid   9
eye   13
```

## Constructing the FP-tree

In addition to the FP-tree, you need a header table to point to the first instance of a given type. The header table will allow you to quickly access all of the elements of a given type in the FP-tree.

In addition to storing pointers, you can use the header table to keep track of the total count of every type of element in the FP-tree.

createTree(), takes the dataset and the minimum support as arguments and builds the FP-tree. This makes two passes through the dataset. The first pass goes through everything in the dataset and counts the frequency of each term. These are stored in the header table.

In [19]:
```def createTree(dataSet, minSup=1): #create FP-tree from dataset but don't mine
#go over dataSet twice
for trans in dataSet:#first pass counts frequency of occurance
for item in trans:
for k in list(headerTable):  #remove items not meeting minSup
#print 'freqItemSet: ',freqItemSet
if len(freqItemSet) == 0: return None, None  #if no items meet min support -->get out
retTree = treeNode('Null Set', 1, None) #create tree
for tranSet, count in dataSet.items():  #go through dataset 2nd time
localD = {}
for item in tranSet:  #put transaction items in order
if item in freqItemSet:
if len(localD) > 0:
orderedItems = [v[0] for v in sorted(localD.items(), key=lambda p: p[1], reverse=True)]
updateTree(orderedItems, retTree, headerTable, count)#populate tree with ordered freq itemset
```

updateTree() grow the Fp-tree with an itemset.

In [20]:
```def updateTree(items, inTree, headerTable, count):
if items[0] in inTree.children:#check if orderedItems[0] in retTree.children
inTree.children[items[0]].inc(count) #incrament count
inTree.children[items[0]] = treeNode(items[0], count, inTree)
else:
if len(items) > 1:#call updateTree() with remaining ordered items
```

updateHeader() makes sure the node links points to every intance of the this item on the tree

In [21]:
```def updateHeader(nodeToTest, targetNode):   #this version does not use recursion
while (nodeToTest.nodeLink != None):    #Do not use recursion to traverse a linked list!
```
In [22]:
```def loadSimpDat():
simpDat = [['r', 'z', 'h', 'j', 'p'],
['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],
['z'],
['r', 'x', 'n', 'o', 's'],
['y', 'r', 'x', 'z', 'q', 't', 'p'],
['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]
return simpDat
```

The createTree() function doesn’t take the input data as lists. It expects a dictionary with the itemsets as the dictionary keys and the frequency as the value. A createInitSet() function does this conversion for you

In [23]:
```def createInitSet(dataSet):
retDict = {}
for trans in dataSet:
retDict[frozenset(trans)] = 1
return retDict
```
In [24]:
```simpDat = loadSimpDat()
```
In [25]:
```simpDat
```
Out[25]:
```[['r', 'z', 'h', 'j', 'p'],
['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],
['z'],
['r', 'x', 'n', 'o', 's'],
['y', 'r', 'x', 'z', 'q', 't', 'p'],
['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]```
In [26]:
```initSet = createInitSet(simpDat)
```
In [27]:
```initSet
```
Out[27]:
```{frozenset({'z'}): 1,
frozenset({'s', 't', 'u', 'v', 'w', 'x', 'y', 'z'}): 1,
frozenset({'e', 'm', 'q', 's', 't', 'x', 'y', 'z'}): 1,
frozenset({'p', 'q', 'r', 't', 'x', 'y', 'z'}): 1,
frozenset({'h', 'j', 'p', 'r', 'z'}): 1,
frozenset({'n', 'o', 'r', 's', 'x'}): 1}```
In [28]:
```#The FP-tree
```
In [29]:
```myFPtree.disp()
```
```   Null Set   1
x   1
r   1
s   1
z   5
x   3
t   2
y   2
r   1
s   1
s   1
t   1
y   1
r   1
```

The item and its frequency count are displayed with indentation representing the depth of the tree.

### Mining frequent items from an FP-tree

There are three basic steps to extract the frequent itemsets from the FP-tree:

1 Get conditional pattern bases from the FP-tree.

2 From the conditional pattern base, construct a conditional FP-tree.

3 Recursively repeat steps 1 and 2 on until the tree contains a single item.

The conditional pattern base is a collection of paths that end with the item you’re looking for. Each of those paths is a prefix path. In short, a prefix path is anything on the tree between the item you’re looking for and the tree root.

ascendTree(), which ascends the tree, collecting the names of items it encounters

In [30]:
```def ascendTree(leafNode, prefixPath): #ascends from leaf node to root
if leafNode.parent != None:
prefixPath.append(leafNode.name)
ascendTree(leafNode.parent, prefixPath)
```

The findPrefixPath() function iterates through the linked list until it hits the end. For each item it encounters, it calls ascendTree().

This list is returned and added to the conditional pattern base dictionary called condPats.

In [31]:
```def findPrefixPath(basePat, treeNode): #treeNode comes from header table
condPats = {}
while treeNode != None:
prefixPath = []
ascendTree(treeNode, prefixPath)
if len(prefixPath) > 1:
condPats[frozenset(prefixPath[1:])] = treeNode.count
return condPats
```
In [33]:
```findPrefixPath('x', myHeaderTab['x'][1])
```
Out[33]:
`{frozenset({'z'}): 3}`
In [34]:
```findPrefixPath('r', myHeaderTab['r'][1])
```
Out[34]:
```{frozenset({'x'}): 1, frozenset({'z'}): 1, frozenset({'t', 'x', 'y', 'z'}): 1}

The FP-growth algorithm is an efficient way of finding frequent patterns in a dataset. The FP-growth algorithm works with the Apriori principle but is much faster. The Apriori algorithm generates candidate itemsets and then scans the dataset to see if they’re frequent. FP-growth is faster because it goes over the dataset only twice. The dataset is stored in a structure called an FP-tree. After the FP-tree is built, you can find frequent itemsets by finding conditional bases for an item and building a conditional FP-tree. This process is repeated, conditioning on more items until the conditional FPtree has only one item.

excerpts from
photo```

### This Post Has 9 Comments

1. Sandeep Kumar Ojha

rootNode = treeNode(‘pyramid’,9,None)

rootNode.children[‘eye’] = treeNode(‘eye’,13,None)

in google collab it’s showing error

2. H i, I am beginner in programming. have to implement fp-growth algorithm in c++. can any one provide its full working code please?

3. BURRE CHANDU

The full code can be found here:
https://nitapcseprojects-chandu.blogspot.com/2018/10/coding-fpgrowth-in-python-from-scratch.html

class fptree:
def __init__(self,ide,cnt,parent):
self.ide=ide
self.cnt=cnt
self.parent=parent
self.child={}
def increm(self,cnt):
self.cnt+=cnt
def genFi(data,minsp,dic=False):
kdc={}
fi,sfi=[],[]
nnewdata={}
for dat in data:
for i in range(0,len(dat)):
if dat[i] not in kdc:
kdc[dat[i]]=1
elif dat[i] in dat[0:i]:
continue
else:
kdc[dat[i]]+=1
for k,v in kdc.items():
if v1:
genTree(data[1:],cnt,null.child[data[0]],kdc)
#nodes with same names but in different paths

4. Patrick

Hi, nice article. One note, your ‘createInitSet’ function does not seem to do what it needs to do. It doesn’t count the frequencies, it just sets all frequencies to 1. I suggest the following:

def createInitSet(dataSet):
retDict = {}
for trans in dataSet:
if frozenset(trans) in retDict:
retDict[frozenset(trans)] += 1
else:
retDict[frozenset(trans)] = 1
return retDict

Please note I’m no python expert, there might be a more efficient expression, but it does what it needs to do.

Kind regards.

1. amrit puhan

thanks mate

5. Gary

Hi, I really like your website and appreciated your efforts to write this out in python3.
For FP growth, I think you are taking reference of by Peter Harrington, right?
I realized that the there is a function to mine the tree automatically, which is missing from your post. Have you tried translate it into python3? not successful??

def mineTree(inTree, headerTable, minSup, preFix, freqItemList):
bigL = [v[0] for v in sorted(headerTable.items(), key=lambda p: p[1])] # (sort header table)
for basePat in bigL: # start from bottom of header table
newFreqSet = preFix.copy()
# print ‘finalFrequent Item: ‘,newFreqSet #append to set
freqItemList.append(newFreqSet)
# print ‘condPattBases :’,basePat, condPattBases
# 2. construct cond FP-tree from cond. pattern base
if myHead != None: # 3. mine cond. FP-tree
# print ‘conditional tree for: ‘,newFreqSet
# myCondTree.disp(1)

1. Yes it is referenced from Peter Harrington’s book. I think I did not try it. I was more interested in finding out how the algorithm works. Thanks for the suggestion. I will give it a try.

1. Tao Song

Hi, thank you for your codes. I tried to do it by myself and I found that items of which frequencies less than the minSup still showed up. I am wondering if I misunderstand anything.

Best

2. Joel Carneiro

Hi, using the mineFPtree with python 3 I have an error:
bigL = [v[0] for v in sorted(headerTable.items(), key=lambda p: p[1])] # (sort header table)
AttributeError: ‘NoneType’ object has no attribute ‘items’