 The AdaBoost (adaptive boosting) algorithm was proposed in 1995 by Yoav Freund and Robert Shapire as a general method for generating a strong classifier out of a set of weak classifiers . AdaBoost works even when the classifiers come from a continuum of potential classifiers (such as neural networks, linear discriminants, etc.)

Pros: Low generalization error, easy to code, works with most classifiers, no parameters to adjust

Cons: Sensitive to outliers

Works with: Numeric values, nominal values

1. Collect: Any method.
2. Prepare: It depends on which type of weak learner you’re going to use. In this chapter, we’ll use decision stumps, which can take any type of data. You could use any classifier, so any of the classifiers from chapters 2–6 would work. Simple classifiers work better for a weak learner.
3. Analyze: Any method.
4. Train: The majority of the time will be spent here. The classifier will train the weak learner multiple times over the same dataset.
5. Test: Calculate the error rate.
6. Use: Like support vector machines, AdaBoost predicts one of two classes.

## Creating a weak learner with a decision stump

A decision stump is a simple decision tree.It’s a tree with only one split, so it’s a stump.

In :
```from numpy import *

datMat = matrix([[ 1. ,  2.1],
[ 2. ,  1.1],
[ 1.3,  1. ],
[ 1. ,  1. ],
[ 2. ,  1. ]])
classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]
return datMat,classLabels
```
In :
```datMat,classLabels = loadSimpData()
```
In :
```def loadDataSet(fileName):      #general function to parse tab -delimited floats
numFeat = len(open(fileName).readline().split('t')) #get number of fields
dataMat = []; labelMat = []
fr = open(fileName)
lineArr =[]
curLine = line.strip().split('t')
for i in range(numFeat-1):
lineArr.append(float(curLine[i]))
dataMat.append(lineArr)
labelMat.append(float(curLine[-1]))
return dataMat,labelMat
```

#### test If any of the values are less than or gretaer than the threshold value

In :
```def stumpClassify(dataMatrix,dimen,threshVal,threshIneq):#just classify the data
retArray = ones((shape(dataMatrix),1))
if threshIneq == 'lt':
retArray[dataMatrix[:,dimen] <= threshVal] = -1.0
else:
retArray[dataMatrix[:,dimen] > threshVal] = -1.0
return retArray
```

#### Loop over a weighted version and find the stump that yeilds the lowest error

In :
```def buildStump(dataArr,classLabels,D):
dataMatrix = mat(dataArr); labelMat = mat(classLabels).T
m,n = shape(dataMatrix)
numSteps = 10.0; bestStump = {}; bestClasEst = mat(zeros((m,1)))
minError = inf #init error sum, to +infinity
for i in range(n):#loop over all dimensions
rangeMin = dataMatrix[:,i].min(); rangeMax = dataMatrix[:,i].max();
stepSize = (rangeMax-rangeMin)/numSteps
for j in range(-1,int(numSteps)+1):#loop over all range in current dimension
for inequal in ['lt', 'gt']: #go over less than and greater than
threshVal = (rangeMin + float(j) * stepSize)
predictedVals = stumpClassify(dataMatrix,i,threshVal,inequal)#call stump classify with i, j, lessThan
errArr = mat(ones((m,1)))
errArr[predictedVals == labelMat] = 0
#Line where AdaBoost interacts with the classifier.
weightedError = D.T*errArr  #calc total error multiplied by D
#print ("split: dim %d, thresh %.2f, thresh ineqal: %s, the weighted error is %.3f" % (i, threshVal, inequal, weightedError))
if weightedError < minError:
minError = weightedError
bestClasEst = predictedVals.copy()
bestStump['dim'] = i
bestStump['thresh'] = threshVal
bestStump['ineq'] = inequal
return bestStump,minError,bestClasEst
```
In :
```D = mat(ones((5,1))/5)
```
In :
```buildStump(datMat,classLabels,D)
```
Out:
```({'dim': 0, 'ineq': 'lt', 'thresh': 1.3}, matrix([[ 0.2]]), array([[-1.],
[ 1.],
[-1.],
[-1.],
[ 1.]]))```

### Implementing the full AdaBoost algorithm

For each iteration:

``````Find the best stump using buildStump()

Add the best stump to the stump array

Calculate alpha

Calculate the new weight vector – D

Update the aggregate class estimate

If the error rate ==0.0 : break out of the for loop``````
In :
```def adaBoostTrainDS(dataArr,classLabels,numIt=40):
weakClassArr = []
#mis the number of datapoints in a dataset
m = shape(dataArr)
#D holds all the weights of each peice of data
D = mat(ones((m,1))/m)   #init D to all equal
#aggregrate estimate of the class for every data point
aggClassEst = mat(zeros((m,1)))
for i in range(numIt):
bestStump,error,classEst = buildStump(dataArr,classLabels,D)#build Stump
print("D:",D.T)
alpha = float(0.5*log((1.0-error)/max(error,1e-16)))#calc alpha, throw in max(error,eps) to account for error=0
bestStump['alpha'] = alpha
weakClassArr.append(bestStump)                  #store Stump Params in Array
print ("classEst: ",classEst.T)
expon = multiply(-1*alpha*mat(classLabels).T,classEst) #exponent for D calc, getting messy
D = multiply(D,exp(expon))                              #Calc New D for next iteration
D = D/D.sum()
#calc training error of all classifiers, if this is 0 quit for loop early (use break)
aggClassEst += alpha*classEst
print ("aggClassEst: ",aggClassEst.T)
aggErrors = multiply(sign(aggClassEst) != mat(classLabels).T,ones((m,1)))
errorRate = aggErrors.sum()/m
print ("total error: ",errorRate)
if errorRate == 0.0: break
return weakClassArr ,aggClassEst
```
In :
```classifierArray = adaBoostTrainDS(datMat,classLabels,9)
```
```D: [[ 0.2  0.2  0.2  0.2  0.2]]
classEst:  [[-1.  1. -1. -1.  1.]]
aggClassEst:  [[-0.69314718  0.69314718 -0.69314718 -0.69314718  0.69314718]]
total error:  0.2
D: [[ 0.5    0.125  0.125  0.125  0.125]]
classEst:  [[ 1.  1. -1. -1. -1.]]
aggClassEst:  [[ 0.27980789  1.66610226 -1.66610226 -1.66610226 -0.27980789]]
total error:  0.2
D: [[ 0.28571429  0.07142857  0.07142857  0.07142857  0.5       ]]
classEst:  [[ 1.  1.  1.  1.  1.]]
aggClassEst:  [[ 1.17568763  2.56198199 -0.77022252 -0.77022252  0.61607184]]
total error:  0.0
```

Remember, our class labels were [1.0, 1.0, -1.0, -1.0, 1.0]. In the first iteration, all the D values were equal; then only one value, the first data point, was misclassified. So, in the next iteration, the D vector puts 0.5 weight on the first data point because it was misclassified previously. You can see the total class by looking at the sign of aggClassEst. After the second iteration, you can see that the first data point is correctly classified, but the last data point is now wrong. The D value now becomes 0.5 for the last element, and the other values in the D vector are much smaller. Finally, in the third iteration the sign of all the values in aggClassEst matches your class labels and the training error becomes 0, so you can quit.

In :
```classifierArray
```
Out:
```[{'alpha': 0.6931471805599453, 'dim': 0, 'ineq': 'lt', 'thresh': 1.3},
{'alpha': 0.9729550745276565, 'dim': 1, 'ineq': 'lt', 'thresh': 1.0},
{'alpha': 0.8958797346140273,
'dim': 0,
'ineq': 'lt',
'thresh': 0.90000000000000002}]```

All you need to do is take the train of weak classifiers from your training function and apply these to an instance. The result of each weak classifier is weighted by its alpha. The weighted results from all of these weak classifiers are added together, and you take the sign of the final weighted sum to get your final answer.

In :
```def adaClassify(datToClass,classifierArr):
dataMatrix = mat(datToClass)#do stuff similar to last aggClassEst in adaBoostTrainDS
m = shape(dataMatrix)
aggClassEst = mat(zeros((m,1)))
for i in range(len(classifierArr)):
classEst =stumpClassify(dataMatrix,classifierArr[i]['dim'],
classifierArr[i]['thresh'],
classifierArr[i]['ineq'])#call stump classify
aggClassEst += classifierArr[i]['alpha']*classEst
print (aggClassEst)
return sign(aggClassEst)
```

In :
```adaClassify([[5, 5],[0,0]],classifierArray)
```
```[[ 0.69314718]
[-0.69314718]]
[[ 1.66610226]
[-1.66610226]]
[[ 2.56198199]
[-2.56198199]]
```
Out:
```matrix([[ 1.],
[-1.]])```

### AdaBoost on a difficult dataset

In :
```datArr,labelArr = loadDataSet('horseColicTraining2.txt')
```
```D: [[ 0.00334448  0.00334448  0.00334448  0.00334448  0.00334448  0.00334448
0.00334448  0.00334448  0.00334448  0.00334448  0.00334448  0.00334448
0.00334448  0.00334448  0.00334448  0.00334448  0.00334448  0.00334448
....
....
....```
```0.00334448  0.00334448  0.00334448  0.00334448  0.00334448  0.00334448
0.00334448  0.00334448  0.00334448  0.00334448  0.00334448  0.00334448
0.00334448  0.00334448  0.00334448  0.00334448  0.00334448]]
classEst:  [[-1.  1.  1.  1.  1.  1.  1.  1. -1. -1.  1.  1.  1.  1.  1. -1. -1.  1.
1.  1. -1.  1.  1.  1.  1.  1.  1.  1.  1.  1. -1.  1.  1.  1. -1. -1.
1. -1.  1.  1. -1.  1.  1. -1. -1. -1. -1.  1.  1. -1.  1.  1.  1.  1.
....
....
....```
```-1. -1. -1.  1. -1. -1.  1.  1.  1.  1.  1.  1.  1.  1.  1. -1.  1.  1.
1.  1.  1. -1.  1.  1.  1. -1. -1.  1.  1.]]
aggClassEst:  [[-0.46166238  0.46166238  0.46166238  0.46166238  0.46166238  0.46166238
0.46166238  0.46166238 -0.46166238 -0.46166238  0.46166238  0.46166238
0.46166238  0.46166238  0.46166238 -0.46166238 -0.46166238  0.46166238
....
....```
```0.46166238  0.46166238  0.46166238 -0.46166238  0.46166238  0.46166238
0.46166238 -0.46166238 -0.46166238  0.46166238  0.46166238]]
total error:  0.284280936455
D: [[ 0.00233645  0.00588235  0.00233645  0.00588235  0.00588235  0.00233645
0.00233645  0.00588235  0.00233645  0.00588235  0.00233645  0.00233645
0.00233645  0.00588235  0.00233645  0.00233645  0.00233645  0.00233645
....
....```
```1.20719077  0.91726555 -0.10329743 -0.57967867  0.27123572  1.69342306
0.05809528 -0.65208904 -1.02662219  0.27123572  0.5606821 ]]
total error:  0.230769230769
```
In :
```testArr,testLabelArr = loadDataSet('horseColicTest2.txt')
```
```[[ 0.46166238]
[ 0.46166238]
[-0.46166238]
...
...
...```
```[ 1.47791472]
[ 0.80958618]
[ 0.54030781]
[ 0.5273375 ]]
```
In :
```errArr=mat(ones((67,1)))
errArr[prediction10!=mat(testLabelArr).T].sum()
```
Out:
`16.0`
In :
```def plotROC(predStrengths, classLabels):
import matplotlib.pyplot as plt
%matplotlib inline
cur = (1.0,1.0) #cursor
ySum = 0.0 #variable to calculate AUC
numPosClas = sum(array(classLabels)==1.0)
yStep = 1/float(numPosClas); xStep = 1/float(len(classLabels)-numPosClas)
sortedIndicies = predStrengths.argsort()#get sorted index, it's reverse
fig = plt.figure()
fig.clf()
ax = plt.subplot(111)
#loop through all the values, drawing a line segment at each point
for index in sortedIndicies.tolist():
if classLabels[index] == 1.0:
delX = 0; delY = yStep;
else:
delX = xStep; delY = 0;
ySum += cur
#draw line from cur to (cur-delX,cur-delY)
ax.plot([cur,cur-delX],[cur,cur-delY], c='b')
cur = (cur-delX,cur-delY)
ax.plot([0,1],[0,1],'b--')
plt.xlabel('False positive rate'); plt.ylabel('True positive rate')
plt.title('ROC curve for AdaBoost horse colic detection system')
ax.axis([0,1,0,1])
plt.show()
print ("the Area Under the Curve is: ",ySum*xStep)
```
In :
```datArr,labelArr = loadDataSet('horseColicTraining2.txt')
```
In :
```classifierArray,aggClassEst = adaBoostTrainDS(datArr,labelArr,10)
```
```D: [[ 0.00334448  0.00334448  0.00334448  0.00334448  0.00334448  0.00334448
0.00334448  0.00334448  0.00334448  0.00334448  0.00334448  0.00334448
0.00334448  0.00334448  0.00334448  0.00334448  0.00334448  0.00334448
0.00334448  0.00334448  0.00334448  0.00334448  0.00334448  0.00334448
0.00334448  0.00334448  0.00334448  0.00334448  0.00334448  0.00334448
....
....
....```
``` -0.01293737  1.53203035  0.95841088 -1.04249592  0.23749438  0.5606821
1.20719077  0.91726555 -0.10329743 -0.57967867  0.27123572  1.69342306
0.05809528 -0.65208904 -1.02662219  0.27123572  0.5606821 ]]
total error:  0.230769230769
```
In :
```plotROC(aggClassEst.T,labelArr)
``` ```the Area Under the Curve is:  0.8582969635063604
```

There are many ways to include the cost information in classification algorithms. In AdaBoost, you can adjust the error weight vector D based on the cost function. In naïve Bayes, you could predict the class with the lowest expected cost instead of the class with the highest probability. In SVMs, you can use different C parameters in the cost function for the different classes. This gives more weight to the smaller class, which when training the classifier will allow fewer errors in the smaller class.

excerpts from

tutorial

photo

` `

### This Post Has One Comment

1. Basemah

I believe that the dataset you used is classified as 1 or -1. What about Datasets that have 3 or more classes? how can I modify your code to run it with multi-classes sets?

This site uses Akismet to reduce spam. Learn how your comment data is processed.