This notebook contains my notes for Predictive Analysis on Binary Classification. It acts as a cookbook.
Importing and sizing up a New Data Set
import urllib
from urllib.request import urlopen
import sys
#read data from uci data repository
target_url = urllib.request.Request("https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data")
data = urlopen(target_url)
#arrange data into list for labels and list of lists for attributes
xList = []
labels = []
for line in data:
#split on comma
row = line.strip().decode().split(",")
xList.append(row)
sys.stdout.write("Number of Rows of Data = " + str(len(xList)) + '\n')
sys.stdout.write("Number of Columns of Data = " + str(len(xList[1])))
If the data set has many more columns than rows, you may be more likely to get the best prediction with penalized linear regression and vice versa.
Determining the Nature of Attributes
The next step in the checklist is to determine how many of the columns of data are numeric versus categorical
#arrange data into list for labels and list of lists for attributes
nrow = len(xList)
ncol = len(xList[1])
type = [0]*3
colCounts = []
for col in range(ncol):
for row in xList:
try:
a = float(row[col])
if isinstance(a, float):
type[0] += 1
except ValueError:
if len(row[col]) > 0:
type[1] += 1
else:
type[2] += 1
colCounts.append(type)
type = [0]*3
sys.stdout.write("Col#" + '\t' + "Number" + '\t' + "Strings" + '\t ' + "Other\n")
iCol = 0
#I deleted one /t to make it align.
for types in colCounts:
sys.stdout.write(str(iCol) + '\t' + str(types[0]) + '\t' +
str(types[1]) + '\t' + str(types[2]) + "\n")
iCol += 1
The code runs down each column and adds up the number of entries that are numeric (int or float), the number of entries that are non-empty strings, and the number that are empty. The result is that the first 60 columns contain all numeric values and the last column contains all strings. The string values are the labels.
Generally, categorical variables are presented as strings, as in this example. In some cases, binary‐valued categorical variables are presented as a 0,1 numeric variable.
After determining which attributes are categorical and which are numeric, you’ll
want some descriptive statistics for the numeric variables and a count of the unique categories in each categorical attribute.
Summary Statistics for Numeric and Categorical Attributes
import numpy as np
#generate summary statistics for column 3 (e.g.)
col = 3
colData = []
for row in xList:
colData.append(float(row[col]))
colArray = np.array(colData)
colMean = np.mean(colArray)
colsd = np.std(colArray)
sys.stdout.write("Mean = " + '\t' + str(colMean) + '\t\t' +
"Standard Deviation = " + '\t ' + str(colsd) + "\n")
#calculate quantile boundaries
ntiles = 4
percentBdry = []
for i in range(ntiles+1):
percentBdry.append(np.percentile(colArray, i*(100)/ntiles))
sys.stdout.write("\nBoundaries for 4 Equal Percentiles \n")
print(percentBdry)
sys.stdout.write(" \n")
#run again with 10 equal intervals
ntiles = 10
percentBdry = []
for i in range(ntiles+1):
percentBdry.append(np.percentile(colArray, i*(100)/ntiles))
sys.stdout.write("Boundaries for 10 Equal Percentiles \n")
print(percentBdry)
sys.stdout.write(" \n")
#The last column contains categorical variables
col = 60
colData = []
for row in xList:
colData.append(row[col])
unique = set(colData)
sys.stdout.write("Unique Label Values \n")
print(unique)
#count up the number of elements having each value
catDict = dict(zip(list(unique),range(len(unique))))
catCount = [0]*2
for elt in colData:
catCount[catDict[elt]] += 1
sys.stdout.write("\nCounts for Each Value of Categorical Label \n")
print(list(unique))
print(catCount)
The first step is to calculate the mean and standard deviation for the chosen attribute.
The next section of code looks for outliers.One way to reveal this sort of mismatch is to divide a set of numbers into percentiles.
First the program calculates the quartiles. That shows that the upper quartile is much wider than the others. To be more certain, the decile boundaries are also calculated and similarly demonstrate that the upper decile is unusually wide. Some widening is normal because distributions often thin out in the tails.
Visualization of Outliers Using Quantile‐Quantile Plot
import pylab
import scipy.stats as stats
#generate summary statistics for column 3 (e.g.)
col = 3
colData = []
for row in xList:
colData.append(float(row[col]))
stats.probplot(colData, dist="norm", plot=pylab)
pylab.show()
The resulting plot shows how the boundaries associated with empirical percentiles in the data compare to the boundaries for the same percentiles of a Gaussian distribution. If the data being analyzed comes from a Gaussian distribution, the point being plotted will lie on a straight line.
Outliers may cause trouble either for model building or prediction. After you’ve trained a model on this data set, you can look at the errors your model makes and see whether the errors are correlated with these outliers.
You can segregate them out and train them as a separate class. You can also edit them out of the data if they represent an abnormality that won’t be present in the data your model will see when deployed.
A reasonable process for this might be to generate quartile boundaries during the exploration phase and note potential outliers to get a feel for how much of a problem you might (or might not) have with it. Then when you’re evaluating performance data, use quantile‐quantile (Q‐Q) plots to determine which points to call outliers for use in your error analysis.
Using Python Pandas to Read Data
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plot
%matplotlib inline
target_url = ("https://archive.ics.uci.edu/ml/machine-learning-"
"databases/undocumented/connectionist-bench/sonar/sonar.all-data")
#read rocks versus mines data into pandas data frame
rocksVMines = pd.read_csv(target_url,header=None, prefix="V")
#print head and tail of data frame
print(rocksVMines.head())
print(rocksVMines.tail())
Structure in the way the data are stored might need to be factored into your approach for doing subsequent sampling.
Using Python Pandas to Summarize Data
#print summary of data frame
summary = rocksVMines.describe()
print(summary)
Notice that the summary produced by the describe function is itself a data frame so that you can automate the process of screening for attributes that have outliers. To do that, you can compare the differences between the various quantiles and raise a flag if any of the differences for an attribute are out of scale with the other differences for the same attributes. The attributes that are shown in the output indicate that several of them have outliers. It would be worth looking to determine how many rows are involved in the outliers.
Visualizing
Parallel Coordinates Plots
for i in range(208):
#assign color based on color based on "M" or "R" labels
if rocksVMines.iat[i,60] == "M":
pcolor = "red"
else:
pcolor = "blue"
#plot rows of data as if they were series data
dataRow = rocksVMines.iloc[i,0:60]
dataRow.plot(color=pcolor, alpha=0.5)
plot.xlabel("Attribute Index")
plot.ylabel(("Attribute Values"))
plot.show()
no extremely clear separation is evident in the line plot, but there are some areas where the blues and reds are separated. Along the bottom of the plot, the blues stand out a bit, and in the range of attribute indices from 30 to 40, the blues are somewhat higher than the reds. These kinds of insights can help in interpreting and confirming predictions made by your trained model.
Visualizing Interrelationships between Attributes and Labels
Another question you might ask of the data is how the various attributes relate to one another. One quick way to get an idea of pair‐wise relationships is to crossplot the attributes with the labels.
Scatter Plot / Cross-Plots
#calculate correlations between real-valued attributes
dataRow2 = rocksVMines.iloc[1,0:60]
dataRow3 = rocksVMines.iloc[2,0:60]
plot.scatter(dataRow2, dataRow3)
plot.xlabel("2nd Attribute")
plot.ylabel(("3rd Attribute"))
plot.show()
dataRow21 = rocksVMines.iloc[20,0:60]
plot.scatter(dataRow2, dataRow21)
plot.xlabel("2nd Attribute")
plot.ylabel(("21st Attribute"))
plot.show()
If you want to develop your intuition about the relation between numeric correlation and the shape of the scatter plot, just search “correlation”
Basically, if the points in the scatter plot lie along a thin straight line, the two variables are highly correlated; if they form a ball of points, they’re uncorrelated.
Correlation between Classification Target and Real Attributes
Plotting a scatter plot between the targets and attribute 35.
The idea of using attribute 35 for the example showing correlation with the target came from the parallel coordinates graph
from random import uniform
#change the targets to numeric values
target = []
for i in range(208):
#assign 0 or 1 target value based on "M" or "R" labels
if rocksVMines.iat[i,60] == "M":
target.append(1.0)
else:
target.append(0.0)
#plot rows of data as if they were series data
dataRow = rocksVMines.iloc[0:208,35]
plot.scatter(dataRow, target)
plot.xlabel("Attribute Value")
plot.ylabel("Target Value")
plot.show()
#
#To improve the visualization, this version dithers the points a little
# and makes them somewhat transparent
target = []
for i in range(208):
#assign 0 or 1 target value based on "M" or "R" labels
# and add some dither
if rocksVMines.iat[i,60] == "M":
target.append(1.0 + uniform(-0.1, 0.1))
else:
target.append(0.0 + uniform(-0.1, 0.1))
#plot rows of data as if they were series data
dataRow = rocksVMines.iloc[0:208,35]
plot.scatter(dataRow, target, alpha=0.5, s=120)
plot.xlabel("Attribute Value")
plot.ylabel("Target Value")
plot.show()
Notice the somewhat higher concentration of attribute 35 on the left end of the upper band of points, whereas the points are more uniformly spread from right to left in the lower band. The upper band of points corresponds to mines. The lower band corresponds to rocks. You could build a classifier for this problem by testing whether attribute 35 is greater than or less than 0.5. If it is greater than 0.5 call it a rock and if it is less than 0.5, call it a mine. The examples where attribute 35 is less than 0.5 contain a higher concentration of mines than rock, and the examples where attribute 35 is less than 0.5 contain a lower density, so you’d get better performance than you would with random guessing.
The degree of correlation between two attributes (or an attribute and a target) can be quantified using Pearson’s correlation coefficient.
The attributes that have close index numbers have relatively higher correlations than those that are separated further.
Pearson’s Correlation Calculation for Attributes 2 versus 3 and 2 versus 21
from math import sqrt
#calculate correlations between real-valued attributes
dataRow2 = rocksVMines.iloc[1,0:60]
dataRow3 = rocksVMines.iloc[2,0:60]
dataRow21 = rocksVMines.iloc[20,0:60]
mean2 = 0.0; mean3 = 0.0; mean21 = 0.0
numElt = len(dataRow2)
for i in range(numElt):
mean2 += dataRow2[i]/numElt
mean3 += dataRow3[i]/numElt
mean21 += dataRow21[i]/numElt
var2 = 0.0; var3 = 0.0; var21 = 0.0
for i in range(numElt):
var2 += (dataRow2[i] - mean2) * (dataRow2[i] - mean2)/numElt
var3 += (dataRow3[i] - mean3) * (dataRow3[i] - mean3)/numElt
var21 += (dataRow21[i] - mean21) * (dataRow21[i] - mean21)/numElt
corr23 = 0.0; corr221 = 0.0
for i in range(numElt):
corr23 += (dataRow2[i] - mean2) *
(dataRow3[i] - mean3) / (sqrt(var2*var3) * numElt)
corr221 += (dataRow2[i] - mean2) *
(dataRow21[i] - mean21) / (sqrt(var2*var21) * numElt)
sys.stdout.write("Correlation between attribute 2 and 3 n")
print(corr23)
sys.stdout.write(" n")
sys.stdout.write("Correlation between attribute 2 and 21 n")
print(corr221)
sys.stdout.write(" n")
Visualizing Attribute and Label Correlations Using a Heat Map
One way to check correlations with a large number of attributes is to calculate the Pearson’s correlation coefficient for pairs of attributes, arrange those correlations into a matrix where the ij‐th entry is the correlation between the ith attribute and the jth attribute, and then plot them in a heat map
#calculate correlations between real-valued attributes
corMat = DataFrame(rocksVMines.corr())
#visualize correlations using heatmap
plot.pcolor(corMat)
plot.show()
The light areas along the diagonal confirm that attributes close to one another in index have relatively high correlations. As mentioned earlier, this is due to the way in which the data are generated. Close indices are sampled at short time intervals from one another and consequently have similar frequencies. Similar frequencies reflect off the targets similarly (and so on).
Perfect correlation (correlation = 1) between attributes means that you may have made a mistake and included the same thing twice. Very high correlation between a set of attributes (pairwise correlations > 0.7) is known as multicollinearity and can lead to unstable estimates. Correlation with the targets is a different matter. Having an attribute that’s correlated with the target generally indicates a predictive relation.
This notebook contains my notes for Predictive Analysis on Binary Classification. It acts as a cookbook. It is a continuation from the previous post on Pearson’s Correlation. This notebook discusses assessing performance of Predictive Models.
One of the most used is the misclassification error—that is, the fraction of examples that the function pred() predicts incorrectly.
Reading and Arranging data
from urllib.request import urlopen
import sys
#read data from uci data repository
target_url = urllib.request.Request("https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data")
data = urlopen(target_url)
#arrange data into list for labels and list of lists for attributes
xList = []
labels = []
for line in data:
#split on comma
row = line.strip().decode().split(",")
#assign label 1.0 for "M" and 0.0 for "R"
if(row[-1] == 'M'):
labels.append(1.0)
else:
labels.append(0.0)
#remove label from row
row.pop()
#convert row to floats
floatRow = [float(num) for num in row]
xList.append(floatRow)
confusionMatrix() takes the predictions, the corresponding actual values (labels), and a threshold value as input. It compares the predictions to the threshold to determine whether to assign each example to the “predicted positive” or “predicted negative” column in the confusion matrix. It uses the actual value to make the assignment to the appropriate row of the confusion matrix.
#import random
from sklearn import datasets, linear_model
def confusionMatrix(predicted, actual, threshold):
if len(predicted) != len(actual): return -1
tp = 0.0
fp = 0.0
tn = 0.0
fn = 0.0
for i in range(len(actual)):
if actual[i] > 0.5: #labels that are 1.0 (positive examples)
if predicted[i] > threshold:
tp += 1.0 #correctly predicted positive
else:
fn += 1.0 #incorrectly predicted negative
else: #labels that are 0.0 (negative examples)
if predicted[i] < threshold:
tn += 1.0 #correctly predicted negative
else:
fp += 1.0 #incorrectly predicted positive
rtn = [tp, fn, fp, tn]
return rtn
#divide attribute matrix and label vector into training(2/3 of data) and test sets (1/3 of data)
indices = range(len(xList))
xListTest = [xList[i] for i in indices if i%3 == 0 ]
xListTrain = [xList[i] for i in indices if i%3 != 0 ]
labelsTest = [labels[i] for i in indices if i%3 == 0]
labelsTrain = [labels[i] for i in indices if i%3 != 0]
#form list of list input into numpy arrays to match input class for scikit-learn linear model
xTrain = np.array(xListTrain); yTrain = np.array(labelsTrain); xTest = np.array(xListTest); yTest = np.array(labelsTest)
#check shapes to see what they look like
print("Shape of xTrain array", xTrain.shape)
print("Shape of yTrain array", yTrain.shape)
print("Shape of xTest array", xTest.shape)
print("Shape of yTest array", yTest.shape)
The data labeled test will not be used in training the classifier, but will be reserved for assessing performance after the classifier is trained.
#train linear regression model
rocksVMinesModel = linear_model.LinearRegression()
rocksVMinesModel.fit(xTrain,yTrain)
#generate predictions on in-sample error
trainingPredictions = rocksVMinesModel.predict(xTrain)
print("Some values predicted by model", trainingPredictions[0:5], trainingPredictions[-6:-1])
The classifier is trained by converting the labels M (for mine) and R (for rock) in the original data set into numeric values—1.0 corresponding to mine, and 0.0 corresponding to rock—and then using the ordinary least squares regression to fit a linear model.
Then the trained model is used to generate predictions on the training set and on the test set.
#generate confusion matrix for predictions on training set (in-sample
confusionMatTrain = confusionMatrix(trainingPredictions, yTrain, 0.50)
#pick threshold value and generate confusion matrix entries
tp = confusionMatTrain[0]; fn = confusionMatTrain[1]; fp = confusionMatTrain[2]; tn = confusionMatTrain[3]
print("tp = " + str(tp) + "tfn = " + str(fn) + "n" + "fp = " + str(fp) + "ttn = " + str(tn) + 'n')
#Misclassification error rate
def errorRate(tp,fn,fp,tn):
#Accuracy
acc = (tp + tn) / (tp + fn + fp + tn)
return 1- acc
The linear regression model generates numbers that are mostly in the interval between 0.0 and 1.0, but not entirely. The predictions aren’t quite probabilities. They can still be used to generate predicted classifications by comparing to a threshold value.
#generate predictions on out-of-sample data
testPredictions = rocksVMinesModel.predict(xTest)
#generate confusion matrix from predictions on out-of-sample data
conMatTest = confusionMatrix(testPredictions, yTest, 0.25)
#pick threshold value and generate confusion matrix entries
tp = conMatTest[0]; fn = conMatTest[1]; fp = conMatTest[2]; tn = conMatTest[3]
print("tp = " + str(tp) + "tfn = " + str(fn) + "n" + "fp = " + str(fp) + "ttn = " + str(tn) + 'n')
print("Miscalssification error rate is : " + str(errorRate(tp,fn,fp,tn)))
confusion matrices for the in-sample data and the out-ofsample data and prints them both out. The misclassification error rate on the in-sample data is about 8 percent, and about 26 percent on the out-ofsample data.
The misclassification error changes when the thresholds are changed.
The best value for the threshold may be the one that minimizes the misclassification error. Sometimes, however, there’s more cost associated with one type of error than with another.
To characterize the overall performance of the classifier instead of using the misclassification error rate for a particular decision threshold. A common technique for doing that is called the receiver operating characteristic or ROC curve
The ROC curve plots the true positive rate (abbreviated TPR) versus the false positive rate (FPR).
TPR is the proportion of positive examples that are correctly classified as positive. FPR is the number of FPs relative to the total number of actual negatives.
from sklearn.metrics import roc_curve, auc
#generate ROC curve for in-sample
fpr, tpr, thresholds = roc_curve(yTrain,trainingPredictions)
roc_auc = auc(fpr, tpr)
print( 'AUC for in-sample ROC curve: %f' % roc_auc)
import pylab as pl
# Plot ROC curve
pl.clf()
pl.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
pl.plot([0, 1], [0, 1], 'k--')
pl.xlim([0.0, 1.0])
pl.ylim([0.0, 1.0])
pl.xlabel('False Positive Rate')
pl.ylabel('True Positive Rate')
pl.title('In sample ROC rocks versus mines')
pl.legend(loc="lower right")
pl.show()
#generate ROC curve for out-of-sample
fpr, tpr, thresholds = roc_curve(yTest,testPredictions)
roc_auc = auc(fpr, tpr)
print( 'AUC for out-of-sample ROC curve: %f' % roc_auc)
# Plot ROC curve
pl.clf()
pl.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
pl.plot([0, 1], [0, 1], 'k--')
pl.xlim([0.0, 1.0])
pl.ylim([0.0, 1.0])
pl.xlabel('False Positive Rate')
pl.ylabel('True Positive Rate')
pl.title('Out-of-sample ROC rocks versus mines')
pl.legend(loc="lower right")
pl.show()
For a perfect classifier, the ROC curve steps straight up from (0, 0) to (0, 1) and then goes straight across to (1, 1).
Not surprisingly, in-sample data comes closer to perfection than out-of-sample data.
The closer that a classifier can come to hitting the upper-left corner, the better it is.
If the ROC curve drops significantly below the diagonal line, it usually means that the data scientist has gotten a sign switched somewhere and should examine his code carefully.
A perfect classifier has an AUC of 1.0, and random guessing has an AUC of 0.5.
Use of ridge regression with a complexity tuning parameter and AUC as the performance measure for the classifier.
#form list of list input into numpy arrays to match input class for scikit-learn linear model
xTrain = numpy.array(xListTrain); yTrain = numpy.array(labelsTrain); xTest = numpy.array(xListTest); yTest = numpy.array(labelsTest)
alphaList = [0.1**i for i in [-3, -2, -1, 0,1, 2, 3, 4, 5]]
aucList = []
for alph in alphaList:
rocksVMinesRidgeModel = linear_model.Ridge(alpha=alph)
rocksVMinesRidgeModel.fit(xTrain, yTrain)
fpr, tpr, thresholds = roc_curve(yTest,rocksVMinesRidgeModel.predict(xTest))
roc_auc = auc(fpr, tpr)
aucList.append(roc_auc)
print("AUC alpha")
for i in range(len(aucList)):
print(aucList[i], alphaList[i])
#plot auc values versus alpha values
x = [-3, -2, -1, 0,1, 2, 3, 4, 5]
plot.plot(x, aucList)
plot.xlabel('-log(alpha)')
plot.ylabel('AUC')
plot.show()
#visualize the performance of the best classifier
indexBest = aucList.index(max(aucList))
alph = alphaList[indexBest]
rocksVMinesRidgeModel = linear_model.Ridge(alpha=alph)
rocksVMinesRidgeModel.fit(xTrain, yTrain)
#scatter plot of actual vs predicted
plot.scatter(rocksVMinesRidgeModel.predict(xTest), yTest, s=100, alpha=0.25)
plot.xlabel("Predicted Value")
plot.ylabel("Actual Value")
plot.show()
For Deployment
Coefficient Trajectories for ElasticNet
#Convert normalized labels to numpy array
Y = numpy.array(labelNormalized)
#Convert normalized attributes to numpy array
X = numpy.array(xNormalized)
alphas, coefs, _ = enet_path(X, Y,l1_ratio=0.4, fit_intercept=False, return_models=False)
plot.plot(alphas,coefs.T)
plot.xlabel('alpha')
plot.ylabel('Coefficients')
plot.axis('tight')
plot.semilogx()
ax = plot.gca()
ax.invert_xaxis()
plot.show()
nattr, nalpha = coefs.shape
#find coefficient ordering
nzList = []
for iAlpha in range(1,nalpha):
coefList = list(coefs[: ,iAlpha])
nzCoef = [index for index in range(nattr) if coefList[index] != 0.0]
for q in nzCoef:
if not(q in nzList):
nzList.append(q)
#make up names for columns of X
names = ['V' + str(i) for i in range(ncol)]
nameList = [names[nzList[i]] for i in range(len(nzList))]
print("Attributes Ordered by How Early They Enter the Model")
print(nameList)
print('')
One is the order in which variables come into the solution as alpha is decremented downward. The other ordering is according to the magnitude of the coefficients at the optimum solution.
Apparently, V10 is important when the coefficient penalty is so large that the algorithm only permits a single attribute, but when the coefficient penalty has shrunk to the point that a multitude of attributes are included, the attribute V10 levels off and drops in importance somewhat as other attributes are added to the mix.
The value for alpha at which coefficients are sought is hard-coded and comes directly from the results generated previously.
Best alpha for AUC = 0.0203348835893
#find coefficients corresponding to best alpha value. alpha value corresponding to
#normalized X and normalized Y is 0.020334883589342503
alphaStar = 0.020334883589342503
indexLTalphaStar = [index for index in range(100) if alphas[index] > alphaStar]
indexStar = max(indexLTalphaStar)
#here's the set of coefficients to deploy
coefStar = list(coefs[:,indexStar])
print("Best Coefficient Values ")
print(coefStar)
print('')
#The coefficients on normalized attributes give another slightly different ordering
absCoef = [abs(a) for a in coefStar]
#sort by magnitude
coefSorted = sorted(absCoef, reverse=True)
idxCoefSize = [absCoef.index(a) for a in coefSorted if not(a == 0.0)]
namesList2 = [names[idxCoefSize[i]] for i in range(len(idxCoefSize))]
print("Attributes Ordered by Coef Size at Optimum alpha")
print(namesList2)
Penalized Logistic Regression – Glmnet
The process has to be iterated until the probabilities (and corresponding weights) stop changing. Basically, the IRLS for logistic regression adds another layer of iteration to the algorithm for (not logistic) penalized regression
from math import sqrt, fabs, exp
def S(z,gamma):
if gamma >= fabs(z):
return 0.0
if z > 0.0:
return z - gamma
else:
return z + gamma
def Pr(b0,b,x):
n = len(x)
sum = b0
for i in range(n):
sum += b[i]*x[i]
if sum < -100: sum = -100
return 1.0/(1.0 + exp(-sum))
#Do Not Normalize labels but do calculate averages
meanLabel = sum(labels)/nrow
sdLabel = sqrt(sum([(labels[i] - meanLabel) * (labels[i] - meanLabel) for i in range(nrow)])/nrow)
#initialize probabilities and weights
sumWxr = [0.0] * ncol
sumWxx = [0.0] * ncol
sumWr = 0.0
sumW = 0.0
#calculate starting points for betas
for iRow in range(nrow):
p = meanLabel
w = p * (1.0 - p)
#residual for logistic
r = (labels[iRow] - p) / w
x = xNormalized[iRow]
sumWxr = [sumWxr[i] + w * x[i] * r for i in range(ncol)]
sumWxx = [sumWxx[i] + w * x[i] * x[i] for i in range(ncol)]
sumWr = sumWr + w * r
sumW = sumW + w
avgWxr = [sumWxr[i]/nrow for i in range(ncol)]
avgWxx = [sumWxx[i]/nrow for i in range(ncol)]
maxWxr = 0.0
for i in range(ncol):
val = abs(avgWxr[i])
if val > maxWxr:
maxWxr = val
alpha = 0.8
#calculate starting value for lambda
lam = maxWxr/alpha
#this value of lambda corresponds to beta = list of 0's
#initialize a vector of coefficients beta
beta = [0.0] * ncol
beta0 = sumWr/sumW
#initialize matrix of betas at each step
betaMat = []
betaMat.append(list(beta))
beta0List = []
beta0List.append(beta0)
#begin iteration
nSteps = 100
lamMult = 0.93 #100 steps gives reduction by factor of 1000 in lambda (recommended by authors)
nzList = []
for iStep in range(nSteps):
#decrease lambda
lam = lam * lamMult
#Use incremental change in betas to control inner iteration
#set middle loop values for betas = to outer values
# values are used for calculating weights and probabilities
#inner values are used for calculating penalized regression updates
#take pass through data to calculate averages over data require for iteration
#initilize accumulators
betaIRLS = list(beta)
beta0IRLS = beta0
distIRLS = 100.0
#Middle loop to calculate new betas with fixed IRLS weights and probabilities
iterIRLS = 0
while distIRLS > 0.01:
iterIRLS += 1
iterInner = 0.0
betaInner = list(betaIRLS)
beta0Inner = beta0IRLS
distInner = 100.0
while distInner > 0.01:
iterInner += 1
if iterInner > 100: break
#cycle through attributes and update one-at-a-time
#record starting value for comparison
betaStart = list(betaInner)
for iCol in range(ncol):
sumWxr = 0.0
sumWxx = 0.0
sumWr = 0.0
sumW = 0.0
for iRow in range(nrow):
x = list(xNormalized[iRow])
y = labels[iRow]
p = Pr(beta0IRLS, betaIRLS, x)
if abs(p) < 1e-5:
p = 0.0
w = 1e-5
elif abs(1.0 - p) < 1e-5:
p = 1.0
w = 1e-5
else:
w = p * (1.0 - p)
z = (y - p) / w + beta0IRLS + sum([x[i] * betaIRLS[i] for i in range(ncol)])
r = z - beta0Inner - sum([x[i] * betaInner[i] for i in range(ncol)])
sumWxr += w * x[iCol] * r
sumWxx += w * x[iCol] * x[iCol]
sumWr += w * r
sumW += w
avgWxr = sumWxr / nrow
avgWxx = sumWxx / nrow
beta0Inner = beta0Inner + sumWr / sumW
uncBeta = avgWxr + avgWxx * betaInner[iCol]
betaInner[iCol] = S(uncBeta, lam * alpha) / (avgWxx + lam * (1.0 - alpha))
sumDiff = sum([abs(betaInner[n] - betaStart[n]) for n in range(ncol)])
sumBeta = sum([abs(betaInner[n]) for n in range(ncol)])
distInner = sumDiff/sumBeta
#print number of steps for inner and middle loop convergence to monitor behavior
#print(iStep, iterIRLS, iterInner)
#if exit inner while loop, then set betaMiddle = betaMiddle and run through middle loop again.
#Check change in betaMiddle to see if IRLS is converged
a = sum([abs(betaIRLS[i] - betaInner[i]) for i in range(ncol)])
b = sum([abs(betaIRLS[i]) for i in range(ncol)])
distIRLS = a / (b + 0.0001)
dBeta = [betaInner[i] - betaIRLS[i] for i in range(ncol)]
gradStep = 1.0
temp = [betaIRLS[i] + gradStep * dBeta[i] for i in range(ncol)]
betaIRLS = list(temp)
beta = list(betaIRLS)
beta0 = beta0IRLS
betaMat.append(list(beta))
beta0List.append(beta0)
nzBeta = [index for index in range(ncol) if beta[index] != 0.0]
for q in nzBeta:
if not(q in nzList):
nzList.append(q)
#make up names for columns of xNum
names = ['V' + str(i) for i in range(ncol)]
nameList = [names[nzList[i]] for i in range(len(nzList))]
print("Attributes Ordered by How Early They Enter the Model")
print(nameList)
for i in range(ncol):
#plot range of beta values for each attribute
coefCurve = [betaMat[k][i] for k in range(nSteps)]
xaxis = range(nSteps)
plot.plot(xaxis, coefCurve)
plot.xlabel("Steps Taken")
plot.ylabel("Coefficient Values")
plot.show()
Ensemble Methods
Classifying using Random Forest
from sklearn.cross_validation import train_test_split
from sklearn import ensemble
#number of rows and columns in x matrix
nrows = len(xNum)
ncols = len(xNum[1])
#form x and y into numpy arrays and make up column names
X = np.array(xNum)
y = np.array(labels)
rocksVMinesNames = np.array(['V' + str(i) for i in range(ncols)])
#break into training and test sets.
xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size=0.30, random_state=531)
auc = []
nTreeList = range(50, 2000, 50)
for iTrees in nTreeList:
depth = None
maxFeat = 8 #try tweaking
rocksVMinesRFModel = ensemble.RandomForestClassifier(n_estimators=iTrees, max_depth=depth, max_features=maxFeat,
oob_score=False, random_state=531)
rocksVMinesRFModel.fit(xTrain,yTrain)
#Accumulate auc on test set
prediction = rocksVMinesRFModel.predict_proba(xTest)
aucCalc = roc_auc_score(yTest, prediction[:,1:2])
auc.append(aucCalc)
print("AUC" )
print(auc[-1])
#plot training and test errors vs number of trees in ensemble
plot.plot(nTreeList, auc)
plot.xlabel('Number of Trees in Ensemble')
plot.ylabel('Area Under ROC Curve - AUC')
#plot.ylim([0.0, 1.1*max(mseOob)])
plot.show()
For AUC, 1.0 is perfect, and 0.5 is perfectly bad. So, with AUC, larger is better, and instead of looking for a valley in the plot, you’re looking for a peak.
Figure shows a peak toward the left side of the plot.
However, because Random Forest only reduces variance and does not overfit, the peak can be attributed to random fluctuation.
The best choice of model is the one including all the trees whose performance is the rightmost point on the curve.
# Plot feature importance
featureImportance = rocksVMinesRFModel.feature_importances_
# normalize by max importance
featureImportance = featureImportance / featureImportance.max()
#plot importance of top 30
idxSorted = numpy.argsort(featureImportance)[30:60]
idxTemp = numpy.argsort(featureImportance)[::-1]
print(idxTemp)
barPos = numpy.arange(idxSorted.shape[0]) + .5
plot.barh(barPos, featureImportance[idxSorted], align='center')
plot.yticks(barPos, rocksVMinesNames[idxSorted])
plot.xlabel('Variable Importance')
plot.show()
The different attributes in the mine detection problem correspond to different frequencies of sonar signal and therefore different wavelengths. If you were given the problem of designing the machine learning for this problem, your next step might be to determine the wavelengths corresponding to these variables and compare those wavelengths to the characteristic dimensions of the rocks and mines in the test and training set.
#plot best version of ROC curve
fpr, tpr, thresh = roc_curve(yTest, list(prediction[:,1:2]))
ctClass = [i*0.01 for i in range(101)]
plot.plot(fpr, tpr, linewidth=2)
plot.plot(ctClass, ctClass, linestyle=':')
plot.xlabel('False Positive Rate')
plot.ylabel('True Positive Rate')
plot.show()
#pick some threshold values and calc confusion matrix for best predictions
#notice that GBM predictions don't fall in range of (0, 1)
#pick threshold values at 25th, 50th and 75th percentiles
idx25 = int(len(thresh) * 0.25)
idx50 = int(len(thresh) * 0.50)
idx75 = int(len(thresh) * 0.75)
#calculate total points, total positives and total negatives
totalPts = len(yTest)
P = sum(yTest)
N = totalPts - P
print('')
print('Confusion Matrices for Different Threshold Values')
#25th
TP = tpr[idx25] * P; FN = P - TP; FP = fpr[idx25] * N; TN = N - FP
print('')
print('Threshold Value = ', thresh[idx25])
print('TP = ', TP/totalPts, 'FP = ', FP/totalPts)
print('FN = ', FN/totalPts, 'TN = ', TN/totalPts)
#50th
TP = tpr[idx50] * P; FN = P - TP; FP = fpr[idx50] * N; TN = N - FP
print('')
print('Threshold Value = ', thresh[idx50])
print('TP = ', TP/totalPts, 'FP = ', FP/totalPts)
print('FN = ', FN/totalPts, 'TN = ', TN/totalPts)
#75th
TP = tpr[idx75] * P; FN = P - TP; FP = fpr[idx75] * N; TN = N - FP
print('')
print('Threshold Value = ', thresh[idx75])
print('TP = ', TP/totalPts, 'FP = ', FP/totalPts)
print('FN = ', FN/totalPts, 'TN = ', TN/totalPts)
Detecting Unexploded Mines with Python Gradient Boosting
#form x and y into numpy arrays and make up column names
X = numpy.array(xNum)
y = numpy.array(labels)
rockVMinesNames = numpy.array(['V' + str(i) for i in range(ncols)])
#break into training and test sets.
xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size=0.30, random_state=531)
#instantiate model
nEst = 2000
depth = 3
learnRate = 0.007
maxFeatures = 20
rockVMinesGBMModel = ensemble.GradientBoostingClassifier(n_estimators=nEst, max_depth=depth,
learning_rate=learnRate,
max_features=maxFeatures)
#train
rockVMinesGBMModel.fit(xTrain, yTrain)
# compute auc on test set as function of ensemble size
auc = []
aucBest = 0.0
predictions = rockVMinesGBMModel.staged_decision_function(xTest)
for p in predictions:
aucCalc = roc_auc_score(yTest, p)
auc.append(aucCalc)
#capture best predictions
if aucCalc > aucBest:
aucBest = aucCalc
pBest = p
idxBest = auc.index(max(auc))
#print best values
print("Best AUC" )
print(auc[idxBest])
print("Number of Trees for Best AUC")
print(idxBest)
#plot training deviance and test auc's vs number of trees in ensemble
plot.figure()
plot.plot(range(1, nEst + 1), rockVMinesGBMModel.train_score_, label='Training Set Deviance', linestyle=":")
plot.plot(range(1, nEst + 1), auc, label='Test Set AUC')
plot.legend(loc='upper right')
plot.xlabel('Number of Trees in Ensemble')
plot.ylabel('Deviance / AUC')
plot.show()
Deviance is related to how far the probability estimates are from correct but differs slightly from misclassification error. Deviance is plotted because that quantity is what gradient boosting is training to improve. It’s included in the plot to show the progress of training. The AUC (on oos data) is also plotted to show how the oos performance is changing as the number of trees increases (or equivalently more gradient steps are taken; each step results in training an additional tree).
# Plot feature importance
featureImportance = rockVMinesGBMModel.feature_importances_
# normalize by max importance
featureImportance = featureImportance / featureImportance.max()
#plot importance of top 30
idxSorted = numpy.argsort(featureImportance)[30:60]
barPos = numpy.arange(idxSorted.shape[0]) + .5
plot.barh(barPos, featureImportance[idxSorted], align='center')
plot.yticks(barPos, rockVMinesNames[idxSorted])
plot.xlabel('Variable Importance')
plot.show()
The variable importances have a somewhat different order than the ones for Random Forest.
There is some commonality; for example, variables V10, V11, V20 and V51 are near the top of both lists although not in quite the same order.
#pick some threshold values and calc confusion matrix for best predictions
#notice that GBM predictions don't fall in range of (0, 1)
#plot best version of ROC curve
fpr, tpr, thresh = roc_curve(yTest, list(pBest))
ctClass = [i*0.01 for i in range(101)]
plot.plot(fpr, tpr, linewidth=2)
plot.plot(ctClass, ctClass, linestyle=':')
plot.xlabel('False Positive Rate')
plot.ylabel('True Positive Rate')
plot.show()
Gradient Boosting does get better results using Random Forest base learners, but the difference isn’t large enough to be obvious in the graph.
#pick some threshold values and calc confusion matrix for best predictions
#notice that GBM predictions don't fall in range of (0, 1)
#pick threshold values at 25th, 50th and 75th percentiles
idx25 = int(len(thresh) * 0.25)
idx50 = int(len(thresh) * 0.50)
idx75 = int(len(thresh) * 0.75)
#calculate total points, total positives and total negatives
totalPts = len(yTest)
P = sum(yTest)
N = totalPts - P
print('')
print('Confusion Matrices for Different Threshold Values')
#25th
TP = tpr[idx25] * P; FN = P - TP; FP = fpr[idx25] * N; TN = N - FP
print('')
print('Threshold Value = ', thresh[idx25])
print('TP = ', TP/totalPts, 'FP = ', FP/totalPts)
print('FN = ', FN/totalPts, 'TN = ', TN/totalPts)
#50th
TP = tpr[idx50] * P; FN = P - TP; FP = fpr[idx50] * N; TN = N - FP
print('')
print('Threshold Value = ', thresh[idx50])
print('TP = ', TP/totalPts, 'FP = ', FP/totalPts)
print('FN = ', FN/totalPts, 'TN = ', TN/totalPts)
#75th
TP = tpr[idx75] * P; FN = P - TP; FP = fpr[idx75] * N; TN = N - FP
print('')
print('Threshold Value = ', thresh[idx75])
print('TP = ', TP/totalPts, 'FP = ', FP/totalPts)
print('FN = ', FN/totalPts, 'TN = ', TN/totalPts)
Reference
http://eu.wiley.com/WileyCDA/WileyTitle/productCd-1118961749.html
Great Read