Predictive Analysis , Binary Classification (Cookbook) – 2

This notebook contains my notes for Predictive Analysis on Binary Classification. It acts as a cookbook. It is a continuation from the previous post.

Summary Statistics for Numeric and Categorical Attributes

In [6]:
import numpy as np
#generate summary statistics for column 3 (e.g.)
col = 3
colData = []
for row in xList:

colArray = np.array(colData)
colMean = np.mean(colArray)
colsd = np.std(colArray)
sys.stdout.write("Mean = " + '\t' + str(colMean) + '\t\t' +
            "Standard Deviation = " + '\t ' + str(colsd) + "\n")

#calculate quantile boundaries
ntiles = 4

percentBdry = []

for i in range(ntiles+1):
    percentBdry.append(np.percentile(colArray, i*(100)/ntiles))

sys.stdout.write("\nBoundaries for 4 Equal Percentiles \n")
sys.stdout.write(" \n")

#run again with 10 equal intervals
ntiles = 10

percentBdry = []

for i in range(ntiles+1):
    percentBdry.append(np.percentile(colArray, i*(100)/ntiles))

sys.stdout.write("Boundaries for 10 Equal Percentiles \n")
sys.stdout.write(" \n")

#The last column contains categorical variables

col = 60
colData = []
for row in xList:

unique = set(colData)
sys.stdout.write("Unique Label Values \n")

#count up the number of elements having each value

catDict = dict(zip(list(unique),range(len(unique))))

catCount = [0]*2

for elt in colData:
    catCount[catDict[elt]] += 1

sys.stdout.write("\nCounts for Each Value of Categorical Label \n")
Mean = 	0.0538923076923		Standard Deviation = 	 0.0464159832226

Boundaries for 4 Equal Percentiles 
[0.0057999999999999996, 0.024375000000000001, 0.044049999999999999, 0.064500000000000002, 0.4264]
Boundaries for 10 Equal Percentiles 
[0.0057999999999999996, 0.0141, 0.022740000000000003, 0.027869999999999995, 0.036220000000000002, 0.044049999999999999, 0.050719999999999987, 0.059959999999999986, 0.077940000000000009, 0.10836, 0.4264]
Unique Label Values 
{'M', 'R'}

Counts for Each Value of Categorical Label 
['M', 'R']
[111, 97]

The first step is to calculate the mean and standard deviation for the chosen attribute.

The next section of code looks for outliers.One way to reveal this sort of mismatch is to divide a set of numbers into percentiles.

First the program calculates the quartiles. That shows that the upper quartile is much wider than the others. To be more certain, the decile boundaries are also calculated and similarly demonstrate that the upper decile is unusually wide. Some widening is normal because distributions often thin out in the tails.

Visualization of Outliers Using Quantile‐Quantile Plot

In [8]:
import pylab
import scipy.stats as stats

#generate summary statistics for column 3 (e.g.)
col = 3
colData = []
for row in xList:
stats.probplot(colData, dist="norm", plot=pylab)

The resulting plot shows how the boundaries associated with empirical percentiles in the data compare to the boundaries for the same percentiles of a Gaussian distribution. If the data being analyzed comes from a Gaussian distribution, the point being plotted will lie on a straight line.

 The tails of the rocks versus mines data contain more examples than the tails of a Gaussian density.

Outliers may cause trouble either for model building or prediction. After you’ve trained a model on this data set, you can look at the errors your model makes and see whether the errors are correlated with these outliers.

You can segregate them out and train them as a separate class. You can also edit them out of the data if they represent an abnormality that won’t be present in the data your model will see when deployed.

A reasonable process for this might be to generate quartile boundaries during the exploration phase and note potential outliers to get a feel for how much of a problem you might (or might not) have with it. Then when you’re evaluating performance data, use quantile‐quantile (Q‐Q) plots to determine which points to call outliers for use in your error analysis.

continued - part 3


Leave a Reply