This notebook contains my notes for Predictive Analysis on Binary Classification. It acts as a cookbook.
Importing and sizing up a New Data Set
import urllib
from urllib.request import urlopen
import sys
#read data from uci data repository
target_url = urllib.request.Request("https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data")
data = urlopen(target_url)
#arrange data into list for labels and list of lists for attributes
xList = []
labels = []
for line in data:
#split on comma
row = line.strip().decode().split(",")
xList.append(row)
sys.stdout.write("Number of Rows of Data = " + str(len(xList)) + '\n')
sys.stdout.write("Number of Columns of Data = " + str(len(xList[1])))
If the data set has many more columns than rows, you may be more likely to get the best prediction with penalized linear regression and vice versa.
Determining the Nature of Attributes
The next step in the checklist is to determine how many of the columns of data are numeric versus categorical
#arrange data into list for labels and list of lists for attributes
nrow = len(xList)
ncol = len(xList[1])
type = [0]*3
colCounts = []
for col in range(ncol):
for row in xList:
try:
a = float(row[col])
if isinstance(a, float):
type[0] += 1
except ValueError:
if len(row[col]) > 0:
type[1] += 1
else:
type[2] += 1
colCounts.append(type)
type = [0]*3
sys.stdout.write("Col#" + '\t' + "Number" + '\t' + "Strings" + '\t ' + "Other\n")
iCol = 0
#I deleted one /t to make it align.
for types in colCounts:
sys.stdout.write(str(iCol) + '\t' + str(types[0]) + '\t' +
str(types[1]) + '\t' + str(types[2]) + "\n")
iCol += 1
The code runs down each column and adds up the number of entries that are numeric (int or float), the number of entries that are non-empty strings, and the number that are empty. The result is that the first 60 columns contain all numeric values and the last column contains all strings. The string values are the labels.
Generally, categorical variables are presented as strings, as in this example. In some cases, binary‐valued categorical variables are presented as a 0,1 numeric variable.
After determining which attributes are categorical and which are numeric, you’ll
want some descriptive statistics for the numeric variables and a count of the unique categories in each categorical attribute.
Reference
http://eu.wiley.com/WileyCDA/WileyTitle/productCd-1118961749.html
Great Read