Categories

# Predictive Analysis , Binary Classification (Cookbook) – 1

This notebook contains my notes for Predictive Analysis on Binary Classification. It acts as a cookbook.

# Importing and sizing up a New Data Set

The file is comma delimited, with the data for one experiment occupying one line of text. This makes it a simple matter to read a line, split it on the comma delimiters, and stack the resulting lists into an outer list containing the whole data set.
In [1]:
```import urllib
from urllib.request import urlopen
import sys
#read data from uci data repository
target_url = urllib.request.Request("https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data")
data = urlopen(target_url)

#arrange data into list for labels and list of lists for attributes
xList = []
labels = []
for line in data:
#split on comma
row = line.strip().decode().split(",")
xList.append(row)
sys.stdout.write("Number of Rows of Data = " + str(len(xList)) + '\n')
sys.stdout.write("Number of Columns of Data = " + str(len(xList[1])))
```
```Number of Rows of Data = 208
Number of Columns of Data = 61```

If the data set has many more columns than rows, you may be more likely to get the best prediction with penalized linear regression and vice versa.

# Determining the Nature of Attributes

The next step in the checklist is to determine how many of the columns of data are numeric versus categorical

In [4]:
```#arrange data into list for labels and list of lists for attributes

nrow = len(xList)
ncol = len(xList[1])

type = [0]*3
colCounts = []

for col in range(ncol):
for row in xList:
try:
a = float(row[col])
if isinstance(a, float):
type[0] += 1
except ValueError:
if len(row[col]) > 0:
type[1] += 1
else:
type[2] += 1

colCounts.append(type)
type = [0]*3

sys.stdout.write("Col#" + '\t' + "Number" + '\t' + "Strings" + '\t ' + "Other\n")
iCol = 0
#I deleted one /t to make it align.
for types in colCounts:
sys.stdout.write(str(iCol) + '\t' + str(types[0]) + '\t' +
str(types[1]) + '\t' + str(types[2]) + "\n")
iCol += 1
```
```Col#	Number	Strings	 Other
0	208	0	0
1	208	0	0
2	208	0	0
3	208	0	0
4	208	0	0
5	208	0	0
6	208	0	0
7	208	0	0
8	208	0	0
9	208	0	0
10	208	0	0
11	208	0	0
12	208	0	0
13	208	0	0
14	208	0	0
15	208	0	0
16	208	0	0
17	208	0	0
18	208	0	0
19	208	0	0
20	208	0	0
21	208	0	0
22	208	0	0
23	208	0	0
24	208	0	0
25	208	0	0
26	208	0	0
27	208	0	0
28	208	0	0
29	208	0	0
30	208	0	0
31	208	0	0
32	208	0	0
33	208	0	0
34	208	0	0
35	208	0	0
36	208	0	0
37	208	0	0
38	208	0	0
39	208	0	0
40	208	0	0
41	208	0	0
42	208	0	0
43	208	0	0
44	208	0	0
45	208	0	0
46	208	0	0
47	208	0	0
48	208	0	0
49	208	0	0
50	208	0	0
51	208	0	0
52	208	0	0
53	208	0	0
54	208	0	0
55	208	0	0
56	208	0	0
57	208	0	0
58	208	0	0
59	208	0	0
60	0	208	0
```

The code runs down each column and adds up the number of entries that are numeric (int or float), the number of entries that are non-empty strings, and the number that are empty. The result is that the first 60 columns contain all numeric values and the last column contains all strings. The string values are the labels.

Generally, categorical variables are presented as strings, as in this example. In some cases, binary‐valued categorical variables are presented as a 0,1 numeric variable.

###### After determining which attributes are categorical and which are numeric, you’ll

want some descriptive statistics for the numeric variables and a count of the unique categories in each categorical attribute.

continued – Part 2

Reference

http://eu.wiley.com/WileyCDA/WileyTitle/productCd-1118961749.html

Gabrielsays: