Predictive Analysis, binary Classification

Predictive Analysis , Binary Classification (Cookbook) – 3

Posted on Posted in Data Analysis Resources, Machine Learning, Predictive Analysis

This notebook contains my notes for Predictive Analysis on Binary Classification. It acts as a cookbook. It is a continuation from the previous post on the summary statistics.

Using Python Pandas to Read Data

In [12]:
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plot
%matplotlib inline
target_url = ("https://archive.ics.uci.edu/ml/machine-learning-"
"databases/undocumented/connectionist-bench/sonar/sonar.all-data")

#read rocks versus mines data into pandas data frame
rocksVMines = pd.read_csv(target_url,header=None, prefix="V")

#print head and tail of data frame
print(rocksVMines.head())
print(rocksVMines.tail())
       V0      V1      V2      V3      V4      V5      V6      V7      V8  \
0  0.0200  0.0371  0.0428  0.0207  0.0954  0.0986  0.1539  0.1601  0.3109   
1  0.0453  0.0523  0.0843  0.0689  0.1183  0.2583  0.2156  0.3481  0.3337   
2  0.0262  0.0582  0.1099  0.1083  0.0974  0.2280  0.2431  0.3771  0.5598   
3  0.0100  0.0171  0.0623  0.0205  0.0205  0.0368  0.1098  0.1276  0.0598   
4  0.0762  0.0666  0.0481  0.0394  0.0590  0.0649  0.1209  0.2467  0.3564   

       V9 ...      V51     V52     V53     V54     V55     V56     V57  \
0  0.2111 ...   0.0027  0.0065  0.0159  0.0072  0.0167  0.0180  0.0084   
1  0.2872 ...   0.0084  0.0089  0.0048  0.0094  0.0191  0.0140  0.0049   
2  0.6194 ...   0.0232  0.0166  0.0095  0.0180  0.0244  0.0316  0.0164   
3  0.1264 ...   0.0121  0.0036  0.0150  0.0085  0.0073  0.0050  0.0044   
4  0.4459 ...   0.0031  0.0054  0.0105  0.0110  0.0015  0.0072  0.0048   

      V58     V59  V60  
0  0.0090  0.0032    R  
1  0.0052  0.0044    R  
2  0.0095  0.0078    R  
3  0.0040  0.0117    R  
4  0.0107  0.0094    R  

[5 rows x 61 columns]
         V0      V1      V2      V3      V4      V5      V6      V7      V8  \
203  0.0187  0.0346  0.0168  0.0177  0.0393  0.1630  0.2028  0.1694  0.2328   
204  0.0323  0.0101  0.0298  0.0564  0.0760  0.0958  0.0990  0.1018  0.1030   
205  0.0522  0.0437  0.0180  0.0292  0.0351  0.1171  0.1257  0.1178  0.1258   
206  0.0303  0.0353  0.0490  0.0608  0.0167  0.1354  0.1465  0.1123  0.1945   
207  0.0260  0.0363  0.0136  0.0272  0.0214  0.0338  0.0655  0.1400  0.1843   

         V9 ...      V51     V52     V53     V54     V55     V56     V57  \
203  0.2684 ...   0.0116  0.0098  0.0199  0.0033  0.0101  0.0065  0.0115   
204  0.2154 ...   0.0061  0.0093  0.0135  0.0063  0.0063  0.0034  0.0032   
205  0.2529 ...   0.0160  0.0029  0.0051  0.0062  0.0089  0.0140  0.0138   
206  0.2354 ...   0.0086  0.0046  0.0126  0.0036  0.0035  0.0034  0.0079   
207  0.2354 ...   0.0146  0.0129  0.0047  0.0039  0.0061  0.0040  0.0036   

        V58     V59  V60  
203  0.0193  0.0157    M  
204  0.0062  0.0067    M  
205  0.0077  0.0031    M  
206  0.0036  0.0048    M  
207  0.0061  0.0115    M  

[5 rows x 61 columns]

Structure in the way the data are stored might need to be factored into your approach for doing subsequent sampling.

Using Python Pandas to Summarize Data

In [13]:
#print summary of data frame
summary = rocksVMines.describe()
print(summary)
               V0          V1          V2          V3          V4          V5  \
count  208.000000  208.000000  208.000000  208.000000  208.000000  208.000000   
mean     0.029164    0.038437    0.043832    0.053892    0.075202    0.104570   
std      0.022991    0.032960    0.038428    0.046528    0.055552    0.059105   
min      0.001500    0.000600    0.001500    0.005800    0.006700    0.010200   
25%      0.013350    0.016450    0.018950    0.024375    0.038050    0.067025   
50%      0.022800    0.030800    0.034300    0.044050    0.062500    0.092150   
75%      0.035550    0.047950    0.057950    0.064500    0.100275    0.134125   
max      0.137100    0.233900    0.305900    0.426400    0.401000    0.382300   

               V6          V7          V8          V9     ...             V50  \
count  208.000000  208.000000  208.000000  208.000000     ...      208.000000   
mean     0.121747    0.134799    0.178003    0.208259     ...        0.016069   
std      0.061788    0.085152    0.118387    0.134416     ...        0.012008   
min      0.003300    0.005500    0.007500    0.011300     ...        0.000000   
25%      0.080900    0.080425    0.097025    0.111275     ...        0.008425   
50%      0.106950    0.112100    0.152250    0.182400     ...        0.013900   
75%      0.154000    0.169600    0.233425    0.268700     ...        0.020825   
max      0.372900    0.459000    0.682800    0.710600     ...        0.100400   

              V51         V52         V53         V54         V55         V56  \
count  208.000000  208.000000  208.000000  208.000000  208.000000  208.000000   
mean     0.013420    0.010709    0.010941    0.009290    0.008222    0.007820   
std      0.009634    0.007060    0.007301    0.007088    0.005736    0.005785   
min      0.000800    0.000500    0.001000    0.000600    0.000400    0.000300   
25%      0.007275    0.005075    0.005375    0.004150    0.004400    0.003700   
50%      0.011400    0.009550    0.009300    0.007500    0.006850    0.005950   
75%      0.016725    0.014900    0.014500    0.012100    0.010575    0.010425   
max      0.070900    0.039000    0.035200    0.044700    0.039400    0.035500   

              V57         V58         V59  
count  208.000000  208.000000  208.000000  
mean     0.007949    0.007941    0.006507  
std      0.006470    0.006181    0.005031  
min      0.000300    0.000100    0.000600  
25%      0.003600    0.003675    0.003100  
50%      0.005800    0.006400    0.005300  
75%      0.010350    0.010325    0.008525  
max      0.044000    0.036400    0.043900  

[8 rows x 60 columns]

Notice that the summary produced by the describe function is itself a data frame so that you can automate the process of screening for attributes that have outliers. To do that, you can compare the differences between the various quantiles and raise a flag if any of the differences for an attribute are out of scale with the other differences for the same attributes. The attributes that are shown in the output indicate that several of them have outliers. It would be worth looking to determine how many rows are involved in the outliers.

continued – part 4

Leave a Reply

Your email address will not be published. Required fields are marked *