4 different ways to predict survival on Titanic - part 4

4 different ways to predict survival on Titanic – part 2

Posted on Posted in Data Analysis Resources, Kaggle, Predictive Analysis
continued from part 1

Classification

KNeighborsClassifier

In [16]:
from sklearn.neighbors import KNeighborsClassifi
alg_ngbh = KNeighborsClassifier(n_neighbors=3)
scores = cross_validation.cross_val_score(alg_ngbh, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1)
print("Accuracy (k-neighbors): {}/{}".format(scores.mean(), scores.std()))
Accuracy (k-neighbors): 0.7957351290684623/0.011110544261068086

SGDClassifier

In [17]:
from sklearn.linear_model.stochastic_gradient import SGDClassifier
alg_sgd = SGDClassifier(random_state=1)
scores = cross_validation.cross_val_score(alg_sgd, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1)
print("Accuracy (sgd): {}/{}".format(scores.mean(), scores.std()))
Accuracy (sgd): 0.7239057239057239/0.015306601231185043

SVC

In [18]:
from sklearn.svm import SVC
alg_svm = SVC(C=1.0)
scores = cross_validation.cross_val_score(alg_svm, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1)
print("Accuracy (svm): {}/{}".format(scores.mean(), scores.std()))
Accuracy (svm): 0.8193041526374859/0.010408101566212916

GaussianNB

In [19]:
from sklearn.naive_bayes import GaussianNB
alg_nbs = GaussianNB()
scores = cross_validation.cross_val_score(alg_nbs, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1)
print("Accuracy (naive bayes): {}/{}".format(scores.mean(), scores.std()))
Accuracy (naive bayes): 0.6206509539842874/0.16339123708526998

LinearRegression

In [ ]:
def linear_scorer(estimator, x, y):
    scorer_predictions = estimator.predict(x)

    scorer_predictions[scorer_predictions > 0.5] = 1
    scorer_predictions[scorer_predictions <= 0.5] = 0

    return metrics.accuracy_score(y, scorer_predictions)
In [ ]:
from sklearn import linear_model
alg_lnr = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(alg_lnr, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1,
                         scoring=linear_scorer)
print("Accuracy (linear regression): {}/{}".format(scores.mean(), scores.std()))

Linear_scorer method is needed because LinearRegression – this regression returning any real number. Accordingly, we share border 0.5 scale and give any of the two classes – 0 and 1.

In [ ]:
from sklearn import linear_model
alg_log = linear_model.LogisticRegression(random_state=1)
scores = cross_validation.cross_val_score(alg_log, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1,
                         scoring=linear_scorer)
print("Accuracy (logistic regression): {}/{}".format(scores.mean(), scores.std()))
In [20]:
from sklearn.ensemble import RandomForestClassifier
alg_frst = RandomForestClassifier(random_state=1, n_estimators=500, min_samples_split=8, min_samples_leaf=2)
scores = cross_validation.cross_val_score(alg_frst, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1)
print("Accuracy (random forest): {}/{}".format(scores.mean(), scores.std()))
Accuracy (random forest): 0.8294051627384961/0.004199391006480274
In [21]:
from sklearn.grid_search import GridSearchCV

alg_frst_model = RandomForestClassifier(random_state=1)
alg_frst_params = [{
    "n_estimators": [350, 400, 450],
    "min_samples_split": [6, 8, 10],
    "min_samples_leaf": [1, 2, 4]
}]
alg_frst_grid = GridSearchCV(alg_frst_model, alg_frst_params, cv=cv, refit=True, verbose=1, n_jobs=-1)
alg_frst_grid.fit(train_data_scaled, train_data_munged["Survived"])
alg_frst_best = alg_frst_grid.best_estimator_
print("Accuracy (random forest auto): {} with params {}"
      .format(alg_frst_grid.best_score_, alg_frst_grid.best_params_))
Fitting 3 folds for each of 27 candidates, totalling 81 fits
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   11.0s
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:   15.3s finished
Accuracy (random forest auto): 0.8372615039281706 with params {'min_samples_split': 6, 'min_samples_leaf': 4, 'n_estimators': 400}
In [22]:
alg_test = alg_frst_best

alg_test.fit(train_data_scaled, train_data_munged["Survived"])

predictions = alg_test.predict(test_data_scaled)

submission = pd.DataFrame({
    "PassengerId": test_data["PassengerId"],
    "Survived": predictions
})

submission.to_csv("titanic-submission.csv", index=False)

2. Way to predict survival on Titianic

These notes are from this link

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
In [2]:
df = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\Titanic\train.csv",header=0)
In [3]:
#Lets take a look at the data format below
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

If you carefully observe the above summary of pandas, there are total 891 rows, Age shows only 714 (means missing), Embarked (2 missing) and Cabin missing a lot as well. Object data types are non-numeric so we have to find a way to encode them to numerical values. One such way is columnisation ie. factorize to the row values to column header.

Lets try to drop some of the columns which many not contribute

In [4]:
cols = ['Name','Ticket','Cabin']
df = df.drop(cols,axis=1)

if we want we can drop all rows in the data

In [5]:
#df = df.dropna()

Now you see the dataset is reduced to 712 rows from 891, which means we are wasting data. Machine learning models need data for training to perform well. So we preserve the data and make use of it as much as we can.

Now we convert the Pclass, Sex, Embarked to columns in pandas and drop them after conversion.

In [6]:
dummies = []
cols = ['Pclass','Sex','Embarked']
for col in cols:
 dummies.append(pd.get_dummies(df[col]))
In [7]:
titanic_dummies = pd.concat(dummies, axis=1)
In [8]:
titanic_dummies.head(2)
Out[8]:
1 2 3 female male C Q S
0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0
1 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
In [9]:
#finally we concatenate to the original dataframe columnwise
df = pd.concat((df,titanic_dummies),axis=1)

Now that we converted Pclass, Sex, Embarked values into columns, we drop the redundant same columns from the dataframe

In [10]:
df = df.drop(['Pclass','Sex','Embarked'],axis=1)
In [11]:
#now look on the new dataframe
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
1              891 non-null float64
2              891 non-null float64
3              891 non-null float64
female         891 non-null float64
male           891 non-null float64
C              891 non-null float64
Q              891 non-null float64
S              891 non-null float64
dtypes: float64(10), int64(4)
memory usage: 97.5 KB

All is good, except age which has lots of missing values. Lets compute a median or interpolate() all the ages and fill those missing age values. Pandas has a nice interpolate() function that will replace all the missing NaNs to interpolated values.

In [12]:
df['Age'] = df['Age'].interpolate()
In [13]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
1              891 non-null float64
2              891 non-null float64
3              891 non-null float64
female         891 non-null float64
male           891 non-null float64
C              891 non-null float64
Q              891 non-null float64
S              891 non-null float64
dtypes: float64(10), int64(4)
memory usage: 97.5 KB

Machine Learning

X = Input set with 14 attributes

y = Small y Output, in this case ‘Survived’

Now we convert our dataframe from pandas to numpy and we assign input and output

In [14]:
X = df.values
y = df['Survived'].values
X has still Survived values in it, which should not be there. So we drop in numpy column which is the 1st column.
In [15]:
X = np.delete(X,1,axis=1)

Now that we are ready with X and y, lets split the dataset for 70% Training and 30% test set using scikit cross validation

In [18]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)

Lets start with simple Decision Tree Classifier machine learning algorithm and see how it goes

In [19]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(max_depth=5)
clf.fit(X_train,y_train)
clf.score(X_test,y_test)
Out[19]:
0.78731343283582089

Not bad it gives score of 78.73%.

If you perform a decision tree on dataset, the variable importances_ contains important information on what columns of data has large variances thus contributing to the decision.

In [20]:
clf.feature_importances_
Out[20]:
array([ 0.08944147,  0.11138782,  0.05743406,  0.02186072,  0.05721831,
        0.03783151,  0.        ,  0.10597366,  0.50209245,  0.        ,
        0.        ,  0.        ,  0.01676001])

This output shows that second element in array 0.111, “Fare” has 11% importance, the last 5 shows 51% which is ‘Females’. Very interesting! yes the large number of survivors in titanic are women and children.

Random Forests

In [21]:
from sklearn import ensemble
clf = ensemble.RandomForestClassifier(n_estimators=100)
clf.fit (X_train, y_train)
clf.score (X_test, y_test)
Out[21]:
0.81343283582089554

Gradient boosting

In [22]:
clf = ensemble.GradientBoostingClassifier()
clf.fit (X_train, y_train)
clf.score (X_test, y_test)
Out[22]:
0.81343283582089554
In [23]:
#Let not give up and play around fine tune this Gradient booster.
clf = ensemble.GradientBoostingClassifier(n_estimators=50)
clf.fit(X_train,y_train)
clf.score(X_test,y_test)
Out[23]:
0.83208955223880599

continued_part3

Leave a Reply

Your email address will not be published. Required fields are marked *