4 different ways to predict survival on Titanic – part 3

Posted on Posted in Data Analysis Resources, Kaggle

continued from part 2

3. Way to predict survival on Titianic

These notes are from this link

I – Exploratory data analysis

We tweak the style of this notebook a little bit to have centered plots.

In [1]:
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")
Out[1]:
In [2]:
#Import the libraries

# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---

%matplotlib inline
import pandas as pd
pd.options.display.max_columns = 100
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
import numpy as np

pd.options.display.max_rows = 100
In [3]:
#Now let's start by loading the training set.
data = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\Titanic\train.csv")
In [4]:
#Pandas allows you to have a sneak peak at your data.
data.head(2)
Out[4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
In [5]:
data.describe()
Out[5]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
In [6]:
#The count variable shows that 177 values are missing in the Age column.
data['Age'].fillna(data['Age'].median(), inplace=True)

Let’s now make some charts

In [7]:
#Let's visualize survival based on the gender.
survived_sex = data[data['Survived']==1]['Sex'].value_counts()
dead_sex = data[data['Survived']==0]['Sex'].value_counts()
df = pd.DataFrame([survived_sex,dead_sex])
df.index = ['Survived','Dead']
df.plot(kind='bar',stacked=True, figsize=(15,8))
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x172f7150a58>

The Sex variable seems to be a decisive feature. Women are more likely to survive.

In [8]:
#Let's now correlate the survival with the age variable.
figure = plt.figure(figsize=(15,8))
plt.hist([data[data['Survived']==1]['Age'],data[data['Survived']==0]['Age']], stacked=True, color = ['g','r'],
         bins = 30,label = ['Survived','Dead'])
plt.xlabel('Age')
plt.ylabel('Number of passengers')
plt.legend()
Out[8]:
<matplotlib.legend.Legend at 0x172f74f4320>

If you follow the chart bin by bin, you will notice that passengers who are less than 10 are more likely to survive than older ones who are more than 12 and less than 50. Older passengers seem to be rescued too.

These two first charts confirm that one old code of conduct that sailors and captains follow in case of threatening situations: “Women and children first !”.

In [9]:
#Let's now focus on the Fare ticket of each passenger and correlate it with the survival.
figure = plt.figure(figsize=(15,8))
plt.hist([data[data['Survived']==1]['Fare'],data[data['Survived']==0]['Fare']], stacked=True, color = ['g','r'],
         bins = 30,label = ['Survived','Dead'])
plt.xlabel('Fare')
plt.ylabel('Number of passengers')
plt.legend()
Out[9]:
<matplotlib.legend.Legend at 0x172f7d5add8>

Passengers with cheaper ticket fares are more likely to die. Put differently, passengers with more expensive tickets, and therefore a more important social status, seem to be rescued first.

In [10]:
# Let's now combine the age, the fare and the survival on a single chart.
plt.figure(figsize=(15,8))
ax = plt.subplot()
ax.scatter(data[data['Survived']==1]['Age'],data[data['Survived']==1]['Fare'],c='green',s=40)
ax.scatter(data[data['Survived']==0]['Age'],data[data['Survived']==0]['Fare'],c='red',s=40)
ax.set_xlabel('Age')
ax.set_ylabel('Fare')
ax.legend(('survived','dead'),scatterpoints=1,loc='upper right',fontsize=15,)
Out[10]:
<matplotlib.legend.Legend at 0x172f7fc0c88>

A distinct cluster of dead passengers (the red one) appears on the chart. Those people are adults (age between 15 and 50) of lower class (lowest ticket fares).

In [11]:
#the ticket fare correlates with the class as we see it in the chart below.
ax = plt.subplot()
ax.set_ylabel('Average fare')
data.groupby('Pclass').mean()['Fare'].plot(kind='bar',figsize=(15,8), ax = ax)
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x172f7f626d8>
In [12]:
#Let's now see how the embarkation site affects the survival.
survived_embark = data[data['Survived']==1]['Embarked'].value_counts()
dead_embark = data[data['Survived']==0]['Embarked'].value_counts()
df = pd.DataFrame([survived_embark,dead_embark])
df.index = ['Survived','Dead']
df.plot(kind='bar',stacked=True, figsize=(15,8))
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x172f85dd320>

There seems to be no distinct correlation here.

II – Feature engineering

In [14]:
#let's define a print function that asserts whether or not a feature has been processed.
def status(feature):
    print ('Processing',feature,': ok')

Loading the data

One trick when starting a machine learning problem is to combine the training set and the test set together. This is a useful technique especially when your test set appears to have a feature that doesn’t exist in the training set. Therefore, if we don’t combine the two sets, testing our model on the test set will dramatically fail.

Besides, combining the two sets will save us some repeated work to do later on when testing.

The procedure is quite simple.

  • We start by loading the train set and the test set.
  • We create an empty dataframe called combined.
  • Then we append test to train and affect the result to combined.
In [15]:
def get_combined_data():
    # reading train data
    train = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\Titanic\train.csv")
    
    # reading test data
    test = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\Titanic\test.csv")

    # extracting and then removing the targets from the training data 
    targets = train.Survived
    train.drop('Survived',1,inplace=True)
    

    # merging train data and test data for future feature engineering
    combined = train.append(test)
    combined.reset_index(inplace=True)
    combined.drop('index',inplace=True,axis=1)
    
    return combined
In [16]:
combined = get_combined_data()
In [17]:
combined.shape
Out[17]:
(1309, 11)

You may notice that the total number of rows (1309) is the exact summation of the number of rows in the train set and the test set.

Extracting the passenger titles

In [18]:
def get_titles():

    global combined
    
    # we extract the title from each name
    combined['Title'] = combined['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
    
    # a map of more aggregated titles
    Title_Dictionary = {
                        "Capt":       "Officer",
                        "Col":        "Officer",
                        "Major":      "Officer",
                        "Jonkheer":   "Royalty",
                        "Don":        "Royalty",
                        "Sir" :       "Royalty",
                        "Dr":         "Officer",
                        "Rev":        "Officer",
                        "the Countess":"Royalty",
                        "Dona":       "Royalty",
                        "Mme":        "Mrs",
                        "Mlle":       "Miss",
                        "Ms":         "Mrs",
                        "Mr" :        "Mr",
                        "Mrs" :       "Mrs",
                        "Miss" :      "Miss",
                        "Master" :    "Master",
                        "Lady" :      "Royalty"

                        }
    
    # we map each title
    combined['Title'] = combined.Title.map(Title_Dictionary)
In [19]:
get_titles()
In [21]:
combined.head(2)
Out[21]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title
0 1 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S Mr
1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C Mrs

Processing the ages

Simply replacing them with the mean or the median age might not be the best solution since the age may differ by groups and categories of passengers.

To understand why, let’s group our dataset by sex, Title and passenger class and for each subset compute the median age.

In [22]:
grouped = combined.groupby(['Sex','Pclass','Title'])
grouped.median()
Out[22]:
PassengerId Age SibSp Parch Fare
Sex Pclass Title
female 1 Miss 529.5 30.0 0.0 0.0 99.9625
Mrs 853.5 45.0 1.0 0.0 78.1125
Officer 797.0 49.0 0.0 0.0 25.9292
Royalty 760.0 39.0 0.0 0.0 86.5000
2 Miss 606.5 20.0 0.0 0.0 20.2500
Mrs 533.0 30.0 1.0 0.0 26.0000
3 Miss 603.5 18.0 0.0 0.0 8.0500
Mrs 668.5 31.0 1.0 1.0 15.5000
male 1 Master 803.0 6.0 1.0 2.0 134.5000
Mr 634.0 41.5 0.0 0.0 47.1000
Officer 678.0 52.0 0.0 0.0 37.5500
Royalty 600.0 40.0 0.0 0.0 27.7208
2 Master 550.0 2.0 1.0 1.0 26.0000
Mr 723.5 30.0 0.0 0.0 13.0000
Officer 513.0 41.5 0.0 0.0 13.0000
3 Master 789.0 6.0 3.0 1.0 22.3583
Mr 640.5 26.0 0.0 0.0 7.8958

Look at the median age column and see how this value can be different based on the Sex, Pclass and Title put together.

For example:

  • If the passenger is female, from Pclass 1, and from royalty the median age is 39.
  • If the passenger is male, from Pclass 3, with a Mr title, the median age is 26.
In [23]:
def process_age():
    
    global combined
    
    # a function that fills the missing values of the Age variable
    
    def fillAges(row):
        if row['Sex']=='female' and row['Pclass'] == 1:
            if row['Title'] == 'Miss':
                return 30
            elif row['Title'] == 'Mrs':
                return 45
            elif row['Title'] == 'Officer':
                return 49
            elif row['Title'] == 'Royalty':
                return 39

        elif row['Sex']=='female' and row['Pclass'] == 2:
            if row['Title'] == 'Miss':
                return 20
            elif row['Title'] == 'Mrs':
                return 30

        elif row['Sex']=='female' and row['Pclass'] == 3:
            if row['Title'] == 'Miss':
                return 18
            elif row['Title'] == 'Mrs':
                return 31

        elif row['Sex']=='male' and row['Pclass'] == 1:
            if row['Title'] == 'Master':
                return 6
            elif row['Title'] == 'Mr':
                return 41.5
            elif row['Title'] == 'Officer':
                return 52
            elif row['Title'] == 'Royalty':
                return 40

        elif row['Sex']=='male' and row['Pclass'] == 2:
            if row['Title'] == 'Master':
                return 2
            elif row['Title'] == 'Mr':
                return 30
            elif row['Title'] == 'Officer':
                return 41.5

        elif row['Sex']=='male' and row['Pclass'] == 3:
            if row['Title'] == 'Master':
                return 6
            elif row['Title'] == 'Mr':
                return 26
    
    combined.Age = combined.apply(lambda r : fillAges(r) if np.isnan(r['Age']) else r['Age'], axis=1)
    
    status('age')
In [24]:
process_age()
Processing age : ok
In [25]:
combined.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1309 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Title          1309 non-null object
dtypes: float64(2), int64(4), object(6)
memory usage: 122.8+ KB
In [26]:
#Let's now process the names.
def process_names():
    
    global combined
    # we clean the Name variable
    combined.drop('Name',axis=1,inplace=True)
    
    # encoding in dummy variable
    titles_dummies = pd.get_dummies(combined['Title'],prefix='Title')
    combined = pd.concat([combined,titles_dummies],axis=1)
    
    # removing the title variable
    combined.drop('Title',axis=1,inplace=True)
    
    status('names')

This function drops the Name column since we won’t be using it anymore because we created a Title column.

Then we encode the title values using a dummy encoding.

In [27]:
process_names()
Processing names : ok
In [28]:
combined.head()
Out[28]:
PassengerId Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked Title_Master Title_Miss Title_Mr Title_Mrs Title_Officer Title_Royalty
0 1 3 male 22.0 1 0 A/5 21171 7.2500 NaN S 0.0 0.0 1.0 0.0 0.0 0.0
1 2 1 female 38.0 1 0 PC 17599 71.2833 C85 C 0.0 0.0 0.0 1.0 0.0 0.0
2 3 3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0.0 1.0 0.0 0.0 0.0 0.0
3 4 1 female 35.0 1 0 113803 53.1000 C123 S 0.0 0.0 0.0 1.0 0.0 0.0
4 5 3 male 35.0 0 0 373450 8.0500 NaN S 0.0 0.0 1.0 0.0 0.0 0.0

As you can see :

  • there is no longer a name feature.
  • new variables (Title_X) appeared. These features are binary.
    • For example, If Title_Mr = 1, the corresponding Title is Mr.

Processing Fare

In [30]:
#This function simply replaces one missing Fare value by the mean.
def process_fares():
    
    global combined
    # there's one missing fare value - replacing it with the mean.
    combined.Fare.fillna(combined.Fare.mean(),inplace=True)
    
    status('fare')
In [31]:
process_fares()
Processing fare : ok

Processing Embarked

In [32]:
#This functions replaces the two missing values of Embarked with the most frequent Embarked value.
def process_embarked():
    
    global combined
    # two missing embarked values - filling them with the most frequent one (S)
    combined.Embarked.fillna('S',inplace=True)
    
    # dummy encoding 
    embarked_dummies = pd.get_dummies(combined['Embarked'],prefix='Embarked')
    combined = pd.concat([combined,embarked_dummies],axis=1)
    combined.drop('Embarked',axis=1,inplace=True)
    
    status('embarked')
In [33]:
process_embarked()
Processing embarked : ok

Processing Cabin

In [34]:
#This function replaces NaN values with U (for Unknow). It then maps each Cabin value to the first letter. Then it encodes the cabin values using dummy encoding again.
def process_cabin():
    
    global combined
    
    # replacing missing cabins with U (for Uknown)
    combined.Cabin.fillna('U',inplace=True)
    
    # mapping each Cabin value with the cabin letter
    combined['Cabin'] = combined['Cabin'].map(lambda c : c[0])
    
    # dummy encoding ...
    cabin_dummies = pd.get_dummies(combined['Cabin'],prefix='Cabin')
    
    combined = pd.concat([combined,cabin_dummies],axis=1)
    
    combined.drop('Cabin',axis=1,inplace=True)
    
    status('cabin')
In [35]:
process_cabin()
Processing cabin : ok
In [36]:
combined.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 26 columns):
PassengerId      1309 non-null int64
Pclass           1309 non-null int64
Sex              1309 non-null object
Age              1309 non-null float64
SibSp            1309 non-null int64
Parch            1309 non-null int64
Ticket           1309 non-null object
Fare             1309 non-null float64
Title_Master     1309 non-null float64
Title_Miss       1309 non-null float64
Title_Mr         1309 non-null float64
Title_Mrs        1309 non-null float64
Title_Officer    1309 non-null float64
Title_Royalty    1309 non-null float64
Embarked_C       1309 non-null float64
Embarked_Q       1309 non-null float64
Embarked_S       1309 non-null float64
Cabin_A          1309 non-null float64
Cabin_B          1309 non-null float64
Cabin_C          1309 non-null float64
Cabin_D          1309 non-null float64
Cabin_E          1309 non-null float64
Cabin_F          1309 non-null float64
Cabin_G          1309 non-null float64
Cabin_T          1309 non-null float64
Cabin_U          1309 non-null float64
dtypes: float64(20), int64(4), object(2)
memory usage: 266.0+ KB
Ok no missing values now.

Processing Sex

In [37]:
#This function maps the string values male and female to 1 and 0 respectively.
def process_sex():
    
    global combined
    # mapping string values to numerical one 
    combined['Sex'] = combined['Sex'].map({'male':1,'female':0})
    
    status('sex')
In [38]:
process_sex()
Processing sex : ok

Processing Pclass

In [39]:
#This function encodes the values of Pclass (1,2,3) using a dummy encoding.
def process_pclass():
    
    global combined
    # encoding into 3 categories:
    pclass_dummies = pd.get_dummies(combined['Pclass'],prefix="Pclass")
    
    # adding dummy variables
    combined = pd.concat([combined,pclass_dummies],axis=1)
    
    # removing "Pclass"
    
    combined.drop('Pclass',axis=1,inplace=True)
    
    status('pclass')
In [40]:
process_pclass()
Processing pclass : ok
  • This functions preprocess the tikets first by extracting the ticket prefix. When it fails in extracting a prefix it returns XXX.
  • Then it encodes prefixes using dummy encoding.
In [47]:
def process_ticket():
    
    global combined
    
    # a function that extracts each prefix of the ticket, returns 'XXX' if no prefix (i.e the ticket is a digit)
    def cleanTicket(ticket):
        ticket = ticket.replace('.','')
        ticket = ticket.replace('/','')
        ticket = ticket.split()
        ticket = map(lambda t : t.strip() , ticket)
        ticket = list(filter(lambda t : not t.isdigit(), ticket))
        if len(ticket) > 0:
            return ticket[0]
        else: 
            return 'XXX'
    

    # Extracting dummy variables from tickets:

    combined['Ticket'] = combined['Ticket'].map(cleanTicket)
    tickets_dummies = pd.get_dummies(combined['Ticket'],prefix='Ticket')
    combined = pd.concat([combined, tickets_dummies],axis=1)
    combined.drop('Ticket',inplace=True,axis=1)

    status('ticket')
In [48]:
process_ticket()
Processing ticket : ok

This function introduces 4 new features:

  • FamilySize : the total number of relatives including the passenger (him/her)self.
  • Sigleton : a boolean variable that describes families of size = 1
  • SmallFamily : a boolean variable that describes families of 2 <= size <= 4
  • LargeFamily : a boolean variable that describes families of 5 < size
In [49]:
def process_family():
    
    global combined
    # introducing a new feature : the size of families (including the passenger)
    combined['FamilySize'] = combined['Parch'] + combined['SibSp'] + 1
    
    # introducing other features based on the family size
    combined['Singleton'] = combined['FamilySize'].map(lambda s : 1 if s == 1 else 0)
    combined['SmallFamily'] = combined['FamilySize'].map(lambda s : 1 if 2<=s<=4 else 0)
    combined['LargeFamily'] = combined['FamilySize'].map(lambda s : 1 if 5<=s else 0)
    
    status('family')
In [50]:
process_family()
Processing family : ok
In [51]:
combined.shape
Out[51]:
(1309, 68)
In [52]:
#Let's normalize all of them in the unit interval. 
#All of them except the PassengerId that we'll need for the submission.
def scale_all_features():
    
    global combined
    
    features = list(combined.columns)
    features.remove('PassengerId')
    combined[features] = combined[features].apply(lambda x: x/x.max(), axis=0)
    
    print ('Features scaled successfully !')

III – Modeling

We now have to:

  1. Break the combined dataset in train set and test set.
  2. Use the train set to build a predictive model.
  3. Evaluate the model using the train set.
  4. Test the model using the test set and generate and output file for the submission.
In [53]:
#Let's start by importing the useful libraries.

from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.cross_validation import StratifiedKFold
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.cross_validation import cross_val_score
In [54]:
#To evaluate our model we'll be using a 5-fold cross validation with the Accuracy metric.
def compute_score(clf, X, y,scoring='accuracy'):
    xval = cross_val_score(clf, X, y, cv = 5,scoring=scoring)
    return np.mean(xval)
In [55]:
#Recover the train set and the test set from the combined dataset
def recover_train_test_target():
    global combined
    
    train0 = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\Titanic\train.csv")
    
    targets = train0.Survived
    train = combined.ix[0:890]
    test = combined.ix[891:]
    
    return train,test,targets
In [56]:
train,test,targets = recover_train_test_target()

Feature selection

In [57]:
#Tree-based estimators can be used to compute feature importances, which in turn can be used to discard irrelevant features.
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
clf = ExtraTreesClassifier(n_estimators=200)
clf = clf.fit(train, targets)
In [58]:
#Let's have a look at the importance of each feature.
features = pd.DataFrame()
features['feature'] = train.columns
features['importance'] = clf.feature_importances_
In [59]:
features.sort(['importance'],ascending=False)
Out[59]:
feature importance
0 PassengerId 0.130185
2 Age 0.117193
5 Fare 0.112109
8 Title_Mr 0.111754
1 Sex 0.107180
7 Title_Miss 0.042467
26 Pclass_3 0.038323
9 Title_Mrs 0.031687
23 Cabin_U 0.029953
24 Pclass_1 0.022660
66 SmallFamily 0.020168
64 FamilySize 0.019737
67 LargeFamily 0.019370
3 SibSp 0.017105
4 Parch 0.014946
6 Title_Master 0.013714
25 Pclass_2 0.013251
63 Ticket_XXX 0.012658
14 Embarked_S 0.012487
65 Singleton 0.011039
12 Embarked_C 0.010579
19 Cabin_E 0.009528
10 Title_Officer 0.007628
13 Embarked_Q 0.007047
41 Ticket_PC 0.006552
16 Cabin_B 0.006545
18 Cabin_D 0.006261
60 Ticket_SWPP 0.006228
17 Cabin_C 0.005919
57 Ticket_STONO 0.005363
29 Ticket_A5 0.003393
34 Ticket_CA 0.003102
61 Ticket_WC 0.002522
15 Cabin_A 0.002515
55 Ticket_SOTONOQ 0.002055
33 Ticket_C 0.001993
58 Ticket_STONO2 0.001938
20 Cabin_F 0.001799
53 Ticket_SOPP 0.001636
21 Cabin_G 0.001578
11 Title_Royalty 0.001162
42 Ticket_PP 0.000798
62 Ticket_WEP 0.000739
50 Ticket_SCParis 0.000576
49 Ticket_SCPARIS 0.000552
39 Ticket_LINE 0.000551
28 Ticket_A4 0.000551
51 Ticket_SOC 0.000542
36 Ticket_FC 0.000495
37 Ticket_FCC 0.000463
22 Cabin_T 0.000333
47 Ticket_SCAH 0.000198
52 Ticket_SOP 0.000186
56 Ticket_SP 0.000122
54 Ticket_SOTONO2 0.000108
44 Ticket_SC 0.000107
38 Ticket_Fa 0.000086
43 Ticket_PPP 0.000080
46 Ticket_SCA4 0.000064
35 Ticket_CASOTON 0.000051
32 Ticket_AS 0.000043
48 Ticket_SCOW 0.000027
59 Ticket_STONOQ 0.000000
27 Ticket_A 0.000000
45 Ticket_SCA3 0.000000
30 Ticket_AQ3 0.000000
31 Ticket_AQ4 0.000000
40 Ticket_LP 0.000000

As you may notice, there is a great importance linked to Title_Mr, Age, Fare, and Sex.

There is also an important correlation with the Passenger_Id.`

In [60]:
#Let's now transform our train set and test set in a more compact datasets.
model = SelectFromModel(clf, prefit=True)
train_new = model.transform(train)
train_new.shape
Out[60]:
(891, 15)
In [61]:
test_new = model.transform(test)
test_new.shape
Out[61]:
(418, 15)

Hyperparameters tuning

In [63]:
#Random Forest are quite handy. They do however come with some parameters to tweak in order to get an optimal model for the prediction task.
forest = RandomForestClassifier(max_features='sqrt')

parameter_grid = {
                 'max_depth' : [4,5,6,7,8],
                 'n_estimators': [200,210,240,250],
                 'criterion': ['gini','entropy']
                 }

cross_validation = StratifiedKFold(targets, n_folds=5)

grid_search = GridSearchCV(forest,
                           param_grid=parameter_grid,
                           cv=cross_validation)

grid_search.fit(train_new, targets)

print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
Best score: 0.8316498316498316
Best parameters: {'criterion': 'gini', 'max_depth': 4, 'n_estimators': 250}
In [64]:
output = grid_search.predict(test_new).astype(int)
df_output = pd.DataFrame()
df_output['PassengerId'] = test['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv('output.csv',index=False)

Your submission scored 0.78947

 

continued_part4

Leave a Reply

Your email address will not be published. Required fields are marked *