4 different ways to predict survival on Titanic – part 1

Posted on Posted in Data Analysis Resources, Kaggle

These are my notes from various blogs to find different ways to predict survival on Titanic using Python-stack. I am interested to compare how different people have attempted the kaggle competition. I am going to compare and contrast different analysis to find similarity and difference in approaches to predict survival on Titanic.

This Notebook will show basic examples of:

  • Data Handling
    • Importing Data with Pandas
    • Cleaning Data
    • Exploring Data through Visualizations with Matplotlib
  • Data Analysis
    • Supervised Machine learning Techniques:
      • Logit Regression Model
      • Plotting results
      • Support Vector Machine (SVM) using 3 kernels
      • Basic Random Forest
      • Plotting results
      • etc..
  • Valuation of the Analysis
    • K-folds cross validation to valuate results locally
    • Output the results from the Notebook to Kaggle

Columns Information

  • The Survived column is the target variable. If Suvival = 1 the passenger survived, otherwise he’s dead.
  • The other variables that describe the passengers are:
    • PassengerId: and id given to each traveler on the boat
    • Pclass: the passenger class. It has three possible values: 1,2,3
    • The Name
    • The Sex
    • The Age
    • SibSp: number of siblings and spouses traveling with the passenger
    • Parch: number of parents and children traveling with the passenger
    • The ticket number
    • The ticket Fare
    • The cabin number
    • The embarkation. It has three possible values S,C,Q

1. Way to predict survival on Titianic

These notes are taken from this link

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
In [2]:
train_data = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\Titanic\train.csv")
test_data = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\Titanic\test.csv")
In [3]:
#Putting the two samples (train-sample and test-sample) in a total all-sampled.
all_data = pd.concat([train_data, test_data])

Why do it, because in the test-sample there is survival field missing? The total sample is useful for calculating the statistics for all the other fields (mean, median, quantile, minima and maxima), as well as connections between these fields.

Data analysis

In [4]:
print("===== survived by class and sex")
print(train_data.groupby(["Pclass", "Sex"])["Survived"].value_counts(normalize=True))
===== survived by class and sex
Pclass  Sex     Survived
1       female  1           0.968085
                0           0.031915
        male    0           0.631148
                1           0.368852
2       female  1           0.921053
                0           0.078947
        male    0           0.842593
                1           0.157407
3       female  0           0.500000
                1           0.500000
        male    0           0.864553
                1           0.135447
dtype: float64

We see that the boat has more survivors as women – the women chance of survival rate is 96.8%, 92.1% and 50% depending on the class of ticket. The chance of surviving men is less, respectively, 36.9%, 15.7% and 13.5%.

In [5]:
describe_fields = ["Age", "Fare", "Pclass", "SibSp", "Parch"]

print("===== train: males")
print(train_data[train_data["Sex"] == "male"][describe_fields].describe())

print("===== test: males")
print(test_data[test_data["Sex"] == "male"][describe_fields].describe())

print("===== train: females")
print(train_data[train_data["Sex"] == "female"][describe_fields].describe())

print("===== test: females")
print(test_data[test_data["Sex"] == "female"][describe_fields].describe())
===== train: males
              Age        Fare      Pclass       SibSp       Parch
count  453.000000  577.000000  577.000000  577.000000  577.000000
mean    30.726645   25.523893    2.389948    0.429809    0.235702
std     14.678201   43.138263    0.813580    1.061811    0.612294
min      0.420000    0.000000    1.000000    0.000000    0.000000
25%     21.000000    7.895800    2.000000    0.000000    0.000000
50%     29.000000   10.500000    3.000000    0.000000    0.000000
75%     39.000000   26.550000    3.000000    0.000000    0.000000
max     80.000000  512.329200    3.000000    8.000000    5.000000
===== test: males
              Age        Fare      Pclass       SibSp       Parch
count  205.000000  265.000000  266.000000  266.000000  266.000000
mean    30.272732   27.527877    2.334586    0.379699    0.274436
std     13.389528   41.079423    0.808497    0.843735    0.883745
min      0.330000    0.000000    1.000000    0.000000    0.000000
25%     22.000000    7.854200    2.000000    0.000000    0.000000
50%     27.000000   13.000000    3.000000    0.000000    0.000000
75%     40.000000   26.550000    3.000000    1.000000    0.000000
max     67.000000  262.375000    3.000000    8.000000    9.000000
===== train: females
              Age        Fare      Pclass       SibSp       Parch
count  261.000000  314.000000  314.000000  314.000000  314.000000
mean    27.915709   44.479818    2.159236    0.694268    0.649682
std     14.110146   57.997698    0.857290    1.156520    1.022846
min      0.750000    6.750000    1.000000    0.000000    0.000000
25%     18.000000   12.071875    1.000000    0.000000    0.000000
50%     27.000000   23.000000    2.000000    0.000000    0.000000
75%     37.000000   55.000000    3.000000    1.000000    1.000000
max     63.000000  512.329200    3.000000    8.000000    6.000000
===== test: females
              Age        Fare      Pclass       SibSp       Parch
count  127.000000  152.000000  152.000000  152.000000  152.000000
mean    30.272362   49.747699    2.144737    0.565789    0.598684
std     15.428613   73.108716    0.887051    0.974313    1.105434
min      0.170000    6.950000    1.000000    0.000000    0.000000
25%     20.500000    8.626050    1.000000    0.000000    0.000000
50%     27.000000   21.512500    2.000000    0.000000    0.000000
75%     38.500000   55.441700    3.000000    1.000000    1.000000
max     76.000000  512.329200    3.000000    8.000000    9.000000

Putting a small digest of the full sample – it will be necessary to continue the conversion samples.

In [6]:
import re
class DataDigest:

    def __init__(self):
        self.ages = None
        self.fares = None
        self.titles = None
        self.cabins = None
        self.families = None
        self.tickets = None

def get_title(name):
    if pd.isnull(name):
        return "Null"

    title_search = re.search(' ([A-Za-z]+)\.', name)
    if title_search:
        return title_search.group(1).lower()
    else:
        return "None"


def get_family(row):
    last_name = row["Name"].split(",")[0]
    if last_name:
        family_size = 1 + row["Parch"] + row["SibSp"]
        if family_size > 3:
            return "{0}_{1}".format(last_name.lower(), family_size)
        else:
            return "nofamily"
    else:
        return "unknown"


data_digest = DataDigest()
#ages - reference median ages based on gender;
data_digest.ages = all_data.groupby("Sex")["Age"].median()
#fares - reference median value of the tickets, depending on the class of ticket;
data_digest.fares = all_data.groupby("Pclass")["Fare"].median()
#titles - reference titles;
data_digest.titles = pd.Index(test_data["Name"].apply(get_title).unique())
#families - families of reference identifiers (name + the number of family members);
data_digest.families = pd.Index(test_data.apply(get_family, axis=1).unique())
#cabins - cabins reference identifiers;
data_digest.cabins = pd.Index(test_data["Cabin"].fillna("unknown").unique())
#tickets - Tickets reference identifiers.
data_digest.tickets = pd.Index(test_data["Ticket"].fillna("unknown").unique())

select features

Convert categorical data to numeric
In [7]:
def get_index(item, index):
    if pd.isnull(item):
        return -1

    try:
        return index.get_loc(item)
    except KeyError:
        return -1


def munge_data(data, digest):
    # Age
    data["AgeF"] = data.apply(lambda r: digest.ages[r["Sex"]] if pd.isnull(r["Age"]) else r["Age"], axis=1)

    # Fare
    data["FareF"] = data.apply(lambda r: digest.fares[r["Pclass"]] if pd.isnull(r["Fare"]) else r["Fare"], axis=1)

    # Gender
    genders = {"male": 1, "female": 0}
    data["SexF"] = data["Sex"].apply(lambda s: genders.get(s))

    # Gender
    gender_dummies = pd.get_dummies(data["Sex"], prefix="SexD", dummy_na=False)
    data = pd.concat([data, gender_dummies], axis=1)

    # Embarkment
    embarkments = {"U": 0, "S": 1, "C": 2, "Q": 3}
    data["EmbarkedF"] = data["Embarked"].fillna("U").apply(lambda e: embarkments.get(e))

    # Embarkment
    embarkment_dummies = pd.get_dummies(data["Embarked"], prefix="EmbarkedD", dummy_na=False)
    data = pd.concat([data, embarkment_dummies], axis=1)

    # Relatives
    data["RelativesF"] = data["Parch"] + data["SibSp"]

    # SingleF
    data["SingleF"] = data["RelativesF"].apply(lambda r: 1 if r == 0 else 0)

    # Deck -
    decks = {"U": 0, "A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7, "T": 8}
    data["DeckF"] = data["Cabin"].fillna("U").apply(lambda c: decks.get(c[0], -1))

    # Deck -
    deck_dummies = pd.get_dummies(data["Cabin"].fillna("U").apply(lambda c: c[0]), prefix="DeckD", dummy_na=False)
    data = pd.concat([data, deck_dummies], axis=1)

    # Titles -
    title_dummies = pd.get_dummies(data["Name"].apply(lambda n: get_title(n)), prefix="TitleD", dummy_na=False)
    data = pd.concat([data, title_dummies], axis=1)

    # Add new Features
    data["CabinF"] = data["Cabin"].fillna("unknown").apply(lambda c: get_index(c, digest.cabins))

    data["TitleF"] = data["Name"].apply(lambda n: get_index(get_title(n), digest.titles))

    data["TicketF"] = data["Ticket"].apply(lambda t: get_index(t, digest.tickets))

    data["FamilyF"] = data.apply(lambda r: get_index(get_family(r), digest.families), axis=1)

    # 
    age_bins = [0, 5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90]
    data["AgeR"] = pd.cut(data["Age"].fillna(-1), bins=age_bins).astype(object)

    return data
In [8]:
train_data_munged = munge_data(train_data, data_digest)
test_data_munged = munge_data(test_data, data_digest)
all_data_munged = pd.concat([train_data_munged, test_data_munged])
In [9]:
all_data_munged.head(5)
Out[9]:
Age AgeF AgeR Cabin CabinF DeckD_A DeckD_B DeckD_C DeckD_D DeckD_E TitleD_master TitleD_miss TitleD_mlle TitleD_mme TitleD_mr TitleD_mrs TitleD_ms TitleD_rev TitleD_sir TitleF
0 22.0 22.0 (20, 25] NaN 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0
1 38.0 38.0 (30, 40] C85 46 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1
2 26.0 26.0 (25, 30] NaN 0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2
3 35.0 35.0 (30, 40] C123 -1 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1
4 35.0 35.0 (30, 40] NaN 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0

5 rows × 56 columns

In [10]:
predictors = ["Pclass",
              "AgeF",
              "TitleF",
              "TitleD_mr", "TitleD_mrs", "TitleD_miss", "TitleD_master", "TitleD_ms", 
              "TitleD_col", "TitleD_rev", "TitleD_dr",
              "CabinF",
              "DeckF",
              "DeckD_U", "DeckD_A", "DeckD_B", "DeckD_C", "DeckD_D", "DeckD_E", "DeckD_F", "DeckD_G",
              "FamilyF",
              "TicketF",
              "SexF",
              "SexD_male", "SexD_female",
              "EmbarkedF",
              "EmbarkedD_S", "EmbarkedD_C", "EmbarkedD_Q",
              "FareF",
              "SibSp", "Parch",
              "RelativesF",
              "SingleF"]
In [11]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(all_data_munged[predictors])

train_data_scaled = scaler.transform(train_data_munged[predictors])
test_data_scaled = scaler.transform(test_data_munged[predictors])
In [12]:
print("===== survived by age")
print(train_data.groupby(["AgeF"])["Survived"].value_counts(normalize=True))

print("===== survived by gender and age")
print(train_data.groupby(["Sex", "AgeF"])["Survived"].value_counts(normalize=True))

print("===== survived by class and age")
print(train_data.groupby(["Pclass", "AgeF"])["Survived"].value_counts(normalize=True))
===== survived by age
AgeF   Survived
0.42   1           1.000000
0.67   1           1.000000
0.75   1           1.000000
0.83   1           1.000000
0.92   1           1.000000
1.00   1           0.714286
       0           0.285714
2.00   0           0.700000
       1           0.300000
3.00   1           0.833333
       0           0.166667
4.00   1           0.700000
       0           0.300000
5.00   1           1.000000
6.00   1           0.666667
       0           0.333333
7.00   0           0.666667
       1           0.333333
8.00   0           0.500000
       1           0.500000
9.00   0           0.750000
       1           0.250000
10.00  0           1.000000
11.00  0           0.750000
       1           0.250000
12.00  1           1.000000
13.00  1           1.000000
14.00  0           0.500000
       1           0.500000
14.50  0           1.000000
                     ...   
51.00  0           0.714286
       1           0.285714
52.00  0           0.500000
       1           0.500000
53.00  1           1.000000
54.00  0           0.625000
       1           0.375000
55.00  0           0.500000
       1           0.500000
55.50  0           1.000000
56.00  0           0.500000
       1           0.500000
57.00  0           1.000000
58.00  1           0.600000
       0           0.400000
59.00  0           1.000000
60.00  0           0.500000
       1           0.500000
61.00  0           1.000000
62.00  0           0.500000
       1           0.500000
63.00  1           1.000000
64.00  0           1.000000
65.00  0           1.000000
66.00  0           1.000000
70.00  0           1.000000
70.50  0           1.000000
71.00  0           1.000000
74.00  0           1.000000
80.00  1           1.000000
dtype: float64
===== survived by gender and age
Sex     AgeF   Survived
female  0.75   1           1.000000
        1.00   1           1.000000
        2.00   0           0.666667
               1           0.333333
        3.00   0           0.500000
               1           0.500000
        4.00   1           1.000000
        5.00   1           1.000000
        6.00   0           0.500000
               1           0.500000
        7.00   1           1.000000
        8.00   0           0.500000
               1           0.500000
        9.00   0           1.000000
        10.00  0           1.000000
        11.00  0           1.000000
        13.00  1           1.000000
        14.00  1           0.750000
               0           0.250000
        14.50  0           1.000000
        15.00  1           1.000000
        16.00  1           0.833333
               0           0.166667
        17.00  1           0.833333
               0           0.166667
        18.00  1           0.615385
               0           0.384615
        19.00  1           1.000000
        20.00  0           1.000000
        21.00  1           0.571429
                             ...   
male    48.00  0           0.400000
        49.00  0           0.500000
               1           0.500000
        50.00  0           0.800000
               1           0.200000
        51.00  0           0.833333
               1           0.166667
        52.00  0           0.750000
               1           0.250000
        54.00  0           1.000000
        55.00  0           1.000000
        55.50  0           1.000000
        56.00  0           0.666667
               1           0.333333
        57.00  0           1.000000
        58.00  0           1.000000
        59.00  0           1.000000
        60.00  0           0.666667
               1           0.333333
        61.00  0           1.000000
        62.00  0           0.666667
               1           0.333333
        64.00  0           1.000000
        65.00  0           1.000000
        66.00  0           1.000000
        70.00  0           1.000000
        70.50  0           1.000000
        71.00  0           1.000000
        74.00  0           1.000000
        80.00  1           1.000000
dtype: float64
===== survived by class and age
Pclass  AgeF   Survived
1       0.92   1           1.000000
        2.00   0           1.000000
        4.00   1           1.000000
        11.00  1           1.000000
        14.00  1           1.000000
        15.00  1           1.000000
        16.00  1           1.000000
        17.00  1           1.000000
        18.00  1           0.750000
               0           0.250000
        19.00  1           0.600000
               0           0.400000
        21.00  1           0.666667
               0           0.333333
        22.00  1           0.800000
               0           0.200000
        23.00  1           1.000000
        24.00  1           0.714286
               0           0.285714
        25.00  1           0.666667
               0           0.333333
        26.00  1           1.000000
        27.00  1           0.923077
               0           0.076923
        28.00  0           0.720000
               1           0.280000
        29.00  0           0.666667
               1           0.333333
        30.00  1           0.833333
               0           0.166667
                             ...   
3       35.00  1           0.166667
        36.00  0           0.833333
               1           0.166667
        37.00  0           1.000000
        38.00  0           0.750000
               1           0.250000
        39.00  0           0.833333
               1           0.166667
        40.00  0           1.000000
        40.50  0           1.000000
        41.00  0           1.000000
        42.00  0           1.000000
        43.00  0           1.000000
        44.00  0           0.750000
               1           0.250000
        45.00  0           0.800000
               1           0.200000
        45.50  0           1.000000
        47.00  0           1.000000
        48.00  0           1.000000
        49.00  0           1.000000
        50.00  0           1.000000
        51.00  0           1.000000
        55.50  0           1.000000
        59.00  0           1.000000
        61.00  0           1.000000
        63.00  1           1.000000
        65.00  0           1.000000
        70.50  0           1.000000
        74.00  0           1.000000
dtype: float64

We see that the chances of survival are great for children up to 5 years, and in old age, chance of survival decreases with age. But this is not true for women – a woman’s chance of survival is great at any age.

In [13]:
import seaborn as sns
sns.pairplot(train_data_munged, vars=["AgeF", "Pclass", "SexF"], hue="Survived", dropna=True)
sns.plt.show()

Beautiful, but such “class-floor” in the pair correlation is not very clear. We estimate the importance of our algorithms signs SelectKBest . wikipedia link:- F-test.

In [14]:
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(k=5)
selector.fit(train_data_munged[predictors], train_data_munged["Survived"])

scores = -np.log10(selector.pvalues_)

plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()

Before you begin to run a classification, we need to understand how we will evaluate it. In the case of Kaggle it is very simple: we just read their rules.

In the case of the Titanic assessment will serve as the ratio of correct assessment of the classifier to the total number of passengers.

In other words, this estimate is called the accuracy . But before you send the classification result on test-sample for evaluation in Kaggle, it would be nice to understand at least the approximate performance of our classifier.

classifier.fit(train_X, train_y)

predict_y = classifier.predict(train_X)

return metrics.accuracy_score(train_y, predict_y)

In [15]:
from sklearn.cross_validation import StratifiedKFold
cv = StratifiedKFold(train_data["Survived"], n_folds=3, shuffle=True, random_state=1)

Here we define a complex process: training data will be divided into three pieces, and the record will fall into each piece randomly (to neutralize the possible dependence of the order), also a strategy to track the ratio of classes were approximately equal in each piece. Thus we produce three measurements on pieces 1 + 2 vs 3, 1 + 3 vs 2, 2 + 3 vs 1 – thereafter can obtain an average estimate accuracy of the classifier (which will characterize the performance), and estimates the variance (which will to characterize the stability of his work).

continued_part2

Leave a Reply

Your email address will not be published. Required fields are marked *