These are my notes from various blogs to find different ways to predict survival on Titanic using Python-stack. I am interested to compare how different people have attempted the kaggle competition. I am going to compare and contrast different analysis to find similarity and difference in approaches to predict survival on Titanic.
This Notebook will show basic examples of:
- Data Handling
- Importing Data with Pandas
- Cleaning Data
- Exploring Data through Visualizations with Matplotlib
- Data Analysis
- Supervised Machine learning Techniques:
- Logit Regression Model
- Plotting results
- Support Vector Machine (SVM) using 3 kernels
- Basic Random Forest
- Plotting results
- etc..
- Supervised Machine learning Techniques:
- Valuation of the Analysis
- K-folds cross validation to valuate results locally
- Output the results from the Notebook to Kaggle
Columns Information
- The Survived column is the target variable. If Suvival = 1 the passenger survived, otherwise he’s dead.
- The other variables that describe the passengers are:
- PassengerId: and id given to each traveler on the boat
- Pclass: the passenger class. It has three possible values: 1,2,3
- The Name
- The Sex
- The Age
- SibSp: number of siblings and spouses traveling with the passenger
- Parch: number of parents and children traveling with the passenger
- The ticket number
- The ticket Fare
- The cabin number
- The embarkation. It has three possible values S,C,Q
1. Way to predict survival on Titianic
These notes are taken from this link
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
train_data = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\Titanic\train.csv")
test_data = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\Titanic\test.csv")
#Putting the two samples (train-sample and test-sample) in a total all-sampled.
all_data = pd.concat([train_data, test_data])
Why do it, because in the test-sample there is survival field missing? The total sample is useful for calculating the statistics for all the other fields (mean, median, quantile, minima and maxima), as well as connections between these fields.
Data analysis
print("===== survived by class and sex")
print(train_data.groupby(["Pclass", "Sex"])["Survived"].value_counts(normalize=True))
We see that the boat has more survivors as women – the women chance of survival rate is 96.8%, 92.1% and 50% depending on the class of ticket. The chance of surviving men is less, respectively, 36.9%, 15.7% and 13.5%.
describe_fields = ["Age", "Fare", "Pclass", "SibSp", "Parch"]
print("===== train: males")
print(train_data[train_data["Sex"] == "male"][describe_fields].describe())
print("===== test: males")
print(test_data[test_data["Sex"] == "male"][describe_fields].describe())
print("===== train: females")
print(train_data[train_data["Sex"] == "female"][describe_fields].describe())
print("===== test: females")
print(test_data[test_data["Sex"] == "female"][describe_fields].describe())
Putting a small digest of the full sample – it will be necessary to continue the conversion samples.
import re
class DataDigest:
def __init__(self):
self.ages = None
self.fares = None
self.titles = None
self.cabins = None
self.families = None
self.tickets = None
def get_title(name):
if pd.isnull(name):
return "Null"
title_search = re.search(' ([A-Za-z]+)\.', name)
if title_search:
return title_search.group(1).lower()
else:
return "None"
def get_family(row):
last_name = row["Name"].split(",")[0]
if last_name:
family_size = 1 + row["Parch"] + row["SibSp"]
if family_size > 3:
return "{0}_{1}".format(last_name.lower(), family_size)
else:
return "nofamily"
else:
return "unknown"
data_digest = DataDigest()
#ages - reference median ages based on gender;
data_digest.ages = all_data.groupby("Sex")["Age"].median()
#fares - reference median value of the tickets, depending on the class of ticket;
data_digest.fares = all_data.groupby("Pclass")["Fare"].median()
#titles - reference titles;
data_digest.titles = pd.Index(test_data["Name"].apply(get_title).unique())
#families - families of reference identifiers (name + the number of family members);
data_digest.families = pd.Index(test_data.apply(get_family, axis=1).unique())
#cabins - cabins reference identifiers;
data_digest.cabins = pd.Index(test_data["Cabin"].fillna("unknown").unique())
#tickets - Tickets reference identifiers.
data_digest.tickets = pd.Index(test_data["Ticket"].fillna("unknown").unique())
select features
Convert categorical data to numeric
def get_index(item, index):
if pd.isnull(item):
return -1
try:
return index.get_loc(item)
except KeyError:
return -1
def munge_data(data, digest):
# Age
data["AgeF"] = data.apply(lambda r: digest.ages[r["Sex"]] if pd.isnull(r["Age"]) else r["Age"], axis=1)
# Fare
data["FareF"] = data.apply(lambda r: digest.fares[r["Pclass"]] if pd.isnull(r["Fare"]) else r["Fare"], axis=1)
# Gender
genders = {"male": 1, "female": 0}
data["SexF"] = data["Sex"].apply(lambda s: genders.get(s))
# Gender
gender_dummies = pd.get_dummies(data["Sex"], prefix="SexD", dummy_na=False)
data = pd.concat([data, gender_dummies], axis=1)
# Embarkment
embarkments = {"U": 0, "S": 1, "C": 2, "Q": 3}
data["EmbarkedF"] = data["Embarked"].fillna("U").apply(lambda e: embarkments.get(e))
# Embarkment
embarkment_dummies = pd.get_dummies(data["Embarked"], prefix="EmbarkedD", dummy_na=False)
data = pd.concat([data, embarkment_dummies], axis=1)
# Relatives
data["RelativesF"] = data["Parch"] + data["SibSp"]
# SingleF
data["SingleF"] = data["RelativesF"].apply(lambda r: 1 if r == 0 else 0)
# Deck -
decks = {"U": 0, "A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7, "T": 8}
data["DeckF"] = data["Cabin"].fillna("U").apply(lambda c: decks.get(c[0], -1))
# Deck -
deck_dummies = pd.get_dummies(data["Cabin"].fillna("U").apply(lambda c: c[0]), prefix="DeckD", dummy_na=False)
data = pd.concat([data, deck_dummies], axis=1)
# Titles -
title_dummies = pd.get_dummies(data["Name"].apply(lambda n: get_title(n)), prefix="TitleD", dummy_na=False)
data = pd.concat([data, title_dummies], axis=1)
# Add new Features
data["CabinF"] = data["Cabin"].fillna("unknown").apply(lambda c: get_index(c, digest.cabins))
data["TitleF"] = data["Name"].apply(lambda n: get_index(get_title(n), digest.titles))
data["TicketF"] = data["Ticket"].apply(lambda t: get_index(t, digest.tickets))
data["FamilyF"] = data.apply(lambda r: get_index(get_family(r), digest.families), axis=1)
#
age_bins = [0, 5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90]
data["AgeR"] = pd.cut(data["Age"].fillna(-1), bins=age_bins).astype(object)
return data
train_data_munged = munge_data(train_data, data_digest)
test_data_munged = munge_data(test_data, data_digest)
all_data_munged = pd.concat([train_data_munged, test_data_munged])
all_data_munged.head(5)
predictors = ["Pclass",
"AgeF",
"TitleF",
"TitleD_mr", "TitleD_mrs", "TitleD_miss", "TitleD_master", "TitleD_ms",
"TitleD_col", "TitleD_rev", "TitleD_dr",
"CabinF",
"DeckF",
"DeckD_U", "DeckD_A", "DeckD_B", "DeckD_C", "DeckD_D", "DeckD_E", "DeckD_F", "DeckD_G",
"FamilyF",
"TicketF",
"SexF",
"SexD_male", "SexD_female",
"EmbarkedF",
"EmbarkedD_S", "EmbarkedD_C", "EmbarkedD_Q",
"FareF",
"SibSp", "Parch",
"RelativesF",
"SingleF"]
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(all_data_munged[predictors])
train_data_scaled = scaler.transform(train_data_munged[predictors])
test_data_scaled = scaler.transform(test_data_munged[predictors])
print("===== survived by age")
print(train_data.groupby(["AgeF"])["Survived"].value_counts(normalize=True))
print("===== survived by gender and age")
print(train_data.groupby(["Sex", "AgeF"])["Survived"].value_counts(normalize=True))
print("===== survived by class and age")
print(train_data.groupby(["Pclass", "AgeF"])["Survived"].value_counts(normalize=True))
We see that the chances of survival are great for children up to 5 years, and in old age, chance of survival decreases with age. But this is not true for women – a woman’s chance of survival is great at any age.
import seaborn as sns
sns.pairplot(train_data_munged, vars=["AgeF", "Pclass", "SexF"], hue="Survived", dropna=True)
sns.plt.show()
Beautiful, but such “class-floor” in the pair correlation is not very clear. We estimate the importance of our algorithms signs SelectKBest . wikipedia link:- F-test.
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(k=5)
selector.fit(train_data_munged[predictors], train_data_munged["Survived"])
scores = -np.log10(selector.pvalues_)
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()
Before you begin to run a classification, we need to understand how we will evaluate it. In the case of Kaggle it is very simple: we just read their rules.
In the case of the Titanic assessment will serve as the ratio of correct assessment of the classifier to the total number of passengers.
In other words, this estimate is called the accuracy . But before you send the classification result on test-sample for evaluation in Kaggle, it would be nice to understand at least the approximate performance of our classifier.
classifier.fit(train_X, train_y)
predict_y = classifier.predict(train_X)
return metrics.accuracy_score(train_y, predict_y)
from sklearn.cross_validation import StratifiedKFold
cv = StratifiedKFold(train_data["Survived"], n_folds=3, shuffle=True, random_state=1)
Here we define a complex process: training data will be divided into three pieces, and the record will fall into each piece randomly (to neutralize the possible dependence of the order), also a strategy to track the ratio of classes were approximately equal in each piece. Thus we produce three measurements on pieces 1 + 2 vs 3, 1 + 3 vs 2, 2 + 3 vs 1 – thereafter can obtain an average estimate accuracy of the classifier (which will characterize the performance), and estimates the variance (which will to characterize the stability of his work).