Statistics and Math are the two things which a data scientist must be good at. Effect Size This notebook is a copy of statistics inference from Pycon 2016 In [1]: from __future__ import print_function, division import numpy import scipy.stats import matplotlib.pyplot as pyplot from ipywidgets import interact, interactive, fixed import ipywidgets as widgets # seed the random number generator so we…

# Category: Data Analysis Resources

## How to win a Data Science competition ?

These are mainly notes for myself to win a Data Science competition , but I figured that they might be of interest to some of the blog readers too. Comments on what is written below are most welcome! Typically, steps 1-5 would happen once per competition or problem, while steps 6-9 would be repeated in a loop or occur in parallel…

## The Comprehensive Guide for Feature Engineering

Feature Engineering is the art/science of representing data is the best way possible. This is the comprehensive guide for Feature Engineering for myself but I figured that they might be of interest to some of the blog readers too. Comments on what is written below are most welcome! Good Feature Engineering involves an elegant blend of domain knowledge, intuition, and…

## 4 different ways to predict survival on Titanic – part 4

continued from part 3 4. Way to predict survival on Titianic These notes are taken from this link In [2]: import matplotlib.pyplot as plt %matplotlib inline import numpy as np import pandas as pd import statsmodels.api as sm from statsmodels.nonparametric.kde import KDEUnivariate from statsmodels.nonparametric import smoothers_lowess from pandas import Series, DataFrame from patsy import dmatrices from sklearn import datasets, svm In [3]:…

## 4 different ways to predict survival on Titanic – part 3

continued from part 2 3. Way to predict survival on Titianic These notes are from this link I – Exploratory data analysis We tweak the style of this notebook a little bit to have centered plots. In [1]: from IPython.core.display import HTML HTML(“”” <style> .output_png { display: table-cell; text-align: center; vertical-align: middle; } </style> “””) Out[1]: In [2]: #Import the libraries #…

## 4 different ways to predict survival on Titanic – part 1

These are my notes from various blogs to find different ways to predict survival on Titanic using Python-stack. I am interested to compare how different people have attempted the kaggle competition. I am going to compare and contrast different analysis to find similarity and difference in approaches to predict survival on Titanic. This Notebook will show basic examples of: Data…

## 4 different ways to predict survival on Titanic – part 2

continued from part 1 Classification KNeighborsClassifier In [16]: from sklearn.neighbors import KNeighborsClassifi alg_ngbh = KNeighborsClassifier(n_neighbors=3) scores = cross_validation.cross_val_score(alg_ngbh, train_data_scaled, train_data_munged[“Survived”], cv=cv, n_jobs=-1) print(“Accuracy (k-neighbors): {}/{}”.format(scores.mean(), scores.std())) Accuracy (k-neighbors): 0.7957351290684623/0.011110544261068086 SGDClassifier In [17]: from sklearn.linear_model.stochastic_gradient import SGDClassifier alg_sgd = SGDClassifier(random_state=1) scores = cross_validation.cross_val_score(alg_sgd, train_data_scaled, train_data_munged[“Survived”], cv=cv, n_jobs=-1) print(“Accuracy (sgd): {}/{}”.format(scores.mean(), scores.std())) Accuracy (sgd): 0.7239057239057239/0.015306601231185043 SVC In [18]: from sklearn.svm import SVC alg_svm = SVC(C=1.0)…

## Dublin R El Dorado Competition

Quick Summary 5,000 data points with pseudo-geological information (including proven gold reserves). 50 (or more) new sites up for auction with limited data. Each team starts with $50,000,000.00 budget to bid with. Blind, sealed-bid auctions for rights to mine the parcel of land (min bid $100,000.00) Auctions happen in order by parcel_id Extraction costs are non-trivial. Winning team has most…