Data Analysis Resources

Computational Statistics

Statistics and Math are the two things which a data scientist must be good at. Effect Size This notebook is a copy of statistics inference from Pycon 2016 In [1]: from __future__ import print_function, division import numpy import scipy.stats import matplotlib.pyplot as pyplot from ipywidgets import interact, interactive, fixed import ipywidgets as widgets # seed the random number generator so we…

Continue Reading

4 different ways to predict survival on Titanic - part 4
Data Analysis Resources, Kaggle

4 different ways to predict survival on Titanic – part 4

continued from part 3 4. Way to predict survival on Titianic These notes are taken from this link In [2]: import matplotlib.pyplot as plt %matplotlib inline import numpy as np import pandas as pd import statsmodels.api as sm from statsmodels.nonparametric.kde import KDEUnivariate from statsmodels.nonparametric import smoothers_lowess from pandas import Series, DataFrame from patsy import dmatrices from sklearn import datasets, svm In [3]:…

Continue Reading

Data Analysis Resources, Kaggle

4 different ways to predict survival on Titanic – part 3

continued from part 2 3. Way to predict survival on Titianic These notes are from this link I – Exploratory data analysis We tweak the style of this notebook a little bit to have centered plots. In [1]: from IPython.core.display import HTML HTML(“”” <style> .output_png { display: table-cell; text-align: center; vertical-align: middle; } </style> “””) Out[1]: In [2]: #Import the libraries #…

Continue Reading

Data Analysis Resources, Kaggle

4 different ways to predict survival on Titanic – part 1

These are my notes from various blogs to find different ways to predict survival on Titanic using Python-stack. I am interested to compare how different people have attempted the kaggle competition. I am going to compare and contrast different analysis to find similarity and difference in approaches to predict survival on Titanic. This Notebook will show basic examples of: Data…

Continue Reading

4 different ways to predict survival on Titanic - part 4
Data Analysis Resources, Kaggle, Predictive Analysis

4 different ways to predict survival on Titanic – part 2

continued from part 1 Classification KNeighborsClassifier In [16]: from sklearn.neighbors import KNeighborsClassifi alg_ngbh = KNeighborsClassifier(n_neighbors=3) scores = cross_validation.cross_val_score(alg_ngbh, train_data_scaled, train_data_munged[“Survived”], cv=cv, n_jobs=-1) print(“Accuracy (k-neighbors): {}/{}”.format(scores.mean(), scores.std())) Accuracy (k-neighbors): 0.7957351290684623/0.011110544261068086 SGDClassifier In [17]: from sklearn.linear_model.stochastic_gradient import SGDClassifier alg_sgd = SGDClassifier(random_state=1) scores = cross_validation.cross_val_score(alg_sgd, train_data_scaled, train_data_munged[“Survived”], cv=cv, n_jobs=-1) print(“Accuracy (sgd): {}/{}”.format(scores.mean(), scores.std())) Accuracy (sgd): 0.7239057239057239/0.015306601231185043 SVC In [18]: from sklearn.svm import SVC alg_svm = SVC(C=1.0)…

Continue Reading

Dublin R El Dorado Competition
Business, Data Analysis Resources

Dublin R El Dorado Competition

Quick Summary 5,000 data points with pseudo-geological information (including proven gold reserves). 50 (or more) new sites up for auction with limited data. Each team starts with $50,000,000.00 budget to bid with. Blind, sealed-bid auctions for rights to mine the parcel of land (min bid $100,000.00) Auctions happen in order by parcel_id Extraction costs are non-trivial. Winning team has most…

Continue Reading