Kobe Bryant

Kaggle Tutorial using Kobe Bryant Dataset – Part 1

Posted on Posted in Kaggle

This is a kaggle tutorial. You can get the data from https://www.kaggle.com/c/kobe-bryant-shot-selection . What excited me was that this dataset is excellent to practice classification basics, feature engineering, and time series analysis.

Importing Data

Let us start with importing the basic libraries we need and the data set.

In [1]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore') 

# sk learn import 
from sklearn.decomposition import PCA, KernelPCA
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.metrics import make_scorer
from sklearn.grid_search import GridSearchCV
from sklearn.feature_selection import VarianceThreshold, RFE, SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier, ExtraTreesClassifier, GradientBoostingClassifier, VotingClassifier, RandomForestClassifier, AdaBoostClassifier

sns.set_style('whitegrid')
pd.set_option('display.max_columns', None) # display all columns


# importing the dataset
df = pd.read_csv('data.csv')
df.head(5)
Out[1]:
action_type combined_shot_type game_event_id game_id lat loc_x loc_y lon minutes_remaining period playoffs season seconds_remaining shot_distance shot_made_flag shot_type shot_zone_area shot_zone_basic shot_zone_range team_id team_name game_date matchup opponent shot_id
0 Jump Shot Jump Shot 10 20000012 33.9723 167 72 -118.1028 10 1 0 2000-01 27 18 NaN 2PT Field Goal Right Side(R) Mid-Range 16-24 ft. 1610612747 Los Angeles Lakers 2000-10-31 LAL @ POR POR 1
1 Jump Shot Jump Shot 12 20000012 34.0443 -157 0 -118.4268 10 1 0 2000-01 22 15 0.0 2PT Field Goal Left Side(L) Mid-Range 8-16 ft. 1610612747 Los Angeles Lakers 2000-10-31 LAL @ POR POR 2
2 Jump Shot Jump Shot 35 20000012 33.9093 -101 135 -118.3708 7 1 0 2000-01 45 16 1.0 2PT Field Goal Left Side Center(LC) Mid-Range 16-24 ft. 1610612747 Los Angeles Lakers 2000-10-31 LAL @ POR POR 3
3 Jump Shot Jump Shot 43 20000012 33.8693 138 175 -118.1318 6 1 0 2000-01 52 22 0.0 2PT Field Goal Right Side Center(RC) Mid-Range 16-24 ft. 1610612747 Los Angeles Lakers 2000-10-31 LAL @ POR POR 4
4 Driving Dunk Shot Dunk 155 20000012 34.0443 0 0 -118.2698 6 2 0 2000-01 19 0 1.0 2PT Field Goal Center(C) Restricted Area Less Than 8 ft. 1610612747 Los Angeles Lakers 2000-10-31 LAL @ POR POR 5

data.set_index(‘shot_id’, inplace=True) data[“action_type”] = data[“action_type”].astype(‘object’) data[“combined_shot_type”] = data[“combined_shot_type”].astype(‘category’) data[“game_event_id”] = data[“game_event_id”].astype(‘category’) data[“game_id”] = data[“game_id”].astype(‘category’) data[“period”] = data[“period”].astype(‘object’) data[“playoffs”] = data[“playoffs”].astype(‘category’) data[“season”] = data[“season”].astype(‘category’) data[“shot_made_flag”] = data[“shot_made_flag”].astype(‘category’) data[“shot_type”] = data[“shot_type”].astype(‘category’) data[“team_id”] = data[“team_id”].astype(‘category’)

Getting to know the data

Let us do some basic operations

In [2]:
df.dtypes
Out[2]:
action_type            object
combined_shot_type     object
game_event_id           int64
game_id                 int64
lat                   float64
loc_x                   int64
loc_y                   int64
lon                   float64
minutes_remaining       int64
period                  int64
playoffs                int64
season                 object
seconds_remaining       int64
shot_distance           int64
shot_made_flag        float64
shot_type              object
shot_zone_area         object
shot_zone_basic        object
shot_zone_range        object
team_id                 int64
team_name              object
game_date              object
matchup                object
opponent               object
shot_id                 int64
dtype: object
In [3]:
# shape
df.shape
Out[3]:
(30697, 25)
In [4]:
#info
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30697 entries, 0 to 30696
Data columns (total 25 columns):
action_type           30697 non-null object
combined_shot_type    30697 non-null object
game_event_id         30697 non-null int64
game_id               30697 non-null int64
lat                   30697 non-null float64
loc_x                 30697 non-null int64
loc_y                 30697 non-null int64
lon                   30697 non-null float64
minutes_remaining     30697 non-null int64
period                30697 non-null int64
playoffs              30697 non-null int64
season                30697 non-null object
seconds_remaining     30697 non-null int64
shot_distance         30697 non-null int64
shot_made_flag        25697 non-null float64
shot_type             30697 non-null object
shot_zone_area        30697 non-null object
shot_zone_basic       30697 non-null object
shot_zone_range       30697 non-null object
team_id               30697 non-null int64
team_name             30697 non-null object
game_date             30697 non-null object
matchup               30697 non-null object
opponent              30697 non-null object
shot_id               30697 non-null int64
dtypes: float64(3), int64(11), object(11)
memory usage: 5.9+ MB

Exploring the columns with Permutation and Random Sampling

To select a random subset without replacement, one way is to slice off the k elements of an array returned by permutation, where k is the desired subet size.

In [5]:
sampler = np.random.permutation(5)
In [6]:
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.take.html
df.take(sampler)
Out[6]:
action_type combined_shot_type game_event_id game_id lat loc_x loc_y lon minutes_remaining period playoffs season seconds_remaining shot_distance shot_made_flag shot_type shot_zone_area shot_zone_basic shot_zone_range team_id team_name game_date matchup opponent shot_id
3 Jump Shot Jump Shot 43 20000012 33.8693 138 175 -118.1318 6 1 0 2000-01 52 22 0.0 2PT Field Goal Right Side Center(RC) Mid-Range 16-24 ft. 1610612747 Los Angeles Lakers 2000-10-31 LAL @ POR POR 4
1 Jump Shot Jump Shot 12 20000012 34.0443 -157 0 -118.4268 10 1 0 2000-01 22 15 0.0 2PT Field Goal Left Side(L) Mid-Range 8-16 ft. 1610612747 Los Angeles Lakers 2000-10-31 LAL @ POR POR 2
4 Driving Dunk Shot Dunk 155 20000012 34.0443 0 0 -118.2698 6 2 0 2000-01 19 0 1.0 2PT Field Goal Center(C) Restricted Area Less Than 8 ft. 1610612747 Los Angeles Lakers 2000-10-31 LAL @ POR POR 5
2 Jump Shot Jump Shot 35 20000012 33.9093 -101 135 -118.3708 7 1 0 2000-01 45 16 1.0 2PT Field Goal Left Side Center(LC) Mid-Range 16-24 ft. 1610612747 Los Angeles Lakers 2000-10-31 LAL @ POR POR 3
0 Jump Shot Jump Shot 10 20000012 33.9723 167 72 -118.1028 10 1 0 2000-01 27 18 NaN 2PT Field Goal Right Side(R) Mid-Range 16-24 ft. 1610612747 Los Angeles Lakers 2000-10-31 LAL @ POR POR 1
In [7]:
randomSample = df.take(np.random.permutation(len(df))[:3])
In [8]:
randomSample
Out[8]:
action_type combined_shot_type game_event_id game_id lat loc_x loc_y lon minutes_remaining period playoffs season seconds_remaining shot_distance shot_made_flag shot_type shot_zone_area shot_zone_basic shot_zone_range team_id team_name game_date matchup opponent shot_id
1904 Jump Shot Jump Shot 198 20100360 33.9113 -164 133 -118.4338 5 2 0 2001-02 39 21 NaN 2PT Field Goal Left Side Center(LC) Mid-Range 16-24 ft. 1610612747 Los Angeles Lakers 2001-12-20 LAL @ HOU HOU 1905
22043 Jump Shot Jump Shot 258 21500269 33.9693 148 75 -118.1218 2 2 0 2015-16 34 16 0.0 2PT Field Goal Right Side(R) Mid-Range 16-24 ft. 1610612747 Los Angeles Lakers 2015-12-02 LAL @ WAS WAS 22044
8747 Layup Shot Layup 472 20500689 34.0443 0 0 -118.2698 3 4 0 2005-06 21 0 0.0 2PT Field Goal Center(C) Restricted Area Less Than 8 ft. 1610612747 Los Angeles Lakers 2006-02-04 LAL @ NOK NOP 8748
In [9]:
randomSample.T
Out[9]:
1904 22043 8747
action_type Jump Shot Jump Shot Layup Shot
combined_shot_type Jump Shot Jump Shot Layup
game_event_id 198 258 472
game_id 20100360 21500269 20500689
lat 33.9113 33.9693 34.0443
loc_x -164 148 0
loc_y 133 75 0
lon -118.434 -118.122 -118.27
minutes_remaining 5 2 3
period 2 2 4
playoffs 0 0 0
season 2001-02 2015-16 2005-06
seconds_remaining 39 34 21
shot_distance 21 16 0
shot_made_flag NaN 0 0
shot_type 2PT Field Goal 2PT Field Goal 2PT Field Goal
shot_zone_area Left Side Center(LC) Right Side(R) Center(C)
shot_zone_basic Mid-Range Mid-Range Restricted Area
shot_zone_range 16-24 ft. 16-24 ft. Less Than 8 ft.
team_id 1610612747 1610612747 1610612747
team_name Los Angeles Lakers Los Angeles Lakers Los Angeles Lakers
game_date 2001-12-20 2015-12-02 2006-02-04
matchup LAL @ HOU LAL @ WAS LAL @ NOK
opponent HOU WAS NOP
shot_id 1905 22044 8748

2. Summarize data

Descriptive statistics

In [10]:
#Let's take a brief look at all numerical columns statistcs:
df.describe(include =['number'])
Out[10]:
game_event_id game_id lat loc_x loc_y lon minutes_remaining period playoffs seconds_remaining shot_distance shot_made_flag team_id shot_id
count 30697.000000 3.069700e+04 30697.000000 30697.000000 30697.000000 30697.000000 30697.000000 30697.000000 30697.000000 30697.000000 30697.000000 25697.000000 3.069700e+04 30697.000000
mean 249.190800 2.476407e+07 33.953192 7.110499 91.107535 -118.262690 4.885624 2.519432 0.146562 28.365085 13.437437 0.446161 1.610613e+09 15349.000000
std 150.003712 7.755175e+06 0.087791 110.124578 87.791361 0.110125 3.449897 1.153665 0.353674 17.478949 9.374189 0.497103 0.000000e+00 8861.604943
min 2.000000 2.000001e+07 33.253300 -250.000000 -44.000000 -118.519800 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 1.610613e+09 1.000000
25% 110.000000 2.050008e+07 33.884300 -68.000000 4.000000 -118.337800 2.000000 1.000000 0.000000 13.000000 5.000000 0.000000 1.610613e+09 7675.000000
50% 253.000000 2.090035e+07 33.970300 0.000000 74.000000 -118.269800 5.000000 3.000000 0.000000 28.000000 15.000000 0.000000 1.610613e+09 15349.000000
75% 368.000000 2.960047e+07 34.040300 95.000000 160.000000 -118.174800 8.000000 3.000000 0.000000 43.000000 21.000000 1.000000 1.610613e+09 23023.000000
max 659.000000 4.990009e+07 34.088300 248.000000 791.000000 -118.021800 11.000000 7.000000 1.000000 59.000000 79.000000 1.000000 1.610613e+09 30697.000000
In [11]:
#And for categorical columns:
df.describe(include=['object', 'category'])
Out[11]:
action_type combined_shot_type season shot_type shot_zone_area shot_zone_basic shot_zone_range team_name game_date matchup opponent
count 30697 30697 30697 30697 30697 30697 30697 30697 30697 30697 30697
unique 57 6 20 2 6 7 5 1 1559 74 33
top Jump Shot Jump Shot 2005-06 2PT Field Goal Center(C) Mid-Range Less Than 8 ft. Los Angeles Lakers 2016-04-13 LAL @ SAS SAS
freq 18880 23485 2318 24271 13455 12625 9398 30697 50 1020 1978

Data Visualization

In [12]:
#See target class distribution
ax = plt.axes()
sns.countplot(x='shot_made_flag', data=df, ax=ax);
ax.set_title('Target class distribution')
plt.show()

At first we can see that the target variable is distributed quite equally. We won’t perform any actions to deal with imbalanced dataset. Data will be presented using boxplot

In [14]:
f, axarr = plt.subplots(4, 2, figsize=(15, 15))
sns.boxplot(x='lat', y='shot_made_flag', data=df, showmeans=True, ax=axarr[0,0])
sns.boxplot(x='lon', y='shot_made_flag', data=df, showmeans=True, ax=axarr[0, 1])
sns.boxplot(x='loc_y', y='shot_made_flag', data=df, showmeans=True, ax=axarr[1, 0])
sns.boxplot(x='loc_x', y='shot_made_flag', data=df, showmeans=True, ax=axarr[1, 1])
sns.boxplot(x='minutes_remaining', y='shot_made_flag', showmeans=True, data=df, ax=axarr[2, 0])
sns.boxplot(x='seconds_remaining', y='shot_made_flag', showmeans=True, data=df, ax=axarr[2, 1])
sns.boxplot(x='shot_distance', y='shot_made_flag', data=df, showmeans=True, ax=axarr[3, 0])

axarr[0, 0].set_title('Latitude')
axarr[0, 1].set_title('Longitude')
axarr[1, 0].set_title('Loc y')
axarr[1, 1].set_title('Loc x')
axarr[2, 0].set_title('Minutes remaining')
axarr[2, 1].set_title('Seconds remaining')
axarr[3, 0].set_title('Shot distance')

plt.tight_layout()
plt.show()
In [28]:
sns.pairplot(df, vars=['loc_x', 'loc_y', 'lat', 'lon', 'shot_distance'], hue='shot_made_flag', size=3)
plt.show()
In [30]:
f, axarr = plt.subplots(8, figsize=(15, 25))

sns.countplot(x="combined_shot_type", hue="shot_made_flag", data=df, ax=axarr[0])
sns.countplot(x="season", hue="shot_made_flag", data=df, ax=axarr[1])
sns.countplot(x="period", hue="shot_made_flag", data=df, ax=axarr[2])
sns.countplot(x="playoffs", hue="shot_made_flag", data=df, ax=axarr[3])
sns.countplot(x="shot_type", hue="shot_made_flag", data=df, ax=axarr[4])
sns.countplot(x="shot_zone_area", hue="shot_made_flag", data=df, ax=axarr[5])
sns.countplot(x="shot_zone_basic", hue="shot_made_flag", data=df, ax=axarr[6])
sns.countplot(x="shot_zone_range", hue="shot_made_flag", data=df, ax=axarr[7])

axarr[0].set_title('Combined shot type')
axarr[1].set_title('Season')
axarr[2].set_title('Period')
axarr[3].set_title('Playoffs')
axarr[4].set_title('Shot Type')
axarr[5].set_title('Shot Zone Area')
axarr[6].set_title('Shot Zone Basic')
axarr[7].set_title('Shot Zone Range')

plt.tight_layout()
plt.show()

Data Cleaning

We are assuming an independence of each shot – therefore some columns might be dropped

In [15]:
data_cl = df.copy() # create a copy of data frame
target = data_cl['shot_made_flag'].copy()

# Remove some columns
data_cl.drop('team_id', axis=1, inplace=True) # Always one number
data_cl.drop('lat', axis=1, inplace=True) # Correlated with loc_x
data_cl.drop('lon', axis=1, inplace=True) # Correlated with loc_y
data_cl.drop('game_id', axis=1, inplace=True) # Independent
data_cl.drop('game_event_id', axis=1, inplace=True) # Independent
data_cl.drop('team_name', axis=1, inplace=True) # Always LA Lakers
data_cl.drop('shot_made_flag', axis=1, inplace=True)
In [16]:
#There are also many outliers, remove them:

def detect_outliers(series, whis=1.5):
    q75, q25 = np.percentile(series, [75 ,25])
    iqr = q75 - q25
    return ~((series - series.median()).abs() <= (whis * iqr))

## For now - do not remove anything

Data Transformation

New features
In [17]:
# Remaining time
data_cl['seconds_from_period_end'] = 60 * data_cl['minutes_remaining'] + data_cl['seconds_remaining']
data_cl['last_5_sec_in_period'] = data_cl['seconds_from_period_end'] < 5

data_cl.drop('minutes_remaining', axis=1, inplace=True)
data_cl.drop('seconds_remaining', axis=1, inplace=True)
data_cl.drop('seconds_from_period_end', axis=1, inplace=True)

## Matchup - (away/home)
data_cl['home_play'] = data_cl['matchup'].str.contains('vs').astype('int')
data_cl.drop('matchup', axis=1, inplace=True)

# Game date
data_cl['game_date'] = pd.to_datetime(data_cl['game_date'])
data_cl['game_year'] = data_cl['game_date'].dt.year
data_cl['game_month'] = data_cl['game_date'].dt.month
data_cl.drop('game_date', axis=1, inplace=True)

# Loc_x, and loc_y binning
data_cl['loc_x'] = pd.cut(data_cl['loc_x'], 25)
data_cl['loc_y'] = pd.cut(data_cl['loc_y'], 25)

# Replace 20 least common action types with value 'Other'
rare_action_types = data_cl['action_type'].value_counts().sort_values().index.values[:20]
data_cl.loc[data_cl['action_type'].isin(rare_action_types), 'action_type'] = 'Other'
Encode categorical variables

Computing Indicator/Dummy Variables

Transforming categorical variables into a “dummy” or “indicator” matrix. If a column in a Dataframe has k distinct values, a matrix containing k columns containing all 1’s or 0’s. Pandas has a get_dummies function. In some cases,you may want to add a prefix to the columns.

In [18]:
categorial_cols = [
    'action_type', 'combined_shot_type', 'period', 'season', 'shot_type',
    'shot_zone_area', 'shot_zone_basic', 'shot_zone_range', 'game_year',
    'game_month', 'opponent', 'loc_x', 'loc_y']

for cc in categorial_cols:
    dummies = pd.get_dummies(data_cl[cc])
    dummies = dummies.add_prefix("{}#".format(cc))
    data_cl.drop(cc, axis=1, inplace=True)
    data_cl = data_cl.join(dummies)

Feature Selection

Let’s reduce the number of features Create views for easier analysis

In [20]:
# Separate dataset for validation
unknown_mask = df['shot_made_flag'].isnull()
data_submit = data_cl[unknown_mask]

# Separate dataset for training
X = data_cl[~unknown_mask]
Y = target[~unknown_mask]

Variance Threshold

In [21]:
#Find all features with more than 90% variance in values.
threshold = 0.90
vt = VarianceThreshold().fit(X)

# Find feature names
feat_var_threshold = data_cl.columns[vt.variances_ > threshold * (1-threshold)]
feat_var_threshold
Out[21]:
Index(['playoffs', 'shot_distance', 'shot_id', 'home_play',
       'action_type#Jump Shot', 'combined_shot_type#Jump Shot',
       'combined_shot_type#Layup', 'period#1', 'period#2', 'period#3',
       'period#4', 'shot_type#2PT Field Goal', 'shot_type#3PT Field Goal',
       'shot_zone_area#Center(C)', 'shot_zone_area#Left Side Center(LC)',
       'shot_zone_area#Left Side(L)', 'shot_zone_area#Right Side Center(RC)',
       'shot_zone_area#Right Side(R)', 'shot_zone_basic#Above the Break 3',
       'shot_zone_basic#In The Paint (Non-RA)', 'shot_zone_basic#Mid-Range',
       'shot_zone_basic#Restricted Area', 'shot_zone_range#16-24 ft.',
       'shot_zone_range#24+ ft.', 'shot_zone_range#8-16 ft.',
       'shot_zone_range#Less Than 8 ft.', 'game_month#1', 'game_month#2',
       'game_month#3', 'game_month#4', 'game_month#11', 'game_month#12',
       'loc_x#(-10.96, 8.96]', 'loc_y#(-10.6, 22.8]', 'loc_y#(22.8, 56.2]',
       'loc_y#(123, 156.4]'],
      dtype='object')

Leave a Reply

Your email address will not be published. Required fields are marked *