Tutorial using Kobe Bryant Dataset – Part 4

This part is a Tutorial using Kobe Bryant Dataset – Part 4. You can get the data from https://www.kaggle.com/c/kobe-bryant-shot-selection . What excited me was that this dataset is excellent to practice classification basics, feature engineering, and time series analysis. This is continued from here.

Exploring the data

In [215]:
#Shot accuracy
sns.countplot('shot_made_flag',data = data)
<matplotlib.axes._subplots.AxesSubplot at 0x270898aa780>
In [216]:
data['shot_made_flag'].value_counts() / data['shot_made_flag'].shape
#He scores around 45% of his shots.
0.0    0.553839
1.0    0.446161
Name: shot_made_flag, dtype: float64
In [218]:
# Let's see his attempts depending on the seconds to the end of a period:
data['timeRemaining'].plot(kind='hist', bins=24, xlim=(720, 0), figsize=(12,6),
                            title='Attempts made over time\n(seconds to the end of period)')
<matplotlib.axes._subplots.AxesSubplot at 0x2709186a710>
In [219]:
# Accuracy of those shots:
time_bins = np.arange(0, 721, 30)
attempts_in_time = pd.cut(data['timeRemaining'], time_bins, right=False)
grouped = data.groupby(attempts_in_time)
prec = grouped['shot_made_flag'].mean()

prec[::-1].plot(kind='bar', figsize=(12, 6), ylim=(0.2, 0.5), 
                title='Shot accuracy over time\n(seconds to the end of period)')
<matplotlib.axes._subplots.AxesSubplot at 0x27089b2ca90>
In [220]:
#Lots of attempts in last 30 seconds, and much worse accuracy than usual. Let's explore that more.
#Shots in the last seconds of a period

last_30 = data[data['timeRemaining'] < 30]
last_30['shot_made_flag'].value_counts() / last_30['shot_made_flag'].shape
#In the last 30 seconds he scores only about 33% of his shots. Pressure?
0.0    0.666305
1.0    0.333695
Name: shot_made_flag, dtype: float64
In [221]:
#Let's explore what happens in those last minutes of the game.
last_2min = data[data['timeRemaining'] <= 120]

last_2min['timeRemaining'].plot(kind='hist', bins=30, xlim=(120, 0), figsize=(12,6),
                            title='Attempts made over time\n(seconds to the end of period)')
#Ok, this explains things a bit. Plenty of last seconds desperate shots. 
#Let's return to last 30 seconds.
<matplotlib.axes._subplots.AxesSubplot at 0x270934a8780>
In [222]:
#Let's return to last 30 seconds.
last_30['timeRemaining'].plot(kind='hist', bins=10, xlim=(30, 0), figsize=(12,6),
                            title='Attempts made over time\n(seconds to the end of period)')
<matplotlib.axes._subplots.AxesSubplot at 0x2709ce4c208>
In [224]:
last_5sec_misses = data[(data['timeRemaining'] <= 5) & (data['shot_made_flag'] == 0)]
last_5sec_scores = data[(data['timeRemaining'] <= 5) & (data['shot_made_flag'] == 1)]

fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(12,7))
ax1.set_ylim(800, -50)

sns.regplot(x='loc_x', y='loc_y', data=last_5sec_misses, fit_reg=False, ax=ax1, color='r')
sns.regplot(x='loc_x', y='loc_y', data=last_5sec_scores, fit_reg=False, ax=ax2, color='b')
#In last 5 seconds, there are some desperate shots from far away, plenty of misses from 3pt line,
#but he misses a lot even from close distance.
<matplotlib.axes._subplots.AxesSubplot at 0x2709d2c34e0>
In [226]:
last_5sec_close = data[(data['timeRemaining'] <= 5) & (data['shotDistance'] <= 20)]

last_5sec_close['shot_made_flag'].value_counts() / last_5sec_close['shot_made_flag'].shape
0.0    0.604317
1.0    0.395683
Name: shot_made_flag, dtype: float64
In [227]:
#For comparison, accuracy from close distance when there are more than 5 seconds to go:
close_shots = data[(data['timeRemaining'] > 5) & (data['shotDistance'] <= 20)]

close_shots['shot_made_flag'].value_counts() / close_shots['shot_made_flag'].shape
0.0    0.512264
1.0    0.487736
Name: shot_made_flag, dtype: float64

Period accuracy

In [228]:
#Number of shots taken in each period
plt.figure(figsize =(12,6))
sns.countplot(x = 'period',hue = "shot_made_flag",data = data)
<matplotlib.axes._subplots.AxesSubplot at 0x2709d320b00>
In [229]:
period_acc = data['shot_made_flag'].groupby(data['period']).mean()
period_acc.plot(kind='barh', figsize=(12, 6))

#Seems like a period of a game doesn't influence much his accuracy.
<matplotlib.axes._subplots.AxesSubplot at 0x2709d6d2c18>

Accuracy depending on shot type

In [231]:
#Combined shot type
#Number of different kinds of shots:
sns.countplot(x="combined_shot_type", hue="shot_made_flag", data=data)    
<matplotlib.axes._subplots.AxesSubplot at 0x2709d8772e8>
In [232]:
shot_type_acc = data['shot_made_flag'].groupby(data['combined_shot_type']).mean()
shot_type_acc.plot(kind='barh', figsize=(12, 6))
<matplotlib.axes._subplots.AxesSubplot at 0x2709d89f4e0>

Action type

In [233]:
#Number of Shots
sns.countplot(y="action_type", hue="shot_made_flag", data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x2709d9ec4a8>
In [237]:
action_type = data['shot_made_flag'].groupby(data['action_type']).mean()

action_type.sort_values().plot(kind='barh', figsize=(12, 18))
<matplotlib.axes._subplots.AxesSubplot at 0x2709e5cca58>

Career accuracy

In [235]:
#Number of shots over seasons:
sns.countplot(x="season", hue="shot_made_flag", data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x270899022e8>
In [238]:
season_acc = data['shot_made_flag'].groupby(data['season']).mean()
season_acc.plot(figsize=(12, 6), title='Accuracy over seasons')
<matplotlib.axes._subplots.AxesSubplot at 0x2709ed62f28>