time series with dataset

Time Series Forecast using Kobe Bryant Dataset

Posted on Posted in Data Analysis Resources, Kaggle

This script is my attempt for time series analysis.

Pandas has dedicated libraries for handling TS objects, particularly the datatime64[ns] class which stores time information and allows us to perform some operations really fast.

In [40]:
import pandas as pd
import numpy as np
#import matplotlib.pylab as plt
#%matplotlib inline
import seaborn as sns
#from matplotlib.pylab import rcParams
#rcParams['figure.figsize'] = 15, 6
In [2]:
data = pd.read_csv('data.csv')
print (data.head(2))
print ('\n Data Types:')
print (data.dtypes)
  action_type combined_shot_type  game_event_id   game_id      lat  loc_x  \
0   Jump Shot          Jump Shot             10  20000012  33.9723    167   
1   Jump Shot          Jump Shot             12  20000012  34.0443   -157   

   loc_y       lon  minutes_remaining  period   ...          shot_type  \
0     72 -118.1028                 10       1   ...     2PT Field Goal   
1      0 -118.4268                 10       1   ...     2PT Field Goal   

  shot_zone_area  shot_zone_basic  shot_zone_range     team_id  \
0  Right Side(R)        Mid-Range        16-24 ft.  1610612747   
1   Left Side(L)        Mid-Range         8-16 ft.  1610612747   

            team_name   game_date    matchup opponent  shot_id  
0  Los Angeles Lakers  2000-10-31  LAL @ POR      POR        1  
1  Los Angeles Lakers  2000-10-31  LAL @ POR      POR        2  

[2 rows x 25 columns]

 Data Types:
action_type            object
combined_shot_type     object
game_event_id           int64
game_id                 int64
lat                   float64
loc_x                   int64
loc_y                   int64
lon                   float64
minutes_remaining       int64
period                  int64
playoffs                int64
season                 object
seconds_remaining       int64
shot_distance           int64
shot_made_flag        float64
shot_type              object
shot_zone_area         object
shot_zone_basic        object
shot_zone_range        object
team_id                 int64
team_name              object
game_date              object
matchup                object
opponent               object
shot_id                 int64
dtype: object
In [3]:
data['shot_made_flag'].unique()
Out[3]:
array([ nan,   0.,   1.])
In [4]:
#Dropped all the nan values
data_Season = data[['season','shot_made_flag']].dropna()
In [5]:
print(data_Season.head(2))
print ('\n Data Types:')
print (data_Season.dtypes)
    season  shot_made_flag
1  2000-01             0.0
2  2000-01             1.0

 Data Types:
season             object
shot_made_flag    float64
dtype: object

The data contains a particular season and number of shots in that season. But this is still not read as a TS object as the data types are ‘object’ and ‘float’. In order to read the data as a time series, we have to pass special arguments to the read_csv command:

In [6]:
#specifies a function which converts an input string into datetime variable.
dateparse = lambda dates: pd.datetime.strptime(dates,'%Y-%y')
#parse_dates: This specifies the column which contains the date-time information.
#index_col: tells pandas to use the 'season' column as  index

data = pd.read_csv('dataSeason.csv', parse_dates='season',index_col = 'season',date_parser=dateparse)
print (data.head())
print (data.dtypes)
            Unnamed: 0  shot_made_flag
season                                
2000-01-01           1             0.0
2000-01-01           2             1.0
2000-01-01           3             0.0
2000-01-01           4             1.0
2000-01-01           5             0.0
Unnamed: 0          int64
shot_made_flag    float64
dtype: object
In [7]:
data.columns
Out[7]:
Index(['Unnamed: 0', 'shot_made_flag'], dtype='object')
In [8]:
data.drop('Unnamed: 0',axis = 1,inplace = True)
In [9]:
data.head(2)
Out[9]:
shot_made_flag
season
2000-01-01 0.0
2000-01-01 1.0
In [10]:
data.index
Out[10]:
DatetimeIndex(['2000-01-01', '2000-01-01', '2000-01-01', '2000-01-01',
               '2000-01-01', '2000-01-01', '2000-01-01', '2000-01-01',
               '2000-01-01', '2000-01-01',
               ...
               '1999-01-01', '1999-01-01', '1999-01-01', '1999-01-01',
               '1999-01-01', '1999-01-01', '1999-01-01', '1999-01-01',
               '1999-01-01', '1999-01-01'],
              dtype='datetime64[ns]', name='season', length=25697, freq=None)
In [11]:
ts = data['shot_made_flag']
In [12]:
ts.head(5)
Out[12]:
season
2000-01-01    0.0
2000-01-01    1.0
2000-01-01    0.0
2000-01-01    1.0
2000-01-01    0.0
Name: shot_made_flag, dtype: float64

Check Stationarity of a Time Series

In [13]:
plt.plot(ts)
Out[13]:
[<matplotlib.lines.Line2D at 0x1f497b94f28>]

clearly evident that there is no an overall increasing trend in the data along with some seasonal variations.

In [14]:
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries):
    
    #Determing rolling statistics
    rolmean = pd.rolling_mean(timeseries, window=12)
    rolstd = pd.rolling_std(timeseries, window=12)

    #Plot rolling statistics:
    orig = plt.plot(timeseries, color='blue',label='Original')
    mean = plt.plot(rolmean, color='red', label='Rolling Mean')
    std = plt.plot(rolstd, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)
    
    #Perform Dickey-Fuller test:
    print ('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print (dfoutput)
In [15]:
test_stationarity(ts)
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:5: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(center=False,window=12).mean()
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:6: FutureWarning: pd.rolling_std is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(center=False,window=12).std()
Results of Dickey-Fuller Test:
Test Statistic                  -162.158185
p-value                            0.000000
#Lags Used                         0.000000
Number of Observations Used    25696.000000
Critical Value (5%)               -2.861652
Critical Value (1%)               -3.430605
Critical Value (10%)              -2.566830
dtype: float64
In [131]:
ts_log = np.log(ts)
plt.plot(ts_log)
Out[131]:
[<matplotlib.lines.Line2D at 0x1eb17c75518>]

I have done some mistake. I will like any comments. However I am taking another approach for the analysis.

Data Visualisation

In [21]:
data_Season.head(2)
Out[21]:
season shot_made_flag
1 2000-01 0.0
2 2000-01 1.0
In [25]:
import matplotlib.mlab as mlab
print (len(data_Season))
25697
In [39]:
#See target class distribution
ax = plt.axes()
sns.countplot(x='shot_made_flag', data=data_Season, ax=ax);
ax.set_title('Target class distribution')
plt.show()
In [49]:
pred = sns.boxplot(x='shot_made_flag', y='season', data=data_Season, showmeans=True)

Leave a Reply

Your email address will not be published. Required fields are marked *