This script is my attempt for time series analysis.
Pandas has dedicated libraries for handling TS objects, particularly the datatime64[ns] class which stores time information and allows us to perform some operations really fast.
In [40]:
import pandas as pd
import numpy as np
#import matplotlib.pylab as plt
#%matplotlib inline
import seaborn as sns
#from matplotlib.pylab import rcParams
#rcParams['figure.figsize'] = 15, 6
In [2]:
data = pd.read_csv('data.csv')
print (data.head(2))
print ('\n Data Types:')
print (data.dtypes)
In [3]:
data['shot_made_flag'].unique()
Out[3]:
In [4]:
#Dropped all the nan values
data_Season = data[['season','shot_made_flag']].dropna()
In [5]:
print(data_Season.head(2))
print ('\n Data Types:')
print (data_Season.dtypes)
The data contains a particular season and number of shots in that season. But this is still not read as a TS object as the data types are ‘object’ and ‘float’. In order to read the data as a time series, we have to pass special arguments to the read_csv command:
In [6]:
#specifies a function which converts an input string into datetime variable.
dateparse = lambda dates: pd.datetime.strptime(dates,'%Y-%y')
#parse_dates: This specifies the column which contains the date-time information.
#index_col: tells pandas to use the 'season' column as index
data = pd.read_csv('dataSeason.csv', parse_dates='season',index_col = 'season',date_parser=dateparse)
print (data.head())
print (data.dtypes)
In [7]:
data.columns
Out[7]:
In [8]:
data.drop('Unnamed: 0',axis = 1,inplace = True)
In [9]:
data.head(2)
Out[9]:
In [10]:
data.index
Out[10]:
In [11]:
ts = data['shot_made_flag']
In [12]:
ts.head(5)
Out[12]:
Check Stationarity of a Time Series
In [13]:
plt.plot(ts)
Out[13]:
clearly evident that there is no an overall increasing trend in the data along with some seasonal variations.
In [14]:
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries):
#Determing rolling statistics
rolmean = pd.rolling_mean(timeseries, window=12)
rolstd = pd.rolling_std(timeseries, window=12)
#Plot rolling statistics:
orig = plt.plot(timeseries, color='blue',label='Original')
mean = plt.plot(rolmean, color='red', label='Rolling Mean')
std = plt.plot(rolstd, color='black', label = 'Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show(block=False)
#Perform Dickey-Fuller test:
print ('Results of Dickey-Fuller Test:')
dftest = adfuller(timeseries, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print (dfoutput)
In [15]:
test_stationarity(ts)
In [131]:
ts_log = np.log(ts)
plt.plot(ts_log)
Out[131]:
I have done some mistake. I will like any comments. However I am taking another approach for the analysis.
Data Visualisation
In [21]:
data_Season.head(2)
Out[21]:
In [25]:
import matplotlib.mlab as mlab
print (len(data_Season))
In [39]:
#See target class distribution
ax = plt.axes()
sns.countplot(x='shot_made_flag', data=data_Season, ax=ax);
ax.set_title('Target class distribution')
plt.show()
In [49]:
pred = sns.boxplot(x='shot_made_flag', y='season', data=data_Season, showmeans=True)