Categories

# Guide for Linear Regression using Python – Part 2

Guide for Linear Regression using Python – Part 2

This blog is the continuation of guide for linear regression using Python from this post.

1. There must be no correlation among independent variables. Multicollinearity is the presence of correlation in independent variables. If variables are correlated, it becomes extremely difficult for the model to determine the true effect of X on Y or find out which variable is contributing to predict the response variable. Additionally, the presence of correlated predictors tends to increase the standard errors. With large standard errors, the confidence interval becomes wider leading to less precise estimates of slope parameters. Also, when predictors are correlated, the estimated regression coefficient of a correlated variable depends on which other predictors are available in the model. If this happens, we’ll end up with an incorrect conclusion that a variable strongly / weakly affects target variable.

How to check for multicollinearity:

• Use scatter plot to visualize correlation effect among variables.
• Use VIF factor. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity.
• Correlation table

Steps: to overcome the issue of multicollinearity, use a correlation matrix to check correlated variables. Let’s say variables A and B are highly correlated. Now, instead of removing one of them, use this approach: Find the average correlation of A and B with the rest of the variables. Whichever variable has the higher average in comparison with other variables, remove it. Alternatively, we can use penalized regression methods such as lasso, ridge, elastic net, etc.

1. Error terms must be uncorrelated. Error at ∈t must not indicate the at error at ∈t+1.The presence of correlation in error terms is known as Autocorrelation. It drastically affects the regression coefficients and standard error values. It causes confidence intervals and prediction intervals to be narrower. Narrower confidence interval means that a 95% confidence interval would have lesser probability than 0.95 that it would contain the actual value of coefficients. Also, lower standard errors would cause the associated p-values to be lower than actual. This will make us incorrectly conclude a parameter to be statistically significant. It drastically reduces model’s accuracy. This usually occurs in time series models where the next instant is dependent on previous instant.

How to check for autocorrelation:

• Look for Durbin – Watson (DW) statistic. It must lie between 0 and 4. If DW = 2, implies no autocorrelation, 0 < DW < 2 implies positive autocorrelation while 2 < DW < 4 indicates negative autocorrelation.
• Look for the seasonal or correlated pattern in residual values in residual vs time plot.
1. Dependent variable and the error terms must possess a normal distribution. Presence of non – normal distribution suggests that there are a few unusual data points which must be studied closely to make a better model.

Presence of non- normally distributed error terms lead to either too wide or narrow confidence intervals. Unstable confidence level leads to difficulty in estimating coefficients based on minimization of least squares.

How to check for normal distribution:

• Look at Normality QQ plot
• Perform statistical tests of normality such as Kolmogorov-Smirnov test, Shapiro-Wilk test.

#### Normality Q-Q Plot

q-q or quantile-quantile is a scatter plot which helps us validate the assumption of normal distribution. It uses standardized values of residuals to determine the normal distribution of errors. Ideally, this plot should show a straight line. A curved, distorted line suggests residuals have a non-normal distribution.

Image for q-q plot : https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot

Steps: to overcome the issue if the errors are not normally distributed is through the nonlinear transformation of the both response or predictors variables.

### Interpretation of Residuals vs Leverage Plot

Residuals vs Leverage Plot is also known as Cook’s Distance plot. Cook’s distance attempts to identify the points which have more influence than other points. Such influential points tend to have a sizable impact on the regression line. The large values marked by Cook’s distance might require further investigation. Image: https://www.researchgate.net/figure/254457803_fig5_Figure-5-Residuals-vs-Leverage-plot-for-DB-GLM-with-Poisson-response-and-Logarithmic

Steps: to solve the issue for influential observations which are nothing but outliers is to remove those rows. Alternatively, we can scale down the outlier observation with the maximum value in data or else treat those values as missing values. The treatment of data points as outliers require a good domain knowledge

## How to know if the regression model is the best fit for your data?

The metrics used to determine model fit can have different values based on the type of data. Hence, we need to be extremely careful while interpreting regression analysis.

Following are some metrics you can use to evaluate your regression model:

1. p-Value are very important because we can consider a model to be statistically significant only when p-Values are less than the pre-determined statistical significance level, which is ideally 0.05. In Linear Regression, the Null Hypothesis is that the coefficients associated with the variables is equal to zero. The alternate hypothesis is that the coefficients are not equal to zero (i.e. there exists a relationship between the independent variable in question and the dependent variable).
2. t-value: Pr(>|t|) or p-value is the probability that you get a t-value as high or higher than the observed value when the Null Hypothesis is true. If the Pr(>|t|) is low, the coefficients are significant (significantly different from zero) and vice versa.

It is important for the model to be statistically significant before we can go ahead and use it to predict the dependent variable, otherwise, the confidence in predicted values from that model reduces and may be construed as an event of chance.

1. R Square (Coefficient of Determination): explains the percentage of variance explained by covariates in the model. It tells us the proportion of variation in the dependent (response) variable that has been explained by the model. It ranges between 0 and 1. Usually, higher values are desirable but it rests on the data quality and domain. We don’t necessarily discard a model based on a low R-Squared value. Its a better practice to look at the AIC and prediction accuracy on validation sample when deciding on the efficacy of a model.
2. Adjusted R²: penalizes total value for the number of predictors in the model. It doesn’t increase, stays same or decrease unless the newly added variable is truly useful. Therefore, when comparing nested models, it is a good practice to look at Adjusted R² value over R-squared.
3. F Statistics: is the ratio of explained variance of the model by unexplained variance. It compares the full model with an intercept to evaluates the overall significance of the model. Its value can range between zero and any arbitrary large number.
4. AIC and BIC: The Akaike’s information criterion – AIC and the Bayesian information criterion – BIC are measures of the goodness of fit of an estimated statistical model and can also be used for model selection. Both criteria depend on the maximized value of the likelihood function L for the estimated model.
5. Mallow’s Cp: is defined as a criterion to assess fits when models with different numbers of parameters are being compared.
6. RMSE / MSE / MAE: are crucial error metric.
• MSE: is mean squared error. It amplifies the impact of outliers on the model’s accuracy.
• MAE: is mean absolute error. It is robust against the effect of outliers.
• RMSE: is root mean square error. It is interpreted as how far on an average; the residuals are from zero. It nullifies squared effect of MSE by square root and provides the result in original units as data.
1. MinMax Accuracy: is defined as mean (min(actual, predicted)/max(actual, predicted)). It tells how far the model’s prediction is off. For a perfect model, this measure is 1.0. The lower the measure, the worse the model, based on out-of-sample performance.

## CRITERION

R-Squared Higher the better (> 0.70)
F-Statistic Higher the better
Std. Error Closer to zero the better
t-statistic Should be greater 1.96 for p-value to be less than 0.05
AIC Lower the better
BIC Lower the better
Mallows Cp Should be close to the number of predictors in model
MAE Lower the better
MSE (Mean squared error) Lower the better
MinMax Accuracy Higher the better

## Example Problem

Let’s use our theoretical knowledge and create a model practically. For this analysis, we will use the diamond dataset. This dataset can be downloaded from here.

Let’s load the data set and do initial data analysis.

```import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
%config InlineBackend.figure_format = 'png' #set 'png' here when working on notebook
warnings.filterwarnings('ignore')

# Set some parameters to get good visuals - style to ggplot and size to 15,10
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (4, 3)
```
In :
```#load the file and set the first column as the index
train = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\Assignment 1\diamonds.csv",index_col = 0)
```
In :
```train.head()
```
Out:
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
5 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
In :
```print ("\n\n---------------------")
print ("TRAIN SET INFORMATION")
print ("---------------------")
print ("Shape of training set:", train.shape, "\n")
print (train.dtypes)
```
```---------------------
TRAIN SET INFORMATION
---------------------
Shape of training set: (53940, 10)

Column Headers: ['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y', 'z']

carat      float64
cut         object
color       object
clarity     object
depth      float64
table      float64
price        int64
x          float64
y          float64
z          float64
dtype: object
```

So there are 53940 number of observations and 10 different variables.

The different kind of variables are : ‘carat’, ‘cut’, ‘color’, ‘clarity’, ‘depth’, ‘table’, ‘price’, ‘x’, ‘y’, ‘z’

In :
```import re
missing_values = []
nonumeric_values = []

print ("TRAINING SET INFORMATION")
print ("========================\n")

for column in train:
# Find all the unique feature values
uniq = train[column].unique()
print ("'{}' has {} unique values" .format(column,uniq.size))
if (uniq.size > 10):
print("~~Listing up to 10 unique values~~")
print (uniq[0:10])
print ("\n-----------------------------------------------------------------------\n")

# Find features with missing values
if (True in pd.isnull(uniq)):
s = "{} has {} missing" .format(column, pd.isnull(train[column]).sum())
missing_values.append(s)

# Find features with non-numeric values
for i in range (1, np.prod(uniq.shape)):
if (re.match('nan', str(uniq[i]))):
break
if not (re.search('(^\d+\.?\d*\$)|(^\d*\.?\d+\$)', str(uniq[i]))):
nonumeric_values.append(column)
break

print ("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")
print ("Features with missing values:\n{}\n\n" .format(missing_values))
print ("Features with non-numeric values:\n{}" .format(nonumeric_values))
print ("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")
```
```TRAINING SET INFORMATION
========================

'carat' has 273 unique values
~~Listing up to 10 unique values~~
[ 0.23  0.21  0.29  0.31  0.24  0.26  0.22  0.3   0.2   0.32]

-----------------------------------------------------------------------

'cut' has 5 unique values
['Ideal' 'Premium' 'Good' 'Very Good' 'Fair']

-----------------------------------------------------------------------

'color' has 7 unique values
['E' 'I' 'J' 'H' 'F' 'G' 'D']

-----------------------------------------------------------------------

'clarity' has 8 unique values
['SI2' 'SI1' 'VS1' 'VS2' 'VVS2' 'VVS1' 'I1' 'IF']

-----------------------------------------------------------------------

'depth' has 184 unique values
~~Listing up to 10 unique values~~
[ 61.5  59.8  56.9  62.4  63.3  62.8  62.3  61.9  65.1  59.4]

-----------------------------------------------------------------------

'table' has 127 unique values
~~Listing up to 10 unique values~~
[ 55.  61.  65.  58.  57.  56.  54.  62.  59.  63.]

-----------------------------------------------------------------------

'price' has 11602 unique values
~~Listing up to 10 unique values~~
[326 327 334 335 336 337 338 339 340 342]

-----------------------------------------------------------------------

'x' has 554 unique values
~~Listing up to 10 unique values~~
[ 3.95  3.89  4.05  4.2   4.34  3.94  4.07  3.87  4.    4.25]

-----------------------------------------------------------------------

'y' has 552 unique values
~~Listing up to 10 unique values~~
[ 3.98  3.84  4.07  4.23  4.35  3.96  4.11  3.78  4.05  4.28]

-----------------------------------------------------------------------

'z' has 375 unique values
~~Listing up to 10 unique values~~
[ 2.43  2.31  2.63  2.75  2.48  2.47  2.53  2.49  2.39  2.73]

-----------------------------------------------------------------------

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Features with missing values:
[]

Features with non-numeric values:
['cut', 'color', 'clarity']

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

```
In :
```# Check for duplicate index or rows
idsUnique = len(set(train.index))
idsTotal = train.shape
idsDupli = idsTotal - idsUnique
print("There are " + str(idsDupli) + " duplicate IDs for " + str(idsTotal) + " total entries")
```
```There are 0 duplicate IDs for 53940 total entries
```
##### Descriptive Statistics
In :
```#get summary of numerical variables
train.describe()
```
Out:
carat depth table price x y z
count 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000
mean 0.797940 61.749405 57.457184 3932.799722 5.731157 5.734526 3.538734
std 0.474011 1.432621 2.234491 3989.439738 1.121761 1.142135 0.705699
min 0.200000 43.000000 43.000000 326.000000 0.000000 0.000000 0.000000
25% 0.400000 61.000000 56.000000 950.000000 4.710000 4.720000 2.910000
50% 0.700000 61.800000 57.000000 2401.000000 5.700000 5.710000 3.530000
75% 1.040000 62.500000 59.000000 5324.250000 6.540000 6.540000 4.040000
max 5.010000 79.000000 95.000000 18823.000000 10.740000 58.900000 31.800000

We can get an idea of a possible skew in the data by comparing the mean to the median, i.e. the 50% figure. It looks that Price is skewed

In :
```# Skewness of the distribution

print(train.skew())

# Values close to 0 show less skew
# The attributes are not skewed too much. Some algos may benefit if skew is corrected
```
```carat    1.116646
depth   -0.082294
table    0.796896
price    1.618395
x        0.378676
y        2.434167
z        1.522423
dtype: float64
```
In :
```# Skewness of the distribution

print(train.kurt())

#Kurtosis is a measure of the "tailedness" of the probability distribution of a real-valued random variable.
```
```carat     1.256635
depth     5.739415
table     2.801857
price     2.177696
x        -0.618161
y        91.214557
z        47.086619
dtype: float64
```
###### We have ‘x’,’y’,’z’ as 0. let us check how many are there and deal with the outliers
In :
```#Check when a value in a cloumn is zero
train[(train['z'] == 0)].count()
```
Out:
```carat      20
cut        20
color      20
clarity    20
depth      20
table      20
price      20
x          20
y          20
z          20
dtype: int64```
In :
```#Separate carat and price
df = train[['carat','price']]
```
In :
```df.head()
```
Out:
carat price
1 0.23 326
2 0.21 326
3 0.23 327
4 0.29 334
5 0.31 335
In :
```#draw a histogram and not fit a kernel density estimate (KDE).
sns.distplot(df['carat'], kde = True, color = 'r', hist_kws={'alpha': 0.9})
```
Out:
`<matplotlib.axes._subplots.AxesSubplot at 0x1d33c55df60>` We see immediately that the carat weights are positively skewed: most diamonds are around 1 carat or below but there are extreme cases of larger diamonds. The plot above has fairly wide bins and there doesn’t appear to be any data beyond a carat size of 3.5. We can make try to get more out of hour histogram by adding some additional arguments to control the size of the bins and limits of the x-axis:

##### Boxplots

Boxplots are another type of univariate plot for summarizing distributions of numeric data graphically. Let’s make a boxplot of carat using the pd.boxplot() function:

In :
```train.boxplot(column="carat")
```
Out:
`<matplotlib.axes._subplots.AxesSubplot at 0x1d33c55db70>` the central box of the boxplot represents the middle 50% of the observations, the central bar is the median and the bars at the end of the dotted lines (whiskers) encapsulate the great majority of the observations. Circles that lie beyond the end of the whiskers are data points that may be outliers. In this case, our data set has over 50,000 observations and we see many data points beyond the top whisker. We probably wouldn’t want to classify all of those points as outliers, but the handful of diamonds at 4 carats and above are definitely far outside the norm.

In :
```plt.rcParams['figure.figsize'] = (6.0, 3.0)
prices = pd.DataFrame({"price":train["price"], "log(price + 1)":np.log1p(train["price"])})
prices.hist(bins = 50)
```
Out:
```array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001D33CDD9390>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001D33CE8E048>]], dtype=object)``` This is a long tail distribution, with a high concentration of observations below the U\$5,000 mark. The distribution is right-skewed with small amounts of very large prices driving up the mean, while the median remains a more robust measure of the center of the distribution.

We can see that the data show some evidence of bomiodality on the scale of log10 scale. So there are two different kind of diamonds available for richer and porer buyers.

In :
```plt.rcParams['figure.figsize'] = (6.0, 3.0)
carat = pd.DataFrame({"carat":train["carat"], "log(carat + 1)":np.log1p(train["carat"])})
carat.hist(bins = 50)
```
Out:
```array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001D33CF55FD0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001D33D122BA8>]], dtype=object)``` In :
```train.plot(kind="scatter",     # Create a scatterplot
x="carat",          # Put carat on the x axis
y="price",          # Put price on the y axis
figsize=(5,5),
ylim=(0,2000))
```
Out:
`<matplotlib.axes._subplots.AxesSubplot at 0x1d33cde30f0>` It is a non-linear relationship and dispersion(variance) of the relationship also increases as carat size increases. Even though size/weight of the diamond set is a critical factor , the relationship is non-linear. Therefore, running a linear model will be a bad idea.

##### Density Plots

A density plot shows the distribution of a numeric variable with a continuous curve. It is similar to a histogram but without discrete bins, a density plot gives a better picture of the underlying shape of a distribution. Create a density plot with series.plot(kind=”density”)

In :
```df["carat"].plot(kind="density",  # Create density plot
figsize=(4,4),    # Set figure size
xlim= (0,5))      # Limit x axis values
```
Out:
`<matplotlib.axes._subplots.AxesSubplot at 0x1d33e90bc18>` In :
```df["price"].plot(kind="density",  # Create density plot
figsize=(4,4))    # Set figure size

```
Out:
`<matplotlib.axes._subplots.AxesSubplot at 0x1d33ea10b70>` In :
```correlationMatrix = df.corr().abs()

plt.subplots(figsize=(7, 4))
sns.heatmap(correlationMatrix,annot=True)

plt.show()
``` The above plot show 92% correlation between price and carat. It’s up to us if we should consider this correlation % as a damaging level. Usually, correlation above 80% (subjective) is considered higher, hence, we will not forego this combination.

### Linear Regression

In :
```import statsmodels.api as sm
import statsmodels.formula.api as smf
print ("OLS regression model for the association between price and carat")
# reg1 is the model name , followed by equal sign.
reg1 = smf.ols(formula = 'price ~ carat', data=df).fit()
# print the result
print (reg1.summary())
```
```OLS regression model for the association between price and carat
OLS Regression Results
==============================================================================
Dep. Variable:                  price   R-squared:                       0.849
Method:                 Least Squares   F-statistic:                 3.041e+05
Date:                Fri, 11 Aug 2017   Prob (F-statistic):               0.00
Time:                        22:50:59   Log-Likelihood:            -4.7273e+05
No. Observations:               53940   AIC:                         9.455e+05
Df Residuals:                   53938   BIC:                         9.455e+05
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept  -2256.3606     13.055   -172.830      0.000     -2281.949 -2230.772
carat       7756.4256     14.067    551.408      0.000      7728.855  7783.996
==============================================================================
Omnibus:                    14025.341   Durbin-Watson:                   0.986
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           153030.525
Skew:                           0.939   Prob(JB):                         0.00
Kurtosis:                      11.035   Cond. No.                         3.65
==============================================================================

Warnings:
 Standard Errors assume that the covariance matrix of the errors is correctly specified.
```

Dep. variable is the name of the response variable

No. of observations show the valid observations

F-statistics is 3.041e+05 and Prob(F-statistics is very small. Hence we can reject the null hypothesis. )

Parameter estimates : Intercept and carat therefore

price = -2256.3606 + 7756.4256 * carat

p>|t| is the pearson value for explanotory variables… when 0.000 p < 0.001

R-squared value : proportion of the variance. such as 84.9 % variance in the dataset.

• Intercept – This is the βo value. It’s the prediction made by model when all the independent variables are set to zero.

• Estimate – This represents regression coefficients for respective variables. It’s the value of slope. Let’s interpret it for Chord_Length. We can say, when Chord_Length is increased by 1 unit, holding other variables constant, Sound_pressure_level decreases by a value of -35.69.

• Std. Error – This determines the level of variability associated with the estimates. Smaller the standard error of an estimate is, more accurate will be the predictions.

• t value – t statistic is generally used to determine variable significance, i.e. if a variable is significantly adding information to the model. t value > 2 suggests the variable is significant. I used it as an optional value as the same information can be extracted from the p value.

• p value – It’s the probability value of respective variables determining their significance in the model. p value < 0.05 is always desirable.

In :
```#Q-Q plot for normality
fig4=sm.qqplot(reg1.resid, line='r')
``` In :
```# simple plot of residuals
sns.residplot('carat','price', data=df)
```
Out:
`<matplotlib.axes._subplots.AxesSubplot at 0x1d33e9da8d0>` Among all, Residual vs. Fitted value catches my attention. Not exactly though, but I see signs of heteroskedasticity in this data. Remember funnel shape? You can see a similar pattern. To overcome this situation, we’ll build another model with log(y).

In :
```df['log_price'] = np.log(1 + df.price)
df['log_carat'] = np.log(1 + df.carat)
```
In :
```df.head()
```
Out:
carat price log_price log_carat
1 0.23 326 5.789960 0.207014
2 0.21 326 5.789960 0.190620
3 0.23 327 5.793014 0.207014
4 0.29 334 5.814131 0.254642
5 0.31 335 5.817111 0.270027
In :
```print ("OLS regression model for the association between price and carat")
# reg_log is the model name , followed by equal sign.

reg_log = smf.ols(formula = 'log_price ~ carat', data=df).fit()
# print the result
print (reg_log.summary())
```
```OLS regression model for the association between price and carat
OLS Regression Results
==============================================================================
Dep. Variable:              log_price   R-squared:                       0.847
Method:                 Least Squares   F-statistic:                 2.983e+05
Date:                Fri, 11 Aug 2017   Prob (F-statistic):               0.00
Time:                        22:51:01   Log-Likelihood:                -26685.
No. Observations:               53940   AIC:                         5.337e+04
Df Residuals:                   53938   BIC:                         5.339e+04
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      6.2164      0.003   1858.062      0.000         6.210     6.223
carat          1.9688      0.004    546.166      0.000         1.962     1.976
==============================================================================
Omnibus:                    10807.191   Durbin-Watson:                   0.977
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            71530.061
Skew:                          -0.803   Prob(JB):                         0.00
Kurtosis:                       8.408   Cond. No.                         3.65
==============================================================================

Warnings:
 Standard Errors assume that the covariance matrix of the errors is correctly specified.
```
In :
```#Q-Q plot for normality
fig4=sm.qqplot(reg_log.resid, line='r')
``` In :
```sns.residplot('carat','log_price', data=df,lowess= True, scatter_kws={"s": 5})
```
Out:
`<matplotlib.axes._subplots.AxesSubplot at 0x1d33d0ca550>`