Guide for Linear Regression using Python – Part 2

This blog is the continuation of guide for linear regression using Python from this post.

**There must be no correlation among independent variables**. Multicollinearity is the presence of correlation in independent variables. If variables are correlated, it becomes extremely difficult for the model to determine the true effect of X on Y or find out which variable is contributing to predict the response variable. Additionally, the presence of correlated predictors tends to increase the standard errors. With large standard errors, the confidence interval becomes wider leading to less precise estimates of slope parameters. Also, when predictors are correlated, the estimated regression coefficient of a correlated variable depends on which other predictors are available in the model. If this happens, we’ll end up with an incorrect conclusion that a variable strongly / weakly affects target variable.

**How to check for multicollinearity**:

- Use scatter plot to visualize correlation effect among variables.
- Use VIF factor. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity.
- Correlation table

**Steps**: to overcome the issue of multicollinearity, use a correlation matrix to check correlated variables. Let’s say variables A and B are highly correlated. Now, instead of removing one of them, use this approach: Find the average correlation of A and B with the rest of the variables. Whichever variable has the higher average in comparison with other variables, remove it. Alternatively, we can use penalized regression methods such as lasso, ridge, elastic net, etc.

- Error terms must be uncorrelated. Error at ∈t must not indicate the at error at ∈t+1.The presence of correlation in error terms is known as Autocorrelation. It drastically affects the regression coefficients and standard error values. It causes confidence intervals and prediction intervals to be narrower. Narrower confidence interval means that a 95% confidence interval would have lesser probability than 0.95 that it would contain the actual value of coefficients. Also, lower standard errors would cause the associated p-values to be lower than actual. This will make us incorrectly conclude a parameter to be statistically significant. It drastically reduces model’s accuracy. This usually occurs in time series models where the next instant is dependent on previous instant.

**How to check for autocorrelation**:

- Look for Durbin – Watson (DW) statistic. It must lie between 0 and 4. If DW = 2, implies no autocorrelation, 0 < DW < 2 implies positive autocorrelation while 2 < DW < 4 indicates negative autocorrelation.
- Look for the seasonal or correlated pattern in residual values in residual vs time plot.

**Dependent variable and the error terms must possess a normal distribution**. Presence of non – normal distribution suggests that there are a few unusual data points which must be studied closely to make a better model.

Presence of non- normally distributed error terms lead to either too wide or narrow confidence intervals. Unstable confidence level leads to difficulty in estimating coefficients based on minimization of least squares.

**How to check for normal distribution**:

- Look at Normality QQ plot
- Perform statistical tests of normality such as Kolmogorov-Smirnov test, Shapiro-Wilk test.

#### Normality Q-Q Plot

q-q or quantile-quantile is a scatter plot which helps us validate the assumption of normal distribution. It uses standardized values of residuals to determine the normal distribution of errors. Ideally, this plot should show a straight line. A curved, distorted line suggests residuals have a non-normal distribution.

Image for q-q plot : https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot

**Steps**: to overcome the issue if the errors are not normally distributed is through the nonlinear transformation of the both response or predictors variables.

### Interpretation of Residuals vs Leverage Plot

Residuals vs Leverage Plot is also known as Cook’s Distance plot. Cook’s distance attempts to identify the points which have more influence than other points. Such influential points tend to have a sizable impact on the regression line. The large values marked by Cook’s distance might require further investigation.

Image: https://www.researchgate.net/figure/254457803_fig5_Figure-5-Residuals-vs-Leverage-plot-for-DB-GLM-with-Poisson-response-and-Logarithmic

**Steps**: to solve the issue for influential observations which are nothing but outliers is to remove those rows. Alternatively, we can scale down the outlier observation with the maximum value in data or else treat those values as missing values. The treatment of data points as outliers require a good domain knowledge

## How to know if the regression model is the best fit for your data?

The metrics used to determine model fit can have different values based on the type of data. Hence, we need to be extremely careful while interpreting regression analysis.

Following are some metrics you can use to evaluate your regression model:

**p-Value**are very important because we can consider a model to be statistically significant only when p-Values are less than the pre-determined statistical significance level, which is ideally 0.05. In Linear Regression, the Null Hypothesis is that the coefficients associated with the variables is equal to zero. The alternate hypothesis is that the coefficients are not equal to zero (i.e. there exists a relationship between the independent variable in question and the dependent variable).**t-value**: Pr(>|t|) or p-value is the probability that you get a t-value as high or higher than the observed value when the Null Hypothesis is true. If the Pr(>|t|) is low, the coefficients are significant (significantly different from zero) and vice versa.

It is important for the model to be statistically significant before we can go ahead and use it to predict the dependent variable, otherwise, the confidence in predicted values from that model reduces and may be construed as an event of chance.

**R Square (Coefficient of Determination)**: explains the percentage of variance explained by covariates in the model. It tells us the proportion of variation in the dependent (response) variable that has been explained by the model. It ranges between 0 and 1. Usually, higher values are desirable but it rests on the data quality and domain. We don’t necessarily discard a model based on a low R-Squared value. Its a better practice to look at the AIC and prediction accuracy on validation sample when deciding on the efficacy of a model.**Adjusted R²**: penalizes total value for the number of predictors in the model. It doesn’t increase, stays same or decrease unless the newly added variable is truly useful. Therefore, when comparing nested models, it is a good practice to look at Adjusted R² value over R-squared.**F Statistics**: is the ratio of explained variance of the model by unexplained variance. It compares the full model with an intercept to evaluates the overall significance of the model. Its value can range between zero and any arbitrary large number.**AIC and BIC**: The Akaike’s information criterion – AIC and the Bayesian information criterion – BIC are measures of the goodness of fit of an estimated statistical model and can also be used for model selection. Both criteria depend on the maximized value of the likelihood function L for the estimated model.**Mallow’s Cp**: is defined as a criterion to assess fits when models with different numbers of parameters are being compared.**RMSE / MSE / MAE**: are crucial error metric.

- MSE: is mean squared error. It amplifies the impact of outliers on the model’s accuracy.
- MAE: is mean absolute error. It is robust against the effect of outliers.
- RMSE: is root mean square error. It is interpreted as how far on an average; the residuals are from zero. It nullifies squared effect of MSE by square root and provides the result in original units as data.

**MinMax Accuracy**: is defined as mean (min(actual, predicted)/max(actual, predicted)). It tells how far the model’s prediction is off. For a perfect model, this measure is 1.0. The lower the measure, the worse the model, based on out-of-sample performance.

## STATISTIC |
## CRITERION |

R-Squared | Higher the better (> 0.70) |

Adj R-Squared | Higher the better |

F-Statistic | Higher the better |

Std. Error | Closer to zero the better |

t-statistic | Should be greater 1.96 for p-value to be less than 0.05 |

AIC | Lower the better |

BIC | Lower the better |

Mallows Cp | Should be close to the number of predictors in model |

MAE | Lower the better |

MSE (Mean squared error) | Lower the better |

MinMax Accuracy | Higher the better |

## Example Problem

Let’s use our theoretical knowledge and create a model practically. For this analysis, we will use the diamond dataset. This dataset can be downloaded from here.

Let’s load the data set and do initial data analysis.

```
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
%config InlineBackend.figure_format = 'png' #set 'png' here when working on notebook
warnings.filterwarnings('ignore')
# Set some parameters to get good visuals - style to ggplot and size to 15,10
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (4, 3)
```

```
#load the file and set the first column as the index
train = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\Assignment 1\diamonds.csv",index_col = 0)
```

```
train.head()
```

```
print ("\n\n---------------------")
print ("TRAIN SET INFORMATION")
print ("---------------------")
print ("Shape of training set:", train.shape, "\n")
print ("Column Headers:", list(train.columns.values), "\n")
print (train.dtypes)
```

So there are 53940 number of observations and 10 different variables.

The different kind of variables are : ‘carat’, ‘cut’, ‘color’, ‘clarity’, ‘depth’, ‘table’, ‘price’, ‘x’, ‘y’, ‘z’

```
import re
missing_values = []
nonumeric_values = []
print ("TRAINING SET INFORMATION")
print ("========================\n")
for column in train:
# Find all the unique feature values
uniq = train[column].unique()
print ("'{}' has {} unique values" .format(column,uniq.size))
if (uniq.size > 10):
print("~~Listing up to 10 unique values~~")
print (uniq[0:10])
print ("\n-----------------------------------------------------------------------\n")
# Find features with missing values
if (True in pd.isnull(uniq)):
s = "{} has {} missing" .format(column, pd.isnull(train[column]).sum())
missing_values.append(s)
# Find features with non-numeric values
for i in range (1, np.prod(uniq.shape)):
if (re.match('nan', str(uniq[i]))):
break
if not (re.search('(^\d+\.?\d*$)|(^\d*\.?\d+$)', str(uniq[i]))):
nonumeric_values.append(column)
break
print ("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")
print ("Features with missing values:\n{}\n\n" .format(missing_values))
print ("Features with non-numeric values:\n{}" .format(nonumeric_values))
print ("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")
```

```
# Check for duplicate index or rows
idsUnique = len(set(train.index))
idsTotal = train.shape[0]
idsDupli = idsTotal - idsUnique
print("There are " + str(idsDupli) + " duplicate IDs for " + str(idsTotal) + " total entries")
```

##### Descriptive Statistics

```
#get summary of numerical variables
train.describe()
```

We can get an idea of a possible skew in the data by comparing the mean to the median, i.e. the 50% figure. It looks that Price is skewed

```
# Skewness of the distribution
print(train.skew())
# Values close to 0 show less skew
# The attributes are not skewed too much. Some algos may benefit if skew is corrected
```

```
# Skewness of the distribution
print(train.kurt())
#Kurtosis is a measure of the "tailedness" of the probability distribution of a real-valued random variable.
```

###### We have ‘x’,’y’,’z’ as 0. let us check how many are there and deal with the outliers

```
#Check when a value in a cloumn is zero
train[(train['z'] == 0)].count()
```

```
#Separate carat and price
df = train[['carat','price']]
```

```
df.head()
```

```
#draw a histogram and not fit a kernel density estimate (KDE).
sns.distplot(df['carat'], kde = True, color = 'r', hist_kws={'alpha': 0.9})
```

We see immediately that the carat weights are positively skewed: most diamonds are around 1 carat or below but there are extreme cases of larger diamonds. The plot above has fairly wide bins and there doesn’t appear to be any data beyond a carat size of 3.5. We can make try to get more out of hour histogram by adding some additional arguments to control the size of the bins and limits of the x-axis:

##### Boxplots

Boxplots are another type of univariate plot for summarizing distributions of numeric data graphically. Let’s make a boxplot of carat using the pd.boxplot() function:

```
train.boxplot(column="carat")
```

the central box of the boxplot represents the middle 50% of the observations, the central bar is the median and the bars at the end of the dotted lines (whiskers) encapsulate the great majority of the observations. Circles that lie beyond the end of the whiskers are data points that may be outliers. In this case, our data set has over 50,000 observations and we see many data points beyond the top whisker. We probably wouldn’t want to classify all of those points as outliers, but the handful of diamonds at 4 carats and above are definitely far outside the norm.

```
plt.rcParams['figure.figsize'] = (6.0, 3.0)
prices = pd.DataFrame({"price":train["price"], "log(price + 1)":np.log1p(train["price"])})
prices.hist(bins = 50)
```

This is a long tail distribution, with a high concentration of observations below the U$5,000 mark. The distribution is right-skewed with small amounts of very large prices driving up the mean, while the median remains a more robust measure of the center of the distribution.

We can see that the data show some evidence of bomiodality on the scale of log10 scale. So there are two different kind of diamonds available for richer and porer buyers.

```
plt.rcParams['figure.figsize'] = (6.0, 3.0)
carat = pd.DataFrame({"carat":train["carat"], "log(carat + 1)":np.log1p(train["carat"])})
carat.hist(bins = 50)
```

```
train.plot(kind="scatter", # Create a scatterplot
x="carat", # Put carat on the x axis
y="price", # Put price on the y axis
figsize=(5,5),
ylim=(0,2000))
```

It is a non-linear relationship and dispersion(variance) of the relationship also increases as carat size increases. Even though size/weight of the diamond set is a critical factor , the relationship is non-linear. Therefore, running a linear model will be a bad idea.

##### Density Plots

A density plot shows the distribution of a numeric variable with a continuous curve. It is similar to a histogram but without discrete bins, a density plot gives a better picture of the underlying shape of a distribution. Create a density plot with series.plot(kind=”density”)

```
df["carat"].plot(kind="density", # Create density plot
figsize=(4,4), # Set figure size
xlim= (0,5)) # Limit x axis values
```

```
df["price"].plot(kind="density", # Create density plot
figsize=(4,4)) # Set figure size
```

```
correlationMatrix = df.corr().abs()
plt.subplots(figsize=(7, 4))
sns.heatmap(correlationMatrix,annot=True)
# Mask unimportant features
sns.heatmap(correlationMatrix, mask=correlationMatrix < 1, cbar=False)
plt.show()
```

The above plot show 92% correlation between price and carat. It’s up to us if we should consider this correlation % as a damaging level. Usually, correlation above 80% (subjective) is considered higher, hence, we will not forego this combination.

### Linear Regression

```
import statsmodels.api as sm
import statsmodels.formula.api as smf
print ("OLS regression model for the association between price and carat")
# reg1 is the model name , followed by equal sign.
reg1 = smf.ols(formula = 'price ~ carat', data=df).fit()
# print the result
print (reg1.summary())
```

Dep. variable is the name of the response variable

No. of observations show the valid observations

F-statistics is 3.041e+05 and Prob(F-statistics is very small. Hence we can reject the null hypothesis. )

Parameter estimates : Intercept and carat therefore

**price = -2256.3606 + 7756.4256 * carat**

p>|t| is the pearson value for explanotory variables… when 0.000 p < 0.001

R-squared value : proportion of the variance. such as 84.9 % variance in the dataset.

• Intercept – This is the βo value. It’s the prediction made by model when all the independent variables are set to zero.

• Estimate – This represents regression coefficients for respective variables. It’s the value of slope. Let’s interpret it for Chord_Length. We can say, when Chord_Length is increased by 1 unit, holding other variables constant, Sound_pressure_level decreases by a value of -35.69.

• Std. Error – This determines the level of variability associated with the estimates. Smaller the standard error of an estimate is, more accurate will be the predictions.

• t value – t statistic is generally used to determine variable significance, i.e. if a variable is significantly adding information to the model. t value > 2 suggests the variable is significant. I used it as an optional value as the same information can be extracted from the p value.

• p value – It’s the probability value of respective variables determining their significance in the model. p value < 0.05 is always desirable.

```
#Q-Q plot for normality
fig4=sm.qqplot(reg1.resid, line='r')
```

```
# simple plot of residuals
sns.residplot('carat','price', data=df)
```

Among all, Residual vs. Fitted value catches my attention. Not exactly though, but I see signs of heteroskedasticity in this data. Remember funnel shape? You can see a similar pattern. To overcome this situation, we’ll build another model with log(y).

```
df['log_price'] = np.log(1 + df.price)
df['log_carat'] = np.log(1 + df.carat)
```

```
df.head()
```

```
print ("OLS regression model for the association between price and carat")
# reg_log is the model name , followed by equal sign.
reg_log = smf.ols(formula = 'log_price ~ carat', data=df).fit()
# print the result
print (reg_log.summary())
```

```
#Q-Q plot for normality
fig4=sm.qqplot(reg_log.resid, line='r')
```

```
sns.residplot('carat','log_price', data=df,lowess= True, scatter_kws={"s": 5})
```

The residplot() function can be a useful tool for checking whether the simple regression model is appropriate for a dataset. It fits and removes a simple linear regression and then plots the residual values for each observation. Ideally, these values should be randomly scattered around y = 0:

If there is structure in the residuals, it suggests that simple linear regression is not appropriate:

But unfortunately this creates a plot of residuals vs the x values. I think in many situations, this graph will tell you the same thing as residuals versus fitted values, but it is not what I was looking for. Here is how to create it with vanilla matplotlib:

http://jeffmax.io/using-python-for-statistics-coursework.html

```
fitted = 6.2164 + 1.9688*df["carat"] # This just needs to be whatever the linear regrsssion equation is
fig = plt.figure()
axes = fig.add_axes([0, 0, 1.0, 1.0])
axes.axhline(color="black", ls="--") # This creates a horizon line (like abline in R)
axes.set_xlabel('Fitted Value')
axes.set_ylabel('Residuals')
axes.set_title('Residuals Versus Fits')
axes.scatter(fitted, df["log_price"] - fitted, marker = "o")
```

#### ANOVA for linear regression

```
from statsmodels.stats.anova import anova_lm
from statsmodels.formula.api import ols
model = ols('log_price ~ carat', data=df)
anova_lm(model.fit())
```

```
help(anova_lm)
```

One often first encounters the term “analysis of variance” when the predictor is categorical, so that you’re fitting the model y=α+βi y=α+βi where ii identifies which category is the value of the predictor. If there are kk categories, you’d get k−1k−1 degrees of freedom in the numerator in the F-statistic, and usually n−kn−k degrees of freedom in the denominator. But the distinction between regression and analysis of variance is still the same for this kind of model.

###### We have two tunnel shape in our plot. Below we select diamonds which are below the mean carat and see if the model is little better

```
df_low_carat = df[df["carat"]< 0.75]
```

```
print ("OLS regression model for the association between price and carat")
# reg_log is the model name , followed by equal sign.
reg_log_low = smf.ols(formula = 'log_price ~ carat', data=df_low_carat).fit()
# print the result
print (reg_log_low.summary())
```

```
#Q-Q plot for normality
fig4=sm.qqplot(reg_log_low.resid, line='r')
```

```
fitted = 5.4605 + 3.4493*df_low_carat["carat"] # This just needs to be whatever the linear regrsssion equation is
fig = plt.figure()
axes = fig.add_axes([0, 0, 1.0, 1.0])
axes.axhline(color="black", ls="--") # This creates a horizon line (like abline in R)
axes.set_xlabel('Fitted Value')
axes.set_ylabel('Residuals')
axes.set_title('Residuals Versus Fits')
axes.scatter(fitted, df_low_carat["log_price"] - fitted, marker = "o")
```