Guide for Linear Regression using Python – Part 2

Posted on Posted in Machine Learning, Predictive Analysis

Guide for Linear Regression using Python – Part 2

This blog is the continuation of guide for linear regression using Python from this post.

  1. There must be no correlation among independent variables. Multicollinearity is the presence of correlation in independent variables. If variables are correlated, it becomes extremely difficult for the model to determine the true effect of X on Y or find out which variable is contributing to predict the response variable. Additionally, the presence of correlated predictors tends to increase the standard errors. With large standard errors, the confidence interval becomes wider leading to less precise estimates of slope parameters. Also, when predictors are correlated, the estimated regression coefficient of a correlated variable depends on which other predictors are available in the model. If this happens, we’ll end up with an incorrect conclusion that a variable strongly / weakly affects target variable.

How to check for multicollinearity:

  • Use scatter plot to visualize correlation effect among variables.
  • Use VIF factor. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity.
  • Correlation table

Steps: to overcome the issue of multicollinearity, use a correlation matrix to check correlated variables. Let’s say variables A and B are highly correlated. Now, instead of removing one of them, use this approach: Find the average correlation of A and B with the rest of the variables. Whichever variable has the higher average in comparison with other variables, remove it. Alternatively, we can use penalized regression methods such as lasso, ridge, elastic net, etc.

  1. Error terms must be uncorrelated. Error at ∈t must not indicate the at error at ∈t+1.The presence of correlation in error terms is known as Autocorrelation. It drastically affects the regression coefficients and standard error values. It causes confidence intervals and prediction intervals to be narrower. Narrower confidence interval means that a 95% confidence interval would have lesser probability than 0.95 that it would contain the actual value of coefficients. Also, lower standard errors would cause the associated p-values to be lower than actual. This will make us incorrectly conclude a parameter to be statistically significant. It drastically reduces model’s accuracy. This usually occurs in time series models where the next instant is dependent on previous instant.

How to check for autocorrelation:

  • Look for Durbin – Watson (DW) statistic. It must lie between 0 and 4. If DW = 2, implies no autocorrelation, 0 < DW < 2 implies positive autocorrelation while 2 < DW < 4 indicates negative autocorrelation.
  • Look for the seasonal or correlated pattern in residual values in residual vs time plot.
  1. Dependent variable and the error terms must possess a normal distribution. Presence of non – normal distribution suggests that there are a few unusual data points which must be studied closely to make a better model.

Presence of non- normally distributed error terms lead to either too wide or narrow confidence intervals. Unstable confidence level leads to difficulty in estimating coefficients based on minimization of least squares.

How to check for normal distribution:

  • Look at Normality QQ plot
  • Perform statistical tests of normality such as Kolmogorov-Smirnov test, Shapiro-Wilk test.

Normality Q-Q Plot

q-q or quantile-quantile is a scatter plot which helps us validate the assumption of normal distribution. It uses standardized values of residuals to determine the normal distribution of errors. Ideally, this plot should show a straight line. A curved, distorted line suggests residuals have a non-normal distribution.

Normality Q-Q plot

Image for q-q plot : https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot

Steps: to overcome the issue if the errors are not normally distributed is through the nonlinear transformation of the both response or predictors variables.

Interpretation of Residuals vs Leverage Plot

Residuals vs Leverage Plot is also known as Cook’s Distance plot. Cook’s distance attempts to identify the points which have more influence than other points. Such influential points tend to have a sizable impact on the regression line. The large values marked by Cook’s distance might require further investigation.

Residual Vs Leverage

Image: https://www.researchgate.net/figure/254457803_fig5_Figure-5-Residuals-vs-Leverage-plot-for-DB-GLM-with-Poisson-response-and-Logarithmic

Steps: to solve the issue for influential observations which are nothing but outliers is to remove those rows. Alternatively, we can scale down the outlier observation with the maximum value in data or else treat those values as missing values. The treatment of data points as outliers require a good domain knowledge

How to know if the regression model is the best fit for your data?

The metrics used to determine model fit can have different values based on the type of data. Hence, we need to be extremely careful while interpreting regression analysis.

Following are some metrics you can use to evaluate your regression model:

  1. p-Value are very important because we can consider a model to be statistically significant only when p-Values are less than the pre-determined statistical significance level, which is ideally 0.05. In Linear Regression, the Null Hypothesis is that the coefficients associated with the variables is equal to zero. The alternate hypothesis is that the coefficients are not equal to zero (i.e. there exists a relationship between the independent variable in question and the dependent variable).
  2. t-value: Pr(>|t|) or p-value is the probability that you get a t-value as high or higher than the observed value when the Null Hypothesis is true. If the Pr(>|t|) is low, the coefficients are significant (significantly different from zero) and vice versa.

It is important for the model to be statistically significant before we can go ahead and use it to predict the dependent variable, otherwise, the confidence in predicted values from that model reduces and may be construed as an event of chance.

  1. R Square (Coefficient of Determination): explains the percentage of variance explained by covariates in the model. It tells us the proportion of variation in the dependent (response) variable that has been explained by the model. It ranges between 0 and 1. Usually, higher values are desirable but it rests on the data quality and domain. We don’t necessarily discard a model based on a low R-Squared value. Its a better practice to look at the AIC and prediction accuracy on validation sample when deciding on the efficacy of a model.
  2. Adjusted R²: penalizes total value for the number of predictors in the model. It doesn’t increase, stays same or decrease unless the newly added variable is truly useful. Therefore, when comparing nested models, it is a good practice to look at Adjusted R² value over R-squared.
  3. F Statistics: is the ratio of explained variance of the model by unexplained variance. It compares the full model with an intercept to evaluates the overall significance of the model. Its value can range between zero and any arbitrary large number.
  4. AIC and BIC: The Akaike’s information criterion – AIC and the Bayesian information criterion – BIC are measures of the goodness of fit of an estimated statistical model and can also be used for model selection. Both criteria depend on the maximized value of the likelihood function L for the estimated model.
  5. Mallow’s Cp: is defined as a criterion to assess fits when models with different numbers of parameters are being compared.
  6. RMSE / MSE / MAE: are crucial error metric.
  • MSE: is mean squared error. It amplifies the impact of outliers on the model’s accuracy.
  • MAE: is mean absolute error. It is robust against the effect of outliers.
  • RMSE: is root mean square error. It is interpreted as how far on an average; the residuals are from zero. It nullifies squared effect of MSE by square root and provides the result in original units as data.
  1. MinMax Accuracy: is defined as mean (min(actual, predicted)/max(actual, predicted)). It tells how far the model’s prediction is off. For a perfect model, this measure is 1.0. The lower the measure, the worse the model, based on out-of-sample performance.

 

STATISTIC

CRITERION

R-Squared Higher the better (> 0.70)
Adj R-Squared Higher the better
F-Statistic Higher the better
Std. Error Closer to zero the better
t-statistic Should be greater 1.96 for p-value to be less than 0.05
AIC Lower the better
BIC Lower the better
Mallows Cp Should be close to the number of predictors in model
MAE Lower the better
MSE (Mean squared error) Lower the better
MinMax Accuracy Higher the better

Example Problem

Let’s use our theoretical knowledge and create a model practically. For this analysis, we will use the diamond dataset. This dataset can be downloaded from here.

Let’s load the data set and do initial data analysis.

import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
%config InlineBackend.figure_format = 'png' #set 'png' here when working on notebook
warnings.filterwarnings('ignore') 

# Set some parameters to get good visuals - style to ggplot and size to 15,10
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (4, 3)
In [2]:
#load the file and set the first column as the index
train = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\Assignment 1\diamonds.csv",index_col = 0)
In [3]:
train.head()
Out[3]:
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
5 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
In [4]:
print ("\n\n---------------------")
print ("TRAIN SET INFORMATION")
print ("---------------------")
print ("Shape of training set:", train.shape, "\n")
print ("Column Headers:", list(train.columns.values), "\n")
print (train.dtypes)
---------------------
TRAIN SET INFORMATION
---------------------
Shape of training set: (53940, 10) 

Column Headers: ['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y', 'z'] 

carat      float64
cut         object
color       object
clarity     object
depth      float64
table      float64
price        int64
x          float64
y          float64
z          float64
dtype: object

So there are 53940 number of observations and 10 different variables.

The different kind of variables are : ‘carat’, ‘cut’, ‘color’, ‘clarity’, ‘depth’, ‘table’, ‘price’, ‘x’, ‘y’, ‘z’

In [5]:
import re
missing_values = []
nonumeric_values = []

print ("TRAINING SET INFORMATION")
print ("========================\n")

for column in train:
    # Find all the unique feature values
    uniq = train[column].unique()
    print ("'{}' has {} unique values" .format(column,uniq.size))
    if (uniq.size > 10):
        print("~~Listing up to 10 unique values~~")
    print (uniq[0:10])
    print ("\n-----------------------------------------------------------------------\n")
    
    # Find features with missing values
    if (True in pd.isnull(uniq)):
        s = "{} has {} missing" .format(column, pd.isnull(train[column]).sum())
        missing_values.append(s)
    
    # Find features with non-numeric values
    for i in range (1, np.prod(uniq.shape)):
        if (re.match('nan', str(uniq[i]))):
            break
        if not (re.search('(^\d+\.?\d*$)|(^\d*\.?\d+$)', str(uniq[i]))):
            nonumeric_values.append(column)
            break
  
print ("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")
print ("Features with missing values:\n{}\n\n" .format(missing_values))
print ("Features with non-numeric values:\n{}" .format(nonumeric_values))
print ("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")
TRAINING SET INFORMATION
========================

'carat' has 273 unique values
~~Listing up to 10 unique values~~
[ 0.23  0.21  0.29  0.31  0.24  0.26  0.22  0.3   0.2   0.32]

-----------------------------------------------------------------------

'cut' has 5 unique values
['Ideal' 'Premium' 'Good' 'Very Good' 'Fair']

-----------------------------------------------------------------------

'color' has 7 unique values
['E' 'I' 'J' 'H' 'F' 'G' 'D']

-----------------------------------------------------------------------

'clarity' has 8 unique values
['SI2' 'SI1' 'VS1' 'VS2' 'VVS2' 'VVS1' 'I1' 'IF']

-----------------------------------------------------------------------

'depth' has 184 unique values
~~Listing up to 10 unique values~~
[ 61.5  59.8  56.9  62.4  63.3  62.8  62.3  61.9  65.1  59.4]

-----------------------------------------------------------------------

'table' has 127 unique values
~~Listing up to 10 unique values~~
[ 55.  61.  65.  58.  57.  56.  54.  62.  59.  63.]

-----------------------------------------------------------------------

'price' has 11602 unique values
~~Listing up to 10 unique values~~
[326 327 334 335 336 337 338 339 340 342]

-----------------------------------------------------------------------

'x' has 554 unique values
~~Listing up to 10 unique values~~
[ 3.95  3.89  4.05  4.2   4.34  3.94  4.07  3.87  4.    4.25]

-----------------------------------------------------------------------

'y' has 552 unique values
~~Listing up to 10 unique values~~
[ 3.98  3.84  4.07  4.23  4.35  3.96  4.11  3.78  4.05  4.28]

-----------------------------------------------------------------------

'z' has 375 unique values
~~Listing up to 10 unique values~~
[ 2.43  2.31  2.63  2.75  2.48  2.47  2.53  2.49  2.39  2.73]

-----------------------------------------------------------------------


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Features with missing values:
[]


Features with non-numeric values:
['cut', 'color', 'clarity']

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In [6]:
# Check for duplicate index or rows
idsUnique = len(set(train.index))
idsTotal = train.shape[0]
idsDupli = idsTotal - idsUnique
print("There are " + str(idsDupli) + " duplicate IDs for " + str(idsTotal) + " total entries")
There are 0 duplicate IDs for 53940 total entries
Descriptive Statistics
In [7]:
#get summary of numerical variables
train.describe()
Out[7]:
carat depth table price x y z
count 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000
mean 0.797940 61.749405 57.457184 3932.799722 5.731157 5.734526 3.538734
std 0.474011 1.432621 2.234491 3989.439738 1.121761 1.142135 0.705699
min 0.200000 43.000000 43.000000 326.000000 0.000000 0.000000 0.000000
25% 0.400000 61.000000 56.000000 950.000000 4.710000 4.720000 2.910000
50% 0.700000 61.800000 57.000000 2401.000000 5.700000 5.710000 3.530000
75% 1.040000 62.500000 59.000000 5324.250000 6.540000 6.540000 4.040000
max 5.010000 79.000000 95.000000 18823.000000 10.740000 58.900000 31.800000

We can get an idea of a possible skew in the data by comparing the mean to the median, i.e. the 50% figure. It looks that Price is skewed

In [8]:
# Skewness of the distribution

print(train.skew())

# Values close to 0 show less skew
# The attributes are not skewed too much. Some algos may benefit if skew is corrected
carat    1.116646
depth   -0.082294
table    0.796896
price    1.618395
x        0.378676
y        2.434167
z        1.522423
dtype: float64
In [9]:
# Skewness of the distribution

print(train.kurt())

#Kurtosis is a measure of the "tailedness" of the probability distribution of a real-valued random variable.
carat     1.256635
depth     5.739415
table     2.801857
price     2.177696
x        -0.618161
y        91.214557
z        47.086619
dtype: float64
We have ‘x’,’y’,’z’ as 0. let us check how many are there and deal with the outliers
In [10]:
#Check when a value in a cloumn is zero
train[(train['z'] == 0)].count()
Out[10]:
carat      20
cut        20
color      20
clarity    20
depth      20
table      20
price      20
x          20
y          20
z          20
dtype: int64
In [11]:
#Separate carat and price
df = train[['carat','price']]
In [12]:
df.head()
Out[12]:
carat price
1 0.23 326
2 0.21 326
3 0.23 327
4 0.29 334
5 0.31 335
In [13]:
#draw a histogram and not fit a kernel density estimate (KDE).
sns.distplot(df['carat'], kde = True, color = 'r', hist_kws={'alpha': 0.9})
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d33c55df60>

We see immediately that the carat weights are positively skewed: most diamonds are around 1 carat or below but there are extreme cases of larger diamonds. The plot above has fairly wide bins and there doesn’t appear to be any data beyond a carat size of 3.5. We can make try to get more out of hour histogram by adding some additional arguments to control the size of the bins and limits of the x-axis:

Boxplots

Boxplots are another type of univariate plot for summarizing distributions of numeric data graphically. Let’s make a boxplot of carat using the pd.boxplot() function:

In [14]:
train.boxplot(column="carat")
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d33c55db70>

the central box of the boxplot represents the middle 50% of the observations, the central bar is the median and the bars at the end of the dotted lines (whiskers) encapsulate the great majority of the observations. Circles that lie beyond the end of the whiskers are data points that may be outliers. In this case, our data set has over 50,000 observations and we see many data points beyond the top whisker. We probably wouldn’t want to classify all of those points as outliers, but the handful of diamonds at 4 carats and above are definitely far outside the norm.

In [15]:
plt.rcParams['figure.figsize'] = (6.0, 3.0)
prices = pd.DataFrame({"price":train["price"], "log(price + 1)":np.log1p(train["price"])})
prices.hist(bins = 50)
Out[15]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001D33CDD9390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001D33CE8E048>]], dtype=object)

This is a long tail distribution, with a high concentration of observations below the U$5,000 mark. The distribution is right-skewed with small amounts of very large prices driving up the mean, while the median remains a more robust measure of the center of the distribution.

We can see that the data show some evidence of bomiodality on the scale of log10 scale. So there are two different kind of diamonds available for richer and porer buyers.

In [16]:
plt.rcParams['figure.figsize'] = (6.0, 3.0)
carat = pd.DataFrame({"carat":train["carat"], "log(carat + 1)":np.log1p(train["carat"])})
carat.hist(bins = 50)
Out[16]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001D33CF55FD0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001D33D122BA8>]], dtype=object)
In [17]:
train.plot(kind="scatter",     # Create a scatterplot
              x="carat",          # Put carat on the x axis
              y="price",          # Put price on the y axis
              figsize=(5,5),
              ylim=(0,2000))  
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d33cde30f0>

It is a non-linear relationship and dispersion(variance) of the relationship also increases as carat size increases. Even though size/weight of the diamond set is a critical factor , the relationship is non-linear. Therefore, running a linear model will be a bad idea.

Density Plots

A density plot shows the distribution of a numeric variable with a continuous curve. It is similar to a histogram but without discrete bins, a density plot gives a better picture of the underlying shape of a distribution. Create a density plot with series.plot(kind=”density”)

In [18]:
df["carat"].plot(kind="density",  # Create density plot
                      figsize=(4,4),    # Set figure size
                      xlim= (0,5))      # Limit x axis values
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d33e90bc18>
In [19]:
df["price"].plot(kind="density",  # Create density plot
                      figsize=(4,4))    # Set figure size
                      
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d33ea10b70>
In [20]:
correlationMatrix = df.corr().abs()

plt.subplots(figsize=(7, 4))
sns.heatmap(correlationMatrix,annot=True)

# Mask unimportant features
sns.heatmap(correlationMatrix, mask=correlationMatrix < 1, cbar=False)
plt.show()

The above plot show 92% correlation between price and carat. It’s up to us if we should consider this correlation % as a damaging level. Usually, correlation above 80% (subjective) is considered higher, hence, we will not forego this combination.

Linear Regression

In [21]:
import statsmodels.api as sm
import statsmodels.formula.api as smf 
print ("OLS regression model for the association between price and carat")
# reg1 is the model name , followed by equal sign. 
reg1 = smf.ols(formula = 'price ~ carat', data=df).fit()
# print the result
print (reg1.summary())
OLS regression model for the association between price and carat
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.849
Model:                            OLS   Adj. R-squared:                  0.849
Method:                 Least Squares   F-statistic:                 3.041e+05
Date:                Fri, 11 Aug 2017   Prob (F-statistic):               0.00
Time:                        22:50:59   Log-Likelihood:            -4.7273e+05
No. Observations:               53940   AIC:                         9.455e+05
Df Residuals:                   53938   BIC:                         9.455e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept  -2256.3606     13.055   -172.830      0.000     -2281.949 -2230.772
carat       7756.4256     14.067    551.408      0.000      7728.855  7783.996
==============================================================================
Omnibus:                    14025.341   Durbin-Watson:                   0.986
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           153030.525
Skew:                           0.939   Prob(JB):                         0.00
Kurtosis:                      11.035   Cond. No.                         3.65
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Dep. variable is the name of the response variable

No. of observations show the valid observations

F-statistics is 3.041e+05 and Prob(F-statistics is very small. Hence we can reject the null hypothesis. )

Parameter estimates : Intercept and carat therefore

price = -2256.3606 + 7756.4256 * carat

p>|t| is the pearson value for explanotory variables… when 0.000 p < 0.001

R-squared value : proportion of the variance. such as 84.9 % variance in the dataset.

• Intercept – This is the βo value. It’s the prediction made by model when all the independent variables are set to zero.

• Estimate – This represents regression coefficients for respective variables. It’s the value of slope. Let’s interpret it for Chord_Length. We can say, when Chord_Length is increased by 1 unit, holding other variables constant, Sound_pressure_level decreases by a value of -35.69.

• Std. Error – This determines the level of variability associated with the estimates. Smaller the standard error of an estimate is, more accurate will be the predictions.

• t value – t statistic is generally used to determine variable significance, i.e. if a variable is significantly adding information to the model. t value > 2 suggests the variable is significant. I used it as an optional value as the same information can be extracted from the p value.

• p value – It’s the probability value of respective variables determining their significance in the model. p value < 0.05 is always desirable.

In [22]:
#Q-Q plot for normality
fig4=sm.qqplot(reg1.resid, line='r')
In [23]:
# simple plot of residuals
sns.residplot('carat','price', data=df)
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d33e9da8d0>

Among all, Residual vs. Fitted value catches my attention. Not exactly though, but I see signs of heteroskedasticity in this data. Remember funnel shape? You can see a similar pattern. To overcome this situation, we’ll build another model with log(y).

In [24]:
df['log_price'] = np.log(1 + df.price)
df['log_carat'] = np.log(1 + df.carat)
In [25]:
df.head()
Out[25]:
carat price log_price log_carat
1 0.23 326 5.789960 0.207014
2 0.21 326 5.789960 0.190620
3 0.23 327 5.793014 0.207014
4 0.29 334 5.814131 0.254642
5 0.31 335 5.817111 0.270027
In [26]:
print ("OLS regression model for the association between price and carat")
# reg_log is the model name , followed by equal sign.

reg_log = smf.ols(formula = 'log_price ~ carat', data=df).fit()
# print the result
print (reg_log.summary())
OLS regression model for the association between price and carat
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              log_price   R-squared:                       0.847
Model:                            OLS   Adj. R-squared:                  0.847
Method:                 Least Squares   F-statistic:                 2.983e+05
Date:                Fri, 11 Aug 2017   Prob (F-statistic):               0.00
Time:                        22:51:01   Log-Likelihood:                -26685.
No. Observations:               53940   AIC:                         5.337e+04
Df Residuals:                   53938   BIC:                         5.339e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      6.2164      0.003   1858.062      0.000         6.210     6.223
carat          1.9688      0.004    546.166      0.000         1.962     1.976
==============================================================================
Omnibus:                    10807.191   Durbin-Watson:                   0.977
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            71530.061
Skew:                          -0.803   Prob(JB):                         0.00
Kurtosis:                       8.408   Cond. No.                         3.65
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [27]:
#Q-Q plot for normality
fig4=sm.qqplot(reg_log.resid, line='r')
In [28]:
sns.residplot('carat','log_price', data=df,lowess= True, scatter_kws={"s": 5})
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d33d0ca550>

The residplot() function can be a useful tool for checking whether the simple regression model is appropriate for a dataset. It fits and removes a simple linear regression and then plots the residual values for each observation. Ideally, these values should be randomly scattered around y = 0:

If there is structure in the residuals, it suggests that simple linear regression is not appropriate:

But unfortunately this creates a plot of residuals vs the x values. I think in many situations, this graph will tell you the same thing as residuals versus fitted values, but it is not what I was looking for. Here is how to create it with vanilla matplotlib:

http://jeffmax.io/using-python-for-statistics-coursework.html

In [29]:
fitted = 6.2164 + 1.9688*df["carat"] # This just needs to be whatever the linear regrsssion equation is
fig = plt.figure()
axes = fig.add_axes([0, 0, 1.0, 1.0]) 
axes.axhline(color="black", ls="--") # This creates a horizon line (like abline in R)
axes.set_xlabel('Fitted Value')
axes.set_ylabel('Residuals')
axes.set_title('Residuals Versus Fits')
axes.scatter(fitted, df["log_price"]  - fitted, marker = "o")
Out[29]:
<matplotlib.collections.PathCollection at 0x1d33eb38320>

ANOVA for linear regression

In [30]:
from statsmodels.stats.anova import anova_lm
from statsmodels.formula.api import ols

model = ols('log_price ~ carat', data=df)
anova_lm(model.fit())
Out[30]:
df sum_sq mean_sq F PR(>F)
carat 1.0 46977.568051 46977.568051 298297.137856 0.0
Residual 53938.0 8494.469923 0.157486 NaN NaN
In [31]:
help(anova_lm)
Help on function anova_lm in module statsmodels.stats.anova:

anova_lm(*args, **kwargs)
    ANOVA table for one or more fitted linear models.
    
    Parameters
    ----------
    args : fitted linear model results instance
        One or more fitted linear models
    scale : float
        Estimate of variance, If None, will be estimated from the largest
        model. Default is None.
    test : str {"F", "Chisq", "Cp"} or None
        Test statistics to provide. Default is "F".
    typ : str or int {"I","II","III"} or {1,2,3}
        The type of ANOVA test to perform. See notes.
    robust : {None, "hc0", "hc1", "hc2", "hc3"}
        Use heteroscedasticity-corrected coefficient covariance matrix.
        If robust covariance is desired, it is recommended to use `hc3`.
    
    Returns
    -------
    anova : DataFrame
    A DataFrame containing.
    
    Notes
    -----
    Model statistics are given in the order of args. Models must have
    been fit using the formula api.
    
    See Also
    --------
    model_results.compare_f_test, model_results.compare_lm_test
    
    Examples
    --------
    >>> import statsmodels.api as sm
    >>> from statsmodels.formula.api import ols
    
    >>> moore = sm.datasets.get_rdataset("Moore", "car",
    ...                                  cache=True) # load data
    >>> data = moore.data
    >>> data = data.rename(columns={"partner.status" :
    ...                             "partner_status"}) # make name pythonic
    >>> moore_lm = ols('conformity ~ C(fcategory, Sum)*C(partner_status, Sum)',
    ...                 data=data).fit()
    
    >>> table = sm.stats.anova_lm(moore_lm, typ=2) # Type 2 ANOVA DataFrame
    >>> print table

One often first encounters the term “analysis of variance” when the predictor is categorical, so that you’re fitting the model y=α+βi y=α+βi where ii identifies which category is the value of the predictor. If there are kk categories, you’d get k−1k−1 degrees of freedom in the numerator in the F-statistic, and usually n−kn−k degrees of freedom in the denominator. But the distinction between regression and analysis of variance is still the same for this kind of model.

https://stats.stackexchange.com/questions/34616/difference-between-regression-analysis-and-analysis-of-variance

We have two tunnel shape in our plot. Below we select diamonds which are below the mean carat and see if the model is little better
In [33]:
df_low_carat = df[df["carat"]< 0.75]
In [36]:
print ("OLS regression model for the association between price and carat")
# reg_log is the model name , followed by equal sign.

reg_log_low = smf.ols(formula = 'log_price ~ carat', data=df_low_carat).fit()
# print the result
print (reg_log_low.summary())
OLS regression model for the association between price and carat
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              log_price   R-squared:                       0.817
Model:                            OLS   Adj. R-squared:                  0.817
Method:                 Least Squares   F-statistic:                 1.337e+05
Date:                Fri, 11 Aug 2017   Prob (F-statistic):               0.00
Time:                        22:55:36   Log-Likelihood:                -24.378
No. Observations:               30034   AIC:                             52.76
Df Residuals:                   30032   BIC:                             69.38
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      5.4605      0.005   1213.174      0.000         5.452     5.469
carat          3.4493      0.009    365.709      0.000         3.431     3.468
==============================================================================
Omnibus:                       63.342   Durbin-Watson:                   0.845
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               79.446
Skew:                          -0.003   Prob(JB):                     5.60e-18
Kurtosis:                       3.252   Cond. No.                         8.16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [37]:
#Q-Q plot for normality
fig4=sm.qqplot(reg_log_low.resid, line='r')
In [45]:
fitted = 5.4605 + 3.4493*df_low_carat["carat"] # This just needs to be whatever the linear regrsssion equation is
fig = plt.figure()
axes = fig.add_axes([0, 0, 1.0, 1.0]) 
axes.axhline(color="black", ls="--") # This creates a horizon line (like abline in R)
axes.set_xlabel('Fitted Value')
axes.set_ylabel('Residuals')
axes.set_title('Residuals Versus Fits')
axes.scatter(fitted, df_low_carat["log_price"]  - fitted, marker = "o")
Out[45]:
<matplotlib.collections.PathCollection at 0x1d342fdb518>

References

http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-to-interpret-the-constant-y-intercept

http://blog.minitab.com/blog/adventures-in-statistics-2/how-to-interpret-regression-analysis-results-p-values-and-coefficients

https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-guide-regression-analysis-plot-interpretations/tutorial/?utm_campaign=machine-learning-indiahacks-2017&utm_medium=email&utm_source=recommendation

http://r-statistics.co/Linear-Regression.html

https://datascienceplus.com/correlation-and-linear-regression/

http://ucanalytics.com/blogs/intuitive-machine-learning-gradient-descent-simplified/

http://dataconomy.com/2017/03/beginners-guide-machine-learning/

https://www.analyticsvidhya.com/blog/2017/03/introduction-to-gradient-descent-algorithm-along-its-variants/

http://mathworld.wolfram.com/Covariance.html

http://www.datasciencecentral.com/profiles/blogs/predicting-car-prices-part-1-linear-regression?utm_content=buffere52d1&utm_medium=social&utm_source=linkedin.com&utm_campaign=buffer

http://www.datasciencecentral.com/profiles/blogs/20-great-blogs-posted-in-the-last-12-months

http://ataspinar.com/2016/03/28/regression-logistic-regression-and-maximum-entropy/

http://www.datasciencecentral.com/profiles/blogs/26-great-articles-and-tutorials-about-regression-analysis?utm_content=bufferfe089&utm_medium=social&utm_source=linkedin.com&utm_campaign=buffer

https://www.r-bloggers.com/regression-model-with-auto-correlated-errors-part-2-the-models/

http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/

http://www.datasciencecentral.com/profiles/blogs/the-best-kept-secret-about-linear-and-logistic-regression?utm_content=buffer6aaed&utm_medium=social&utm_source=linkedin.com&utm_campaign=buffer

http://www.datasciencecentral.com/profiles/blogs/advanced-machine-learning-with-basic-excel

https://www.r-bloggers.com/the-real-meaning-of-spurious-correlations/

http://www.datasciencecentral.com/page/search?q=Regression+Analysis&cx=007889287169707156432%3A04zouenp6u0&ie=UTF-8&cof=FORID%3A11%3BNB%3A1&sa=Search

http://www.kdnuggets.com/2016/11/linear-regression-least-squares-matrix-multiplication-concise-technical-overview.html

http://stattrek.com/regression/linear-regression.aspx

http://stattrek.com/statistics/dictionary.aspx?definition=Residual%20plot

https://people.duke.edu/~rnau/compare.htm

http://aakashjapi.com/housing-prices-in-berkeley/

http://www.pyimagesearch.com/2016/10/10/gradient-descent-with-python/

http://www.kdnuggets.com/2016/08/primer-logistic-regression-part-1.html

https://realdataweb.wordpress.com/2017/05/23/ridge-regression-and-the-lasso/

https://www.analyticsvidhya.com/blog/2016/07/deeper-regression-analysis-assumptions-plots-solutions/

http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-tutorial-and-examples

2 thoughts on “Guide for Linear Regression using Python – Part 2

  1. Actually, kurtosis does not measure peakedness/flatness at all. The beta(.5,1) distribution is infinitely peaked, but with negative excess kurtosis. And if you mix a Uniform with a Cauchy, with .0000000001 mixing probability on the Cauchy, you have a flat-topped distribution with infinite kurtosis. All you can say for sure is that kurtosis measures the outlier characteristic of the data/distribution. Please see the current Wikipedia entry for details.

Leave a Reply

Your email address will not be published. Required fields are marked *