You are currently viewing One and Only Guide to Plotting Regression Line in Python

One and Only Guide to Plotting Regression Line in Python

This article is a guide to plotting regression line in Python. It will focus on linear regression. We will learn the crucial concepts with code in python. Once finished we’ll be able to build, improve, and optimize regression models.

What is Regression?

Before we discuss how plotting regression line in python is done, we should understand what is regression.

Regression is the first algorithm we need to master if we are aspiring to become data scientists. It is one of the easiest algorithms to learn yet requires understanding and effort to get to master it.

The term “regression” was used by Francis Galton in his 1886 paper “Regression towards mediocrity in hereditary stature”. He used the term in the context of regression toward the mean.

Guide for Linear Regression using Python

Today regression has an even broader meaning. It’s any model that makes continuous predictions.

Regression is a parametric technique used to predict the value of an outcome variable Y based on one or more input predictor variables X. It is parametric in nature because it makes certain assumptions based on the data set. If the data set follows those assumptions, regression gives incredible results. Otherwise, it struggles to provide convincing accuracy.

Regression Analysis helps us to find answers to:

  • Prediction of future observations
  • Find an association, the relationship between variables.
  • Identify which variables contribute more towards predicting the future outcomes.

Types of regression problems:

Simple Linear Regression:

If the model deals with one input, called an independent or predictor variable, and one output variable, called a dependent or response variable then it is called Simple Linear Regression. This type of Linear regression assumes that there exists a linear relationship between predictor and response variable of the form.

Multiple Linear Regression:

If the problem contains more than one input variable and one response variable, then it is called Multiple Linear regression.

Plotting Regression Line

The aim of linear regression is to establish a linear relationship (a mathematical formula) between the predictor variable(s) and the response variable. This mathematical equation can be generalized as Y = β1 + β2X + ϵ

Regression

X is the known input variable and if we can estimate β1, β2 by some method then Y can be predicted. To predict future outcomes, by using the training data we need to estimate the unknown model parameters (β1 β2 ).

  • β1 is the intercept. It is the prediction value you get when X = 0.
  • β2 is the slope. It explains the change in Y when X changes by 1 unit.
  • ϵ is the error term, the part of Y the regression model is unable to explain. ϵ represents the residual value, i.e. the difference between actual and predicted values.

Collectively, they are called regression coefficients.

The error is an inevitable part of the prediction-making process. We can’t eliminate the (ϵ) error term, but we try to reduce it to the lowest.

To do this, regression uses a technique known as Ordinary Least Square (OLS), Generalized Least Square, Percentage Least Square, Total Least Squares, Least absolute deviation, and much more. In this article, we are referring to the OLS technique when using linear/multiple regression. We will be plotting regression line in python.

OLS technique tries to reduce the sum of squared errors ∑[Actual(y) – Predicted(y’)] ² by finding the best possible value of regression coefficients (β1, β2, etc).

OLS uses squared error which has nice mathematical properties, thereby making it easier to differentiate and compute gradient descent. It is also easy to analyze and computationally faster, i.e. it can be quickly applied to data sets having 1000s of features. Further, interpretation of OLS is much easier than other regression techniques.

Graphical Analysis of variables

Before modeling regression and plotting regression line in python, we need to understand the independent variables (predictors) and dependent variables. The best way to do it through visualizing their behavior is through:

  • Scatter plot: helps to visualize any linear relationship between the dependent (response) variable and independent (predictor) variables. Ideally, we have multiple predictor variables, a scatter plot is drawn for each one of them against the response variable.
  • Box plot helps to spot any outlier observations in the variable. Having outliers in our predictor can drastically affect the predictions as they can easily affect the direction/slope of the line of best fit.
  • Density plot: checks if the response variable is close to normality and sees the distribution of the predictor variable. Ideally, a close to a normal distribution (a bell-shaped curve), without being skewed to the left or right is preferred.
  • Correlation: is a statistical measure that suggests the level of linear dependence between two variables, that occur in a pair. It can take values between -1 to +1. Correlation between two variables will be closer to 1 if there is a high positive relationship. The opposite is true for an inverse relationship, in which case, the correlation between the variables will be close to -1. A value closer to 0 suggests a weak relationship between the variables. A low correlation (-0.2 < x < 0.2) probably suggests that much of the variation of the response variable (Y) is unexplained by the predictor (X), in which case, we should probably look for better explanatory variables. Covariance provides a measure of the strength of the correlation between two or more sets of random variates.

What are the assumptions made in regression?

Regression is a parametric technique, so it makes assumptions. The presence of these assumptions makes regression quite restrictive and is conditioned on fulfillment of these assumptions. Once these assumptions get violated, regression makes biased, erratic predictions. If data violates these regression assumptions an obvious solution is to use tree-based algorithms which capture nonlinearity quite well.

We can also check performance metrics to estimate violations.

Let’s look at the assumptions and interpretations of regression plots and plotting regression line in python.

Linear and additive relationship

Exists between the response variable (Y) and predictors (X). By linear, it means that the change in Y by 1-unit change in X is constant. If we fit a linear model to a nonlinear, non-additive data set, the regression algorithm would fail to capture the trend mathematically, thus resulting in an inefficient model.

Also, this will result in erroneous predictions on an unseen data set.

By additive, it refers to the effect of X on Y is independent of other variables.

How to check for a linear and additive relationship?

Look for residual vs fitted value plots.

Residual vs. Fitted Values Plot

The Scatter plot shows the distribution of residuals (errors) vs fitted values (predicted values). It reveals various useful insights including outliers. The outliers in this plot are labeled by their observation number which makes them easy to detect.

Ideally, this plot shouldn’t show any pattern. If there exists any shape (curve, U shape), it suggests non-linearity in the data. It means that the model doesn’t capture nonlinear effects.  But if you see nonlinearity in the data set. In addition, If a funnel shape is evident in the plot, consider it as the sign of non-constant variance i.e. heteroskedasticity.

Residuals Vs Fitted

Image for residuals Vs fitted value

Steps: to overcome the issue of nonlinearity, we can do a nonlinear transformation of predictors such as log (X), √X, or X² transform the dependent variable.

If your data is suffering from nonlinearity, transform the dependent variables using sqrt, log, square, etc. Also, we can include polynomial terms (X, X², X³) in your model to capture the nonlinear effect.

Error terms must possess constant variance

The absence of constant variance in the error terms results in heteroscedasticity. Generally, non-constant variance arises in presence of outliers or extreme leverage values.

These values get too much weight, hence disproportionately influences the model’s performance. When this phenomenon occurs, the confidence interval for out-of-sample prediction tends to be unrealistically wide or narrow.

How to check for heteroscedasticity?

  • Look at the residual vs fitted values plot. If heteroscedasticity exists, the plot would exhibit a funnel shape pattern.
  • Use Breusch-Pagan / Cook – Weisberg test or White general test to detect this phenomenon. If we find p < 0.05, we reject the null hypothesis and infer that heteroscedasticity is present.
  • Scale Location Plot
Scale Location Plot

This plot shows how the residuals are spread along with the range of predictors. It’s similar to the residual vs fitted value plot except it uses standardized residual values. We need to corroborate the findings of this plot with the funnel shape in residual vs. fitted values.

Ideally, this plot shouldn’t show any pattern. If there is no presence of a pattern errors are normally distributed. However, if the plot shows any discernible pattern (probably a funnel shape), it would imply a non-normal distribution of errors.

Scale Location Plot

Image for Scale Location Plot

Steps: to overcome the issue of heteroskedasticity, transform the response variable(Y)  using sqrt, log, square, etc. Also, we can use the weighted least square method to tackle this problem.

There must be no correlation among independent variables

Multicollinearity is the presence of correlation in independent variables. If variables are correlated, it becomes extremely difficult for the model to determine the true effect of X on Y or find out which variable is contributing to predict the response variable.

Additionally, the presence of correlated predictors tends to increase the standard errors. With large standard errors, the confidence interval becomes wider leading to less precise estimates of slope parameters.

Also, when predictors are correlated, the estimated regression coefficient of a correlated variable depends on which other predictors are available in the model. If this happens, we’ll end up with an incorrect conclusion that a variable strongly / weakly affects the target variable.

How to check for multicollinearity?

  • Use scatter plot to visualize correlation effect among variables.
  • Use VIF factor. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity.
  • Correlation table

Steps: to overcome the issue of multicollinearity, use a correlation matrix to check correlated variables. Let’s say variables A and B are highly correlated. Now, instead of removing one of them, use this approach: Find the average correlation of A and B with the rest of the variables. Whichever variable has the higher average in comparison with other variables, remove it. Alternatively, we can use penalized regression methods such as lasso, ridge, elastic net, etc.

Error terms must be uncorrelated

Error at ∈t must not indicate the at error at ∈t+1. The presence of correlation in error terms is known as Autocorrelation. It drastically affects the regression coefficients and standard error values. It causes confidence intervals and prediction intervals to be narrower.

A narrower confidence interval means that a 95% confidence interval would have a lesser probability than 0.95 that it would contain the actual value of coefficients.

Also, lower standard errors would cause the associated p-values to be lower than actual. This will make us incorrectly conclude a parameter to be statistically significant. It drastically reduces the model’s accuracy. This usually occurs in time series models where the next instant is dependent on the previous instant.

How to check for autocorrelation?

  • Look for Durbin – Watson (DW) statistic. It must lie between 0 and 4. If DW = 2, implies no autocorrelation, 0 < DW < 2 implies positive autocorrelation while 2 < DW < 4 indicates negative autocorrelation.
  • Look for the seasonal or correlated pattern in residual values in the residual vs time plot

The dependent variable and the error terms must possess a normal distribution

The presence of non – normal distribution suggests that there are a few unusual data points that must be studied closely to make a better model.

The presence of non-normally distributed error terms leads to either too wide or narrow confidence intervals. Unstable confidence level leads to difficulty in estimating coefficients based on the minimization of least squares.

How to check for normal distribution?

  • Look at the Normality QQ plot
  • Perform statistical tests of normality such as the Kolmogorov-Smirnov test, Shapiro-Wilk test.
Normality Q-Q Plot

q-q or quantile-quantile is a scatter plot that helps us validate the assumption of normal distribution. It uses standardized values of residuals to determine the normal distribution of errors. Ideally, this plot should show a straight line. A curved, distorted line suggests residuals have a non-normal distribution.

Normality Q-Q plot

Image for q-q plot

Steps: to overcome the issue if the errors are not normally distributed is through the nonlinear transformation of the both response or predictors variables.

Interpretation of Residuals vs Leverage Plot

Residuals vs Leverage Plot is also known as Cook’s Distance plot. Cook’s distance attempts to identify the points which have more influence than other points. Such influential points tend to have a sizable impact on the regression line. The large values marked by Cook’s distance might require further investigation.

Residual Vs Leverage

Image

Steps: to solve the issue for influential observations which are nothing but outliers is to remove those rows. Alternatively, we can scale down the outlier observation with the maximum value in data or else treat those values as missing values. The treatment of data points as outliers require a good domain knowledge

How to know if the regression model is the best fit for your data?

The metrics used to determine model fit can have different values based on the type of data. Hence, we need to be extremely careful while interpreting regression analysis.

Following are some metrics along with plotting regression line in python to evaluate your regression model:

  1. p-Value is very important because we can consider a model to be statistically significant only when p-Values are less than the pre-determined statistical significance level, which is ideally 0.05. In linear regression, the Null Hypothesis is that the coefficients associated with the variables are equal to zero. The alternate hypothesis is that the coefficients are not equal to zero (i.e. there exists a relationship between the independent variable in question and the dependent variable).
  2. t-value: Pr(>|t|) or p-value is the probability that you get a t-value as high or higher than the observed value when the Null Hypothesis is true. If the Pr(>|t|) is low, the coefficients are significant (significantly different from zero) and vice versa.

It is important for the model to be statistically significant before we can go ahead and use it to predict the dependent variable, otherwise, the confidence in predicted values from that model reduces and may be construed as an event of chance.

  1. R Square (Coefficient of Determination): explains the percentage of variance explained by covariates in the model. It tells us the proportion of variation in the dependent (response) variable that has been explained by the model. It ranges between 0 and 1. Usually, higher values are desirable but it rests on the data quality and domain. We don’t necessarily discard a model based on a low R-Squared value. It is a better practice to look at the AIC and prediction accuracy on validation samples when deciding on the efficacy of a model.
  2. Adjusted R²: penalizes total value for the number of predictors in the model. It doesn’t increase, stays the same, or decrease unless the newly added variable is truly useful. Therefore, when comparing nested models, it is a good practice to look at the Adjusted R² value over R-squared.
  3. F Statistics: is the ratio of explained variance of the model by unexplained variance. It compares the full model with an intercept to evaluates the overall significance of the model. Its value can range between zero and any arbitrarily large number.
  4. AIC and BIC: The Akaike’s information criterion – AIC and the Bayesian information criterion – BIC are measures of the goodness of fit of an estimated statistical model and can also be used for model selection. Both criteria depend on the maximized value of the likelihood function L for the estimated model.
  5. Mallow’s Cp: is defined as a criterion to assess fits when models with different numbers of parameters are being compared.
  6. RMSE / MSE / MAE: are crucial error metric. MSE is mean squared error. It amplifies the impact of outliers on the model’s accuracy. MAE is mean absolute error. It is robust against the effect of outliers. RMSE is the root mean square error. It is interpreted as how far on average; the residuals are from zero. It nullifies the squared effect of MSE by square root and provides the result in original units as data.
  7. MinMax Accuracy: is defined as mean (min(actual, predicted)/max(actual, predicted)). It tells how far the model’s prediction is off. For a perfect model, this measure is 1.0. The lower the measure, the worse the model, based on out-of-sample performance.

STATISTIC

CRITERION

R-SquaredHigher the better (> 0.70)
Adj R-SquaredHigher the better
F-StatisticHigher the better
Std. ErrorCloser to zero the better
t-statisticShould be greater than 1.96 for p-value to be less than 0.05
AICLower the better
BICLower the better
Mallows CpShould be close to the number of predictors in the model
MAELower the better
MSE (Mean squared error)Lower the better
MinMax AccuracyHigher the better

Example Problem for Plotting Regression Line in Python

Let’s use our theoretical knowledge and create a model practically. We will also be plotting regression line in python.

For this analysis, we will use the diamond dataset. This dataset can be downloaded from here.

Let’s load the data set and do the initial data analysis.

import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
%config InlineBackend.figure_format = 'png' #set 'png' here when working on notebook
warnings.filterwarnings('ignore') 

# Set some parameters to get good visuals - style to ggplot and size to 15,10
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (4, 3)
#load the file and set the first column as the index
train = pd.read_csv(r"C:UserspiushDesktopDatasetAssignment 1diamonds.csv",index_col = 0)
train.head()
caratcutcolorclaritydepthtablepricexyz
10.23IdealESI261.555.03263.953.982.43
20.21PremiumESI159.861.03263.893.842.31
30.23GoodEVS156.965.03274.054.072.31
40.29PremiumIVS262.458.03344.204.232.63
50.31GoodJSI263.358.03354.344.352.75
print ("nn---------------------")
print ("TRAIN SET INFORMATION")
print ("---------------------")
print ("Shape of training set:", train.shape, "n")
print ("Column Headers:", list(train.columns.values), "n")
print (train.dtypes)
---------------------
TRAIN SET INFORMATION
---------------------
Shape of training set: (53940, 10) 

Column Headers: ['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y', 'z'] 

carat      float64
cut         object
color       object
clarity     object
depth      float64
table      float64
price        int64
x          float64
y          float64
z          float64
dtype: object

So there are 53940 observations and 10 different variables.

The different kind of variables are : ‘carat’, ‘cut’, ‘color’, ‘clarity’, ‘depth’, ‘table’, ‘price’, ‘x’, ‘y’, ‘z’

import re
missing_values = []
nonumeric_values = []

print ("TRAINING SET INFORMATION")
print ("========================n")

for column in train:
    # Find all the unique feature values
    uniq = train[column].unique()
    print ("'{}' has {} unique values" .format(column,uniq.size))
    if (uniq.size > 10):
        print("~~Listing up to 10 unique values~~")
    print (uniq[0:10])
    print ("n-----------------------------------------------------------------------n")
    
    # Find features with missing values
    if (True in pd.isnull(uniq)):
        s = "{} has {} missing" .format(column, pd.isnull(train[column]).sum())
        missing_values.append(s)
    
    # Find features with non-numeric values
    for i in range (1, np.prod(uniq.shape)):
        if (re.match('nan', str(uniq[i]))):
            break
        if not (re.search('(^d+.?d*$)|(^d*.?d+$)', str(uniq[i]))):
            nonumeric_values.append(column)
            break
  
print ("n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~n")
print ("Features with missing values:n{}nn" .format(missing_values))
print ("Features with non-numeric values:n{}" .format(nonumeric_values))
print ("n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~n")
TRAINING SET INFORMATION
========================

'carat' has 273 unique values
~~Listing up to 10 unique values~~
[ 0.23  0.21  0.29  0.31  0.24  0.26  0.22  0.3   0.2   0.32]

-----------------------------------------------------------------------

'cut' has 5 unique values
['Ideal' 'Premium' 'Good' 'Very Good' 'Fair']

-----------------------------------------------------------------------

'color' has 7 unique values
['E' 'I' 'J' 'H' 'F' 'G' 'D']

-----------------------------------------------------------------------

'clarity' has 8 unique values
['SI2' 'SI1' 'VS1' 'VS2' 'VVS2' 'VVS1' 'I1' 'IF']

-----------------------------------------------------------------------

'depth' has 184 unique values
~~Listing up to 10 unique values~~
[ 61.5  59.8  56.9  62.4  63.3  62.8  62.3  61.9  65.1  59.4]

-----------------------------------------------------------------------

'table' has 127 unique values
~~Listing up to 10 unique values~~
[ 55.  61.  65.  58.  57.  56.  54.  62.  59.  63.]

-----------------------------------------------------------------------

'price' has 11602 unique values
~~Listing up to 10 unique values~~
[326 327 334 335 336 337 338 339 340 342]

-----------------------------------------------------------------------

'x' has 554 unique values
~~Listing up to 10 unique values~~
[ 3.95  3.89  4.05  4.2   4.34  3.94  4.07  3.87  4.    4.25]

-----------------------------------------------------------------------

'y' has 552 unique values
~~Listing up to 10 unique values~~
[ 3.98  3.84  4.07  4.23  4.35  3.96  4.11  3.78  4.05  4.28]

-----------------------------------------------------------------------

'z' has 375 unique values
~~Listing up to 10 unique values~~
[ 2.43  2.31  2.63  2.75  2.48  2.47  2.53  2.49  2.39  2.73]

-----------------------------------------------------------------------


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Features with missing values:
[]


Features with non-numeric values:
['cut', 'color', 'clarity']

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Check for duplicate index or rows
idsUnique = len(set(train.index))
idsTotal = train.shape[0]
idsDupli = idsTotal - idsUnique
print("There are " + str(idsDupli) + " duplicate IDs for " + str(idsTotal) + " total entries")
There are 0 duplicate IDs for 53940 total entries
Descriptive Statistics
#get summary of numerical variables
train.describe()
caratdepthtablepricexyz
count53940.00000053940.00000053940.00000053940.00000053940.00000053940.00000053940.000000
mean0.79794061.74940557.4571843932.7997225.7311575.7345263.538734
std0.4740111.4326212.2344913989.4397381.1217611.1421350.705699
min0.20000043.00000043.000000326.0000000.0000000.0000000.000000
25%0.40000061.00000056.000000950.0000004.7100004.7200002.910000
50%0.70000061.80000057.0000002401.0000005.7000005.7100003.530000
75%1.04000062.50000059.0000005324.2500006.5400006.5400004.040000
max5.01000079.00000095.00000018823.00000010.74000058.90000031.800000

We can get an idea of a possible skew in the data by comparing the mean to the median, i.e. the 50% figure. It looks that Price is skewed.

# Skewness of the distribution

print(train.skew())

# Values close to 0 show less skew
# The attributes are not skewed too much. Some algos may benefit if skew is corrected
carat    1.116646
depth   -0.082294
table    0.796896
price    1.618395
x        0.378676
y        2.434167
z        1.522423
dtype: float64
# Skewness of the distribution

print(train.kurt())

#Kurtosis is a measure of the "tailedness" of the probability distribution of a real-valued random variable.
carat     1.256635
depth     5.739415
table     2.801857
price     2.177696
x        -0.618161
y        91.214557
z        47.086619
dtype: float64
We have ‘x’,’y’,’z’ as 0. let us check how many are there and deal with the outliers
#Check when a value in a cloumn is zero
train[(train['z'] == 0)].count()
carat      20
cut        20
color      20
clarity    20
depth      20
table      20
price      20
x          20
y          20
z          20
dtype: int64
#Separate carat and price
df = train[['carat','price']]
df.head()
caratprice
10.23326
20.21326
30.23327
40.29334
50.31335
#draw a histogram and not fit a kernel density estimate (KDE).
sns.distplot(df['carat'], kde = True, color = 'r', hist_kws={'alpha': 0.9})

We see immediately that the carat weights are positively skewed: most diamonds are around 1 carat or below but there are extreme cases of larger diamonds. The plot above has fairly wide bins and there doesn’t appear to be any data beyond a carat size of 3.5.

We can make try to get more out of hour histogram by adding some additional arguments to control the size of the bins and limits of the x-axis.

Boxplots

Boxplots are another type of univariate plot for summarizing distributions of numeric data graphically. Let’s make a boxplot of carat using the pd.boxplot() function:

train.boxplot(column="carat")

The central box of the boxplot represents the middle 50% of the observations, the central bar is the median and the bars at the end of the dotted lines (whiskers) encapsulate the great majority of the observations. Circles that lie beyond the end of the whiskers are data points that may be outliers. In this case, our data set has over 50,000 observations and we see many data points beyond the top whisker. We probably wouldn’t want to classify all of those points as outliers, but the handful of diamonds at 4 carats and above are definitely far outside the norm.

plt.rcParams['figure.figsize'] = (6.0, 3.0)
prices = pd.DataFrame({"price":train["price"], "log(price + 1)":np.log1p(train["price"])})
prices.hist(bins = 50)

This is a long-tail distribution, with a high concentration of observations below the U$5,000 mark. The distribution is right-skewed with small amounts of very large prices driving up the mean, while the median remains a more robust measure of the center of the distribution.

We can see that the data show some evidence of bimodality on the scale of log10 scale. So there are two different kinds of diamonds available for richer and poorer buyers.

plt.rcParams['figure.figsize'] = (6.0, 3.0)
carat = pd.DataFrame({"carat":train["carat"], "log(carat + 1)":np.log1p(train["carat"])})
carat.hist(bins = 50)
train.plot(kind="scatter",     # Create a scatterplot
              x="carat",          # Put carat on the x axis
              y="price",          # Put price on the y axis
              figsize=(5,5),
              ylim=(0,2000))

It is a non-linear relationship and the dispersion(variance) of the relationship also increases as carat size increases. Even though the size/weight of the diamond set is a critical factor, the relationship is non-linear. Therefore, running a linear model will be a bad idea.

Density Plots

A density plot shows the distribution of a numeric variable with a continuous curve. It is similar to a histogram but without discrete bins, a density plot gives a better picture of the underlying shape of a distribution. Create a density plot with series.plot(kind=”density”)

df["carat"].plot(kind="density",  # Create density plot
                      figsize=(4,4),    # Set figure size
                      xlim= (0,5))      # Limit x axis values
df["price"].plot(kind="density",  # Create density plot
                      figsize=(4,4))    # Set figure size
correlationMatrix = df.corr().abs()

plt.subplots(figsize=(7, 4))
sns.heatmap(correlationMatrix,annot=True)

# Mask unimportant features
sns.heatmap(correlationMatrix, mask=correlationMatrix < 1, cbar=False)
plt.show()

The above plot shows a 92% correlation between price and carat. It’s up to us if we should consider this correlation % as a damaging level. Usually, correlation above 80% (subjective) is considered higher, hence, we will not forego this combination.

Plotting Regression Line in Python – 1 Dependent Variable

import statsmodels.api as sm
import statsmodels.formula.api as smf 
print ("OLS regression model for the association between price and carat")
# reg1 is the model name , followed by equal sign. 
reg1 = smf.ols(formula = 'price ~ carat', data=df).fit()
# print the result
print (reg1.summary())
OLS regression model for the association between price and carat
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.849
Model:                            OLS   Adj. R-squared:                  0.849
Method:                 Least Squares   F-statistic:                 3.041e+05
Date:                Fri, 11 Aug 2017   Prob (F-statistic):               0.00
Time:                        22:50:59   Log-Likelihood:            -4.7273e+05
No. Observations:               53940   AIC:                         9.455e+05
Df Residuals:                   53938   BIC:                         9.455e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept  -2256.3606     13.055   -172.830      0.000     -2281.949 -2230.772
carat       7756.4256     14.067    551.408      0.000      7728.855  7783.996
==============================================================================
Omnibus:                    14025.341   Durbin-Watson:                   0.986
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           153030.525
Skew:                           0.939   Prob(JB):                         0.00
Kurtosis:                      11.035   Cond. No.                         3.65
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
  • Dep. variable is the name of the response variable
  • No. of observations show the valid observations
  • F-statistics is 3.041e+05 and Prob(F-statistics is very small. Hence we can reject the null hypothesis. )
  • Parameter estimates : Intercept and carat therefore
  • price = -2256.3606 + 7756.4256 * carat
  • p>|t| is the Pearson value for explanatory variables… when 0.000 p < 0.001
  • R-squared value: proportion of the variance. such as 84.9 % variance in the dataset.
  • Intercept – This is the βo value. It’s the prediction made by the model when all the independent variables are set to zero.
  • Estimate – This represents regression coefficients for respective variables. It’s the value of the slope. Let’s interpret it for Chord_Length. We can say, when Chord_Length is increased by 1 unit, holding other variables constant, Sound_pressure_level decreases by a value of -35.69.
  • Std. Error – This determines the level of variability associated with the estimates. The smaller the standard error of an estimate is, the more accurate will be the predictions.
  • t value – t statistic is generally used to determine variable significance, i.e. if a variable is significantly adding information to the model. t value > 2 suggests the variable is significant. I used it as an optional value as the same information can be extracted from the p-value.
  • p-value – It’s the probability value of respective variables determining their significance in the model. p-value < 0.05 is always desirable.
#Q-Q plot for normality
fig4=sm.qqplot(reg1.resid, line='r')
# simple plot of residuals
sns.residplot('carat','price', data=df)

Among all, Residual vs. Fitted value catches our attention. Not exactly though, but we see signs of heteroskedasticity in this data.

Remember funnel shape?  We can see a similar pattern. To overcome this situation, we’ll build another model with log(y).

Plotting Regression Line in Python – Log of Target Variable

df['log_price'] = np.log(1 + df.price)
df['log_carat'] = np.log(1 + df.carat)
df.head()
caratpricelog_pricelog_carat
10.233265.7899600.207014
20.213265.7899600.190620
30.233275.7930140.207014
40.293345.8141310.254642
50.313355.8171110.270027
print ("OLS regression model for the association between price and carat")
# reg_log is the model name , followed by equal sign.

reg_log = smf.ols(formula = 'log_price ~ carat', data=df).fit()
# print the result
print (reg_log.summary())
OLS regression model for the association between price and carat
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              log_price   R-squared:                       0.847
Model:                            OLS   Adj. R-squared:                  0.847
Method:                 Least Squares   F-statistic:                 2.983e+05
Date:                Fri, 11 Aug 2017   Prob (F-statistic):               0.00
Time:                        22:51:01   Log-Likelihood:                -26685.
No. Observations:               53940   AIC:                         5.337e+04
Df Residuals:                   53938   BIC:                         5.339e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      6.2164      0.003   1858.062      0.000         6.210     6.223
carat          1.9688      0.004    546.166      0.000         1.962     1.976
==============================================================================
Omnibus:                    10807.191   Durbin-Watson:                   0.977
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            71530.061
Skew:                          -0.803   Prob(JB):                         0.00
Kurtosis:                       8.408   Cond. No.                         3.65
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
#Q-Q plot for normality
fig4=sm.qqplot(reg_log.resid, line='r')
sns.residplot('carat','log_price', data=df,lowess= True, scatter_kws={"s": 5})

The residplot() function can be a useful tool for checking whether the simple regression model is appropriate for a dataset. It fits and removes a simple linear regression and then plots the residual values for each observation. Ideally, these values should be randomly scattered around y = 0:

If there is a structure in the residuals, it suggests that simple linear regression is not appropriate:

But unfortunately, this creates a plot of residuals vs the x values. I think in many situations, this graph will tell you the same thing as residuals versus fitted values, but it is not what I was looking for.

Here is how to create it with vanilla matplotlib:

fitted = 6.2164 + 1.9688*df["carat"] # This just needs to be whatever the linear regrsssion equation is
fig = plt.figure()
axes = fig.add_axes([0, 0, 1.0, 1.0]) 
axes.axhline(color="black", ls="--") # This creates a horizon line (like abline in R)
axes.set_xlabel('Fitted Value')
axes.set_ylabel('Residuals')
axes.set_title('Residuals Versus Fits')
axes.scatter(fitted, df["log_price"]  - fitted, marker = "o")
<matplotlib.collections.PathCollection at 0x1d33eb38320>

ANOVA for linear regression

from statsmodels.stats.anova import anova_lm
from statsmodels.formula.api import ols

model = ols('log_price ~ carat', data=df)
anova_lm(model.fit())
dfsum_sqmean_sqFPR(>F)
carat1.046977.56805146977.568051298297.1378560.0
Residual53938.08494.4699230.157486NaNNaN
help(anova_lm)
Help on function anova_lm in module statsmodels.stats.anova:

anova_lm(*args, **kwargs)
    ANOVA table for one or more fitted linear models.
    
    Parameters
    ----------
    args : fitted linear model results instance
        One or more fitted linear models
    scale : float
        Estimate of variance, If None, will be estimated from the largest
        model. Default is None.
    test : str {"F", "Chisq", "Cp"} or None
        Test statistics to provide. Default is "F".
    typ : str or int {"I","II","III"} or {1,2,3}
        The type of ANOVA test to perform. See notes.
    robust : {None, "hc0", "hc1", "hc2", "hc3"}
        Use heteroscedasticity-corrected coefficient covariance matrix.
        If robust covariance is desired, it is recommended to use `hc3`.
    
    Returns
    -------
    anova : DataFrame
    A DataFrame containing.
    
    Notes
    -----
    Model statistics are given in the order of args. Models must have
    been fit using the formula api.
    
    See Also
    --------
    model_results.compare_f_test, model_results.compare_lm_test
    
    Examples
    --------
    >>> import statsmodels.api as sm
    >>> from statsmodels.formula.api import ols
    
    >>> moore = sm.datasets.get_rdataset("Moore", "car",
    ...                                  cache=True) # load data
    >>> data = moore.data
    >>> data = data.rename(columns={"partner.status" :
    ...                             "partner_status"}) # make name pythonic
    >>> moore_lm = ols('conformity ~ C(fcategory, Sum)*C(partner_status, Sum)',
    ...                 data=data).fit()
    
    >>> table = sm.stats.anova_lm(moore_lm, typ=2) # Type 2 ANOVA DataFrame
    >>> print table

One often first encounters the term “analysis of variance” when the predictor is categorical, so that you’re fitting the model y=α+βi y=α+βi where ii identifies which category is the value of the predictor. If there are kk categories, you’d get k−1k−1 degrees of freedom in the numerator in the F-statistic, and usually n−kn−k degrees of freedom in the denominator. But the distinction between regression and analysis of variance is still the same for this kind of model.

We have two tunnel shapes in our plot.

Plotting Regression Line in Python – Carat less than 0.75

Below we select diamonds that are below the mean carat and see if the model is a little better
df_low_carat = df[df["carat"]< 0.75]
print ("OLS regression model for the association between price and carat")
# reg_log is the model name , followed by equal sign.

reg_log_low = smf.ols(formula = 'log_price ~ carat', data=df_low_carat).fit()
# print the result
print (reg_log_low.summary())
OLS regression model for the association between price and carat
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              log_price   R-squared:                       0.817
Model:                            OLS   Adj. R-squared:                  0.817
Method:                 Least Squares   F-statistic:                 1.337e+05
Date:                Fri, 11 Aug 2017   Prob (F-statistic):               0.00
Time:                        22:55:36   Log-Likelihood:                -24.378
No. Observations:               30034   AIC:                             52.76
Df Residuals:                   30032   BIC:                             69.38
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      5.4605      0.005   1213.174      0.000         5.452     5.469
carat          3.4493      0.009    365.709      0.000         3.431     3.468
==============================================================================
Omnibus:                       63.342   Durbin-Watson:                   0.845
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               79.446
Skew:                          -0.003   Prob(JB):                     5.60e-18
Kurtosis:                       3.252   Cond. No.                         8.16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
#Q-Q plot for normality
fig4=sm.qqplot(reg_log_low.resid, line='r')
fitted = 5.4605 + 3.4493*df_low_carat["carat"] # This just needs to be whatever the linear regrsssion equation is
fig = plt.figure()
axes = fig.add_axes([0, 0, 1.0, 1.0]) 
axes.axhline(color="black", ls="--") # This creates a horizon line (like abline in R)
axes.set_xlabel('Fitted Value')
axes.set_ylabel('Residuals')
axes.set_title('Residuals Versus Fits')
axes.scatter(fitted, df_low_carat["log_price"]  - fitted, marker = "o")

We learned about the concepts of linear regression and how to do plotting of regression line in python. We implemented the model using statsmodel library as well.

Hope you liked our example and have tried coding the model as well. Do let me know your feedback in the comment section below.

References

This Post Has 2 Comments

  1. Peter Westfall

    Actually, kurtosis does not measure peakedness/flatness at all. The beta(.5,1) distribution is infinitely peaked, but with negative excess kurtosis. And if you mix a Uniform with a Cauchy, with .0000000001 mixing probability on the Cauchy, you have a flat-topped distribution with infinite kurtosis. All you can say for sure is that kurtosis measures the outlier characteristic of the data/distribution. Please see the current Wikipedia entry for details.

    1. piush vaish

      Thank you for pointing out my mistake. I have changed the comment about Kurtosis.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.