Visualization is the presentation of data in a pictorial or graphical format. It enables decision-makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns. This visualization of house prices is for the Kaggle dataset. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges predicting the final price of each home.

Find all categorical data

In [5]:

cats = []
for col in train.columns.values:
    if train[col].dtype == 'object':
        cats.append(col)

Create separate datasets for Continuous vs Categorical

In [6]:

train_cont = train.drop(cats, axis=1)
train_cat = train[cats]

Numerical Features

A numerical or continuous variable (attribute) is one that may take on any value within a finite or infinite interval (e.g., height, weight, temperature, blood glucose, …). There are two types of numerical variables, interval and ratio. An interval variable has values whose differences are interpretable, but it does not have a true zero. A good example is a temperature in Centigrade degrees. Data on an interval scale can be added and subtracted but cannot be meaningfully multiplied or divided. For example, we cannot say that one day is twice as hot as another day. In contrast, a ratio variable has values with a true zero and can be added, subtracted, multiplied or divided (e.g., weight).

In [100]:

train_cont.columns

Out[100]:

Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold', 'SalePrice'],
      dtype='object')

In [12]:

print("Some Statistics of the Housing Price:n")
print(train['SalePrice'].describe())
print("nThe median of the Housing Price is: ", train['SalePrice'].median(axis = 0))

Some Statistics of the Housing Price:

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

The median of the Housing Price is:  163000.0

Visualization of House Prices

Plotting univariate distributions

In [43]:

#draw a histogram and not fit a kernel density estimate (KDE).
sns.distplot(train['SalePrice'], kde = False, color = 'b', hist_kws={'alpha': 0.9})

Out[43]:

<matplotlib.axes._subplots.AxesSubplot at 0x1f1182802b0>

In [101]:

#draw a histogram and fit a kernel density estimate (KDE).
sns.distplot(train['YrSold'], hist_kws={'alpha': 0.9})

Out[101]:

<matplotlib.axes._subplots.AxesSubplot at 0x1f12187db70>

In [102]:

#draw a histogram and fit a kernel density estimate (KDE).
sns.distplot(train['YearBuilt'], hist_kws={'alpha': 0.9})

Out[102]:

<matplotlib.axes._subplots.AxesSubplot at 0x1f12217d780>

In [103]:

plt.figure(figsize = (12, 6))
sns.boxplot(x = 'MSSubClass', y = 'SalePrice',  data = train)
xt = plt.xticks(rotation=45)

In [104]:

#Plot two sets of values on the same axis with a histogram.
#plt.hist(train['SalePrice'], bins=100, histtype='stepfilled', normed=True, color='b', label='SalePrice')
plt.hist(train['YearBuilt'], bins=100, histtype='stepfilled', normed=True, color='r', alpha=0.5, label='YearBuilt')
plt.title("YearBuilt Histogram")
plt.xlabel("Year")
plt.ylabel("Probability")
plt.legend()
plt.show()

In [105]:

#Basic Data Plotting with Matplotlib: Lines, Points & Formatting
sns.set_style("whitegrid")
plt.plot(train['YearBuilt'], train['SalePrice'],marker='o', linestyle='--', color='r',label='Year Built')
#plt.plot(train['YrSold'], train['SalePrice'],  label='Year Sold')
plt.xlabel('YearBuilt')
plt.ylabel('SalePrice')
plt.title('Sales Price Vs Year Built')
plt.show()

Let us scatterplot some variables on the sales price

In [106]:

plt.figure(1)
f, axarr = plt.subplots(3, 2, figsize=(10, 9))
price = train.SalePrice.values
axarr[0, 0].scatter(train['YearBuilt'].values, price)
axarr[0, 0].set_title('YearBuilt')
axarr[0, 1].scatter(train.GrLivArea.values, price)
axarr[0, 1].set_title('GrLivArea')
axarr[1, 0].scatter(train.LotFrontage.values, price)
axarr[1, 0].set_title('LotFrontage')
axarr[1, 1].scatter(train['LotArea'].values, price)
axarr[1, 1].set_title('LotArea')
axarr[2, 0].scatter(train.OverallQual.values, price)
axarr[2, 0].set_title('OverallQual')
axarr[2, 1].scatter(train.PoolArea.values, price)
axarr[2, 1].set_title('PoolArea')
f.text(-0.01, 0.5, 'Sale Price', va='center', rotation='vertical', fontsize = 18)
plt.tight_layout()
plt.show()

<matplotlib.figure.Figure at 0x1f12257fa58>

In [107]:

fig = plt.figure(2, figsize=(9, 7))
plt.subplot(211)
plt.scatter(train.YrSold.values, price)
plt.title('Year Sold')

plt.subplot(212)
plt.scatter(train.MoSold.values, price)
plt.title('Month Sold')

fig.text(-0.01, 0.5, 'Sale Price', va = 'center', rotation = 'vertical', fontsize = 12)

plt.tight_layout()

In [108]:

#corr = df.select_dtypes(include = ['float64', 'int64']).iloc[:, 1:].corr()
# selecting without ID
corr = train_cont.iloc[:, 1:].corr()
plt.figure(figsize=(8, 8))
sns.heatmap(corr, vmax=1, square=True)

Out[108]:

<matplotlib.axes._subplots.AxesSubplot at 0x1f12266b160>

List the numerical features descendingly by their correlation with Sale Price:

In [109]:

cor_dict = corr['SalePrice'].to_dict()
del cor_dict['SalePrice']
print("List the numerical features decendingly by their correlation with Sale Price:n")
for ele in sorted(cor_dict.items(), key = lambda x: -abs(x[1])):
    print("{0}: t{1}".format(*ele))

List the numerical features decendingly by their correlation with Sale Price:

OverallQual: 	0.7909816005838047
GrLivArea: 	0.7086244776126511
GarageCars: 	0.640409197258349
GarageArea: 	0.6234314389183598
TotalBsmtSF: 	0.6135805515591944
1stFlrSF: 	0.6058521846919166
FullBath: 	0.5606637627484452
TotRmsAbvGrd: 	0.5337231555820238
YearBuilt: 	0.5228973328794967
YearRemodAdd: 	0.5071009671113867
GarageYrBlt: 	0.48636167748786213
MasVnrArea: 	0.4774930470957107
Fireplaces: 	0.4669288367515242
BsmtFinSF1: 	0.38641980624215627
LotFrontage: 	0.35179909657067854
WoodDeckSF: 	0.32441344456813076
2ndFlrSF: 	0.31933380283206614
OpenPorchSF: 	0.31585622711605577
HalfBath: 	0.2841076755947784
LotArea: 	0.2638433538714063
BsmtFullBath: 	0.22712223313149718
BsmtUnfSF: 	0.214479105546969
BedroomAbvGr: 	0.1682131543007415
KitchenAbvGr: 	-0.1359073708421417
EnclosedPorch: 	-0.12857795792595636
ScreenPorch: 	0.11144657114291048
PoolArea: 	0.09240354949187278
MSSubClass: 	-0.08428413512659523
OverallCond: 	-0.0778558940486776
MoSold: 	0.04643224522381936
3SsnPorch: 	0.04458366533574792
YrSold: 	-0.028922585168730426
LowQualFinSF: 	-0.02560613000068015
MiscVal: 	-0.02118957964030379
BsmtHalfBath: 	-0.016844154297359294
BsmtFinSF2: 	-0.011378121450215216

The housing price correlates strongly with OverallQual, GrLivArea(GarageCars), GarageArea, TotalBsmtSF, 1stFlrSF, FullBath, TotRmsAbvGrd, YearBuilt, YearRemodAdd, GargeYrBlt, MasVnrArea and Fireplaces.

Some of the features have tiny correlation coefficients with the target variable. Such as MiscVal: -0.02118957964030379 , BsmtHalfBath: -0.016844154297359294 ,BsmtFinSF2: -0.011378121450215216. Sometimes it is not worth keeping features like these ones. Feature eliminations may help some machine learning methods while it doesn’t affect others very much or at all (such as tree-based methods). Yet these features may still be useful in combination with other features.

But some of the features are highly correlated with each other.

In [110]:

train.plot(kind="scatter", x="GrLivArea", y="OverallQual")

Out[110]:

<matplotlib.axes._subplots.AxesSubplot at 0x1f122d379e8>

In [111]:

# A seaborn jointplot shows bivariate scatterplots and univariate histograms in the same figure
ax = sns.jointplot(x="SalePrice", y="OverallQual", data=train, size=10)

In [112]:

#Plot data and a linear regression model fit.
#Use a 68% confidence interval, which corresponds with the standard error of the estimate:
ax = sns.regplot(x = 'OverallQual', y = 'SalePrice', data = train, color = 'Green',ci=68)

Hexbin plots

The bivariate analog of a histogram is known as a “hexbin” plot because it shows the counts of observations that fall within hexagonal bins. This plot works best with relatively large datasets. It’s available through the matplotlib plt.hexbin function and as a style in jointplot().

In [113]:

sns.jointplot(x="SalePrice", y="OverallQual", data=train, size=10, kind="hex", color="#4CB391")

Out[113]:

<seaborn.axisgrid.JointGrid at 0x1f122e27860>

 Categorical Variables

A categorical or discrete variable is one that has two or more categories (values). There are two types of a categorical variables, nominal and ordinal. A nominal variable has no intrinsic ordering to its categories. For example, gender is a categorical variable having two categories (male and female) with no intrinsic ordering to the categories. An ordinal variable has a clear ordering. For example, temperature is a variable with three orderly categories (low, medium, and high). A frequency table is a way of counting how often each category of the variable in question occurs. It may be enhanced by the addition of percentages that fall into each category.

In [116]:

train_cat.columns

Out[116]:

Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
       'SaleType', 'SaleCondition'],
      dtype='object')

In [117]:

# Let's see how many examples we have of SaleCondition
train["SaleCondition"].value_counts()

Out[117]:

Normal     1198
Partial     125
Abnorml     101
Family       20
Alloca       12
AdjLand       4
Name: SaleCondition, dtype: int64

In [118]:

# Show value counts for a single categorical variable:
ax = sns.countplot(x="SaleCondition", data=train)

In [119]:

#Show value counts for two categorical variables:
ax = sns.countplot(y="SaleType", hue="SaleCondition", data=train)

In [121]:

plt.figure(figsize = (12, 6))
sns.boxplot(x = 'Neighborhood', y = 'SalePrice',  data = train)
xt = plt.xticks(rotation=45)

In [122]:

fig, ax = plt.subplots(2, 1, figsize = (10, 6))
sns.boxplot(x = 'SaleType', y = 'SalePrice', data = train, ax = ax[0])
sns.boxplot(x = 'SaleCondition', y = 'SalePrice', data = train, ax = ax[1])
plt.tight_layout()

In [127]:

g = sns.FacetGrid(train, col = 'YrSold', col_wrap = 6)
g.map(sns.boxplot, 'MoSold', 'SalePrice', palette='Set2', order = range(1, 13))
.set(ylim = (0, 500000))
plt.tight_layout()

In [129]:

#Home Functionality
sns.violinplot('Functional', 'SalePrice', data = train)

In [148]:

sns.factorplot(x="HeatingQC", y="SalePrice", data=train,size=5, aspect=.8)

Out[148]:

<seaborn.axisgrid.FacetGrid at 0x1f1213b3630>

In [155]:

sns.factorplot('FireplaceQu', 'SalePrice', data = train, color = 'm',col = "Street",
               estimator = np.median, order = ['Ex', 'Gd', 'TA', 'Fa', 'Po'], size = 4.5,  aspect=1.35)

Out[155]:

<seaborn.axisgrid.FacetGrid at 0x1f1209be550>

In [152]:

sns.factorplot('HeatingQC', 'SalePrice', hue = 'CentralAir', estimator = np.mean, data = train,
             size = 4.5, aspect = 1.4)

Out[152]:

<seaborn.axisgrid.FacetGrid at 0x1f11ce500b8>

In [151]:

#Heating
pd.crosstab(train.HeatingQC, train.CentralAir)

Out[151]:

CentralAir	N	Y
HeatingQC
Ex	8	733
Fa	24	25
Gd	13	228
Po	1	0
TA	49	379

In [156]:

#Street & Alley Access
fig, ax = plt.subplots(1, 2, figsize = (10, 4))
sns.boxplot(x = 'Street', y = 'SalePrice', data = train, ax = ax[0])
sns.boxplot(x = 'Alley', y = 'SalePrice', data = train, ax = ax[1])
plt.tight_layout()

What are your opinions and suggestion about these approaches? Please let me know me in the comments below or contact me.

Tagged Data Analysis, Kaggle, Visualization

One Response

weat5her says:

May 4, 2017 at 4:27 am

Twitter s firehose of data provides a rich, robust pool from which data analysts and scientists can pull to analyze real-time reactions to current events.

Visualisation of House Prices