Visualization is the presentation of data in a pictorial or graphical format. It enables decision-makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns. This visualization of house prices is for the Kaggle dataset. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges predicting the final price of each home.
Find all categorical data
cats = []
for col in train.columns.values:
if train[col].dtype == 'object':
cats.append(col)
Create separate datasets for Continuous vs Categorical
train_cont = train.drop(cats, axis=1)
train_cat = train[cats]
Numerical Features
train_cont.columns
print("Some Statistics of the Housing Price:n")
print(train['SalePrice'].describe())
print("nThe median of the Housing Price is: ", train['SalePrice'].median(axis = 0))
Visualization of House Prices
Plotting univariate distributions
#draw a histogram and not fit a kernel density estimate (KDE).
sns.distplot(train['SalePrice'], kde = False, color = 'b', hist_kws={'alpha': 0.9})
#draw a histogram and fit a kernel density estimate (KDE).
sns.distplot(train['YrSold'], hist_kws={'alpha': 0.9})
#draw a histogram and fit a kernel density estimate (KDE).
sns.distplot(train['YearBuilt'], hist_kws={'alpha': 0.9})
plt.figure(figsize = (12, 6))
sns.boxplot(x = 'MSSubClass', y = 'SalePrice', data = train)
xt = plt.xticks(rotation=45)
#Plot two sets of values on the same axis with a histogram.
#plt.hist(train['SalePrice'], bins=100, histtype='stepfilled', normed=True, color='b', label='SalePrice')
plt.hist(train['YearBuilt'], bins=100, histtype='stepfilled', normed=True, color='r', alpha=0.5, label='YearBuilt')
plt.title("YearBuilt Histogram")
plt.xlabel("Year")
plt.ylabel("Probability")
plt.legend()
plt.show()
#Basic Data Plotting with Matplotlib: Lines, Points & Formatting
sns.set_style("whitegrid")
plt.plot(train['YearBuilt'], train['SalePrice'],marker='o', linestyle='--', color='r',label='Year Built')
#plt.plot(train['YrSold'], train['SalePrice'], label='Year Sold')
plt.xlabel('YearBuilt')
plt.ylabel('SalePrice')
plt.title('Sales Price Vs Year Built')
plt.show()
Let us scatterplot some variables on the sales price
plt.figure(1)
f, axarr = plt.subplots(3, 2, figsize=(10, 9))
price = train.SalePrice.values
axarr[0, 0].scatter(train['YearBuilt'].values, price)
axarr[0, 0].set_title('YearBuilt')
axarr[0, 1].scatter(train.GrLivArea.values, price)
axarr[0, 1].set_title('GrLivArea')
axarr[1, 0].scatter(train.LotFrontage.values, price)
axarr[1, 0].set_title('LotFrontage')
axarr[1, 1].scatter(train['LotArea'].values, price)
axarr[1, 1].set_title('LotArea')
axarr[2, 0].scatter(train.OverallQual.values, price)
axarr[2, 0].set_title('OverallQual')
axarr[2, 1].scatter(train.PoolArea.values, price)
axarr[2, 1].set_title('PoolArea')
f.text(-0.01, 0.5, 'Sale Price', va='center', rotation='vertical', fontsize = 18)
plt.tight_layout()
plt.show()
fig = plt.figure(2, figsize=(9, 7))
plt.subplot(211)
plt.scatter(train.YrSold.values, price)
plt.title('Year Sold')
plt.subplot(212)
plt.scatter(train.MoSold.values, price)
plt.title('Month Sold')
fig.text(-0.01, 0.5, 'Sale Price', va = 'center', rotation = 'vertical', fontsize = 12)
plt.tight_layout()
#corr = df.select_dtypes(include = ['float64', 'int64']).iloc[:, 1:].corr()
# selecting without ID
corr = train_cont.iloc[:, 1:].corr()
plt.figure(figsize=(8, 8))
sns.heatmap(corr, vmax=1, square=True)
List the numerical features descendingly by their correlation with Sale Price:
cor_dict = corr['SalePrice'].to_dict()
del cor_dict['SalePrice']
print("List the numerical features decendingly by their correlation with Sale Price:n")
for ele in sorted(cor_dict.items(), key = lambda x: -abs(x[1])):
print("{0}: t{1}".format(*ele))
The housing price correlates strongly with OverallQual, GrLivArea(GarageCars), GarageArea, TotalBsmtSF, 1stFlrSF, FullBath, TotRmsAbvGrd, YearBuilt, YearRemodAdd, GargeYrBlt, MasVnrArea and Fireplaces.
Some of the features have tiny correlation coefficients with the target variable. Such as MiscVal: -0.02118957964030379 , BsmtHalfBath: -0.016844154297359294 ,BsmtFinSF2: -0.011378121450215216. Sometimes it is not worth keeping features like these ones. Feature eliminations may help some machine learning methods while it doesn’t affect others very much or at all (such as tree-based methods). Yet these features may still be useful in combination with other features.
But some of the features are highly correlated with each other.
train.plot(kind="scatter", x="GrLivArea", y="OverallQual")
# A seaborn jointplot shows bivariate scatterplots and univariate histograms in the same figure
ax = sns.jointplot(x="SalePrice", y="OverallQual", data=train, size=10)
#Plot data and a linear regression model fit.
#Use a 68% confidence interval, which corresponds with the standard error of the estimate:
ax = sns.regplot(x = 'OverallQual', y = 'SalePrice', data = train, color = 'Green',ci=68)
Hexbin plots
The bivariate analog of a histogram is known as a “hexbin” plot because it shows the counts of observations that fall within hexagonal bins. This plot works best with relatively large datasets. It’s available through the matplotlib plt.hexbin function and as a style in jointplot().
sns.jointplot(x="SalePrice", y="OverallQual", data=train, size=10, kind="hex", color="#4CB391")
Categorical Variables
A categorical or discrete variable is one that has two or more categories (values). There are two types of a categorical variables, nominal and ordinal. A nominal variable has no intrinsic ordering to its categories. For example, gender is a categorical variable having two categories (male and female) with no intrinsic ordering to the categories. An ordinal variable has a clear ordering. For example, temperature is a variable with three orderly categories (low, medium, and high). A frequency table is a way of counting how often each category of the variable in question occurs. It may be enhanced by the addition of percentages that fall into each category.
train_cat.columns
# Let's see how many examples we have of SaleCondition
train["SaleCondition"].value_counts()
# Show value counts for a single categorical variable:
ax = sns.countplot(x="SaleCondition", data=train)
#Show value counts for two categorical variables:
ax = sns.countplot(y="SaleType", hue="SaleCondition", data=train)
plt.figure(figsize = (12, 6))
sns.boxplot(x = 'Neighborhood', y = 'SalePrice', data = train)
xt = plt.xticks(rotation=45)
fig, ax = plt.subplots(2, 1, figsize = (10, 6))
sns.boxplot(x = 'SaleType', y = 'SalePrice', data = train, ax = ax[0])
sns.boxplot(x = 'SaleCondition', y = 'SalePrice', data = train, ax = ax[1])
plt.tight_layout()
g = sns.FacetGrid(train, col = 'YrSold', col_wrap = 6)
g.map(sns.boxplot, 'MoSold', 'SalePrice', palette='Set2', order = range(1, 13))
.set(ylim = (0, 500000))
plt.tight_layout()
#Home Functionality
sns.violinplot('Functional', 'SalePrice', data = train)
sns.factorplot('FireplaceQu', 'SalePrice', data = train, color = 'm',col = "Street",
estimator = np.median, order = ['Ex', 'Gd', 'TA', 'Fa', 'Po'], size = 4.5, aspect=1.35)
sns.factorplot('HeatingQC', 'SalePrice', hue = 'CentralAir', estimator = np.mean, data = train,
size = 4.5, aspect = 1.4)
#Heating
pd.crosstab(train.HeatingQC, train.CentralAir)
#Street & Alley Access
fig, ax = plt.subplots(1, 2, figsize = (10, 4))
sns.boxplot(x = 'Street', y = 'SalePrice', data = train, ax = ax[0])
sns.boxplot(x = 'Alley', y = 'SalePrice', data = train, ax = ax[1])
plt.tight_layout()
Twitter s firehose of data provides a rich, robust pool from which data analysts and scientists can pull to analyze real-time reactions to current events.