Exploratory Data Analysis with pandas – 1

This post is exploratory data analysis with pandas – 1. Clear data plots that explicate the relationship between variables can lead to the creation of newer and better features that can predict more than the existing ones.

Exploratory Data Analysis, which can be effective if it has the following characteristics:

• It should be fast, allowing you to explore, develop new ideas and test them, and restart with a new exploration and fresh ideas.

• It should be graphic in order to better represent data as a whole, no matter how high its dimensionality is.

Instead, if your purpose is to best communicate the findings by using beautiful visualization, you may notice that it is not so easy to customize the pandas graphical outputs. Therefore, when it is paramount to create specific graphic outputs, it is better to start working directly from the beginning with matplotlib instructions.

In [5]:
import pandas as pd
%matplotlib inline
print  ('Your pandas version is: %s' % pd.__version__)
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
groups = list(iris.target)
iris_df['groups'] = pd.Series([iris.target_names[k] for k in groups])
Your pandas version is: 0.18.0
Boxplots and histograms
Distributions should always be the first aspect to be checked in your data. Boxplots draft the key figures in the distribution and help you spot outliers.
In [8]:
boxplots = iris_df.boxplot(return_type='axes')

If you already have groups in your data (from categorical variables, or they may be derived from unsupervised learning), just point out the variable for which you need the boxplot and specify that you need to have the data separated by the groups (use the by parameter followed by the string name of the grouping variable):

In [7]:
boxplots = iris_df.boxplot(column='sepal length (cm)', by='groups', return_type='axes')

In this way, you can quickly know whether the variable is a good discriminator of the group differences.

Anyway, Boxplots cannot provide you with a complete view of distributions as histograms and density plots. For instance, by using histograms and density plots, you can figure out whether there are distribution peaks or valleys.

You can obtain both histograms and density plots by using the plot method. This method allows you to represent the whole dataset, specific groups of variables (you just have to provide a list of the string names and do some fancy indexing), or even single variables.

Exploratory data analysis with pandas – 1 is continued part 2

Leave a Reply