EDA

Exploratory Data Analysis with pandas – 2

Posted on Posted in Data Analysis Resources

continued from part 1

In [10]:
densityplot = iris_df.plot(kind='density')
In [11]:
single_distribution = iris_df['petal width (cm)'].plot(kind='hist', alpha=0.5)

Scatterplots

Scatterplots can be used to effectively understand whether the variables are in a nonlinear relationship, and you can get an idea about their best possible transformations to achieve linearization

In [12]:
colors_palette = {0: 'red', 1: 'yellow', 2:'blue'}
colors = [colors_palette[c] for c in groups]
simple_scatterplot = iris_df.plot(kind='scatter', x=0, y=1, c=colors)

Scatterplots can be turned into hexagonal binning plots. Also, they help you effectively visualize the point densities, thus revealing natural clusters hidden in your data by using some of the variables in the dataset or the dimensions obtained by PCA or other dimensionality reduction algorithm

In [13]:
hexbin = iris_df.plot(kind='hexbin', x=0, y=1, gridsize=10)

The scatterplot matrix can inform you about the conjoint distributions of your features. It thus helps you locate groups in data and verify their separability.

If your variables are less in number (otherwise, the visualization will get cluttered), a quick turnaround is to automatically place a command to draw a matrix of scatterplots:

In [14]:
from pandas.tools.plotting import scatter_matrix
colors_palette = {0: "red", 1: "green", 2: "blue"}
colors = [colors_palette[c] for c in groups]   
matrix_of_scatterplots = scatter_matrix(iris_df, alpha=0.2, figsize=(6, 6), color=colors, diagonal='kde')

The alpha parameter controls the amount of transparency, and figsize provides the width and height of the matrix in inches. Finally, color accepts a list indicating the color of each point in the plot, thus allowing the depicting of different groups in data. Also, by selecting ‘kde’ or ‘hist’ on your diagonal parameter, you can opt to represent density curves or histograms (faster) of each variable on the diagonal of the scatter matrix.

Parallel coordinates

Parallel coordinates plot is helpful in the task of providing you with a hint about the most group-discriminating variables.

By plotting all the observations as parallel lines with respect to all the possible variables (arbitrarily aligned on the abscissa), parallel coordinates will help you spot whether there are streams of observations grouped as your classes and understand the variables that best separate the streams (the most useful predictor variables).

The parallel_coordinates is a pandas function and, to work properly, it just needs as parameters the data DataFrame and the string name of the variable containing the groups whose separability you want to test. This is why you should add it to your dataset. However, don’t forget to remove it after you finish exploring by using the DataFrame.drop(‘variable name’,axis=1) method.

In [15]:
from pandas.tools.plotting import parallel_coordinates
iris_df['groups'] = [iris.target_names[k] for k in groups]
pll = parallel_coordinates(iris_df,'groups')

Leave a Reply

Your email address will not be published. Required fields are marked *