# I am just ignoring warnings becuase seaborn hasn't been updated in a while
# Never do this unless you know what warnings you are ignoring
import warnings
warnings.filterwarnings('ignore')

A few EDA pointers¶

Some pointers on useful pandas and seaborn functions to conduct EDA on a new dataset. This is of course not the only way of doing EDA, so help yourself to what you find useful and leave the rest.

Textual and single variable EDA¶

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd


# Personal preference, this is about the maximum amount of customization I think is needed for EDA
sns.set(context='notebook', style='ticks', font_scale=1.2, rc={'axes.spines.right':False, 'axes.spines.top':False})

titanic = pd.read_csv('train.csv', )
titanic.columns = titanic.columns.str.lower() # less shift clicking

head() is the most useful place to start to get a feel of what the data looks like.

titanic.head()

info() is good for viewing data shape, missing values, and column data types.

titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
passengerid    891 non-null int64
survived       891 non-null int64
pclass         891 non-null int64
name           891 non-null object
sex            891 non-null object
age            714 non-null float64
sibsp          891 non-null int64
parch          891 non-null int64
ticket         891 non-null object
fare           891 non-null float64
cabin          204 non-null object
embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

nunique() displays the number of unique values for each column, which is useful to understand the structure of the data and for faceting and hueing later on.

titanic.nunique()

passengerid    891
survived         2
pclass           3
name           891
sex              2
age             88
sibsp            7
parch            7
ticket         681
fare           248
cabin          147
embarked         3
dtype: int64

describe() gives and idea of distribution shapes (although I usually think it's easier to see this in plots, but a quick peak can't hurt).

titanic.describe()

The fastest way to view all univariate distributions is via the hist() method of pandas data frames.

titanic.hist(figsize=(10,6), grid=False, bins=20)
plt.tight_layout()

Although some of these numerical variables are actually categorical, I would leave them in as numbers initially, just to eyeball if there are any obvious relationships to follow up on later.

Exploring relationships between continuous variables¶

sns.pairplot() shows the pair-wise variable relationships in addition to the single variable distributions. I tend to favor doing this directly instead of hist(), but both have their use cases. The pairplot grid can take quite some time to create for big data sets, so it can be a good idea to use sample() to only plot a subset of the data (but be sure to run it a few times to sample different subsets).

sns.pairplot(titanic)

<seaborn.axisgrid.PairGrid at 0x7faf4472e208>

The hue parameter makes it straightforward to split by variables (variable choice can be guided by nunique() and/or the plot above).

sns.pairplot(titanic, hue='survived')

<seaborn.axisgrid.PairGrid at 0x7faf4333f898>

It is good to keep in mind that the diagonal KDE-plots can be a bit misleading for discrete data (especially the categorical columns here). I might do a few more of these hue splits depending on how the data looks, potentially with fewer columns to create smaller plots.

Around this point I would encode the variables with what I believe is their correct data type to facilitate exploring them further.

cols_to_cat = ['survived', 'pclass', 'sibsp', 'parch', 'embarked', 'sex']
titanic[cols_to_cat] = titanic[cols_to_cat].astype('category')
titanic = titanic.set_index('passengerid') # or drop this column
# `pd.cut()` can be used to change a numeric dtype into categorical bins

Now the pairplot is focused on the relationship between continuous variables.

numeric_cols = titanic.select_dtypes('number').columns
sns.pairplot(titanic, hue='survived', vars=numeric_cols, plot_kws={'s':6, 'edgecolor':'none'})

<seaborn.axisgrid.PairGrid at 0x7faf401a4780>

Since the scatters are still saturated, I would probably want to investigate these two variables with separate 2D histograms or similar instead of different colors in the scatter plot.

What I have done up until this point are what I tend to do most of the time. The below is more situational for me, so there might be better ways of going about it (such as rectangular area plots), but this approach is quick without any additional imports.

Exploring relationships between categorical and continuous variables¶

To quickly gauge relationships between categorical and continuous variables, I would loop over the columns and subset the data. You could do some of this in a Facetgrid, but the melting steps are not easier than the loops in my opinion, and Facetgrids are really meant to display multiple subsets of the data by distributing variable values across columns and rows in the plot grid but keeping the x and y axes the same throughout.

cat_cols = titanic.select_dtypes('category').columns.to_list()
num_cols = len(cat_cols)
for numeric_col in numeric_cols.to_list():
    fig, axes = plt.subplots(1, num_cols, figsize=(num_cols * 3, 3), constrained_layout=True)
    for col, ax in zip(cat_cols, axes.flatten()):
        # I would prefer a swarmplot if there were less data points
        sns.violinplot(x=numeric_col, y=col, data=titanic, ax=ax, cut=0, scale='width')
        # I might add some sort of dotplot here, e.g. sns.stripplot or the fliers only from a boxplot

sns.countplot() can be used to show counts of categorical variables as barplots without the need for manually plotting value_counts().

fig, axes = plt.subplots(1, num_cols, figsize=(num_cols * 3, 3), constrained_layout=True)
for col, ax in zip(cat_cols, axes.flatten()):
    sns.countplot(x=col, data=titanic, ax=ax)#color='steelblue')

To understand relationships between the categorical variables, I would loop over the categorical columns and subset the data to count occurrences in the subsets. Again, you could do some of this in a Facetgrid, but it is a bit buggy for categorical counting since it is not its intended function.

for hue_col in cat_cols:
    cat_cols_to_plot = [col for col in cat_cols if col != hue_col]
    num_cols = len(cat_cols_to_plot)
    fig, axes = plt.subplots(1, num_cols, figsize=(num_cols * 3, 3), constrained_layout=True)
    for col, ax in zip(cat_cols_to_plot, axes.flatten()):
        sns.countplot(x=col, data=titanic, ax=ax, hue=hue_col)
        # The below is optional
        if not ax == axes.flatten()[0]:
            ax.legend_.remove()
            ax.set_ylabel('')

The above could be made into one big subplot grid also, but it would involve a bit more verbose and EDA is ideally done without too much thinking about graphics layouts.

After this initial broad EDA, I would start more targeted EDA by using sns.relplot to explore relationships between two continuous variables and sns.catplot to explore relationships between a continuous and a categorical variable. Both of these plotting functions allow the use of small multiples (facets) to break the data into categorical subsets. The seaborn tutorials is a good place to learn more about this. I also created this tutorial as part of UofTCoders.

	passengerid	survived	pclass	name	sex	age	sibsp	ticket	fare	cabin	embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	passengerid	survived	pclass	age	sibsp	parch	fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200