From spreadsheets to pandas

In [22]:
Graphical abstract
Out[22]:

ToC

Preamble

This is a brief tutorial for anyone who is interested in how Python can facilitate their data analyses. The tutorial is aimed at people who currently use a spreadsheet program as their primary data analyses tool, and that have no previous programming experience. If you want to code along, a simple way to install Python is to follow these instructions, but I encourage you to just read through this tutorial on a conceptual level at first.

Motivation

Spreadsheet software is great for viewing and entering small datasets and creating simple visualizations fast. However, it can be tricky to create publication-ready figures, automate reproducible analysis workflows, perform advanced calculations, and clean datasets robustly. Even when using a spreadsheet program to record data, it is often beneficial to pick up some basic programming skills to facilitate the analyses of that data.

Conceptual understanding

Spreadsheet programs, such as MS Excel and Libre/OpenOffice, have their functionality sectioned into menus. In programming languages, all the functionality is accessed by typing the name of functions directly instead of finding the functions in the menu hierarchy. Initially this might seem intimidating and non-intuitive for people who are used to the menu-driven approach.

However, think of it as learning a new natural language. Initially, you will slowly string together sentences by looking up individual words in the dictionary. As you improve, you will only reference the dictionary occasionally since you already know most of the words. Practicing the language whenever you can, and receiving immediate feedback, is often the fastest way to learn. Sitting at home trying to learn every word in the dictionary before engaging in conversation, is destined to kill the joy of learning any language, natural or formal.

In my experience, learning programming is similar to learning a foreign language, and you will often learn the most from just trying to do something and receiving feedback from the computer! When there is something you can't wrap you head around, or if you are actively trying to find a new way of expressing a thought, then look it up, just as you would with a natural language.

Programming basics

Just like in spreadsheet software, the basic installation of Python includes fundamental math operations, e.g. adding numbers together:

In [1]:
4 + 4
Out[1]:
8

Variable assignment

It is possible to assign values to variables:

In [2]:
a = 5
a * 2
Out[2]:
10
In [3]:
my_variable_name = 4
a - my_variable_name
Out[3]:
1

Variables can also hold more data types than just numbers, for example a sequence of characters surrounded by single or double quotation marks (called a string). In Python, it is intuitive to append string by adding them together:

In [4]:
b = 'Hello'
c = 'universe'
b + c
Out[4]:
'Hellouniverse'

A space can be added to separate the words:

In [5]:
b + ' ' + c
Out[5]:
'Hello universe'

Assigning multiple values to variables

Lists

Variables can also store more than one value, for example in a list of values:

In [6]:
list_of_things = [1, 55, 'Hi']
list_of_things
Out[6]:
[1, 55, 'Hi']

I can index the list to access a specific item. Note that numbering in Python starts at 0, so the third item 'Hi', has index 2.

In [7]:
list_of_things[2]
Out[7]:
'Hi'

Dictionaries

In a dictionary, values are paired with names, called keys. These are not stored in any specific order, and are therefore accessed by the key name rather than a number.

In [8]:
fruit_colors = {'tangerine':'organge', 'banana':'yellow', 'apple':['green', 'red']}
fruit_colors['banana']
Out[8]:
'yellow'
In [9]:
fruit_colors['apple']
Out[9]:
['green', 'red']

Comparisons

Python can compare values and assess whether the expression is true or false.

In [10]:
1 == 1
Out[10]:
True
In [11]:
1 == 0
Out[11]:
False
In [12]:
1 > 0
Out[12]:
True
In [13]:
'hey' == 'Hey'
Out[13]:
False
In [14]:
'hey' == 'hey'
Out[14]:
True
In [15]:
a >= 2 * 2
Out[15]:
True

When we start working with spreadsheet-like data, we will see that these comparisons are really useful to extract subsets of data, for example observations from a certain time period.

Using functions

To access additional functionality in a spreadsheet program, you need to click the menu and select the tool you want to use. All charts are in one menu, text layout tools in another, data analyses tools in a third, and so on. Programming languages such as Python have so many tools and functions so that they would not fit in a menu. Instead of clicking File -> Open and chose the file, you would type something similar to file.open('<filename>') in a programming language. Don't worry if you forget the exact expression, it is often enough to just type the few first letters and then hit Tab, to show the available options, more on that later.

Packages

Since there are so many esoteric tools and functions available in Python, it is unnecessary to include all of them with the basics that are loaded by default when you start the programming language (it would be as if your new phone came with every single app preinstalled). Instead, more advanced functionality is grouped into separate packages, which can be accessed by typing import <package_name> in Python. You can think of this as that you are telling the program which menu items you want to use (similar to how Excel hides the Developer menu by default since most people rarely use it and you need activate it in the settings if you want to access its functionality). Some packages needs to be downloaded before they can be used, just like downloading an addon to a browser or mobile phone.

Just like in spreadsheet software menus, there are lots of different tools within each Python package. For example, if I want to use numerical Python functions, I can import the numerical python module, numpy. I can then access any function by writing numpy.<function_name>.

In [16]:
import numpy

numpy.mean([1, 2, 3, 4, 5])
Out[16]:
3.0

How to get help

Once you start out using Python, you don't know what functions are availble within each package. Luckily, in the Jupyter Notebook, you can type numpy.Tab (that is numpy + period + tab-key) and a small menu will pop up that shows you all the available functions in that module. This is analogous to clicking a 'numpy-menu' and then going through the list of functions. As I mentioned earlier, there are plenty of available functions and it can be helpful to filter the menu by typing the initial letters of the function name.

To write this tutorial, I am using a software called Jupyter Notebook, and I took a screenshot of how the tab-completion appears in the notebook. Below I use the Image function from the IPython.display package to display this screenshot.

In [17]:
from IPython.display import Image
Image('./tab-completion.png')
Out[17]:

To get more info on the function you want to use, you can type out the full name and then press Shift + Tab once to bring up a help dialogue and again to expand that dialogue. Below I am displaying a screenshot for how this looks in the numpy.mean function. We can see that to use this function, we need to supply it with the argument a, which should be 'array-like'. An array is essentially just a sequence of numbers. In our previous example, we numbers enclosed in brackets [], which in Python means that these numbers are in a list, which is a sequence of number just like an array.

We could also have defined a as a tuple: (1, 2, 3, 4, 5), or the specific numpy array format: numpy.array([1, 2, 3, 4, 5]). For now, it is not important to know all the differences between these formats, and I am mainly including them so you get a better understanding of what it means for something to be 'array-like'.

In [18]:
Image('./shift-tab-help.png')
Out[18]:

If you need a more extensive help dialog, you can click Shift + Tab four times or just type ?numpy.mean.

When you start getting familiar with typing function names, you will notice that this is often faster than looking for functions in menus. It is similar to getting fluent in a language. I know what English words I want to type right now, and it is much easier for me to type them out, than to select each one from a menu. However, sometimes I forget and it is useful to get some hints as described above.

It is common to give packages nicknames, so that it is faster to type. This is not necessary, but can save some work in long files and make code less verbose so that it is easier to read:

In [19]:
import numpy as np

np.std([1, 2, 3, 4, 5])
Out[19]:
1.4142135623730951

Data analysis with pandas

The Python package that is most commonly used to work with spreadsheet-like data is called pandas, the name is derived from "panel data", an econometrics term for multidimensional structured data sets. Data are easily loaded into pandas from .csv or other spreadsheet formats. The format pandas uses to store this data is called a dataframe.

I do not have any good data set lying around, so I will load a public dataset from the web (you can view the data by pasting the url into your browser). I store my dataframe in a variable called iris, that I can use to access the data from now on.

In [20]:
import pandas as pd
pd.options.display.max_rows = 10 # Shortens the output from cells

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

To have a quick peak at the data, I can type iris.head(<number_of_rows>). Notice that I do not have to use the pd.-syntax for this. iris is now a pandas dataframe, and all pandas dataframes have a number of built-in functions (called methods) that can be appended directly to the dataframe instead of by calling pandas separately.

In [21]:
iris.head() # The default is to show 5 rows
Out[21]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

As you can see, the dataframe row numbering starts at 0, which might be confusing at first, but a standard in many programming languages and you get used to it quickly. I can easily find out how many rows and columns my data set has:

In [22]:
iris.shape
Out[22]:
(150, 5)

And the name of those five columns:

In [23]:
iris.columns
Out[23]:
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

To select a column we can index the dataframe with the column name.

In [24]:
iris['sepal_length']
Out[24]:
0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64

The output is here rendered slightly differently from before, because when we are looking only at one column, it is no longer a data frame, but a series. The differences are not important for this lecture, so this is all you need to know about that for now.

We could now create a new column if we wanted:

In [25]:
iris['sepal_length_x2'] = iris['sepal_length'] * 2
iris['sepal_length_x2']
Out[25]:
0      10.2
1       9.8
2       9.4
3       9.2
4      10.0
       ... 
145    13.4
146    12.6
147    13.0
148    12.4
149    11.8
Name: sepal_length_x2, Length: 150, dtype: float64

And delete that column again:

In [26]:
iris = iris.drop('sepal_length_x2', axis=1)

There are some built-in methods that make it convenient to calculate common operation on dataframe columns.

In [27]:
iris['sepal_length'].mean()
Out[27]:
5.843333333333335
In [28]:
iris['sepal_length'].median()
Out[28]:
5.8

It is also possible to use these methods on all columns at the same time without having to type the same thing over and over again.

In [29]:
iris.mean()
Out[29]:
sepal_length    5.843333
sepal_width     3.057333
petal_length    3.758000
petal_width     1.199333
dtype: float64

Similarly, you can get a statistical summary of the data frame:

In [30]:
iris.describe()
Out[30]:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Subsetting data

A common task is to subset the data into only those observations that match a criteria. For example, we might be interest in only studying one specific species. First let's find out how many different species there are in our dataset:

In [31]:
iris['species'].unique()
Out[31]:
array(['setosa', 'versicolor', 'virginica'], dtype=object)

Let's arbitrarily choose setosa as the one to study! To select only observations from this species in the original dataframe, we index the dataframe with a comparison:

In [32]:
iris['species'] == 'setosa'
Out[32]:
0       True
1       True
2       True
3       True
4       True
       ...  
145    False
146    False
147    False
148    False
149    False
Name: species, Length: 150, dtype: bool
In [33]:
iris[iris['species'] == 'setosa']
Out[33]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
45 4.8 3.0 1.4 0.3 setosa
46 5.1 3.8 1.6 0.2 setosa
47 4.6 3.2 1.4 0.2 setosa
48 5.3 3.7 1.5 0.2 setosa
49 5.0 3.3 1.4 0.2 setosa

50 rows × 5 columns

Now we can easily perform computation on this subset of the data:

In [34]:
iris[iris['species'] == 'setosa'].mean()
Out[34]:
sepal_length    5.006
sepal_width     3.428
petal_length    1.462
petal_width     0.246
dtype: float64
In [54]:
# The below calculates the mean for each row, but that is not very valuable on this data set
# iris[iris['species'] == 'setosa'].mean(axis=1)

We could also compare all groups within the data against each other, by using the split-apply-combine workflow. This splits data into groups, applies an operation on each group, and then combines the results into a table for display.

In pandas, we split into groups with the group_by command and then we apply an operation to the grouped data frame, e.g. .mean().

In [35]:
iris.groupby('species').mean()
Out[35]:
sepal_length sepal_width petal_length petal_width
species
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026

We can also easily count the number of observations in each group:

In [36]:
iris.groupby('species').size()
Out[36]:
species
setosa        50
versicolor    50
virginica     50
dtype: int64

Data visualization

We can see that there are clear differences between species, but they might be even clearer if we display them graphically in a chart.

Plotting with pandas

Pandas interfaces with one of Python's most powerful data visualization libraries, matplotlib, to enable simple visualizations at minimal effort.

In [37]:
# Prevent plots from popping up in a new window
%matplotlib inline

species_comparison = iris.groupby('species').mean() # Assign to a variable
species_comparison.plot(kind='bar')
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6564da3d68>

Depending on what you are interesting in showing, it could be useful to have the species as the different colors and the columns along the x-axis. We can easily achieve this by transposing (.T) our data frame.

In [38]:
species_comparison.T.plot(kind='bar')
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f65619dbf98>

Plotting with seaborn

Another plotting library is seaborn, which also builds upon matplotlib, and extends it by adding new styles, additional plot types and some commonly performed statistical measures.

In [39]:
import seaborn as sns

sns.swarmplot('species', 'sepal_length', data = iris)
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f65591a9d30>

That looks a bit small, so let's change the style we are using for plotting.

In [40]:
sns.set(style='ticks', context='talk', rc={'figure.figsize':(8, 5)}) # This applies to all subseque

sns.swarmplot('species', 'sepal_length', data=iris)
sns.despine() # Remove the right and top axes lines

We can use the same syntax to create many of the common plots in seaborn.

In [41]:
sns.barplot('species', 'sepal_length', data=iris)
sns.despine() # I might set this in sns.set() to not have to type it all the time...

Bar charts are a common, but not very useful way of presenting data aggregations (e.g. the mean). A better way is to use the points as we did above, or a plot that capture the distribution of the data, such as a boxplot, or a violin plot:

In [42]:
sns.violinplot('species', 'sepal_length', data = iris)
sns.despine()

We can also combine two plots, by simply adding the two line after each other. There is also a more advanced figure interface available in matplotlib to explicitly indicate which figure and axes you want the plot to appear in, but this is outside the scope of this tutorial (more info here and here).

In [43]:
sns.violinplot('species', 'sepal_length', data=iris, inner=None)
sns.swarmplot('species', 'sepal_length', data=iris, color='black', size=4)
sns.despine()

Instead of plotting one categorical variable vs a numerical variable, we can also plot two numerical values against each other to explore potential correlations between these two variables:

In [44]:
sns.lmplot('sepal_width', 'sepal_length', data=iris, size=6)
Out[44]:
<seaborn.axisgrid.FacetGrid at 0x7f65673f7e80>

There is a regression line plotted by default to indicate the trend in the data. Let's turn that off for now and look at only the data points.

In [45]:
sns.lmplot('sepal_width', 'sepal_length', data=iris, fit_reg=False, size=6)
Out[45]:
<seaborn.axisgrid.FacetGrid at 0x7f6558aef710>

There appears to be some structure in this data. At least two clusters of points seem to be present. Let's color according to species and see if that explains what we see.

In [46]:
sns.lmplot('sepal_width', 'sepal_length', data=iris, hue='species', fit_reg=False, size=6)
Out[46]:
<seaborn.axisgrid.FacetGrid at 0x7f65586ba0f0>

Now we can add back the regression line, but this time one for each group.

In [47]:
sns.lmplot('sepal_width', 'sepal_length', data=iris, hue='species', fit_reg=True, size=6)
Out[47]:
<seaborn.axisgrid.FacetGrid at 0x7f65586ba3c8>

Instead of creating a plot for each variable against each other, we can easily create a grid of subplots for all variables with a single command:

In [48]:
sns.pairplot(iris, hue="species", size=3.5)
Out[48]:
<seaborn.axisgrid.PairGrid at 0x7f65585cceb8>

More complex visualizations

Many visualizations are easier to make if we first need to reshape our data frame into the tidy or long data format, which is the format seaborn prefers. This is sometimes referred to as changing the data from wide to long format, and moves some of the information from columns to rows:

In [49]:
Image('./molten-data.png') # Image from http://vita.had.co.nz/papers/tidy-data.pdf
Out[49]:

We can use pandas built-in melt-function to transform the data into the long format. The new columns will be given the names variable and value (see the help of melt if you would like to change these names).

In [50]:
iris_melt = pd.melt(iris, id_vars='species')
iris_melt
Out[50]:
species variable value
0 setosa sepal_length 5.1
1 setosa sepal_length 4.9
2 setosa sepal_length 4.7
3 setosa sepal_length 4.6
4 setosa sepal_length 5.0
... ... ... ...
595 virginica petal_width 2.3
596 virginica petal_width 1.9
597 virginica petal_width 2.0
598 virginica petal_width 2.3
599 virginica petal_width 1.8

600 rows × 3 columns

We do not need to call groupby or mean on the long iris-dataframe when plotting with seaborn. Instead we control these options from seaborn with the plot type we chose (barplot = mean automatically) and the hue-parameter, which is analogous to groupby.

In [51]:
sns.set(context='poster', style='white', rc={'figure.figsize':(10, 6)})

sns.swarmplot(x='variable', y='value', hue='species', data=iris_melt, dodge=True, palette='Set2', size=4)
sns.despine() # removes the right and top black border
In [52]:
sns.set(context='poster', style='darkgrid', rc={'figure.figsize':(12, 6)})

sns.boxplot(y='variable', x='value', hue='species', data=iris_melt, color='c', )
sns.stripplot(y='variable', x='value', hue='species', data=iris_melt, size=2.5, palette=['k']*3, jitter=True, dodge=True)
sns.despine(trim=True)

It is also possible to get too fancy and accidentally hide important messages in the data. However, the fact that you now have access to several ways to plot your data forces you to consider what is actually important and how you can best communicate that message, rather than always making the same plot without considering its strengths and weaknesses.

Resources to learn more

The documentation for these packages are great resources to learn more.