Home
DSP Statistics
Coding Interaction Information
Click to run the codes in Google colab.

Intro to Pandas Library¶

The pandas library is a library built largely emulating the functionality of R dataframes and the ggplot library. It works very well with numpy and it's very useful to understand how to read-in data into a dataframe, drop or add data to an existing dataframe, cross-reference or subset data, and plot and manipulate your data. This lesson will give you a basic overview.

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Numpy arrays to dataframe¶

Like the numpy library, pandas has it's own native data structures:

  • Series is for one-dimensional data. A single series is equivalent to one column.
In [ ]:
apples = np.array([5.5,6.5,7.9,5.7,6.8,8.1])
apples
Out[ ]:
array([5.5, 6.5, 7.9, 5.7, 6.8, 8.1])
In [ ]:
app = pd.Series(apples)
app
Out[ ]:
0    5.5
1    6.5
2    7.9
3    5.7
4    6.8
5    8.1
dtype: float64

A Series can be treated much like a numpy array. It is both iterable but capable of vectorized operations. For instance, we can do arithmetic with scalar values applied accross the entire Series; we can compute sums, means, or custom-defined functions;

In [ ]:
# add one to all values
app + 1
Out[ ]:
0    6.5
1    7.5
2    8.9
3    6.7
4    7.8
5    9.1
dtype: float64
In [ ]:
# compute the Series mean
app.mean()
Out[ ]:
6.75

When we have the equivalent of multi-dimensional numpy arrays, instead of a single column we will need multiple columns. This is the pandas DataFrame object.

We can create a dataframe directly from a 2d numpy array:

In [ ]:
fruits = np.array([[5.5,6.5,7.9,5.7,6.8,8.1],[5.1,3.4,6.5,8.8,6.3,4.9],[4.3,6.7,6.5,4.6,7.2,7.9]])

fruits.shape
Out[ ]:
(3, 6)
In [ ]:
fruits.transpose()
Out[ ]:
array([[5.5, 5.1, 4.3],
       [6.5, 3.4, 6.7],
       [7.9, 6.5, 6.5],
       [5.7, 8.8, 4.6],
       [6.8, 6.3, 7.2],
       [8.1, 4.9, 7.9]])
In [ ]:
# create dataframe from ndarray
f = pd.DataFrame(fruits).transpose()
f
Out[ ]:
0 1 2
0 5.5 5.1 4.3
1 6.5 3.4 6.7
2 7.9 6.5 6.5
3 5.7 8.8 4.6
4 6.8 6.3 7.2
5 8.1 4.9 7.9
In [ ]:
# name the columns
f.columns = (['apples','bananas','oranges'])
In [ ]:
f
Out[ ]:
apples bananas oranges
0 5.5 5.1 4.3
1 6.5 3.4 6.7
2 7.9 6.5 6.5
3 5.7 8.8 4.6
4 6.8 6.3 7.2
5 8.1 4.9 7.9
In [ ]:
type(f)
Out[ ]:
pandas.core.frame.DataFrame

Pandas describe function is a useful function for gathering summary statistics for your columns of data

In [ ]:
f.describe()
Out[ ]:
apples bananas oranges
count 6.000000 6.000000 6.000000
mean 6.750000 5.833333 6.200000
std 1.083974 1.832667 1.442221
min 5.500000 3.400000 4.300000
25% 5.900000 4.950000 5.075000
50% 6.650000 5.700000 6.600000
75% 7.625000 6.450000 7.075000
max 8.100000 8.800000 7.900000

Built-in Plotting Capabilities¶

Pandas uses matplotlib "under the hood" to allow some simple built-in graphing capabilities.

For instance, we can plot the entire dataframe at once. We can plot over indicies (useful if you have time-series data, for example)

In [ ]:
ax1 = plt.subplot(211)
ax2 = plt.subplot(212)
plt.tight_layout()
f.plot(ax=ax1, ylim=(0,9))
f.plot.bar(ax=ax2, rot=0) #By default pandas bar plot will put your x-labels on their sides. Use rot=0 to fix it.
Out[ ]:
<AxesSubplot: >

... we can compare the medians (or means) and standard deviations of columns... (read more about pandas.DataFrame.boxplot here)

In [ ]:
# a boxplot 
f.boxplot()
Out[ ]:
<AxesSubplot: >

As we will see, it is common when evaluating musical datasets to have each row be a separate observation (e.g., a song) and each column represent a different musical feature or partial feature.

Creating new columns¶

It is typically useful when doing modeling to have dataframes that are longer than they are wide. This is called a "long form" dataframe. This is because it is typically useful in both computational and perception work to have features or variables stored as columns. You can have as many columns as you want in a dataframe. It is very easy to create subsets and ignore or "throw out" data later on.

Creating a new column to an existing dataframe is simple so long as the new column has the same number of rows as the preexisting dataframe. (Note that Pandas has many capabilities for joining and merging dataframes that we won't get into here.)

In [ ]:
apples.size
Out[ ]:
6
In [ ]:
# create Series of same length
pears = pd.Series([6.5,6.1,8.2,7.5,7.3,6.4])
In [ ]:
# add new column to dataframe
f['pears'] = pears
In [ ]:
f
Out[ ]:
apples bananas oranges pears
0 5.5 5.1 4.3 6.5
1 6.5 3.4 6.7 6.1
2 7.9 6.5 6.5 8.2
3 5.7 8.8 4.6 7.5
4 6.8 6.3 7.2 7.3
5 8.1 4.9 7.9 6.4

Sometimes you would like to have a single column label for all the data in a dataframe (or subset) because you will later append it to a different dataframe. This is as simple as creating a label for the column and giving a default value:

In [ ]:
f['farm'] = 'McDonald'
f
Out[ ]:
apples bananas oranges pears farm
0 5.5 5.1 4.3 6.5 McDonald
1 6.5 3.4 6.7 6.1 McDonald
2 7.9 6.5 6.5 8.2 McDonald
3 5.7 8.8 4.6 7.5 McDonald
4 6.8 6.3 7.2 7.3 McDonald
5 8.1 4.9 7.9 6.4 McDonald

Notice what happens if we tried now to plot our dataframe, now that each row has both floating point and string data:

In [ ]:
f.plot()
Out[ ]:
<AxesSubplot: >

Dataframes from dictionaries¶

Pandas dataframes can also be read in from python dictionaries. E.g.,

In [ ]:
fr = {'apples':[3.3,4.5,5.7,4.7,3.7,4.4], 'bananas': [4.5,3.9,6.4,5.1,4.8,4.3], 
      'oranges':[4.0,5.9,6.0,6.5,5.2,5.8], 'pears':[6.8,6.6,6.2,6.1,5.9,6.5]}
In [ ]:
fruits2 = pd.DataFrame(fr)
fruits2
Out[ ]:
apples bananas oranges pears
0 3.3 4.5 4.0 6.8
1 4.5 3.9 5.9 6.6
2 5.7 6.4 6.0 6.2
3 4.7 5.1 6.5 6.1
4 3.7 4.8 5.2 5.9
5 4.4 4.3 5.8 6.5

let's say this data came from a different farm:

In [ ]:
fruits2['farm'] = 'FarmerJohn'
fruits2
Out[ ]:
apples bananas oranges pears farm
0 3.3 4.5 4.0 6.8 FarmerJohn
1 4.5 3.9 5.9 6.6 FarmerJohn
2 5.7 6.4 6.0 6.2 FarmerJohn
3 4.7 5.1 6.5 6.1 FarmerJohn
4 3.7 4.8 5.2 5.9 FarmerJohn
5 4.4 4.3 5.8 6.5 FarmerJohn

Perhaps we would like to know which farm produces better crops? In this case, lets imagine that each row represents an average size fruit from each of their plots.

Let's combine these into a single dataframe:

In [ ]:
f = f.append(fruits2)
f
/var/folders/jr/tf9k36ws1956sfj0w8ybnj2m0000gn/T/ipykernel_43618/4084900850.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  f = f.append(fruits2)
Out[ ]:
apples bananas oranges pears farm
0 5.5 5.1 4.3 6.5 McDonald
1 6.5 3.4 6.7 6.1 McDonald
2 7.9 6.5 6.5 8.2 McDonald
3 5.7 8.8 4.6 7.5 McDonald
4 6.8 6.3 7.2 7.3 McDonald
5 8.1 4.9 7.9 6.4 McDonald
0 3.3 4.5 4.0 6.8 FarmerJohn
1 4.5 3.9 5.9 6.6 FarmerJohn
2 5.7 6.4 6.0 6.2 FarmerJohn
3 4.7 5.1 6.5 6.1 FarmerJohn
4 3.7 4.8 5.2 5.9 FarmerJohn
5 4.4 4.3 5.8 6.5 FarmerJohn

We would now like to compare the averages from Farmer John's apples to Farmer McDonald's apples. For this we need to group or subset our data.

In this case, we want to look at all the apples but grouped by farm. Whenever we want to do this and apply a single function (here it's the mean), it is useful to use pandas groupby and apply functionality.

In [ ]:
apps = f['apples'].groupby(f['farm'])
apps
Out[ ]:
<pandas.core.groupby.generic.SeriesGroupBy object at 0x12fec0dc0>
In [ ]:
apps.apply(np.mean)
Out[ ]:
farm
FarmerJohn    4.383333
McDonald      6.750000
Name: apples, dtype: float64

We can also plot this data by looking at a boxplot:

In [ ]:
# groupby objects have plot capabilities too
apps.plot()
Out[ ]:
farm
FarmerJohn    AxesSubplot(0.125,0.11;0.775x0.77)
McDonald      AxesSubplot(0.125,0.11;0.775x0.77)
Name: apples, dtype: object

Somtimes it is easier for a simple scenario not to use groupby:

In [ ]:
f.boxplot("apples", by="farm")
Out[ ]:
<AxesSubplot: title={'center': 'apples'}, xlabel='farm'>

Read in a dataframe¶

Often you will have preexisting data that you will simply read in. To do this use the appropriate read function for pandas. Most commonly read.csv

In [ ]:
flowers = pd.read_csv('../Datasets/iris.csv')
flowers
Out[ ]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

Pandas loc and iloc¶

pandas loc and iloc functions are useful for calling rows and columns by position and name. Generally, iLoc is for index location, while loc is useful with label names.

In [ ]:
# first row of data and 3rd through 5th columns
flowers.iloc[0, 2:5]
Out[ ]:
Petal.Length       1.4
Petal.Width        0.2
Species         setosa
Name: 0, dtype: object
In [ ]:
# Ditto but using names instead of numeric index location. Note the colon indicating the range of consecutive columns
flowers.loc[1, 'Petal.Length':'Species']
Out[ ]:
Petal.Length       1.4
Petal.Width        0.2
Species         setosa
Name: 1, dtype: object
In [ ]:
# LOC: return all rows from 'Species' column matching query from Sepal.Length column
flowers.loc[(flowers['Sepal.Length'] >= 4.9), 'Species']
Out[ ]:
0         setosa
1         setosa
4         setosa
5         setosa
7         setosa
         ...    
145    virginica
146    virginica
147    virginica
148    virginica
149    virginica
Name: Species, Length: 134, dtype: object
In [ ]:
flowers.loc[(flowers['Sepal.Length'] >= 4.9)]
Out[ ]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
7 5.0 3.4 1.5 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

134 rows × 5 columns

In [ ]:
data = pd.read_csv('../Datasets/TitanicSurvival.csv')

#show first N rows (default = 5)
data.head()
Out[ ]:
Unnamed: 0 survived sex age passengerClass
0 Allen, Miss. Elisabeth Walton yes female 29.0000 1st
1 Allison, Master. Hudson Trevor yes male 0.9167 1st
2 Allison, Miss. Helen Loraine no female 2.0000 1st
3 Allison, Mr. Hudson Joshua Crei no male 30.0000 1st
4 Allison, Mrs. Hudson J C (Bessi no female 25.0000 1st

Categorical data & counting¶

When you have categorical data (such as "yes"/"no" or "A","B","C", etc.) it can be useful to make use of a Series function called value_counts()

In [ ]:
data['survived'].value_counts()
Out[ ]:
no     809
yes    500
Name: survived, dtype: int64

You can access these values as proportions by setting the argument "normalize" to True

In [ ]:
data['survived'].value_counts(normalize = True)
Out[ ]:
no     0.618029
yes    0.381971
Name: survived, dtype: float64
In [ ]:
 
Back to top