Python Pandas Cheat Sheet

2020-03-25

Python Development

Being able to look up and use functions fast allows us to achieve a certain flow when writing code. So I’ve created this cheatsheet of functions from python pandas. This is not a comprehensive list but contains the functions I use most, an example, and my insights as to when it’s most useful.

First Step - Load CSV

If you want to run these examples yourself, download the Anime recommendation dataset from Kaggle , unzip and drop it in the same folder as your jupyter notebook.

Next Run these commands and you should be able to replicate my results for any of the below functions.

import pandas as pd
import numpy as np
 
anime = pd.read_csv('anime.csv')                               
rating = pd.read_csv('rating.csv')                               
 
anime_modified = anime.set_index('name')

Importing

Convert a CSV directly into a data frame. Sometimes loading data from a CSV also requires specifying an encoding (ie:encoding='ISO-8859–1'). It’s the first thing you should try if your data frame contains unreadable characters.

Another similar function also exists called pd.read_excel for excel files.

anime = pd.read_csv('anime.csv')

print(anime)
       anime_id                                               name                                              genre     type episodes  rating  members
0         32281                                     Kimi no Na wa.               Drama, Romance, School, Supernatural    Movie        1    9.37   200630
1          5114                   Fullmetal Alchemist: Brotherhood  Action, Adventure, Drama, Fantasy, Magic, Mili...       TV       64    9.26   793665
2         28977                                           Gintama°  Action, Comedy, Historical, Parody, Samurai, S...       TV       51    9.25   114262
3          9253                                        Steins;Gate                                   Sci-Fi, Thriller       TV       24    9.17   673572
...
...
...

Build data frame from inputted data

Useful when you want to manually instantiate simple data so that you can see how it changes as it flows through a pipeline.

df = pd.DataFrame([[1,'USA', 'North America'],
                   [2,'Germany', 'Europe'],
                   [3,'China', 'Asia']], 
                   columns=['ID','Country', 'Continent'])

print(df)
   ID  Country      Continent
0   1      USA  North America
1   2  Germany         Europe
2   3    China           Asia

Copy a data frame

Useful when you want to make changes to a data frame while maintaining a copy of the original. It’s good practise to copy all data frames immediately after loading them.

anime_copy = anime.copy(deep=True)

print(anime_copy)
       anime_id                                               name                                              genre     type episodes  rating  members
0         32281                                     Kimi no Na wa.               Drama, Romance, School, Supernatural    Movie        1    9.37   200630
1          5114                   Fullmetal Alchemist: Brotherhood  Action, Adventure, Drama, Fantasy, Magic, Mili...       TV       64    9.26   793665
2         28977                                           Gintama°  Action, Comedy, Historical, Parody, Samurai, S...       TV       51    9.25   114262
3          9253                                        Steins;Gate                                   Sci-Fi, Thriller       TV       24    9.17   673572
...
...
...

Exporting - Save to CSV

This dumps to the same directory as the notebook. I’m only saving the 1st 5 rows below but you don’t need to do that. Again, df.to_excel() also exists and functions basically the same for excel files.

anime_copy[:5].to_csv('anime_copy.csv', index=False)

~] cat anime_copy.csv 
anime_id,name,genre,type,episodes,rating,members
32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Military, Shounen",TV,64,9.26,793665
28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen",TV,51,9.25,114262
9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen",TV,51,9.16,151266

Viewing and Inspecting

Get top or bottom n records

Display the first n records from a data frame. I often print the top record of a data frame somewhere in my notebook so I can refer back to it if I forget what’s inside.

anime_copy.head(3)
   anime_id                              name                                              genre   type episodes  rating  members
0     32281                    Kimi no Na wa.               Drama, Romance, School, Supernatural  Movie        1    9.37   200630
1      5114  Fullmetal Alchemist: Brotherhood  Action, Adventure, Drama, Fantasy, Magic, Mili...     TV       64    9.26   793665
2     28977                          Gintama°  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.25   114262

anime_copy.tail(2)
       anime_id                                               name   genre   type episodes  rating  members
12292      6133  Violence Gekiga Shin David no Hoshi: Inma Dens...  Hentai    OVA        1    4.98      175
12293     26081                   Yasuji no Pornorama: Yacchimae!!  Hentai  Movie        1    5.46      142

Count rows

This is not a pandas function, but len() counts rows and can be saved to a variable and used elsewhere.

len(df)
3

len(anime_copy)   
12294

Count unique rows

Count unique values in a column:

len(anime_copy['type'].unique())   
7

len(rating['user_id'].unique())    
73515

# or 

rating['user_id'].nunique() 
73515

Get data frame info

Useful for getting some general information like header, number of values and datatype by column. A similar but less useful function is df.dtypes which just gives column data types.

anime_copy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
anime_id    12294 non-null int64
name        12294 non-null object
genre       12232 non-null object
type        12269 non-null object
episodes    12294 non-null object
rating      12064 non-null float64
members     12294 non-null int64
dtypes: float64(1), int64(2), object(4)
memory usage: 672.4+ KB

Get statistics

Really useful if the data frame has a lot of numeric values. Knowing the mean, min and max of the rating column give us a sense of how the data frame looks overall.

anime_copy.describe()
           anime_id        rating       members
count  12294.000000  12064.000000  1.229400e+04
mean   14058.221653      6.473902  1.807134e+04
std    11455.294701      1.026746  5.482068e+04
min        1.000000      1.670000  5.000000e+00
25%     3484.250000      5.880000  2.250000e+02
50%    10260.500000      6.570000  1.550000e+03
75%    24794.500000      7.180000  9.437000e+03
max    34527.000000     10.000000  1.013917e+06

Get counts of values

anime_copy.type.value_counts()
TV         3787
OVA        3311
Movie      2348
Special    1676
ONA         659
Music       488
Name: type, dtype: int64

Selecting

Get a list or series of values for a column

This works if you need to pull the values in columns into x and y variables so you can fit a machine learning model.

anime_copy['genre']    
0                     Drama, Romance, School, Supernatural
1        Action, Adventure, Drama, Fantasy, Magic, Mili...
2        Action, Comedy, Historical, Parody, Samurai, S...
3                                         Sci-Fi, Thriller
4        Action, Comedy, Historical, Parody, Samurai, S...

anime_copy['genre'].tolist()
['Drama, Romance, School, Supernatural',
 'Action, Adventure, Drama, Fantasy, Magic, Military, Shounen',
 'Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen', 
 'Sci-Fi, Thriller',
 'Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen'
 ...
 ...
 ...]

Get a list of index values

Create a list of values from index.

anime_modified.index.tolist()
['Kimi no Na wa.', 
 'Fullmetal Alchemist: Brotherhood',
 'Gintama°', 
 'Steins;Gate',
 ...
 ...
 ...]

Get a list of column values

anime_copy.columns.tolist()
['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members']

Adding or Dropping

Append new column with a set value

I do this on occasion when I have test and train sets in 2 separate data frames and want to mark which rows are related to what set before combining them.

anime_copy['train set'] = True

anime_copy.columns.tolist()   
['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members', 'train set']

Create new data frame from a subset of columns

Useful when you only want to keep a few columns from a giant data frame and don’t want to specify each that you want to drop.

anime_copy[['name','episodes']]
                                                    name episodes
0                                         Kimi no Na wa.        1
1                       Fullmetal Alchemist: Brotherhood       64
2                                               Gintama°       51
3                                            Steins;Gate       24
4                                          Gintama&#039;       51
...
...
...
12291                     Violence Gekiga David no Hoshi        4
12292  Violence Gekiga Shin David no Hoshi: Inma Dens...        1
12293                   Yasuji no Pornorama: Yacchimae!!        1

[12294 rows x 2 columns]

Drop specified columns

Useful when you only need to drop a few columns. Otherwise, it can be tedious to write them all out and I prefer the previous option.

anime_copy.drop(['anime_id', 'genre', 'members'], axis=1).head(8)
                                                name   type episodes  rating  train set
0                                     Kimi no Na wa.  Movie        1    9.37       True
1                   Fullmetal Alchemist: Brotherhood     TV       64    9.26       True
2                                           Gintama°     TV       51    9.25       True
3                                        Steins;Gate     TV       24    9.17       True
4                                      Gintama&#039;     TV       51    9.16       True
5  Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...     TV       10    9.15       True
6                             Hunter x Hunter (2011)     TV      148    9.13       True
7                               Ginga Eiyuu Densetsu    OVA      110    9.11       True

Add a row with sum of other rows

We’ll manually create a small data frame here because it’s easier to look at. The interesting part here is df.sum(axis=0) which adds the values across rows. Alternatively df.sum(axis=0) adds values across columns.

The same logic applies when calculating counts or means, ie: df.mean(axis=0).

df = pd.DataFrame([[1,'John', 80],
                   [2,'Alice', 60],
                   [3,'Bob', 75]], 
                   columns=['ID','Name', 'weight'])

df.append(df.sum(axis=0), ignore_index=True)       
   ID          Name  weight
0   1          John      80
1   2         Alice      60
2   3           Bob      75
3   6  JohnAliceBob     215

Combining

Concatenate 2 dataframes

Use this if you have 2 data frames with the same columns and want to combine them.

Here we split a data frame in 2 them add them back together.

df1 = anime_copy[0:2]

print(df1)
   anime_id                              name                                              genre   type episodes  rating  members  train set
0     32281                    Kimi no Na wa.               Drama, Romance, School, Supernatural  Movie        1    9.37   200630       True
1      5114  Fullmetal Alchemist: Brotherhood  Action, Adventure, Drama, Fantasy, Magic, Mili...     TV       64    9.26   793665       True


df2 = anime_copy[2:4]

print(df2)
   anime_id         name                                              genre type episodes  rating  members  train set
2     28977     Gintama°  Action, Comedy, Historical, Parody, Samurai, S...   TV       51    9.25   114262       True
3      9253  Steins;Gate                                   Sci-Fi, Thriller   TV       24    9.17   673572       True



pd.concat([df1, df2], ignore_index=True)
   anime_id                              name                                              genre   type episodes  rating  members  train set
0     32281                    Kimi no Na wa.               Drama, Romance, School, Supernatural  Movie        1    9.37   200630       True
1      5114  Fullmetal Alchemist: Brotherhood  Action, Adventure, Drama, Fantasy, Magic, Mili...     TV       64    9.26   793665       True
2     28977                          Gintama°  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.25   114262       True
3      9253                       Steins;Gate                                   Sci-Fi, Thriller     TV       24    9.17   673572       True

Merge dataframes

This functions like a SQL left join, when you have 2 data frames and want to join on a column.

rating.merge(anime_copy, left_on='anime_id', right_on='anime_id', suffixes=('_left', '_right'))     
         user_id  anime_id  rating_left                                               name                                              genre     type episodes  rating_right  members
0              1        20           -1                                             Naruto  Action, Comedy, Martial Arts, Shounen, Super P...       TV      220          7.81   683297
1              3        20            8                                             Naruto  Action, Comedy, Martial Arts, Shounen, Super P...       TV      220          7.81   683297
2              5        20            6                                             Naruto  Action, Comedy, Martial Arts, Shounen, Super P...       TV      220          7.81   683297
3              6        20           -1                                             Naruto  Action, Comedy, Martial Arts, Shounen, Super P...       TV      220          7.81   683297
4             10        20           -1                                             Naruto  Action, Comedy, Martial Arts, Shounen, Super P...       TV      220          7.81   683297
...
7813723    69497     33484           10                                         Shiroi Zou                           Action, Historical, Kids    Movie        1          4.71       45
7813724    70463     29481           -1                                    Kakinoki Mokkii                                      Fantasy, Kids  Special        1          4.33       61
7813725    72404     34412           -1                 Hashiri Hajimeta bakari no Kimi ni                                              Music    Music        1          6.76      239
7813726    72800     30738            4                       Gamba: Gamba to Nakama-tachi                                    Adventure, Kids    Movie        1          5.55      185

Filtering

Retrieve rows with matching index values

The index values in anime_modified are the names of the anime. Notice how we’ve used those names to grab specific columns.

anime_modified.loc[['Haikyuu!! Second Season','Gintama']]
                         anime_id                                              genre type episodes  rating  members
name                                                                                                               
Haikyuu!! Second Season     28891             Comedy, Drama, School, Shounen, Sports   TV       25    8.93   179342
Gintama                       918  Action, Comedy, Historical, Parody, Samurai, S...   TV      201    9.04   336376

Retrieve rows by numbered index values

This differs from the previous function. Using iloc, the 1st row has an index of 0, the 2nd row has an index of 1, and so on… even if you’ve modified the data frame and are now using string values in the index column.

Use this is you want the first 3 rows in a data frame.

anime_modified.iloc[0:3]
                                  anime_id                                              genre   type episodes  rating  members
name                                                                                                                          
Kimi no Na wa.                       32281               Drama, Romance, School, Supernatural  Movie        1    9.37   200630
Fullmetal Alchemist: Brotherhood      5114  Action, Adventure, Drama, Fantasy, Magic, Mili...     TV       64    9.26   793665
Gintama°                             28977  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.25   114262

Get rows

Retrieve rows where a column’s value is in a given list. anime[anime['type'] == 'TV'] also works when matching on a single value.

anime_copy[anime_copy['type'].isin(['TV', 'Movie'])] 
       anime_id                                               name                                              genre   type episodes  rating  members
0         32281                                     Kimi no Na wa.               Drama, Romance, School, Supernatural  Movie        1    9.37   200630
1          5114                   Fullmetal Alchemist: Brotherhood  Action, Adventure, Drama, Fantasy, Magic, Mili...     TV       64    9.26   793665
2         28977                                           Gintama°  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.25   114262
3          9253                                        Steins;Gate                                   Sci-Fi, Thriller     TV       24    9.17   673572
4          9969                                      Gintama&#039;  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.16   151266
...
11805      2755                                          Cleopatra                                 Hentai, Historical  Movie        1    6.06     1709
12244     11141                                       Blue Seagull                                     Action, Hentai  Movie        1    4.60      337
12258     20007                      Hi Gekiga Ukiyoe Senya Ichiya                                     Action, Hentai  Movie        1    1.92      129
12293     26081                   Yasuji no Pornorama: Yacchimae!!                                             Hentai  Movie        1    5.46      142

Slice a dataframe

This is just like slicing a list. Slice a data frame to get all rows before/between/after specified indices.

anime_copy[1:3]
   anime_id                              name                                              genre type episodes  rating  members
1      5114  Fullmetal Alchemist: Brotherhood  Action, Adventure, Drama, Fantasy, Magic, Mili...   TV       64    9.26   793665
2     28977                          Gintama°  Action, Comedy, Historical, Parody, Samurai, S...   TV       51    9.25   114262

Filter by value

Filter data frame for rows that meet a condition. Note this maintains existing index values.

anime_copy[anime_copy['rating'] > 8]
       anime_id                                               name                                              genre   type episodes  rating  members
0         32281                                     Kimi no Na wa.               Drama, Romance, School, Supernatural  Movie        1    9.37   200630
1          5114                   Fullmetal Alchemist: Brotherhood  Action, Adventure, Drama, Fantasy, Magic, Mili...     TV       64    9.26   793665
2         28977                                           Gintama°  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.25   114262
3          9253                                        Steins;Gate                                   Sci-Fi, Thriller     TV       24    9.17   673572
4          9969                                      Gintama&#039;  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.16   151266
...
10786     26313                    Yakusoku: Africa Mizu to Midori                                        Drama, Kids    OVA        1    9.25       53
10793     28557  Yamete! Writer Asobi: Doubutsu Mura no Shoubou...                                        Drama, Kids    OVA        1    8.67       40
10847     26097                                    Yume no Tsuzuki                                        Drama, Kids    OVA        1    8.67       53
12251     33960  Chou Do M na Hentai Mesu-tachi Otokoton Chouky...                                             Hentai    OVA  Unknown    8.38      161

Sorting

sort_values

Sort data frame by values in a column.

anime_copy.sort_values('rating', ascending=False)
       anime_id                                               name                                              genre     type episodes  rating  members
10464     33662            Taka no Tsume 8: Yoshida-kun no X-Files                                     Comedy, Parody    Movie        1   10.00       13
10400     30120                        Spoon-hime no Swing Kitchen                                    Adventure, Kids       TV  Unknown    9.60       47
9595      23005                                   Mogura no Motoro                                      Slice of Life    Movie        1    9.50       62
0         32281                                     Kimi no Na wa.               Drama, Romance, School, Supernatural    Movie        1    9.37   200630
...
12279     34491                       Sagurare Otome The Animation                                             Hentai      OVA        1     NaN       79
12280     34312                                       Saimin Class                                             Hentai      OVA  Unknown     NaN      240
12282     34388                    Shikkoku no Shaga The Animation                                             Hentai      OVA  Unknown     NaN      195
12285     34399                                   Taimanin Asagi 3                       Demons, Hentai, Supernatural      OVA  Unknown     NaN      485

Aggregating

Groupby and count

Count number of records for each distinct value in a column.

anime_copy.groupby('type').count()
         anime_id  name  genre  episodes  rating  members
type                                                     
Movie        2348  2348   2306      2348    2297     2348
Music         488   488    488       488     488      488
ONA           659   659    655       659     652      659
OVA          3311  3311   3310      3311    3285     3311
Special      1676  1676   1674      1676    1671     1676
TV           3787  3787   3777      3787    3671     3787

Groupby and aggregate columns in different ways

Note I added reset_index() otherwise the type column becomes the index column — I recommend doing the same in most cases.

anime_copy.groupby(["type"]).agg({
                             "rating": "sum",
                             "episodes": "count",
                             "name": "last"
                           }).reset_index()

      type    rating  episodes                                               name
0    Movie  14512.58      2348                   Yasuji no Pornorama: Yacchimae!!
1    Music   2727.43       488                                       Yuu no Mahou
2      ONA   3679.43       659                                     Docchi mo Maid
3      OVA  20942.60      3311  Violence Gekiga Shin David no Hoshi: Inma Dens...
4  Special  10900.77      1676                   Junjou Shoujo Et Cetera Specials
5       TV  25338.34      3787       Yuuki Yuuna wa Yuusha de Aru: Yuusha no Shou

Create a pivot table

Nothing better than a pivot table for pulling a subset of data from a data frame.

Note I’ve heavily filtered the data frame so it’s quicker to build the pivot table.

tmp_df = rating.copy()

tmp_df.head(10)
     user_id  anime_id  rating
156        3        20       8
308        5        24       1
307        5        22       5
306        5        20       6
305        5        18       6
304        5        17       6
303        5        15       6
302        5         6       8
806        7        22       7

tmp_df.sort_values('user_id', ascending=True, inplace=True)
tmp_df = tmp_df[tmp_df.user_id < 10] 
tmp_df = tmp_df[tmp_df.anime_id < 30]
tmp_df = tmp_df[tmp_df.rating != -1]

pd.pivot_table(tmp_df, values='rating', index=['user_id'], columns=['anime_id'], aggfunc=np.sum, fill_value=0)

anime_id  6   15  17  18  20  22  24
user_id                             
3          0   0   0   0   8   0   0
5          8   6   6   6   6   5   1
7          0   0   0   0   0   7   0

Cleaning

Set NaN cells to some value

Set cells with NaN value to 0 . In the example we create the same pivot table as before but without fill_value=0 then use fillna(0) to fill them in afterwards.

pivot = pd.pivot_table(tmp_df, values='rating', index=['user_id'], columns=['anime_id'], aggfunc=np.sum)

print(pivot)
anime_id   6    15   17   18   20   22   24
user_id                                    
3         NaN  NaN  NaN  NaN  8.0  NaN  NaN
5         8.0  6.0  6.0  6.0  6.0  5.0  1.0
7         NaN  NaN  NaN  NaN  NaN  7.0  NaN

pivot.fillna(0)
anime_id   6    15   17   18   20   22   24
user_id                                    
3         0.0  0.0  0.0  0.0  8.0  0.0  0.0
5         8.0  6.0  6.0  6.0  6.0  5.0  1.0
7         0.0  0.0  0.0  0.0  0.0  7.0  0.0

Other

Sample a data frame

I use this all the time taking a small sample from a larger data frame. It allows randomly rearranging rows while maintaining indices if frac=1

anime_copy.sample(frac=0.25)
       anime_id                                               name                                              genre     type episodes  rating  members
5415       3101                            F-Zero: Falcon Densetsu  Action, Adventure, Cars, Mecha, Sci-Fi, Shoune...       TV       51    6.46     3863
169       34240                                            Shelter                                      Music, Sci-Fi    Music        1    8.38    71136
429        1089                     Macross: Do You Remember Love?  Action, Mecha, Military, Music, Romance, Sci-F...    Movie        1    8.09    19240
9770      16393                                       Norakuro-kun                                    Comedy, Mystery       TV       50    7.17      114
...
9367      29493                                    Kyouryou Pikora                                    Adventure, Kids  Special        1    3.93       47
3727       6735  Ginga Tetsudou 999: Kimi wa Haha no You ni Ais...                                      Sci-Fi, Space  Special        1    6.90      730
10995     28613                             Ginga Jinpuu Jinraiger                           Action, Adventure, Mecha      NaN  Unknown     NaN      627
6865      31710                                        Divine Gate                            Action, Fantasy, Sci-Fi       TV       12    5.88    83271

Iterate over row indices

Iterate over index and rows in data frame.

for idx, row in anime_copy[:2].iterrows():
    print(idx, row)

0 anime_id                                   32281
name                              Kimi no Na wa.
genre       Drama, Romance, School, Supernatural
type                                       Movie
episodes                                       1
rating                                      9.37
members                                   200630
Name: 0, dtype: object
1 anime_id                                                 5114
name                         Fullmetal Alchemist: Brotherhood
genre       Action, Adventure, Drama, Fantasy, Magic, Mili...
type                                                       TV
episodes                                                   64
rating                                                   9.26
members                                                793665
Name: 1, dtype: object