Python Pandas Cheat Sheet
Being able to look up and use functions fast allows us to achieve a certain flow when writing code. So I’ve created this cheatsheet of functions from python pandas. This is not a comprehensive list but contains the functions I use most, an example, and my insights as to when it’s most useful.
First Step - Load CSV
If you want to run these examples yourself, download the Anime recommendation dataset from Kaggle , unzip and drop it in the same folder as your jupyter notebook.
Next Run these commands and you should be able to replicate my results for any of the below functions.
import pandas as pd
import numpy as np
anime = pd.read_csv('anime.csv')
rating = pd.read_csv('rating.csv')
anime_modified = anime.set_index('name')
Importing
Convert a CSV directly into a data frame. Sometimes loading data from a CSV also requires specifying an encoding (ie:encoding='ISO-8859–1'). It’s the first thing you should try if your data frame contains unreadable characters.
Another similar function also exists called pd.read_excel for excel files.
anime = pd.read_csv('anime.csv')
print(anime)
anime_id name genre type episodes rating members
0 32281 Kimi no Na wa. Drama, Romance, School, Supernatural Movie 1 9.37 200630
1 5114 Fullmetal Alchemist: Brotherhood Action, Adventure, Drama, Fantasy, Magic, Mili... TV 64 9.26 793665
2 28977 Gintama° Action, Comedy, Historical, Parody, Samurai, S... TV 51 9.25 114262
3 9253 Steins;Gate Sci-Fi, Thriller TV 24 9.17 673572
...
...
...
Build data frame from inputted data
Useful when you want to manually instantiate simple data so that you can see how it changes as it flows through a pipeline.
df = pd.DataFrame([[1,'USA', 'North America'],
[2,'Germany', 'Europe'],
[3,'China', 'Asia']],
columns=['ID','Country', 'Continent'])
print(df)
ID Country Continent
0 1 USA North America
1 2 Germany Europe
2 3 China Asia
Copy a data frame
Useful when you want to make changes to a data frame while maintaining a copy of the original. It’s good practise to copy all data frames immediately after loading them.
anime_copy = anime.copy(deep=True)
print(anime_copy)
anime_id name genre type episodes rating members
0 32281 Kimi no Na wa. Drama, Romance, School, Supernatural Movie 1 9.37 200630
1 5114 Fullmetal Alchemist: Brotherhood Action, Adventure, Drama, Fantasy, Magic, Mili... TV 64 9.26 793665
2 28977 Gintama° Action, Comedy, Historical, Parody, Samurai, S... TV 51 9.25 114262
3 9253 Steins;Gate Sci-Fi, Thriller TV 24 9.17 673572
...
...
...
Exporting - Save to CSV
This dumps to the same directory as the notebook. I’m only saving the 1st 5 rows below but you don’t need to do that. Again, df.to_excel() also exists and functions basically the same for excel files.
anime_copy[:5].to_csv('anime_copy.csv', index=False)
~] cat anime_copy.csv
anime_id,name,genre,type,episodes,rating,members
32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Military, Shounen",TV,64,9.26,793665
28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen",TV,51,9.25,114262
9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
9969,Gintama',"Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen",TV,51,9.16,151266
Viewing and Inspecting
Get top or bottom n records
Display the first n records from a data frame. I often print the top record of a data frame somewhere in my notebook so I can refer back to it if I forget what’s inside.
anime_copy.head(3)
anime_id name genre type episodes rating members
0 32281 Kimi no Na wa. Drama, Romance, School, Supernatural Movie 1 9.37 200630
1 5114 Fullmetal Alchemist: Brotherhood Action, Adventure, Drama, Fantasy, Magic, Mili... TV 64 9.26 793665
2 28977 Gintama° Action, Comedy, Historical, Parody, Samurai, S... TV 51 9.25 114262
anime_copy.tail(2)
anime_id name genre type episodes rating members
12292 6133 Violence Gekiga Shin David no Hoshi: Inma Dens... Hentai OVA 1 4.98 175
12293 26081 Yasuji no Pornorama: Yacchimae!! Hentai Movie 1 5.46 142
Count rows
This is not a pandas function, but len() counts rows and can be saved to a variable and used elsewhere.
len(df)
3
len(anime_copy)
12294
Count unique rows
Count unique values in a column:
len(anime_copy['type'].unique())
7
len(rating['user_id'].unique())
73515
# or
rating['user_id'].nunique()
73515
Get data frame info
Useful for getting some general information like header, number of values and datatype by column. A similar but less useful function is df.dtypes which just gives column data types.
anime_copy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
anime_id 12294 non-null int64
name 12294 non-null object
genre 12232 non-null object
type 12269 non-null object
episodes 12294 non-null object
rating 12064 non-null float64
members 12294 non-null int64
dtypes: float64(1), int64(2), object(4)
memory usage: 672.4+ KB
Get statistics
Really useful if the data frame has a lot of numeric values. Knowing the mean, min and max of the rating column give us a sense of how the data frame looks overall.
anime_copy.describe()
anime_id rating members
count 12294.000000 12064.000000 1.229400e+04
mean 14058.221653 6.473902 1.807134e+04
std 11455.294701 1.026746 5.482068e+04
min 1.000000 1.670000 5.000000e+00
25% 3484.250000 5.880000 2.250000e+02
50% 10260.500000 6.570000 1.550000e+03
75% 24794.500000 7.180000 9.437000e+03
max 34527.000000 10.000000 1.013917e+06
Get counts of values
anime_copy.type.value_counts()
TV 3787
OVA 3311
Movie 2348
Special 1676
ONA 659
Music 488
Name: type, dtype: int64
Selecting
Get a list or series of values for a column
This works if you need to pull the values in columns into x and y variables so you can fit a machine learning model.
anime_copy['genre']
0 Drama, Romance, School, Supernatural
1 Action, Adventure, Drama, Fantasy, Magic, Mili...
2 Action, Comedy, Historical, Parody, Samurai, S...
3 Sci-Fi, Thriller
4 Action, Comedy, Historical, Parody, Samurai, S...
anime_copy['genre'].tolist()
['Drama, Romance, School, Supernatural',
'Action, Adventure, Drama, Fantasy, Magic, Military, Shounen',
'Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen',
'Sci-Fi, Thriller',
'Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen'
...
...
...]
Get a list of index values
Create a list of values from index.
anime_modified.index.tolist()
['Kimi no Na wa.',
'Fullmetal Alchemist: Brotherhood',
'Gintama°',
'Steins;Gate',
...
...
...]
Get a list of column values
anime_copy.columns.tolist()
['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members']
Adding or Dropping
Append new column with a set value
I do this on occasion when I have test and train sets in 2 separate data frames and want to mark which rows are related to what set before combining them.
anime_copy['train set'] = True
anime_copy.columns.tolist()
['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members', 'train set']
Create new data frame from a subset of columns
Useful when you only want to keep a few columns from a giant data frame and don’t want to specify each that you want to drop.
anime_copy[['name','episodes']]
name episodes
0 Kimi no Na wa. 1
1 Fullmetal Alchemist: Brotherhood 64
2 Gintama° 51
3 Steins;Gate 24
4 Gintama' 51
...
...
...
12291 Violence Gekiga David no Hoshi 4
12292 Violence Gekiga Shin David no Hoshi: Inma Dens... 1
12293 Yasuji no Pornorama: Yacchimae!! 1
[12294 rows x 2 columns]
Drop specified columns
Useful when you only need to drop a few columns. Otherwise, it can be tedious to write them all out and I prefer the previous option.
anime_copy.drop(['anime_id', 'genre', 'members'], axis=1).head(8)
name type episodes rating train set
0 Kimi no Na wa. Movie 1 9.37 True
1 Fullmetal Alchemist: Brotherhood TV 64 9.26 True
2 Gintama° TV 51 9.25 True
3 Steins;Gate TV 24 9.17 True
4 Gintama' TV 51 9.16 True
5 Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga... TV 10 9.15 True
6 Hunter x Hunter (2011) TV 148 9.13 True
7 Ginga Eiyuu Densetsu OVA 110 9.11 True
Add a row with sum of other rows
We’ll manually create a small data frame here because it’s easier to look at. The interesting part here is df.sum(axis=0) which adds the values across rows. Alternatively df.sum(axis=0) adds values across columns.
The same logic applies when calculating counts or means, ie: df.mean(axis=0).
df = pd.DataFrame([[1,'John', 80],
[2,'Alice', 60],
[3,'Bob', 75]],
columns=['ID','Name', 'weight'])
df.append(df.sum(axis=0), ignore_index=True)
ID Name weight
0 1 John 80
1 2 Alice 60
2 3 Bob 75
3 6 JohnAliceBob 215
Combining
Concatenate 2 dataframes
Use this if you have 2 data frames with the same columns and want to combine them.
Here we split a data frame in 2 them add them back together.
df1 = anime_copy[0:2]
print(df1)
anime_id name genre type episodes rating members train set
0 32281 Kimi no Na wa. Drama, Romance, School, Supernatural Movie 1 9.37 200630 True
1 5114 Fullmetal Alchemist: Brotherhood Action, Adventure, Drama, Fantasy, Magic, Mili... TV 64 9.26 793665 True
df2 = anime_copy[2:4]
print(df2)
anime_id name genre type episodes rating members train set
2 28977 Gintama° Action, Comedy, Historical, Parody, Samurai, S... TV 51 9.25 114262 True
3 9253 Steins;Gate Sci-Fi, Thriller TV 24 9.17 673572 True
pd.concat([df1, df2], ignore_index=True)
anime_id name genre type episodes rating members train set
0 32281 Kimi no Na wa. Drama, Romance, School, Supernatural Movie 1 9.37 200630 True
1 5114 Fullmetal Alchemist: Brotherhood Action, Adventure, Drama, Fantasy, Magic, Mili... TV 64 9.26 793665 True
2 28977 Gintama° Action, Comedy, Historical, Parody, Samurai, S... TV 51 9.25 114262 True
3 9253 Steins;Gate Sci-Fi, Thriller TV 24 9.17 673572 True
Merge dataframes
This functions like a SQL left join, when you have 2 data frames and want to join on a column.
rating.merge(anime_copy, left_on='anime_id', right_on='anime_id', suffixes=('_left', '_right'))
user_id anime_id rating_left name genre type episodes rating_right members
0 1 20 -1 Naruto Action, Comedy, Martial Arts, Shounen, Super P... TV 220 7.81 683297
1 3 20 8 Naruto Action, Comedy, Martial Arts, Shounen, Super P... TV 220 7.81 683297
2 5 20 6 Naruto Action, Comedy, Martial Arts, Shounen, Super P... TV 220 7.81 683297
3 6 20 -1 Naruto Action, Comedy, Martial Arts, Shounen, Super P... TV 220 7.81 683297
4 10 20 -1 Naruto Action, Comedy, Martial Arts, Shounen, Super P... TV 220 7.81 683297
...
7813723 69497 33484 10 Shiroi Zou Action, Historical, Kids Movie 1 4.71 45
7813724 70463 29481 -1 Kakinoki Mokkii Fantasy, Kids Special 1 4.33 61
7813725 72404 34412 -1 Hashiri Hajimeta bakari no Kimi ni Music Music 1 6.76 239
7813726 72800 30738 4 Gamba: Gamba to Nakama-tachi Adventure, Kids Movie 1 5.55 185
Filtering
Retrieve rows with matching index values
The index values in anime_modified are the names of the anime. Notice how we’ve used those names to grab specific columns.
anime_modified.loc[['Haikyuu!! Second Season','Gintama']]
anime_id genre type episodes rating members
name
Haikyuu!! Second Season 28891 Comedy, Drama, School, Shounen, Sports TV 25 8.93 179342
Gintama 918 Action, Comedy, Historical, Parody, Samurai, S... TV 201 9.04 336376
Retrieve rows by numbered index values
This differs from the previous function. Using iloc, the 1st row has an index of 0, the 2nd row has an index of 1, and so on… even if you’ve modified the data frame and are now using string values in the index column.
Use this is you want the first 3 rows in a data frame.
anime_modified.iloc[0:3]
anime_id genre type episodes rating members
name
Kimi no Na wa. 32281 Drama, Romance, School, Supernatural Movie 1 9.37 200630
Fullmetal Alchemist: Brotherhood 5114 Action, Adventure, Drama, Fantasy, Magic, Mili... TV 64 9.26 793665
Gintama° 28977 Action, Comedy, Historical, Parody, Samurai, S... TV 51 9.25 114262
Get rows
Retrieve rows where a column’s value is in a given list. anime[anime['type'] == 'TV'] also works when matching on a single value.
anime_copy[anime_copy['type'].isin(['TV', 'Movie'])]
anime_id name genre type episodes rating members
0 32281 Kimi no Na wa. Drama, Romance, School, Supernatural Movie 1 9.37 200630
1 5114 Fullmetal Alchemist: Brotherhood Action, Adventure, Drama, Fantasy, Magic, Mili... TV 64 9.26 793665
2 28977 Gintama° Action, Comedy, Historical, Parody, Samurai, S... TV 51 9.25 114262
3 9253 Steins;Gate Sci-Fi, Thriller TV 24 9.17 673572
4 9969 Gintama' Action, Comedy, Historical, Parody, Samurai, S... TV 51 9.16 151266
...
11805 2755 Cleopatra Hentai, Historical Movie 1 6.06 1709
12244 11141 Blue Seagull Action, Hentai Movie 1 4.60 337
12258 20007 Hi Gekiga Ukiyoe Senya Ichiya Action, Hentai Movie 1 1.92 129
12293 26081 Yasuji no Pornorama: Yacchimae!! Hentai Movie 1 5.46 142
Slice a dataframe
This is just like slicing a list. Slice a data frame to get all rows before/between/after specified indices.
anime_copy[1:3]
anime_id name genre type episodes rating members
1 5114 Fullmetal Alchemist: Brotherhood Action, Adventure, Drama, Fantasy, Magic, Mili... TV 64 9.26 793665
2 28977 Gintama° Action, Comedy, Historical, Parody, Samurai, S... TV 51 9.25 114262
Filter by value
Filter data frame for rows that meet a condition. Note this maintains existing index values.
anime_copy[anime_copy['rating'] > 8]
anime_id name genre type episodes rating members
0 32281 Kimi no Na wa. Drama, Romance, School, Supernatural Movie 1 9.37 200630
1 5114 Fullmetal Alchemist: Brotherhood Action, Adventure, Drama, Fantasy, Magic, Mili... TV 64 9.26 793665
2 28977 Gintama° Action, Comedy, Historical, Parody, Samurai, S... TV 51 9.25 114262
3 9253 Steins;Gate Sci-Fi, Thriller TV 24 9.17 673572
4 9969 Gintama' Action, Comedy, Historical, Parody, Samurai, S... TV 51 9.16 151266
...
10786 26313 Yakusoku: Africa Mizu to Midori Drama, Kids OVA 1 9.25 53
10793 28557 Yamete! Writer Asobi: Doubutsu Mura no Shoubou... Drama, Kids OVA 1 8.67 40
10847 26097 Yume no Tsuzuki Drama, Kids OVA 1 8.67 53
12251 33960 Chou Do M na Hentai Mesu-tachi Otokoton Chouky... Hentai OVA Unknown 8.38 161
Sorting
sort_values
Sort data frame by values in a column.
anime_copy.sort_values('rating', ascending=False)
anime_id name genre type episodes rating members
10464 33662 Taka no Tsume 8: Yoshida-kun no X-Files Comedy, Parody Movie 1 10.00 13
10400 30120 Spoon-hime no Swing Kitchen Adventure, Kids TV Unknown 9.60 47
9595 23005 Mogura no Motoro Slice of Life Movie 1 9.50 62
0 32281 Kimi no Na wa. Drama, Romance, School, Supernatural Movie 1 9.37 200630
...
12279 34491 Sagurare Otome The Animation Hentai OVA 1 NaN 79
12280 34312 Saimin Class Hentai OVA Unknown NaN 240
12282 34388 Shikkoku no Shaga The Animation Hentai OVA Unknown NaN 195
12285 34399 Taimanin Asagi 3 Demons, Hentai, Supernatural OVA Unknown NaN 485
Aggregating
Groupby and count
Count number of records for each distinct value in a column.
anime_copy.groupby('type').count()
anime_id name genre episodes rating members
type
Movie 2348 2348 2306 2348 2297 2348
Music 488 488 488 488 488 488
ONA 659 659 655 659 652 659
OVA 3311 3311 3310 3311 3285 3311
Special 1676 1676 1674 1676 1671 1676
TV 3787 3787 3777 3787 3671 3787
Groupby and aggregate columns in different ways
Note I added reset_index() otherwise the type column becomes the index column — I recommend doing the same in most cases.
anime_copy.groupby(["type"]).agg({
"rating": "sum",
"episodes": "count",
"name": "last"
}).reset_index()
type rating episodes name
0 Movie 14512.58 2348 Yasuji no Pornorama: Yacchimae!!
1 Music 2727.43 488 Yuu no Mahou
2 ONA 3679.43 659 Docchi mo Maid
3 OVA 20942.60 3311 Violence Gekiga Shin David no Hoshi: Inma Dens...
4 Special 10900.77 1676 Junjou Shoujo Et Cetera Specials
5 TV 25338.34 3787 Yuuki Yuuna wa Yuusha de Aru: Yuusha no Shou
Create a pivot table
Nothing better than a pivot table for pulling a subset of data from a data frame.
Note I’ve heavily filtered the data frame so it’s quicker to build the pivot table.
tmp_df = rating.copy()
tmp_df.head(10)
user_id anime_id rating
156 3 20 8
308 5 24 1
307 5 22 5
306 5 20 6
305 5 18 6
304 5 17 6
303 5 15 6
302 5 6 8
806 7 22 7
tmp_df.sort_values('user_id', ascending=True, inplace=True)
tmp_df = tmp_df[tmp_df.user_id < 10]
tmp_df = tmp_df[tmp_df.anime_id < 30]
tmp_df = tmp_df[tmp_df.rating != -1]
pd.pivot_table(tmp_df, values='rating', index=['user_id'], columns=['anime_id'], aggfunc=np.sum, fill_value=0)
anime_id 6 15 17 18 20 22 24
user_id
3 0 0 0 0 8 0 0
5 8 6 6 6 6 5 1
7 0 0 0 0 0 7 0
Cleaning
Set NaN cells to some value
Set cells with NaN value to 0 . In the example we create the same pivot table as before but without fill_value=0 then use fillna(0) to fill them in afterwards.
pivot = pd.pivot_table(tmp_df, values='rating', index=['user_id'], columns=['anime_id'], aggfunc=np.sum)
print(pivot)
anime_id 6 15 17 18 20 22 24
user_id
3 NaN NaN NaN NaN 8.0 NaN NaN
5 8.0 6.0 6.0 6.0 6.0 5.0 1.0
7 NaN NaN NaN NaN NaN 7.0 NaN
pivot.fillna(0)
anime_id 6 15 17 18 20 22 24
user_id
3 0.0 0.0 0.0 0.0 8.0 0.0 0.0
5 8.0 6.0 6.0 6.0 6.0 5.0 1.0
7 0.0 0.0 0.0 0.0 0.0 7.0 0.0
Other
Sample a data frame
I use this all the time taking a small sample from a larger data frame. It allows randomly rearranging rows while maintaining indices if frac=1
anime_copy.sample(frac=0.25)
anime_id name genre type episodes rating members
5415 3101 F-Zero: Falcon Densetsu Action, Adventure, Cars, Mecha, Sci-Fi, Shoune... TV 51 6.46 3863
169 34240 Shelter Music, Sci-Fi Music 1 8.38 71136
429 1089 Macross: Do You Remember Love? Action, Mecha, Military, Music, Romance, Sci-F... Movie 1 8.09 19240
9770 16393 Norakuro-kun Comedy, Mystery TV 50 7.17 114
...
9367 29493 Kyouryou Pikora Adventure, Kids Special 1 3.93 47
3727 6735 Ginga Tetsudou 999: Kimi wa Haha no You ni Ais... Sci-Fi, Space Special 1 6.90 730
10995 28613 Ginga Jinpuu Jinraiger Action, Adventure, Mecha NaN Unknown NaN 627
6865 31710 Divine Gate Action, Fantasy, Sci-Fi TV 12 5.88 83271
Iterate over row indices
Iterate over index and rows in data frame.
for idx, row in anime_copy[:2].iterrows():
print(idx, row)
0 anime_id 32281
name Kimi no Na wa.
genre Drama, Romance, School, Supernatural
type Movie
episodes 1
rating 9.37
members 200630
Name: 0, dtype: object
1 anime_id 5114
name Fullmetal Alchemist: Brotherhood
genre Action, Adventure, Drama, Fantasy, Magic, Mili...
type TV
episodes 64
rating 9.26
members 793665
Name: 1, dtype: object