Pandas Examples
Introduction
Following is an example of many features of the Pandas library for the Python language. You may think
of it as a cheat sheet of sorts but I'm reluctant to call it that because to be useful as such
a cheat sheet should be much smaller. It was developed from personal notes accumulated over many years
but just released publicaly now for the benefit of my grandson.
You can reach me using the contact from from one of my other sites:
Birdland Contact.
This is also available as a: Markdown file.
Bill Wetzel
24-Apr-2025
Preparatory
Imports
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import matplotlib
import matplotlib.pyplot as plt
Series
Make a Series
s = Series( np.random.rand( 10 ))
0 0.994610
1 0.164102
2 0.254481
3 0.760523
4 0.581324
5 0.887759
6 0.550993
7 0.056151
8 0.553723
9 0.028766
dtype: float64
Show Prior Series
s
0 0.994610
1 0.164102
2 0.254481
3 0.760523
4 0.581324
5 0.887759
6 0.550993
7 0.056151
8 0.553723
9 0.028766
dtype: float64
Make a Series with Named Index
s = Series( np.random.rand( 10 ), ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"] )
a 0.895349
b 0.506245
c 0.959434
d 0.718147
e 0.874011
f 0.432590
g 0.677948
h 0.918245
i 0.032920
j 0.630093
dtype: float64
Index for Series
s.index
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')
Distinct values (rounded) in Series
s.round( 1 ).value_counts()
0.9 3
0.7 2
0.5 1
1.0 1
0.4 1
0.0 1
0.6 1
Name: count, dtype: int64
Dataframes
Make a Dataframe
df = DataFrame( np.random.rand( 6, 4 ))
0 1 2 3
0 0.526257 0.082531 0.773950 0.430945
1 0.918965 0.518207 0.625018 0.044193
2 0.303664 0.825669 0.075327 0.693838
3 0.078656 0.777591 0.926702 0.385731
4 0.276900 0.667200 0.392218 0.964382
5 0.428334 0.905876 0.446561 0.364982
Columns in DataFrame
df.columns
[0, 1, 2, 3]
Rows (index) of DataFrame
df.index
[0, 1, 2, 3, 4, 5]
Transpose of Dataframe
df.T
0 1 2 3 4 5
0 0.526257 0.918965 0.303664 0.078656 0.276900 0.428334
1 0.082531 0.518207 0.825669 0.777591 0.667200 0.905876
2 0.773950 0.625018 0.075327 0.926702 0.392218 0.446561
3 0.430945 0.044193 0.693838 0.385731 0.964382 0.364982
Description of Dataframe
df.describe()
0 1 2 3
count 6.000000 6.000000 6.000000 6.000000
mean 0.422130 0.629512 0.539963 0.480679
std 0.286645 0.299927 0.302652 0.314579
min 0.078656 0.082531 0.075327 0.044193
25% 0.283591 0.555455 0.405804 0.370169
50% 0.365999 0.722395 0.535790 0.408338
75% 0.501777 0.813649 0.736717 0.628115
max 0.918965 0.905876 0.926702 0.964382
df.info()
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 6 non-null float64
1 1 6 non-null float64
2 2 6 non-null float64
3 3 6 non-null float64
dtypes: float64(4)
memory usage: 324.0 bytes
Shape of Dataframe
df.shape
(6, 4)
Rounded Dataframe
df.round(1)
0 1 2 3
0 0.5 0.1 0.8 0.4
1 0.9 0.5 0.6 0.0
2 0.3 0.8 0.1 0.7
3 0.1 0.8 0.9 0.4
4 0.3 0.7 0.4 1.0
5 0.4 0.9 0.4 0.4
Value Counts of Rounded Dataframe
df.round(1).value_counts()
0 1 2 3
0.1 0.8 0.9 0.4 1
0.3 0.7 0.4 1.0 1
0.8 0.1 0.7 1
0.4 0.9 0.4 0.4 1
0.5 0.1 0.8 0.4 1
0.9 0.5 0.6 0.0 1
Name: count, dtype: int64
Unique Value Counts of Rounded Dataframe
df.round(1).nunique()
0 5
1 5
2 5
3 4
dtype: int64
Rounded Selected Column
df[2].round(1)
0 0.8
1 0.6
2 0.1
3 0.9
4 0.4
5 0.4
Name: 2, dtype: float64
Value Counts of Rounded Selected Column
df[2].round(1).value_counts()
2
0.4 2
0.8 1
0.6 1
0.1 1
0.9 1
Name: count, dtype: int64
Unique Value Counts of Rounded Selected Column
df[2].round(1).nunique()
5
Get Array from DataFrame for Faster Iteration
df.values
[[0.52625746 0.08253089 0.77394952 0.4309448 ]
[0.91896505 0.51820701 0.62501799 0.04419306]
[0.30366439 0.82566895 0.07532738 0.69383818]
[0.07865641 0.77759054 0.92670187 0.38573085]
[0.27690023 0.66719996 0.39221784 0.96438196]
[0.42833369 0.90587645 0.44656138 0.3649823 ]]
Assign Dataframe Index to Variable
idx = df.index
[0, 1, 2, 3, 4, 5]
The following two examples are explained in detail in a later section.
Drop Rows Meeting Condition in Specified Column
df.drop( df[ df[2] < .5].index, axis=0 )
0 1 2 3
0 0.526257 0.082531 0.773950 0.430945
1 0.918965 0.518207 0.625018 0.044193
3 0.078656 0.777591 0.926702 0.385731
Drop Columns Meeting Condition in Specified Row
df.drop( df.loc[ :, df.iloc[ 2, : ] < .5 ].columns, axis=1 )
1 3
0 0.082531 0.430945
1 0.518207 0.044193
2 0.825669 0.693838
3 0.777591 0.385731
4 0.667200 0.964382
5 0.905876 0.364982
A Few Properties of DataFrame Index
idx.is_monotonic_increasing
True
idx.is_monotonic_decreasing
False
idx.has_duplicates
False
idx.nlevels
1
idx.values
[0 1 2 3 4 5]
idx.tolist()
[0, 1, 2, 3, 4, 5]
idx.min()
0
idx.max()
5
Make a DataFrame from Argument List with Labeled Rows and Columns
df = DataFrame( [[ 1, 2, 3 ], [4, 5, 6], [7, 8, 9]], index = [ 'R1', 'R2', 'R3' ], columns = ['C1', 'C2', 'C3'] )
C1 C2 C3
R1 1 2 3
R2 4 5 6
R3 7 8 9
Assign CSV Data to a Variable
csv_data = """Color, Code, Alt-Code
R1, Red, #ff0000, 255/0/0
R2, Green, #00ff00, 0/255/0
R3, Blue, #0000ff, 0/0/255"""
Make a DataFrame from CSV Data with Labeled Rows and Columns
df = pd.read_csv( StringIO( csv_data ), header = 0, index_col = 0, skipinitialspace = True )
Color Code Alt-Code
R1 Red #ff0000 255/0/0
R2 Green #00ff00 0/255/0
R3 Blue #0000ff 0/0/255
Save DataFrame for Later Reference
df_colors = df
Color Code Alt-Code
R1 Red #ff0000 255/0/0
R2 Green #00ff00 0/255/0
R3 Blue #0000ff 0/0/255
Assign Dictionary to a Variable
dict_data = {
'c0' : [ 1, 2, 3, 4, 5 ],
'c1' : [ 2, 3, 4, 5, 6 ],
'c2' : [ 3, 4, 5, 6, 7 ]
}
Make a DataFrame from Dictionary with Labeled Columns
df = DataFrame( dict_data )
c0 c1 c2
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
Assign Dictionary to a Variable
dict_data = {
'r0' : [ 1, 2, 3, 4, 5 ],
'r1' : [ 2, 3, 4, 5, 6 ],
'r2' : [ 3, 4, 5, 6, 7 ]
}
Make a DataFrame from Dictionary with Labeled Rows
df = DataFrame.from_dict( dict_data, orient='index' )
0 1 2 3 4
r0 1 2 3 4 5
r1 2 3 4 5 6
r2 3 4 5 6 7
Assign Dictionary to a Variable
# global dict_data
dict_data = {
'r0' : { 'c0' : 1, 'c1' : 2, 'c2' : 3, 'c3' : 4, 'c4' : 5 },
'r1' : { 'c0' : 2, 'c1' : 3, 'c2' : 4, 'c3' : 5, 'c4' : 6 },
'r2' : { 'c0' : 3, 'c1' : 4, 'c2' : 5, 'c3' : 6, 'c4' : 7 }
}
Make a DataFrame from Dictionary with Labeled Rows and Columns
df = DataFrame.from_dict( dict_data, orient='index' )
c0 c1 c2 c3 c4
r0 1 2 3 4 5
r1 2 3 4 5 6
r2 3 4 5 6 7
Convert DataFrame to String
df.to_string()
c0 c1 c2 c3 c4
r0 1 2 3 4 5
r1 2 3 4 5 6
r2 3 4 5 6 7
Iterate Over Data
Make a DataFrame
df = DataFrame( [[ 1, 2 ], [3, 4]], index = [ 'R1', 'R2' ], columns = ['C1', 'C2'] )
C1 C2
R1 1 2
R2 3 4
Iterate on Columns
for n, s in df.items(): # df.iteritems() deprecated
print("n:", n)
print("s:")
print(s)
print()
n: C1
s:
R1 1
R2 3
Name: C1, dtype: int64
n: C2
s:
R1 2
R2 4
Name: C2, dtype: int64
Iterate on Rows
for n, s in df.iterrows():
print("n:", n)
print("s:")
print(s)
print()
n: R1
s:
C1 1
C2 2
Name: R1, dtype: int64
n: R2
s:
C1 3
C2 4
Name: R2, dtype: int64
Iterate on .values for Fast Access
for row in df.values:
print("row:", row)
for v in row:
print(" v:", v)
row: [1 2]
v: 1
v: 2
row: [3 4]
v: 3
v: 4
Series Indexing
Make a Series
s = Series( np.random.rand( 5 ), index = ['x0', 'x1', 'x2', 'x3', 'x4' ] )
x0 0.891443
x1 0.501428
x2 0.458724
x3 0.411455
x4 0.794857
dtype: float64
Index by Label
s[ 'x2' ]
0.45872423599707834
Index by Attribute
s.x2
0.45872423599707834
Index by Label with .loc
s.loc[ 'x2' ]
0.45872423599707834
Index by Integer with .iloc
s.iloc[ 2 ]
0.45872423599707834
DataFrame Indexing
Fetch Saved DataFrame
df = df_colors
Color Code Alt-Code
R1 Red #ff0000 255/0/0
R2 Green #00ff00 0/255/0
R3 Blue #0000ff 0/0/255
Index Column by Label
df[ 'Color' ]
R1 Red
R2 Green
R3 Blue
Name: Color, dtype: object
Index Column by Attribute
df.Color
R1 Red
R2 Green
R3 Blue
Name: Color, dtype: object
Index Column by Integer with .iloc
df.iloc[ :, 2 ]
R1 255/0/0
R2 0/255/0
R3 0/0/255
Name: Alt-Code, dtype: object
Index Row by Label with .loc
df.loc[ 'R1' ]
Color Red
Code #ff0000
Alt-Code 255/0/0
Name: R1, dtype: object
Index Row by Integer with .iloc
df.iloc[ 2 ]
Color Blue
Code #0000ff
Alt-Code 0/0/255
Name: R3, dtype: object
Index Row/Col by Labels with .loc
df.loc[ 'R1', 'Color' ]
Red
Index Row/Col by Integers with .iloc
df.iloc[ 1, 1 ]
#00ff00
Index by Contents of String
df.query( 'Color == "Green"' )
Color Code Alt-Code
R2 Green #00ff00 0/255/0
Boolean Indexing
Make a Dataframe
df = DataFrame( np.random.rand( 6, 4 ))
0 1 2 3
0 0.066228 0.007887 0.410989 0.029063
1 0.067921 0.516712 0.182392 0.618393
2 0.801518 0.915226 0.398700 0.692419
3 0.907892 0.937800 0.667126 0.144018
4 0.407870 0.338500 0.198335 0.195538
5 0.898656 0.418393 0.737517 0.693376
Make Boolean Series on Rows
df[2] < .5
0 True
1 True
2 True
3 False
4 True
5 False
Name: 2, dtype: bool
Index Rows by Boolean Series
df[ df[2] < .5 ]
0 1 2 3
0 0.066228 0.007887 0.410989 0.029063
1 0.067921 0.516712 0.182392 0.618393
2 0.801518 0.915226 0.398700 0.692419
4 0.407870 0.338500 0.198335 0.195538
Index Rows by Boolean Series
df[ df[2] > .5 ]
0 1 2 3
3 0.907892 0.937800 0.667126 0.144018
5 0.898656 0.418393 0.737517 0.693376
Make Boolean Series on Columns
df.iloc[:,2] < .5
0 True
1 True
2 True
3 False
4 True
5 False
Name: 2, dtype: bool
Index Columns by Boolean Series
df[ df.iloc[:,2] < .5 ]
0 1 2 3
0 0.066228 0.007887 0.410989 0.029063
1 0.067921 0.516712 0.182392 0.618393
2 0.801518 0.915226 0.398700 0.692419
4 0.407870 0.338500 0.198335 0.195538
Index Columns by Boolean Series
df[ df.iloc[:,2] > .5 ]
0 1 2 3
3 0.907892 0.937800 0.667126 0.144018
5 0.898656 0.418393 0.737517 0.693376
Operations on Series / Dataframes
Get Series from Saved Dataframe
s = df_colors[ 'Color' ]
R1 Red
R2 Green
R3 Blue
Name: Color, dtype: object
map() - execute function on each element of series
s.map( lambda x: print( x ))
Red
Green
Blue
apply() - execute function on each column of dataframe
df_colors.apply( lambda x: print( x ), axis=0 )
R1 Red
R2 Green
R3 Blue
Name: Color, dtype: object
R1 #ff0000
R2 #00ff00
R3 #0000ff
Name: Code, dtype: object
R1 255/0/0
R2 0/255/0
R3 0/0/255
Name: Alt-Code, dtype: object
apply() - execute function on each row of dataframe
df_colors.apply( lambda x: print( x ), axis=1 )
Color Red
Code #ff0000
Alt-Code 255/0/0
Name: R1, dtype: object
Color Green
Code #00ff00
Alt-Code 0/255/0
Name: R2, dtype: object
Color Blue
Code #0000ff
Alt-Code 0/0/255
Name: R3, dtype: object
map() - execute function on each element of dataframe
df_colors.map( lambda x: print( x ))
Red
Green
Blue
#ff0000
#00ff00
#0000ff
255/0/0
0/255/0
0/0/255
Sorting Series
Make Series
s = Series( np.random.rand( 5 ), index=[ 'Ix', 'Ie', 'Ia', 'It', 'Io'])
Ix 0.728740
Ie 0.309253
Ia 0.386008
It 0.428475
Io 0.623328
dtype: float64
Sort by Index
s.sort_index()
Ia 0.386008
Ie 0.309253
Io 0.623328
It 0.428475
Ix 0.728740
dtype: float64
Sort by Value
s.sort_values()
Ie 0.309253
Ia 0.386008
It 0.428475
Io 0.623328
Ix 0.728740
dtype: float64
Sorting Dataframe
Make Dataframe
df = DataFrame( np.random.rand( 6, 4 ), index=['Rc', 'Ra', 'Rx', 'Rj', 'Rw', 'Rn' ], columns = ['Ce', 'Cr', 'Ci', 'Cq' ] )
Ce Cr Ci Cq
Rc 0.756230 0.775280 0.323332 0.599263
Ra 0.313539 0.467441 0.978614 0.485691
Rx 0.106189 0.723605 0.215187 0.588402
Rj 0.379624 0.480167 0.131351 0.820475
Rw 0.549709 0.615262 0.919092 0.012299
Rn 0.294391 0.792651 0.384427 0.482769
Sort by Row Index
df.sort_index()
Ce Cr Ci Cq
Ra 0.313539 0.467441 0.978614 0.485691
Rc 0.756230 0.775280 0.323332 0.599263
Rj 0.379624 0.480167 0.131351 0.820475
Rn 0.294391 0.792651 0.384427 0.482769
Rw 0.549709 0.615262 0.919092 0.012299
Rx 0.106189 0.723605 0.215187 0.588402
Sort by Column Index
df.sort_index( axis=1 )
Ce Ci Cq Cr
Rc 0.756230 0.323332 0.599263 0.775280
Ra 0.313539 0.978614 0.485691 0.467441
Rx 0.106189 0.215187 0.588402 0.723605
Rj 0.379624 0.131351 0.820475 0.480167
Rw 0.549709 0.919092 0.012299 0.615262
Rn 0.294391 0.384427 0.482769 0.792651
Sort by Column Value
df.sort_values( by='Cr' )
Ce Cr Ci Cq
Ra 0.313539 0.467441 0.978614 0.485691
Rj 0.379624 0.480167 0.131351 0.820475
Rw 0.549709 0.615262 0.919092 0.012299
Rx 0.106189 0.723605 0.215187 0.588402
Rc 0.756230 0.775280 0.323332 0.599263
Rn 0.294391 0.792651 0.384427 0.482769
Sort by Row Value
df.sort_values( by='Rj', axis=1 )
Ci Ce Cr Cq
Rc 0.323332 0.756230 0.775280 0.599263
Ra 0.978614 0.313539 0.467441 0.485691
Rx 0.215187 0.106189 0.723605 0.588402
Rj 0.131351 0.379624 0.480167 0.820475
Rw 0.919092 0.549709 0.615262 0.012299
Rn 0.384427 0.294391 0.792651 0.482769
Summarize Functions on Dataframe
Make Dataframe
df = DataFrame( np.random.rand( 6, 4 ))
0 1 2 3
0 0.237400 0.771043 0.308005 0.461401
1 0.418274 0.263856 0.649630 0.544882
2 0.667811 0.586960 0.890119 0.376049
3 0.501345 0.916115 0.269861 0.449142
4 0.887702 0.289208 0.116405 0.751740
5 0.333775 0.750669 0.235345 0.789468
Sum
df.sum()
0 3.046306
1 3.577851
2 2.469365
3 3.372682
dtype: float64
Count
df.count()
0 6
1 6
2 6
3 6
dtype: int64
df.median()
0 0.459809
1 0.668815
2 0.288933
3 0.503142
dtype: float64
Quantile
df.quantile([0.25, 0.5, 0.75])
0 1 2 3
0.25 0.354899 0.363646 0.243974 0.452207
0.50 0.459809 0.668815 0.288933 0.503142
0.75 0.626194 0.765949 0.564223 0.700026
Min
df.min()
0 0.237400
1 0.263856
2 0.116405
3 0.376049
dtype: float64
Max
df.max()
0 0.887702
1 0.916115
2 0.890119
3 0.789468
dtype: float64
Mean
df.mean()
0 0.507718
1 0.596309
2 0.411561
3 0.562114
dtype: float64
Var
df.var()
0 0.056277
1 0.072312
2 0.086933
3 0.029094
dtype: float64
Std
df.std()
0 0.237228
1 0.268908
2 0.294844
3 0.170570
dtype: float64
Variations of Summarize Functions Shown for sum()
Sum Columns
df.sum()
0 3.046306
1 3.577851
2 2.469365
3 3.372682
dtype: float64
Sum Rows
df.sum( axis=1 )
0 1.777849
1 1.876641
2 2.520939
3 2.136463
4 2.045055
5 2.109256
dtype: float64
Sum Entire Dataframe
df.sum( axis=1 ).sum()
12.466203402723599
Histograms
Make Dataframe
df = DataFrame( np.random.normal( loc = 5, scale = 2, size = [1000, 4] ), dtype = float ).round()
0 1 2 3
0 5.0 7.0 3.0 4.0
1 7.0 5.0 4.0 7.0
2 5.0 5.0 3.0 6.0
3 5.0 5.0 3.0 4.0
4 3.0 4.0 6.0 5.0
.. ... ... ... ...
995 6.0 5.0 4.0 4.0
996 8.0 4.0 7.0 9.0
997 6.0 5.0 4.0 7.0
998 6.0 5.0 3.0 2.0
999 4.0 7.0 6.0 2.0
[1000 rows x 4 columns]
Histogram using pd.value_counts
s = df.apply( lambda x: x.value_counts()).fillna(0)
0 1 2 3
-2.0 0.0 0.0 1 0.0
-1.0 3.0 3.0 3 6.0
0.0 6.0 9.0 3 12.0
1.0 19.0 32.0 28 30.0
2.0 70.0 68.0 70 65.0
3.0 129.0 108.0 129 117.0
4.0 164.0 168.0 185 182.0
5.0 181.0 212.0 193 197.0
6.0 174.0 159.0 182 173.0
7.0 138.0 123.0 112 107.0
8.0 77.0 81.0 58 65.0
9.0 25.0 28.0 23 35.0
10.0 11.0 5.0 10 7.0
11.0 2.0 3.0 2 4.0
12.0 1.0 1.0 1 0.0
Histogram Using Pandas .hist()
h = df.hist( sharex=True, sharey=True, linewidth=1, figsize=(Fig_X, Fig_Y) )
Get Series from Column 0 of Dataframe
s = df.iloc[ :, 0].values
[ 5. 7. 5. 5. 3. 3. -1. 4. 6. 7. 7. 4. 4. 6. 7. 4. 2. 5.
5. 6. 7. 3. 4. 9. 6. 5. 5. 4. 6. 8. 4. 7. 3. 0. 8. 7.
5. 2. 5. 5. 7. 7. 4. 6. 5. 5. 7. 6. 5. 6. 3. 3. 7. 4.
7. 3. 6. 7. 3. 3. 8. 10. 0. 3. 3. 4. 5. 7. 6. 7. 6. 3.
5. 7. 5. 8. 6. 0. 5. 6. 3. 8. -1. 7. 2. 6. 6. 4. 7. 6.
3. 6. 7. 5. 6. 2. 8. 4. 7. 8. 5. 5. 7. 8. 10. 7. 4. 7.
6. 3. 6. 6. 2. 6. 4. 3. 8. 5. 4. 5. 1. 7. 3. 5. 4. 7.
2. 4. 4. 5. 5. 3. 7. 6. 7. 8. 2. 5. 7. 8. 2. 7. 4. 4.
5. 5. 4. 5. 4. 10. 6. 8. 6. 2. 9. 5. 7. 6. 6. 5. 6. 3.
5. 4. 9. 2. 5. 6. 3. 5. 4. 5. 8. 4. 6. 8. 3. 6. 1. 5.
8. 4. 2. 4. 8. 7. 5. 8. 6. 6. 5. 6. 8. 7. 8. 7. 1. 5.
4. 2. 7. 6. 8. 8. 4. 3. 4. 8. 6. 7. 4. 6. 6. 3. 7. 4.
7. 5. 4. 5. 3. 3. 5. 3. 2. 5. 7. 6. 4. 5. 6. 4. 2. 3.
3. 7. 5. 6. -0. 7. 7. 3. 6. 4. 7. 6. 1. 5. 7. 5. 6. 7.
2. 8. 5. 11. 5. 2. 5. 5. 4. 5. 2. 7. 6. 8. 3. 7. 3. 6.
5. 2. 8. 3. 5. 4. 7. 5. 7. 4. 7. 5. 8. 6. 5. 6. 6. 8.
8. 7. 3. 4. 7. 4. 2. 5. 9. 7. 4. 5. 7. 6. 4. 5. 5. 5.
5. 4. 7. 3. 7. 4. 5. 2. 4. 6. 3. 3. 3. 3. 5. 7. 5. 6.
5. 8. 3. 5. 6. 7. 5. 3. 3. 5. 10. 3. 2. 7. 7. 8. 4. 9.
5. 2. 3. 5. 3. 8. 1. 8. 6. 6. 5. 4. 4. 7. 2. 1. 6. 3.
5. 6. 5. 3. 6. 8. 3. 5. 5. 5. 9. 6. 6. 3. 5. 3. 4. 1.
6. 8. 4. 8. 6. 4. 9. 6. 4. 3. 4. 4. 4. 3. 8. 5. 5. 6.
8. 6. 6. 4. 6. 5. 7. 3. 5. 2. 1. 7. 5. 4. 6. 7. 1. 6.
7. 6. 5. 8. 4. 3. 7. 5. 9. 4. 3. 5. 7. 3. 6. 3. 5. 9.
4. 5. 4. 8. 3. 3. 3. 4. 3. 7. 7. 4. 4. 7. 3. 3. 5. 6.
5. 5. 4. 8. 8. 2. 8. 4. 6. 2. 3. 2. 8. 5. 7. 7. 8. 7.
4. 7. 4. 3. 6. 8. 7. 4. 7. 7. 6. 5. 3. 7. 5. 1. 6. 7.
7. 2. 6. 5. 5. 4. 3. 6. 4. 7. 8. 5. 7. 4. 6. 6. 3. 3.
7. 4. 6. 4. 4. 5. 6. 6. 3. 6. 6. 8. 6. 7. 8. 4. 8. 7.
3. 3. 4. 9. 3. 4. 2. 7. 4. 3. 5. 4. 8. 6. 4. 5. 7. 5.
2. 6. 2. 4. 5. 3. 4. 6. 2. 8. 4. 2. 2. 6. 4. 3. 10. 6.
5. 6. 5. 3. 3. 7. 4. 4. 5. 2. 2. 3. 6. 4. 6. 6. 6. 6.
4. 2. 4. 7. 7. 3. 3. 6. 2. 4. 7. 4. 5. 5. -0. 5. 7. 4.
6. 5. 7. 5. 3. 4. 2. 3. 6. 7. 4. 7. -1. 6. 5. 6. 6. 7.
4. 4. 6. 9. 8. 6. 2. 6. 5. 2. 6. 7. 12. 8. 7. 3. 5. 6.
2. 9. 4. 4. 2. 6. 4. 7. 2. 4. 2. 4. 4. 2. 5. 5. 7. 6.
4. 4. 6. 4. 5. 7. 4. 5. 4. 8. 3. 8. 5. 2. 6. 6. 5. 6.
10. 5. 5. 7. 6. 5. 3. 2. 5. 6. 2. 5. 7. 4. 5. 7. 5. 5.
9. 8. 7. 10. 5. 4. 4. 4. 2. 7. 6. 1. 9. 3. 3. 3. 3. 4.
9. 3. 9. 7. 7. 4. 5. 8. 4. 4. 7. 4. 5. 7. 9. 3. 2. 7.
6. 4. 4. 3. 4. 4. 2. 5. 5. 9. 7. 5. 5. 5. 4. 2. 3. 6.
9. 4. 2. 4. 5. 5. 9. 5. 5. 3. 5. 3. 7. 3. 5. 2. 4. 4.
5. 5. 8. 6. 3. 6. 7. 5. 3. 3. 5. 8. 6. 6. 3. 4. 6. 5.
7. 3. 4. 4. 6. 2. 8. -0. 4. 3. 2. 7. 4. 3. 5. 10. 5. 8.
7. 4. 6. 6. 3. 5. 5. 8. 6. 5. 4. 7. 1. 5. 7. 10. 4. 5.
6. 5. 6. 4. 4. 8. 4. 6. 6. 10. 3. 8. 7. 6. 6. 3. 8. 6.
9. 3. 5. 5. 3. 5. 6. 6. 5. 4. 7. 5. 6. 6. 2. 5. 2. 7.
2. 6. 1. 8. 3. 11. 3. 4. 6. 8. 6. 6. 7. 5. 6. 5. 1. 2.
4. 4. 1. 4. 8. 3. 6. 3. 7. 4. 4. 5. 3. 6. 7. 3. 2. 6.
9. 4. 3. 1. 7. 6. 5. 6. 5. 6. 6. 6. 5. 6. 6. 1. 4. 7.
6. 2. 3. 5. 6. 5. 8. 6. 9. 4. 6. 3. 7. 3. 8. 3. 9. 4.
4. 3. 2. 5. 3. 5. 2. 4. 7. 2. 5. 3. 6. 6. 3. 8. 6. 8.
7. 2. 7. 10. 6. 4. 8. 9. 8. 4. 2. 1. 1. 3. 6. 6. 5. 4.
5. 3. 3. 6. 4. 4. 7. 6. 6. 3. 4. 8. 5. 7. 7. 7. 5. 5.
4. 4. 6. 7. 7. 6. 4. 7. 5. 2. 5. 2. 4. 3. 4. 6. 6. 8.
5. 3. 4. 5. 7. 6. 8. 6. 6. 4.]
Histogram using np.histogram()
h = np.histogram( s )
(array([ 9, 19, 70, 293, 181, 174, 215, 25, 11, 3]), array([-1. , 0.3, 1.6, 2.9, 4.2, 5.5, 6.8, 8.1, 9.4, 10.7, 12. ]))
Miscellaneous
Fetch Saved DataFrame
df = df_colors
Color Code Alt-Code
R1 Red #ff0000 255/0/0
R2 Green #00ff00 0/255/0
R3 Blue #0000ff 0/0/255
Get JSON from Dataframe
df.to_json( orient='records', lines=True )
{"Color":"Red","Code":"#ff0000","Alt-Code":"255\/0\/0"}
{"Color":"Green","Code":"#00ff00","Alt-Code":"0\/255\/0"}
{"Color":"Blue","Code":"#0000ff","Alt-Code":"0\/0\/255"}
Series with MultiIndex
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.Series(np.random.randn(8), index=index)
print(s)
first second
bar one -0.356158
two -0.384729
baz one -1.038348
two -0.019466
foo one 0.063635
two -0.238940
qux one -1.485342
two 1.291386
dtype: float64
Dataframe with MultiIndex
df = DataFrame( np.arange(12).reshape((4,3)),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2 ]],
columns = [['NJ', 'Ohio', 'Colorado' ]] )
NJ Ohio Colorado
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
Show Dataframe with MultiIndex
for x, ndf in df.groupby(level=0):
print(ndf)
print("x:", x)
NJ Ohio Colorado
a 1 0 1 2
2 3 4 5
x: a
NJ Ohio Colorado
b 1 6 7 8
2 9 10 11
x: b
Make a Series
s = pd.Series( { 'I1': '1234', 'I2': '2345', 'I3': '3456' } )
I1 1234
I2 2345
I3 3456
dtype: object
Series to DataFrame
df = s.to_frame()
0
I1 1234
I2 2345
I3 3456
DataFrame Transposed
df.T
I1 I2 I3
0 1234 2345 3456
Pivot Tables
Make a Dataframe with Named Columns
df = pd.DataFrame(
[
["Alice", "North", 200 ],
["Alice", "South", 150 ],
["Bob", "East", 100 ],
["Bob", "West", 100 ],
["Charlie", "North", 120 ],
["Charlie", "West", 200 ],
],
columns = [ "Agent", "Region", "Sales" ]
)
Agent Region Sales
0 Alice North 200
1 Alice South 150
2 Bob East 100
3 Bob West 100
4 Charlie North 120
5 Charlie West 200
Pivot
df.pivot(index='Agent', columns='Region', values='Sales')
Region East North South West
Agent
Alice NaN 200.0 150.0 NaN
Bob 100.0 NaN NaN 100.0
Charlie NaN 120.0 NaN 200.0
Pivot
df.pivot_table(index='Agent', columns='Region', values='Sales')
Region East North South West
Agent
Alice NaN 200.0 150.0 NaN
Bob 100.0 NaN NaN 100.0
Charlie NaN 120.0 NaN 200.0
Concatenation
Make Dataframes
df1 = DataFrame( np.random.rand( 2, 3 ))
0 1 2
0 0.670060 0.355179 0.345913
1 0.411077 0.543479 0.693139
df2 = DataFrame( np.random.rand( 4, 3 ))
0 1 2
0 0.757974 0.130799 0.752522
1 0.106363 0.190630 0.539620
2 0.342775 0.085893 0.832798
3 0.222075 0.198671 0.068842
df3 = DataFrame( np.random.rand( 2, 4 ))
0 1 2 3
0 0.637892 0.928158 0.887674 0.583632
1 0.055699 0.694496 0.363749 0.389777
Concatenation Dataframes Vertically
pd.concat( [df1, df2] )
0 1 2
0 0.670060 0.355179 0.345913
1 0.411077 0.543479 0.693139
0 0.757974 0.130799 0.752522
1 0.106363 0.190630 0.539620
2 0.342775 0.085893 0.832798
3 0.222075 0.198671 0.068842
Concatenation Dataframes Horizontally
pd.concat( [df1, df3], axis=1 )
0 1 2 0 1 2 3
0 0.670060 0.355179 0.345913 0.637892 0.928158 0.887674 0.583632
1 0.411077 0.543479 0.693139 0.055699 0.694496 0.363749 0.389777
Make a Dataframe
df = DataFrame( np.random.rand( 2, 3 ))
0 1 2
0 0.047693 0.008885 0.391842
1 0.011640 0.697079 0.176610
Make a Series
s = Series( np.random.rand( 3 ))
0 0.909998
1 0.029004
2 0.108992
dtype: float64
Concat Dataframe and Series
pd.concat( [df, s.to_frame().T ] )
0 1 2
0 0.047693 0.008885 0.391842
1 0.011640 0.697079 0.176610
0 0.909998 0.029004 0.108992
Incremental, Detailed Development of Selected Concepts
With some overlap of other examples and in excruciating detail we show here the development of
selectors for drop and change. All with integer indexing.
Incremental Development of Row Selection
Make a Dataframe
df = DataFrame( np.random.rand( 6, 4 ))
0 1 2 3
0 0.937316 0.843650 0.581525 0.469518
1 0.861697 0.648787 0.066774 0.456743
2 0.261171 0.925944 0.987221 0.905516
3 0.180667 0.269428 0.585313 0.780752
4 0.429281 0.546803 0.223258 0.501009
5 0.942636 0.306115 0.958083 0.310338
Make Explicit Boolean Row Index
bi = [True, False, True, False, True, False]
[True, False, True, False, True, False]
Dataframe Matching Boolean Row Index
df[ bi ]
0 1 2 3
0 0.937316 0.843650 0.581525 0.469518
2 0.261171 0.925944 0.987221 0.905516
4 0.429281 0.546803 0.223258 0.501009
Drop Test Column
df[2]
0 0.581525
1 0.066774
2 0.987221
3 0.585313
4 0.223258
5 0.958083
Name: 2, dtype: float64
Drop Row Condition
df[2] < .5
0 False
1 True
2 False
3 False
4 True
5 False
Name: 2, dtype: bool
Dataframe Matching Drop Condition
df[ df[2] < .5 ]
0 1 2 3
1 0.861697 0.648787 0.066774 0.456743
4 0.429281 0.546803 0.223258 0.501009
Dataframe Matching Drop Condition Index
df[ df[2] < .5 ].index
Index([1, 4], dtype='int64')
Dataframe Matching Drop Condition Index with List Expansion
df[ df[2] < .5 ].index
[1, 4]
Drop Rows Meeting Column Condition
df.drop( df[ df[2] < .5].index, axis=0 )
0 1 2 3
0 0.937316 0.843650 0.581525 0.469518
2 0.261171 0.925944 0.987221 0.905516
3 0.180667 0.269428 0.585313 0.780752
5 0.942636 0.306115 0.958083 0.310338
Change Column Values Meeting Condition
df.iloc[ df[2] < .5, 2 ] = 99999
Show Updated Dataframe
df
0 1 2 3
0 0.937316 0.843650 0.581525 0.469518
1 0.861697 0.648787 99999.000000 0.456743
2 0.261171 0.925944 0.987221 0.905516
3 0.180667 0.269428 0.585313 0.780752
4 0.429281 0.546803 99999.000000 0.501009
5 0.942636 0.306115 0.958083 0.310338
Incremental Development of Column Selection
Make a Dataframe
df = DataFrame( np.random.rand( 6, 4 ))
0 1 2 3
0 0.287383 0.841523 0.636146 0.898033
1 0.799503 0.258190 0.966333 0.478904
2 0.131248 0.442555 0.361959 0.538689
3 0.841960 0.509757 0.743804 0.792505
4 0.227624 0.837596 0.152987 0.112047
5 0.526719 0.361483 0.140216 0.694413
Make Explicit Boolean Column Index
bi = [True, False, True, False ]
[True, False, True, False]
Dataframe Matching Explicit Boolean Column Index
df.iloc[ :, bi ]
0 2
0 0.287383 0.636146
1 0.799503 0.966333
2 0.131248 0.361959
3 0.841960 0.743804
4 0.227624 0.152987
5 0.526719 0.140216
Drop Test Row
df.iloc[2]
0 0.131248
1 0.442555
2 0.361959
3 0.538689
Name: 2, dtype: float64
Drop Column Condition
df.iloc[ 2, : ] < .5
0 True
1 True
2 True
3 False
Name: 2, dtype: bool
Dataframe Matching Drop Condition
df.loc[ :, df.iloc[ 2, : ] < .5 ]
0 1 2
0 0.287383 0.841523 0.636146
1 0.799503 0.258190 0.966333
2 0.131248 0.442555 0.361959
3 0.841960 0.509757 0.743804
4 0.227624 0.837596 0.152987
5 0.526719 0.361483 0.140216
Dataframe Matching Drop Condition Column Index
df.loc[ :, df.iloc[ 2, : ] < .5 ].columns
Index([0, 1, 2], dtype='int64')
Dataframe Matching Drop Condition Column Index with List Expansion
df.loc[ :, df.iloc[ 2, : ] < .5 ].columns
[0, 1, 2]
Drop Columns Meeting Condition
df.drop( df.loc[ :, df.iloc[ 2, : ] < .5 ].columns, axis=1 )
3
0 0.898033
1 0.478904
2 0.538689
3 0.792505
4 0.112047
5 0.694413
Change Column Values Meeting Condition
df.loc[ :, df.iloc[ 2, : ] < .5 ] = 99999
Show Updated Dataframe
df
0 1 2 3
0 99999.0 99999.0 99999.0 0.898033
1 99999.0 99999.0 99999.0 0.478904
2 99999.0 99999.0 99999.0 0.538689
3 99999.0 99999.0 99999.0 0.792505
4 99999.0 99999.0 99999.0 0.112047
5 99999.0 99999.0 99999.0 0.694413
Arithmetic on Dataframes
Make a Dataframe
df = DataFrame( np.random.rand( 6, 4 ))
0 1 2 3
0 0.001642 0.380782 0.337040 0.094192
1 0.433359 0.639956 0.701023 0.246333
2 0.142322 0.805451 0.234232 0.913534
3 0.413745 0.809982 0.597281 0.651065
4 0.525349 0.699731 0.144319 0.271298
5 0.987510 0.513552 0.783160 0.635383
Make a Series
s = Series( np.random.rand( 4 ))
0 0.577031
1 0.677590
2 0.613750
3 0.375769
dtype: float64
Add a Constant to Every Element of Dataframe
df + 3
0 1 2 3
0 3.001642 3.380782 3.337040 3.094192
1 3.433359 3.639956 3.701023 3.246333
2 3.142322 3.805451 3.234232 3.913534
3 3.413745 3.809982 3.597281 3.651065
4 3.525349 3.699731 3.144319 3.271298
5 3.987510 3.513552 3.783160 3.635383
Add a Series to Every Row of Dataframe
df.add( s, axis='columns' )
0 1 2 3
0 0.578673 1.058373 0.950790 0.469961
1 1.010390 1.317546 1.314773 0.622101
2 0.719354 1.483042 0.847982 1.289303
3 0.990776 1.487572 1.211031 1.026834
4 1.102380 1.377322 0.758069 0.647067
5 1.564541 1.191143 1.396910 1.011152
Add a Series to Every Row of Dataframe
df + s
0 1 2 3
0 0.578673 1.058373 0.950790 0.469961
1 1.010390 1.317546 1.314773 0.622101
2 0.719354 1.483042 0.847982 1.289303
3 0.990776 1.487572 1.211031 1.026834
4 1.102380 1.377322 0.758069 0.647067
5 1.564541 1.191143 1.396910 1.011152
Add a Series to Every Column of Dataframe
df.add( s, axis='index' )
0 1 2 3
0 0.578673 0.957813 0.914071 0.671224
1 1.110949 1.317546 1.378614 0.923923
2 0.756073 1.419202 0.847982 1.527285
3 0.789513 1.185750 0.973049 1.026834
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
Add two Dataframes
df + df
0 1 2 3
0 0.003283 0.761564 0.674079 0.188385
1 0.866718 1.279912 1.402046 0.492665
2 0.284645 1.610903 0.468463 1.827069
3 0.827489 1.619963 1.194561 1.302130
4 1.050698 1.399463 0.288637 0.542596
5 1.975019 1.027105 1.566320 1.270766
Missing Data
Make a Dataframe
df = DataFrame( np.random.rand( 6, 4 ))
0 1 2 3
0 0.665310 0.973201 0.334828 0.956764
1 0.520282 0.185836 0.601835 0.306442
2 0.929082 0.639267 0.691352 0.639253
3 0.112779 0.889961 0.306980 0.812285
4 0.520858 0.440499 0.145265 0.106547
5 0.986067 0.328122 0.827735 0.313569
Add Missing Data
df.iloc[3,2] = np.nan
df
0 1 2 3
0 0.665310 0.973201 0.334828 0.956764
1 0.520282 0.185836 0.601835 0.306442
2 0.929082 0.639267 0.691352 0.639253
3 0.112779 0.889961 NaN 0.812285
4 0.520858 0.440499 0.145265 0.106547
5 0.986067 0.328122 0.827735 0.313569
Drop Rows Having Missing Data
df.dropna()
0 1 2 3
0 0.665310 0.973201 0.334828 0.956764
1 0.520282 0.185836 0.601835 0.306442
2 0.929082 0.639267 0.691352 0.639253
4 0.520858 0.440499 0.145265 0.106547
5 0.986067 0.328122 0.827735 0.313569
Drop Columns Having Missing Data
df.dropna( axis=1 )
0 1 3
0 0.665310 0.973201 0.956764
1 0.520282 0.185836 0.306442
2 0.929082 0.639267 0.639253
3 0.112779 0.889961 0.812285
4 0.520858 0.440499 0.106547
5 0.986067 0.328122 0.313569
Fill Missing Data
df.fillna( 99 )
0 1 2 3
0 0.665310 0.973201 0.334828 0.956764
1 0.520282 0.185836 0.601835 0.306442
2 0.929082 0.639267 0.691352 0.639253
3 0.112779 0.889961 99.000000 0.812285
4 0.520858 0.440499 0.145265 0.106547
5 0.986067 0.328122 0.827735 0.313569
Replace Values
Make Dataframe
df = DataFrame( np.random.rand( 6, 4 )).round(1)
0 1 2 3
0 0.6 0.3 0.2 0.6
1 0.7 1.0 0.5 0.6
2 0.7 0.6 0.2 0.6
3 0.4 0.4 0.8 0.9
4 0.6 0.5 0.7 0.1
5 0.3 0.4 0.9 0.1
Replace Single Value
df.replace( .3, 99 )
0 1 2 3
0 0.6 99.0 0.2 0.6
1 0.7 1.0 0.5 0.6
2 0.7 0.6 0.2 0.6
3 0.4 0.4 0.8 0.9
4 0.6 0.5 0.7 0.1
5 99.0 0.4 0.9 0.1
Replace Values in List
df.replace( [.2, .3, .4], [99, 98, 97] )
0 1 2 3
0 0.6 98.0 99.0 0.6
1 0.7 1.0 0.5 0.6
2 0.7 0.6 99.0 0.6
3 97.0 97.0 0.8 0.9
4 0.6 0.5 0.7 0.1
5 98.0 97.0 0.9 0.1
Replace Values in Dict
df.replace( {.2 : 97, .3 : 98, .4: 99} )
0 1 2 3
0 0.6 98.0 97.0 0.6
1 0.7 1.0 0.5 0.6
2 0.7 0.6 97.0 0.6
3 99.0 99.0 0.8 0.9
4 0.6 0.5 0.7 0.1
5 98.0 99.0 0.9 0.1
Replace Values with Where
df.where( df < .5, 99 )
0 1 2 3
0 99.0 0.3 0.2 99.0
1 99.0 99.0 99.0 99.0
2 99.0 99.0 0.2 99.0
3 0.4 0.4 99.0 99.0
4 99.0 99.0 99.0 0.1
5 0.3 0.4 99.0 0.1
Adding Columns
Make Dataframe
df = DataFrame( np.random.rand( 6, 4 )).round(1)
0 1 2 3
0 0.5 0.6 0.6 0.7
1 0.4 0.7 0.8 0.6
2 0.6 0.1 0.7 0.7
3 0.0 0.8 0.8 0.7
4 0.6 0.6 0.3 0.7
5 0.4 0.8 0.5 0.6
Add Column with Assign
df.assign( Max = df.max( axis=1 ))
0 1 2 3 Max
0 0.5 0.6 0.6 0.7 0.7
1 0.4 0.7 0.8 0.6 0.8
2 0.6 0.1 0.7 0.7 0.7
3 0.0 0.8 0.8 0.7 0.8
4 0.6 0.6 0.3 0.7 0.7
5 0.4 0.8 0.5 0.6 0.8
Add Column with Explicit Assignment
df['Min'] = df.min( axis=1 )
print(df)
0 1 2 3 Min
0 0.5 0.6 0.6 0.7 0.5
1 0.4 0.7 0.8 0.6 0.4
2 0.6 0.1 0.7 0.7 0.1
3 0.0 0.8 0.8 0.7 0.0
4 0.6 0.6 0.3 0.7 0.3
5 0.4 0.8 0.5 0.6 0.4
Merging Dataframes
Make Dataframes with Labeled Rows and Columns
df1 = DataFrame( [[ 1, 2, 3 ], [4, 5, 6], [7, 8, 9]], index = [ 'R1', 'R2', 'R3' ], columns = ['C1', 'C2', 'C3'] )
print( df1 )
C1 C2 C3
R1 1 2 3
R2 4 5 6
R3 7 8 9
df2 = DataFrame( [[ 10, 2, 3 ], [4, 5, 6], [7, 8, 9]], index = [ 'R1', 'R2', 'R4' ], columns = ['C1', 'C4', 'C5'] )
print( df2 )
C1 C4 C5
R1 10 2 3
R2 4 5 6
R4 7 8 9
Join Using Keys from Left Frame
pd.merge( df1, df2, how='left', on='C1', indicator=True)
C1 C2 C3 C4 C5 _merge
0 1 2 3 NaN NaN left_only
1 4 5 6 5.0 6.0 both
2 7 8 9 8.0 9.0 both
Join Using Keys from Right Frame
pd.merge( df1, df2, how='right', on='C1', indicator=True)
C1 C2 C3 C4 C5 _merge
0 10 NaN NaN 2 3 right_only
1 4 5.0 6.0 5 6 both
2 7 8.0 9.0 8 9 both
Join Using Union of Keys from Both Frames
pd.merge( df1, df2, how='outer', on='C1', indicator=True)
C1 C2 C3 C4 C5 _merge
0 1 2.0 3.0 NaN NaN left_only
1 4 5.0 6.0 5.0 6.0 both
2 7 8.0 9.0 8.0 9.0 both
3 10 NaN NaN 2.0 3.0 right_only
Join Using Intersection of Keys from Both Frames
pd.merge( df1, df2, how='inner', on='C1', indicator=True)
C1 C2 C3 C4 C5 _merge
0 4 5 6 5 6 both
1 7 8 9 8 9 both
Appendix
Python
platform.python_version()
3.13.3
Pandas
pd.__version__
2.2.3
Matplotlib
matplotlib.__version__
3.10.1
Internals for the Examples
The examples here use the Python eval() function for expressions and
exec() for statements. Expressions have a return value, which is shown following
the code example. Statements do not have a return value and show results with calls to print(). Both are executed in
the global context.
Here are simple examples of eval() and exec().
Assign value to variable with eval()
var1 = 123
123
Show variable
var1
123
Assign value to variable with exec()
var2 = 234
print( var2 )
234