Pandas Examples
Introduction
Following is an example of many features of the Pandas library for the Python language. You may think
of it as a cheat sheet of sorts but I'm reluctant to call it that because to be useful as such
a cheat sheet should be much smaller. It was developed from personal notes accumulated over many years
but just released publicaly now for the benefit of my grandson.
You can reach me using the contact from from one of my other sites:
Birdland Contact.
This is also available as a: Markdown file.
Bill Wetzel
24-Apr-2025
Preparatory
Imports
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import matplotlib
import matplotlib.pyplot as plt
Series
Make a Series
s = Series( np.random.rand( 10 ))
0 0.026916
1 0.899089
2 0.820587
3 0.379184
4 0.454606
5 0.629850
6 0.444346
7 0.243404
8 0.125773
9 0.825495
dtype: float64
Show Prior Series
s
0 0.026916
1 0.899089
2 0.820587
3 0.379184
4 0.454606
5 0.629850
6 0.444346
7 0.243404
8 0.125773
9 0.825495
dtype: float64
Make a Series with Named Index
s = Series( np.random.rand( 10 ), ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"] )
a 0.206310
b 0.504339
c 0.680141
d 0.010279
e 0.365725
f 0.597950
g 0.130547
h 0.952285
i 0.683855
j 0.810067
dtype: float64
Index for Series
s.index
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')
Distinct values (rounded) in Series
s.round( 1 ).value_counts()
0.7 2
0.2 1
0.5 1
0.0 1
0.4 1
0.6 1
0.1 1
1.0 1
0.8 1
Name: count, dtype: int64
Dataframes
Make a Dataframe
df = DataFrame( np.random.rand( 6, 4 ))
0 1 2 3
0 0.722405 0.222979 0.455354 0.728915
1 0.262425 0.625404 0.899462 0.289853
2 0.398877 0.898858 0.480237 0.278750
3 0.686958 0.374376 0.225608 0.708383
4 0.152364 0.652888 0.512709 0.698512
5 0.818334 0.465584 0.550335 0.581760
Columns in DataFrame
df.columns
[0, 1, 2, 3]
Rows (index) of DataFrame
df.index
[0, 1, 2, 3, 4, 5]
Transpose of Dataframe
df.T
0 1 2 3 4 5
0 0.722405 0.262425 0.398877 0.686958 0.152364 0.818334
1 0.222979 0.625404 0.898858 0.374376 0.652888 0.465584
2 0.455354 0.899462 0.480237 0.225608 0.512709 0.550335
3 0.728915 0.289853 0.278750 0.708383 0.698512 0.581760
Description of Dataframe
df.describe()
0 1 2 3
count 6.000000 6.000000 6.000000 6.000000
mean 0.506894 0.540015 0.520617 0.547695
std 0.273125 0.237608 0.217894 0.210419
min 0.152364 0.222979 0.225608 0.278750
25% 0.296538 0.397178 0.461575 0.362829
50% 0.542918 0.545494 0.496473 0.640136
75% 0.713544 0.646017 0.540928 0.705915
max 0.818334 0.898858 0.899462 0.728915
df.info()
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 6 non-null float64
1 1 6 non-null float64
2 2 6 non-null float64
3 3 6 non-null float64
dtypes: float64(4)
memory usage: 324.0 bytes
Shape of Dataframe
df.shape
(6, 4)
Rounded Dataframe
df.round(1)
0 1 2 3
0 0.7 0.2 0.5 0.7
1 0.3 0.6 0.9 0.3
2 0.4 0.9 0.5 0.3
3 0.7 0.4 0.2 0.7
4 0.2 0.7 0.5 0.7
5 0.8 0.5 0.6 0.6
Value Counts of Rounded Dataframe
df.round(1).value_counts()
0 1 2 3
0.2 0.7 0.5 0.7 1
0.3 0.6 0.9 0.3 1
0.4 0.9 0.5 0.3 1
0.7 0.2 0.5 0.7 1
0.4 0.2 0.7 1
0.8 0.5 0.6 0.6 1
Name: count, dtype: int64
Unique Value Counts of Rounded Dataframe
df.round(1).nunique()
0 5
1 6
2 4
3 3
dtype: int64
Rounded Selected Column
df[2].round(1)
0 0.5
1 0.9
2 0.5
3 0.2
4 0.5
5 0.6
Name: 2, dtype: float64
Value Counts of Rounded Selected Column
df[2].round(1).value_counts()
2
0.5 3
0.9 1
0.2 1
0.6 1
Name: count, dtype: int64
Unique Value Counts of Rounded Selected Column
df[2].round(1).nunique()
4
Get Array from DataFrame for Faster Iteration
df.values
[[0.72240536 0.22297917 0.45535418 0.72891476]
[0.26242474 0.62540412 0.8994617 0.28985255]
[0.39887738 0.89885752 0.48023714 0.27875013]
[0.68695812 0.37437587 0.22560782 0.70838271]
[0.15236431 0.65288795 0.51270861 0.69851235]
[0.81833438 0.46558359 0.55033471 0.58176026]]
Assign Dataframe Index to Variable
idx = df.index
[0, 1, 2, 3, 4, 5]
The following two examples are explained in detail in a later section.
Drop Rows Meeting Condition in Specified Column
df.drop( df[ df[2] < .5].index, axis=0 )
0 1 2 3
1 0.262425 0.625404 0.899462 0.289853
4 0.152364 0.652888 0.512709 0.698512
5 0.818334 0.465584 0.550335 0.581760
Drop Columns Meeting Condition in Specified Row
df.drop( df.loc[ :, df.iloc[ 2, : ] < .5 ].columns, axis=1 )
1
0 0.222979
1 0.625404
2 0.898858
3 0.374376
4 0.652888
5 0.465584
A Few Properties of DataFrame Index
idx.is_monotonic_increasing
True
idx.is_monotonic_decreasing
False
idx.has_duplicates
False
idx.nlevels
1
idx.values
[0 1 2 3 4 5]
idx.tolist()
[0, 1, 2, 3, 4, 5]
idx.min()
0
idx.max()
5
Make a DataFrame from Argument List with Labeled Rows and Columns
df = DataFrame( [[ 1, 2, 3 ], [4, 5, 6], [7, 8, 9]], index = [ 'R1', 'R2', 'R3' ], columns = ['C1', 'C2', 'C3'] )
C1 C2 C3
R1 1 2 3
R2 4 5 6
R3 7 8 9
Assign CSV Data to a Variable
csv_data = """Color, Code, Alt-Code
R1, Red, #ff0000, 255/0/0
R2, Green, #00ff00, 0/255/0
R3, Blue, #0000ff, 0/0/255"""
Make a DataFrame from CSV Data with Labeled Rows and Columns
df = pd.read_csv( StringIO( csv_data ), header = 0, index_col = 0, skipinitialspace = True )
Color Code Alt-Code
R1 Red #ff0000 255/0/0
R2 Green #00ff00 0/255/0
R3 Blue #0000ff 0/0/255
Save DataFrame for Later Reference
df_colors = df
Color Code Alt-Code
R1 Red #ff0000 255/0/0
R2 Green #00ff00 0/255/0
R3 Blue #0000ff 0/0/255
Assign Dictionary to a Variable
dict_data = {
'c0' : [ 1, 2, 3, 4, 5 ],
'c1' : [ 2, 3, 4, 5, 6 ],
'c2' : [ 3, 4, 5, 6, 7 ]
}
Make a DataFrame from Dictionary with Labeled Columns
df = DataFrame( dict_data )
c0 c1 c2
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
Assign Dictionary to a Variable
dict_data = {
'r0' : [ 1, 2, 3, 4, 5 ],
'r1' : [ 2, 3, 4, 5, 6 ],
'r2' : [ 3, 4, 5, 6, 7 ]
}
Make a DataFrame from Dictionary with Labeled Rows
df = DataFrame.from_dict( dict_data, orient='index' )
0 1 2 3 4
r0 1 2 3 4 5
r1 2 3 4 5 6
r2 3 4 5 6 7
Assign Dictionary to a Variable
# global dict_data
dict_data = {
'r0' : { 'c0' : 1, 'c1' : 2, 'c2' : 3, 'c3' : 4, 'c4' : 5 },
'r1' : { 'c0' : 2, 'c1' : 3, 'c2' : 4, 'c3' : 5, 'c4' : 6 },
'r2' : { 'c0' : 3, 'c1' : 4, 'c2' : 5, 'c3' : 6, 'c4' : 7 }
}
Make a DataFrame from Dictionary with Labeled Rows and Columns
df = DataFrame.from_dict( dict_data, orient='index' )
c0 c1 c2 c3 c4
r0 1 2 3 4 5
r1 2 3 4 5 6
r2 3 4 5 6 7
Convert DataFrame to String
df.to_string()
c0 c1 c2 c3 c4
r0 1 2 3 4 5
r1 2 3 4 5 6
r2 3 4 5 6 7
Iterate Over Data
Make a DataFrame
df = DataFrame( [[ 1, 2 ], [3, 4]], index = [ 'R1', 'R2' ], columns = ['C1', 'C2'] )
C1 C2
R1 1 2
R2 3 4
Iterate on Columns
for n, s in df.items(): # df.iteritems() deprecated
print("n:", n)
print("s:")
print(s)
print()
n: C1
s:
R1 1
R2 3
Name: C1, dtype: int64
n: C2
s:
R1 2
R2 4
Name: C2, dtype: int64
Iterate on Rows
for n, s in df.iterrows():
print("n:", n)
print("s:")
print(s)
print()
n: R1
s:
C1 1
C2 2
Name: R1, dtype: int64
n: R2
s:
C1 3
C2 4
Name: R2, dtype: int64
Iterate on .values for Fast Access
for row in df.values:
print("row:", row)
for v in row:
print(" v:", v)
row: [1 2]
v: 1
v: 2
row: [3 4]
v: 3
v: 4
Series Indexing
Make a Series
s = Series( np.random.rand( 5 ), index = ['x0', 'x1', 'x2', 'x3', 'x4' ] )
x0 0.909917
x1 0.909049
x2 0.627205
x3 0.061164
x4 0.574769
dtype: float64
Index by Label
s[ 'x2' ]
0.6272046954527306
Index by Attribute
s.x2
0.6272046954527306
Index by Label with .loc
s.loc[ 'x2' ]
0.6272046954527306
Index by Integer with .iloc
s.iloc[ 2 ]
0.6272046954527306
DataFrame Indexing
Fetch Saved DataFrame
df = df_colors
Color Code Alt-Code
R1 Red #ff0000 255/0/0
R2 Green #00ff00 0/255/0
R3 Blue #0000ff 0/0/255
Index Column by Label
df[ 'Color' ]
R1 Red
R2 Green
R3 Blue
Name: Color, dtype: object
Index Column by Attribute
df.Color
R1 Red
R2 Green
R3 Blue
Name: Color, dtype: object
Index Column by Integer with .iloc
df.iloc[ :, 2 ]
R1 255/0/0
R2 0/255/0
R3 0/0/255
Name: Alt-Code, dtype: object
Index Row by Label with .loc
df.loc[ 'R1' ]
Color Red
Code #ff0000
Alt-Code 255/0/0
Name: R1, dtype: object
Index Row by Integer with .iloc
df.iloc[ 2 ]
Color Blue
Code #0000ff
Alt-Code 0/0/255
Name: R3, dtype: object
Index Row/Col by Labels with .loc
df.loc[ 'R1', 'Color' ]
Red
Index Row/Col by Integers with .iloc
df.iloc[ 1, 1 ]
#00ff00
Index by Contents of String
df.query( 'Color == "Green"' )
Color Code Alt-Code
R2 Green #00ff00 0/255/0
Boolean Indexing
Make a Dataframe
df = DataFrame( np.random.rand( 6, 4 ))
0 1 2 3
0 0.189668 0.005349 0.512844 0.247548
1 0.533455 0.262488 0.751297 0.399315
2 0.169302 0.351072 0.083203 0.037990
3 0.125548 0.468077 0.056051 0.152374
4 0.171021 0.557780 0.603718 0.810708
5 0.606764 0.345117 0.337621 0.250499
Make Boolean Series on Rows
df[2] < .5
0 False
1 False
2 True
3 True
4 False
5 True
Name: 2, dtype: bool
Index Rows by Boolean Series
df[ df[2] < .5 ]
0 1 2 3
2 0.169302 0.351072 0.083203 0.037990
3 0.125548 0.468077 0.056051 0.152374
5 0.606764 0.345117 0.337621 0.250499
Index Rows by Boolean Series
df[ df[2] > .5 ]
0 1 2 3
0 0.189668 0.005349 0.512844 0.247548
1 0.533455 0.262488 0.751297 0.399315
4 0.171021 0.557780 0.603718 0.810708
Make Boolean Series on Columns
df.iloc[:,2] < .5
0 False
1 False
2 True
3 True
4 False
5 True
Name: 2, dtype: bool
Index Columns by Boolean Series
df[ df.iloc[:,2] < .5 ]
0 1 2 3
2 0.169302 0.351072 0.083203 0.037990
3 0.125548 0.468077 0.056051 0.152374
5 0.606764 0.345117 0.337621 0.250499
Index Columns by Boolean Series
df[ df.iloc[:,2] > .5 ]
0 1 2 3
0 0.189668 0.005349 0.512844 0.247548
1 0.533455 0.262488 0.751297 0.399315
4 0.171021 0.557780 0.603718 0.810708
Operations on Series / Dataframes
Get Series from Saved Dataframe
s = df_colors[ 'Color' ]
R1 Red
R2 Green
R3 Blue
Name: Color, dtype: object
map() - execute function on each element of series
s.map( lambda x: print( x ))
Red
Green
Blue
apply() - execute function on each column of dataframe
df_colors.apply( lambda x: print( x ), axis=0 )
R1 Red
R2 Green
R3 Blue
Name: Color, dtype: object
R1 #ff0000
R2 #00ff00
R3 #0000ff
Name: Code, dtype: object
R1 255/0/0
R2 0/255/0
R3 0/0/255
Name: Alt-Code, dtype: object
apply() - execute function on each row of dataframe
df_colors.apply( lambda x: print( x ), axis=1 )
Color Red
Code #ff0000
Alt-Code 255/0/0
Name: R1, dtype: object
Color Green
Code #00ff00
Alt-Code 0/255/0
Name: R2, dtype: object
Color Blue
Code #0000ff
Alt-Code 0/0/255
Name: R3, dtype: object
map() - execute function on each element of dataframe
df_colors.map( lambda x: print( x ))
Red
Green
Blue
#ff0000
#00ff00
#0000ff
255/0/0
0/255/0
0/0/255
Sorting Series
Make Series
s = Series( np.random.rand( 5 ), index=[ 'Ix', 'Ie', 'Ia', 'It', 'Io'])
Ix 0.413698
Ie 0.571940
Ia 0.670323
It 0.909708
Io 0.307946
dtype: float64
Sort by Index
s.sort_index()
Ia 0.670323
Ie 0.571940
Io 0.307946
It 0.909708
Ix 0.413698
dtype: float64
Sort by Value
s.sort_values()
Io 0.307946
Ix 0.413698
Ie 0.571940
Ia 0.670323
It 0.909708
dtype: float64
Sorting Dataframe
Make Dataframe
df = DataFrame( np.random.rand( 6, 4 ), index=['Rc', 'Ra', 'Rx', 'Rj', 'Rw', 'Rn' ], columns = ['Ce', 'Cr', 'Ci', 'Cq' ] )
Ce Cr Ci Cq
Rc 0.275871 0.886893 0.729471 0.323904
Ra 0.086272 0.128443 0.756100 0.417623
Rx 0.052068 0.286391 0.072795 0.913555
Rj 0.916973 0.653368 0.643975 0.902830
Rw 0.411599 0.386706 0.118761 0.317172
Rn 0.975480 0.375994 0.468117 0.317176
Sort by Row Index
df.sort_index()
Ce Cr Ci Cq
Ra 0.086272 0.128443 0.756100 0.417623
Rc 0.275871 0.886893 0.729471 0.323904
Rj 0.916973 0.653368 0.643975 0.902830
Rn 0.975480 0.375994 0.468117 0.317176
Rw 0.411599 0.386706 0.118761 0.317172
Rx 0.052068 0.286391 0.072795 0.913555
Sort by Column Index
df.sort_index( axis=1 )
Ce Ci Cq Cr
Rc 0.275871 0.729471 0.323904 0.886893
Ra 0.086272 0.756100 0.417623 0.128443
Rx 0.052068 0.072795 0.913555 0.286391
Rj 0.916973 0.643975 0.902830 0.653368
Rw 0.411599 0.118761 0.317172 0.386706
Rn 0.975480 0.468117 0.317176 0.375994
Sort by Column Value
df.sort_values( by='Cr' )
Ce Cr Ci Cq
Ra 0.086272 0.128443 0.756100 0.417623
Rx 0.052068 0.286391 0.072795 0.913555
Rn 0.975480 0.375994 0.468117 0.317176
Rw 0.411599 0.386706 0.118761 0.317172
Rj 0.916973 0.653368 0.643975 0.902830
Rc 0.275871 0.886893 0.729471 0.323904
Sort by Row Value
df.sort_values( by='Rj', axis=1 )
Ci Cr Cq Ce
Rc 0.729471 0.886893 0.323904 0.275871
Ra 0.756100 0.128443 0.417623 0.086272
Rx 0.072795 0.286391 0.913555 0.052068
Rj 0.643975 0.653368 0.902830 0.916973
Rw 0.118761 0.386706 0.317172 0.411599
Rn 0.468117 0.375994 0.317176 0.975480
Summarize Functions on Dataframe
Make Dataframe
df = DataFrame( np.random.rand( 6, 4 ))
0 1 2 3
0 0.269477 0.094294 0.495846 0.815584
1 0.338004 0.827806 0.834389 0.421820
2 0.265565 0.832254 0.895707 0.674183
3 0.895362 0.279115 0.000134 0.884850
4 0.299228 0.042413 0.903832 0.588768
5 0.822910 0.451138 0.526698 0.213542
Sum
df.sum()
0 2.890545
1 2.527020
2 3.656606
3 3.598748
dtype: float64
Count
df.count()
0 6
1 6
2 6
3 6
dtype: int64
df.median()
0 0.318616
1 0.365126
2 0.680543
3 0.631475
dtype: float64
Quantile
df.quantile([0.25, 0.5, 0.75])
0 1 2 3
0.25 0.276915 0.140499 0.503559 0.463557
0.50 0.318616 0.365126 0.680543 0.631475
0.75 0.701684 0.733639 0.880377 0.780234
Min
df.min()
0 0.265565
1 0.042413
2 0.000134
3 0.213542
dtype: float64
Max
df.max()
0 0.895362
1 0.832254
2 0.903832
3 0.884850
dtype: float64
Mean
df.mean()
0 0.481758
1 0.421170
2 0.609434
3 0.599791
dtype: float64
Var
df.var()
0 0.086648
1 0.121145
2 0.122044
3 0.062869
dtype: float64
Std
df.std()
0 0.294360
1 0.348059
2 0.349348
3 0.250736
dtype: float64
Variations of Summarize Functions Shown for sum()
Sum Columns
df.sum()
0 2.890545
1 2.527020
2 3.656606
3 3.598748
dtype: float64
Sum Rows
df.sum( axis=1 )
0 1.675201
1 2.422020
2 2.667709
3 2.059461
4 1.834240
5 2.014288
dtype: float64
Sum Entire Dataframe
df.sum( axis=1 ).sum()
12.6729190770787
Histograms
Make Dataframe
df = DataFrame( np.random.normal( loc = 5, scale = 2, size = [1000, 4] ), dtype = float ).round()
0 1 2 3
0 9.0 7.0 3.0 8.0
1 3.0 6.0 2.0 8.0
2 6.0 4.0 4.0 2.0
3 6.0 2.0 9.0 3.0
4 4.0 7.0 6.0 11.0
.. ... ... ... ...
995 5.0 6.0 4.0 1.0
996 1.0 5.0 4.0 5.0
997 5.0 1.0 4.0 3.0
998 4.0 6.0 6.0 2.0
999 7.0 5.0 5.0 4.0
[1000 rows x 4 columns]
Histogram using pd.value_counts
s = df.apply( lambda x: x.value_counts()).fillna(0)
0 1 2 3
-3.0 0.0 1.0 0.0 0.0
-2.0 1.0 0.0 2.0 0.0
-1.0 6.0 1.0 2.0 6.0
-0.0 8.0 8.0 8.0 5.0
1.0 28.0 25.0 33.0 26.0
2.0 65.0 70.0 67.0 71.0
3.0 128.0 111.0 123.0 112.0
4.0 158.0 158.0 177.0 177.0
5.0 176.0 201.0 202.0 218.0
6.0 174.0 182.0 166.0 157.0
7.0 140.0 121.0 120.0 116.0
8.0 77.0 91.0 65.0 71.0
9.0 32.0 24.0 23.0 30.0
10.0 5.0 5.0 11.0 8.0
11.0 2.0 2.0 1.0 3.0
Histogram Using Pandas .hist()
h = df.hist( sharex=True, sharey=True, linewidth=1, figsize=(Fig_X, Fig_Y) )
Get Series from Column 0 of Dataframe
s = df.iloc[ :, 0].values
[ 9. 3. 6. 6. 4. 2. 6. 3. 2. 7. 6. 9. 6. 5. 4. 9. 8. 7.
4. 2. 3. 5. 8. 6. 3. 6. 3. 6. 5. 2. 6. 4. 6. 2. 7. 4.
6. 6. 7. 8. 8. 4. 8. 9. 4. 5. 3. 5. 8. 5. -1. 5. 7. 5.
3. 6. 4. 7. 5. 3. 4. 5. 5. 4. 7. 4. 4. 2. 4. 2. 1. 8.
7. 5. 4. 5. 7. 4. 3. 2. 4. 8. 6. 4. 4. 8. 8. 7. 7. 5.
3. 2. 6. 5. 7. 2. 3. 6. 7. 4. 3. 4. 6. 8. -0. 6. 8. 6.
3. 4. 9. 8. 7. 5. 6. 8. 6. 6. 2. 2. 3. 4. 4. 6. 7. 7.
2. 7. 2. 4. 7. 2. 6. 9. 4. 6. 6. 7. -0. 4. 6. 7. 6. 5.
5. 3. 6. 5. 7. 5. 6. 6. 3. 7. 3. 5. 3. 4. 4. 8. 4. 9.
7. 1. 3. 7. 2. 1. 2. 6. 3. 5. 6. 6. 4. 3. 7. 4. 6. 4.
6. 5. 6. 6. 5. 8. 7. 5. 6. 3. -1. 6. 5. 5. 6. 6. 7. 6.
6. 6. 8. 7. 6. 6. 7. 2. 4. 7. 1. 5. 8. 1. 3. 7. 5. 3.
2. 3. 2. 5. 5. 5. 3. 5. 8. 8. 7. 4. 7. 7. 3. 5. 10. 9.
8. 7. 6. 9. 7. 5. 5. 6. 4. 6. 6. 4. 5. -1. 6. 4. 5. 4.
3. 4. 1. 3. 4. 7. 3. 9. 5. 6. 5. 6. 3. 3. 4. 7. 5. 4.
5. 4. 4. 7. 5. 5. 6. 6. 7. 5. 5. 6. 5. 4. 6. 6. 8. 1.
1. 9. 4. 2. 4. 7. 6. 6. 5. 6. 6. 5. 6. 5. 4. 5. 2. 3.
2. 8. 6. 5. 6. 6. 9. 0. 4. 5. 6. 3. 6. 5. 7. 9. 3. 2.
3. 6. 5. 4. 7. 4. 7. 7. 6. 4. 9. 5. 5. 6. 8. 5. 8. 5.
11. 8. 5. 7. 2. 5. 5. 6. 5. 2. 3. 9. 8. 7. 3. 4. 3. 3.
6. 3. 3. 5. 7. 5. 5. 3. 7. 2. 7. 7. 6. 2. 7. 6. 5. 3.
5. 4. 3. 7. 5. 3. 5. 8. 5. 3. 4. 6. 6. 5. 4. 4. 3. 7.
7. 4. 4. 7. 7. 5. 8. 3. 8. 4. 3. 4. 2. 7. 5. 6. 5. 4.
7. 5. -1. 8. 9. 8. 8. 7. 4. 3. 8. 5. 6. 4. 4. 3. 4. 7.
4. 4. 7. 6. 8. 4. 5. 3. 4. 7. 5. 4. 4. 8. 2. 5. 5. 7.
8. 3. 3. 8. 6. 4. 3. 9. 4. 5. 4. 4. 3. 7. 9. 2. 4. 7.
6. 6. 6. 3. 5. 2. 4. 4. 5. 6. 8. 8. 0. 3. -1. 4. 2. 3.
5. 7. 9. 7. 8. 4. 6. 4. 6. 4. 3. 1. 3. 4. 4. 5. 4. 5.
5. 3. 5. 3. 3. 8. 3. 5. 3. 5. 5. 7. 1. 7. 7. 3. 5. 3.
7. 5. 4. 5. 3. 6. 4. 6. 6. 6. 3. 6. 10. 5. 7. 5. 6. 3.
6. 4. 1. 6. 4. 7. 2. 6. 6. 6. 4. 4. 6. 3. 5. 6. 1. 3.
5. 5. 3. 7. 8. 2. 4. 8. 7. 4. 4. 5. 4. 3. 5. 6. 6. 6.
5. 7. 3. 7. 3. 2. 5. 8. -1. 4. 6. 4. 5. 3. 6. 7. 4. 4.
6. 8. 4. 2. 8. 4. 7. 3. 2. 4. 7. 3. 6. 2. 7. 5. 6. 6.
7. 5. 5. 3. 7. 5. 4. 5. 4. 3. 7. -0. 7. 5. 5. 4. 8. 2.
2. 4. 2. 3. 4. 6. 4. 2. 5. 7. 6. 7. 8. 8. 6. 7. 5. 8.
4. 3. 2. 5. 9. 8. 6. 7. 5. 8. 5. 3. 2. 4. 5. 3. 5. 2.
7. 5. 5. 2. 7. 6. 6. 7. 3. 7. 5. 4. 8. 4. 8. 9. 9. 10.
6. 1. 6. 5. 0. 5. 7. 3. 4. 3. 8. 5. 4. 4. 4. 3. 10. 3.
4. 4. 11. 7. 2. 5. 7. 5. 7. 4. 1. 3. 5. 3. 6. 3. 9. 2.
5. 3. 6. 7. 6. 8. 8. 6. 2. 6. 3. 2. 4. 5. 7. 0. 4. 8.
5. 2. 6. 9. 7. 7. 2. 4. 2. 9. 3. 4. 6. 6. 3. 5. 2. 8.
-2. 7. 4. 6. 5. 4. 7. 3. 3. 5. 7. 7. 5. 5. 6. 6. 2. 6.
2. 6. 3. 8. 7. 8. 1. 7. 3. 4. 7. 8. 6. 4. 3. 1. 7. 7.
3. 7. 6. 5. 7. 3. 5. 6. 3. 9. 3. 2. 9. 8. 4. 3. 4. 6.
7. 3. 3. 3. 6. 1. 9. 5. 6. 3. 5. 5. 7. 3. 4. 3. 7. 4.
7. 5. 5. 6. 5. 6. 4. 4. 3. 9. 2. 5. 6. 7. 8. 6. 10. 7.
5. 5. 6. 7. 6. 3. 6. 4. 7. 9. 6. 5. 1. 5. 6. 6. 4. 4.
3. 3. 7. 2. 5. 9. 4. 7. 8. 4. 7. 5. 6. 6. 1. 5. 5. 3.
6. 4. 3. 4. 7. 7. 4. 5. 6. 2. 7. 2. 7. 6. 6. 7. 6. 6.
6. 4. 8. 6. 1. 4. 7. 4. 7. 7. 4. 4. 5. 6. 7. 4. 4. 2.
5. 5. 8. 8. 6. 3. 5. 0. 6. 8. 6. 1. 5. 5. 3. 5. 1. 4.
7. 7. 4. 6. 5. 7. 5. 5. 1. 6. 5. 7. 6. 8. 7. 4. 6. 6.
1. 8. 6. 8. 2. 6. 5. 5. 6. 6. 3. 5. 2. 1. 3. 9. 6. 5.
4. 5. 4. 6. 5. 5. 1. 4. 4. 6. 8. 4. 3. 8. 3. 3. 8. 7.
5. 5. 1. 8. 6. 5. 1. 5. 4. 7.]
Histogram using np.histogram()
h = np.histogram( s )
(array([ 7, 8, 28, 193, 158, 176, 314, 77, 32, 7]), array([-2. , -0.7, 0.6, 1.9, 3.2, 4.5, 5.8, 7.1, 8.4, 9.7, 11. ]))
Miscellaneous
Fetch Saved DataFrame
df = df_colors
Color Code Alt-Code
R1 Red #ff0000 255/0/0
R2 Green #00ff00 0/255/0
R3 Blue #0000ff 0/0/255
Get JSON from Dataframe
df.to_json( orient='records', lines=True )
{"Color":"Red","Code":"#ff0000","Alt-Code":"255\/0\/0"}
{"Color":"Green","Code":"#00ff00","Alt-Code":"0\/255\/0"}
{"Color":"Blue","Code":"#0000ff","Alt-Code":"0\/0\/255"}
Series with MultiIndex
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.Series(np.random.randn(8), index=index)
print(s)
first second
bar one 0.082317
two -0.291566
baz one -0.849155
two -0.374036
foo one 3.025427
two -0.671678
qux one -0.698915
two 0.380036
dtype: float64
Dataframe with MultiIndex
df = DataFrame( np.arange(12).reshape((4,3)),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2 ]],
columns = [['NJ', 'Ohio', 'Colorado' ]] )
NJ Ohio Colorado
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
Show Dataframe with MultiIndex
for x, ndf in df.groupby(level=0):
print(ndf)
print("x:", x)
NJ Ohio Colorado
a 1 0 1 2
2 3 4 5
x: a
NJ Ohio Colorado
b 1 6 7 8
2 9 10 11
x: b
Make a Series
s = pd.Series( { 'I1': '1234', 'I2': '2345', 'I3': '3456' } )
I1 1234
I2 2345
I3 3456
dtype: object
Series to DataFrame
df = s.to_frame()
0
I1 1234
I2 2345
I3 3456
DataFrame Transposed
df.T
I1 I2 I3
0 1234 2345 3456
Pivot Tables
Make a Dataframe with Named Columns
df = pd.DataFrame(
[
["Alice", "North", 200 ],
["Alice", "South", 150 ],
["Bob", "East", 100 ],
["Bob", "West", 100 ],
["Charlie", "North", 120 ],
["Charlie", "West", 200 ],
],
columns = [ "Agent", "Region", "Sales" ]
)
Agent Region Sales
0 Alice North 200
1 Alice South 150
2 Bob East 100
3 Bob West 100
4 Charlie North 120
5 Charlie West 200
Pivot
df.pivot(index='Agent', columns='Region', values='Sales')
Region East North South West
Agent
Alice NaN 200.0 150.0 NaN
Bob 100.0 NaN NaN 100.0
Charlie NaN 120.0 NaN 200.0
Pivot
df.pivot_table(index='Agent', columns='Region', values='Sales')
Region East North South West
Agent
Alice NaN 200.0 150.0 NaN
Bob 100.0 NaN NaN 100.0
Charlie NaN 120.0 NaN 200.0
Concatenation
Make Dataframes
df1 = DataFrame( np.random.rand( 2, 3 ))
0 1 2
0 0.440904 0.256521 0.587847
1 0.306891 0.165160 0.803135
df2 = DataFrame( np.random.rand( 4, 3 ))
0 1 2
0 0.130194 0.769418 0.654715
1 0.911438 0.554571 0.296708
2 0.862839 0.067075 0.131169
3 0.836283 0.491386 0.153753
df3 = DataFrame( np.random.rand( 2, 4 ))
0 1 2 3
0 0.494061 0.619880 0.435683 0.047196
1 0.083486 0.893137 0.880059 0.356199
Concatenation Dataframes Vertically
pd.concat( [df1, df2] )
0 1 2
0 0.440904 0.256521 0.587847
1 0.306891 0.165160 0.803135
0 0.130194 0.769418 0.654715
1 0.911438 0.554571 0.296708
2 0.862839 0.067075 0.131169
3 0.836283 0.491386 0.153753
Concatenation Dataframes Horizontally
pd.concat( [df1, df3], axis=1 )
0 1 2 0 1 2 3
0 0.440904 0.256521 0.587847 0.494061 0.619880 0.435683 0.047196
1 0.306891 0.165160 0.803135 0.083486 0.893137 0.880059 0.356199
Make a Dataframe
df = DataFrame( np.random.rand( 2, 3 ))
0 1 2
0 0.576904 0.881436 0.335952
1 0.026668 0.814627 0.499610
Make a Series
s = Series( np.random.rand( 3 ))
0 0.821551
1 0.306187
2 0.408363
dtype: float64
Concat Dataframe and Series
pd.concat( [df, s.to_frame().T ] )
0 1 2
0 0.576904 0.881436 0.335952
1 0.026668 0.814627 0.499610
0 0.821551 0.306187 0.408363
Incremental, Detailed Development of Selected Concepts
With some overlap of other examples and in excruciating detail we show here the development of
selectors for drop and change. All with integer indexing.
Incremental Development of Row Selection
Make a Dataframe
df = DataFrame( np.random.rand( 6, 4 ))
0 1 2 3
0 0.589165 0.094179 0.057888 0.285678
1 0.397765 0.302189 0.626236 0.032388
2 0.097558 0.984696 0.597267 0.897242
3 0.985481 0.093285 0.102098 0.879075
4 0.543510 0.030003 0.685510 0.420352
5 0.846434 0.647452 0.306411 0.429671
Make Explicit Boolean Row Index
bi = [True, False, True, False, True, False]
[True, False, True, False, True, False]
Dataframe Matching Boolean Row Index
df[ bi ]
0 1 2 3
0 0.589165 0.094179 0.057888 0.285678
2 0.097558 0.984696 0.597267 0.897242
4 0.543510 0.030003 0.685510 0.420352
Drop Test Column
df[2]
0 0.057888
1 0.626236
2 0.597267
3 0.102098
4 0.685510
5 0.306411
Name: 2, dtype: float64
Drop Row Condition
df[2] < .5
0 True
1 False
2 False
3 True
4 False
5 True
Name: 2, dtype: bool
Dataframe Matching Drop Condition
df[ df[2] < .5 ]
0 1 2 3
0 0.589165 0.094179 0.057888 0.285678
3 0.985481 0.093285 0.102098 0.879075
5 0.846434 0.647452 0.306411 0.429671
Dataframe Matching Drop Condition Index
df[ df[2] < .5 ].index
Index([0, 3, 5], dtype='int64')
Dataframe Matching Drop Condition Index with List Expansion
df[ df[2] < .5 ].index
[0, 3, 5]
Drop Rows Meeting Column Condition
df.drop( df[ df[2] < .5].index, axis=0 )
0 1 2 3
1 0.397765 0.302189 0.626236 0.032388
2 0.097558 0.984696 0.597267 0.897242
4 0.543510 0.030003 0.685510 0.420352
Change Column Values Meeting Condition
df.iloc[ df[2] < .5, 2 ] = 99999
Show Updated Dataframe
df
0 1 2 3
0 0.589165 0.094179 99999.000000 0.285678
1 0.397765 0.302189 0.626236 0.032388
2 0.097558 0.984696 0.597267 0.897242
3 0.985481 0.093285 99999.000000 0.879075
4 0.543510 0.030003 0.685510 0.420352
5 0.846434 0.647452 99999.000000 0.429671
Incremental Development of Column Selection
Make a Dataframe
df = DataFrame( np.random.rand( 6, 4 ))
0 1 2 3
0 0.025985 0.756718 0.706298 0.717380
1 0.540794 0.097740 0.087657 0.529131
2 0.316994 0.539271 0.035169 0.362495
3 0.247725 0.168063 0.758273 0.332343
4 0.127355 0.355506 0.956690 0.128359
5 0.189782 0.626763 0.342883 0.288338
Make Explicit Boolean Column Index
bi = [True, False, True, False ]
[True, False, True, False]
Dataframe Matching Explicit Boolean Column Index
df.iloc[ :, bi ]
0 2
0 0.025985 0.706298
1 0.540794 0.087657
2 0.316994 0.035169
3 0.247725 0.758273
4 0.127355 0.956690
5 0.189782 0.342883
Drop Test Row
df.iloc[2]
0 0.316994
1 0.539271
2 0.035169
3 0.362495
Name: 2, dtype: float64
Drop Column Condition
df.iloc[ 2, : ] < .5
0 True
1 False
2 True
3 True
Name: 2, dtype: bool
Dataframe Matching Drop Condition
df.loc[ :, df.iloc[ 2, : ] < .5 ]
0 2 3
0 0.025985 0.706298 0.717380
1 0.540794 0.087657 0.529131
2 0.316994 0.035169 0.362495
3 0.247725 0.758273 0.332343
4 0.127355 0.956690 0.128359
5 0.189782 0.342883 0.288338
Dataframe Matching Drop Condition Column Index
df.loc[ :, df.iloc[ 2, : ] < .5 ].columns
Index([0, 2, 3], dtype='int64')
Dataframe Matching Drop Condition Column Index with List Expansion
df.loc[ :, df.iloc[ 2, : ] < .5 ].columns
[0, 2, 3]
Drop Columns Meeting Condition
df.drop( df.loc[ :, df.iloc[ 2, : ] < .5 ].columns, axis=1 )
1
0 0.756718
1 0.097740
2 0.539271
3 0.168063
4 0.355506
5 0.626763
Change Column Values Meeting Condition
df.loc[ :, df.iloc[ 2, : ] < .5 ] = 99999
Show Updated Dataframe
df
0 1 2 3
0 99999.0 0.756718 99999.0 99999.0
1 99999.0 0.097740 99999.0 99999.0
2 99999.0 0.539271 99999.0 99999.0
3 99999.0 0.168063 99999.0 99999.0
4 99999.0 0.355506 99999.0 99999.0
5 99999.0 0.626763 99999.0 99999.0
Arithmetic on Dataframes
Make a Dataframe
df = DataFrame( np.random.rand( 6, 4 ))
0 1 2 3
0 0.028382 0.921065 0.317498 0.942125
1 0.038418 0.089952 0.807904 0.532124
2 0.575442 0.596120 0.466168 0.299828
3 0.525795 0.534371 0.000020 0.763198
4 0.481748 0.024781 0.599480 0.223577
5 0.077146 0.818834 0.655083 0.170712
Make a Series
s = Series( np.random.rand( 4 ))
0 0.754740
1 0.244048
2 0.573210
3 0.644877
dtype: float64
Add a Constant to Every Element of Dataframe
df + 3
0 1 2 3
0 3.028382 3.921065 3.317498 3.942125
1 3.038418 3.089952 3.807904 3.532124
2 3.575442 3.596120 3.466168 3.299828
3 3.525795 3.534371 3.000020 3.763198
4 3.481748 3.024781 3.599480 3.223577
5 3.077146 3.818834 3.655083 3.170712
Add a Series to Every Row of Dataframe
df.add( s, axis='columns' )
0 1 2 3
0 0.783122 1.165113 0.890708 1.587002
1 0.793158 0.334000 1.381114 1.177001
2 1.330182 0.840168 1.039378 0.944705
3 1.280535 0.778419 0.573230 1.408075
4 1.236488 0.268829 1.172690 0.868454
5 0.831886 1.062883 1.228293 0.815589
Add a Series to Every Row of Dataframe
df + s
0 1 2 3
0 0.783122 1.165113 0.890708 1.587002
1 0.793158 0.334000 1.381114 1.177001
2 1.330182 0.840168 1.039378 0.944705
3 1.280535 0.778419 0.573230 1.408075
4 1.236488 0.268829 1.172690 0.868454
5 0.831886 1.062883 1.228293 0.815589
Add a Series to Every Column of Dataframe
df.add( s, axis='index' )
0 1 2 3
0 0.783122 1.675805 1.072238 1.696865
1 0.282467 0.334000 1.051952 0.776172
2 1.148652 1.169330 1.039378 0.873038
3 1.170672 1.179248 0.644897 1.408075
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
Add two Dataframes
df + df
0 1 2 3
0 0.056763 1.842130 0.634996 1.884250
1 0.076837 0.179903 1.615808 1.064248
2 1.150884 1.192240 0.932336 0.599656
3 1.051590 1.068742 0.000041 1.526397
4 0.963495 0.049562 1.198960 0.447154
5 0.154291 1.637669 1.310166 0.341424
Missing Data
Make a Dataframe
df = DataFrame( np.random.rand( 6, 4 ))
0 1 2 3
0 0.080150 0.082561 0.835291 0.020266
1 0.059493 0.217728 0.871861 0.909545
2 0.795321 0.711423 0.560041 0.984722
3 0.462916 0.819765 0.917924 0.137840
4 0.467936 0.623530 0.952643 0.583052
5 0.331254 0.313015 0.729224 0.525719
Add Missing Data
df.iloc[3,2] = np.nan
df
0 1 2 3
0 0.080150 0.082561 0.835291 0.020266
1 0.059493 0.217728 0.871861 0.909545
2 0.795321 0.711423 0.560041 0.984722
3 0.462916 0.819765 NaN 0.137840
4 0.467936 0.623530 0.952643 0.583052
5 0.331254 0.313015 0.729224 0.525719
Drop Rows Having Missing Data
df.dropna()
0 1 2 3
0 0.080150 0.082561 0.835291 0.020266
1 0.059493 0.217728 0.871861 0.909545
2 0.795321 0.711423 0.560041 0.984722
4 0.467936 0.623530 0.952643 0.583052
5 0.331254 0.313015 0.729224 0.525719
Drop Columns Having Missing Data
df.dropna( axis=1 )
0 1 3
0 0.080150 0.082561 0.020266
1 0.059493 0.217728 0.909545
2 0.795321 0.711423 0.984722
3 0.462916 0.819765 0.137840
4 0.467936 0.623530 0.583052
5 0.331254 0.313015 0.525719
Fill Missing Data
df.fillna( 99 )
0 1 2 3
0 0.080150 0.082561 0.835291 0.020266
1 0.059493 0.217728 0.871861 0.909545
2 0.795321 0.711423 0.560041 0.984722
3 0.462916 0.819765 99.000000 0.137840
4 0.467936 0.623530 0.952643 0.583052
5 0.331254 0.313015 0.729224 0.525719
Replace Values
Make Dataframe
df = DataFrame( np.random.rand( 6, 4 )).round(1)
0 1 2 3
0 0.5 0.5 0.8 0.8
1 0.6 0.7 0.1 0.2
2 0.9 0.1 0.2 0.7
3 1.0 0.1 0.6 0.5
4 1.0 0.7 0.6 1.0
5 0.4 1.0 0.1 0.4
Replace Single Value
df.replace( .3, 99 )
0 1 2 3
0 0.5 0.5 0.8 0.8
1 0.6 0.7 0.1 0.2
2 0.9 0.1 0.2 0.7
3 1.0 0.1 0.6 0.5
4 1.0 0.7 0.6 1.0
5 0.4 1.0 0.1 0.4
Replace Values in List
df.replace( [.2, .3, .4], [99, 98, 97] )
0 1 2 3
0 0.5 0.5 0.8 0.8
1 0.6 0.7 0.1 99.0
2 0.9 0.1 99.0 0.7
3 1.0 0.1 0.6 0.5
4 1.0 0.7 0.6 1.0
5 97.0 1.0 0.1 97.0
Replace Values in Dict
df.replace( {.2 : 97, .3 : 98, .4: 99} )
0 1 2 3
0 0.5 0.5 0.8 0.8
1 0.6 0.7 0.1 97.0
2 0.9 0.1 97.0 0.7
3 1.0 0.1 0.6 0.5
4 1.0 0.7 0.6 1.0
5 99.0 1.0 0.1 99.0
Replace Values with Where
df.where( df < .5, 99 )
0 1 2 3
0 99.0 99.0 99.0 99.0
1 99.0 99.0 0.1 0.2
2 99.0 0.1 0.2 99.0
3 99.0 0.1 99.0 99.0
4 99.0 99.0 99.0 99.0
5 0.4 99.0 0.1 0.4
Adding Columns
Make Dataframe
df = DataFrame( np.random.rand( 6, 4 )).round(1)
0 1 2 3
0 0.5 0.2 0.6 0.4
1 0.2 1.0 0.4 0.2
2 0.0 0.9 0.9 0.9
3 0.8 0.8 0.4 0.8
4 0.1 0.3 0.3 0.9
5 0.6 0.2 0.1 0.8
Add Column with Assign
df.assign( Max = df.max( axis=1 ))
0 1 2 3 Max
0 0.5 0.2 0.6 0.4 0.6
1 0.2 1.0 0.4 0.2 1.0
2 0.0 0.9 0.9 0.9 0.9
3 0.8 0.8 0.4 0.8 0.8
4 0.1 0.3 0.3 0.9 0.9
5 0.6 0.2 0.1 0.8 0.8
Add Column with Explicit Assignment
df['Min'] = df.min( axis=1 )
print(df)
0 1 2 3 Min
0 0.5 0.2 0.6 0.4 0.2
1 0.2 1.0 0.4 0.2 0.2
2 0.0 0.9 0.9 0.9 0.0
3 0.8 0.8 0.4 0.8 0.4
4 0.1 0.3 0.3 0.9 0.1
5 0.6 0.2 0.1 0.8 0.1
Merging Dataframes
Make Dataframes with Labeled Rows and Columns
df1 = DataFrame( [[ 1, 2, 3 ], [4, 5, 6], [7, 8, 9]], index = [ 'R1', 'R2', 'R3' ], columns = ['C1', 'C2', 'C3'] )
print( df1 )
C1 C2 C3
R1 1 2 3
R2 4 5 6
R3 7 8 9
df2 = DataFrame( [[ 10, 2, 3 ], [4, 5, 6], [7, 8, 9]], index = [ 'R1', 'R2', 'R4' ], columns = ['C1', 'C4', 'C5'] )
print( df2 )
C1 C4 C5
R1 10 2 3
R2 4 5 6
R4 7 8 9
Join Using Keys from Left Frame
pd.merge( df1, df2, how='left', on='C1', indicator=True)
C1 C2 C3 C4 C5 _merge
0 1 2 3 NaN NaN left_only
1 4 5 6 5.0 6.0 both
2 7 8 9 8.0 9.0 both
Join Using Keys from Right Frame
pd.merge( df1, df2, how='right', on='C1', indicator=True)
C1 C2 C3 C4 C5 _merge
0 10 NaN NaN 2 3 right_only
1 4 5.0 6.0 5 6 both
2 7 8.0 9.0 8 9 both
Join Using Union of Keys from Both Frames
pd.merge( df1, df2, how='outer', on='C1', indicator=True)
C1 C2 C3 C4 C5 _merge
0 1 2.0 3.0 NaN NaN left_only
1 4 5.0 6.0 5.0 6.0 both
2 7 8.0 9.0 8.0 9.0 both
3 10 NaN NaN 2.0 3.0 right_only
Join Using Intersection of Keys from Both Frames
pd.merge( df1, df2, how='inner', on='C1', indicator=True)
C1 C2 C3 C4 C5 _merge
0 4 5 6 5 6 both
1 7 8 9 8 9 both
Appendix
Python
platform.python_version()
3.14.4
Pandas
pd.__version__
2.3.3
Matplotlib
matplotlib.__version__
3.10.9
Internals for the Examples
The examples here use the Python eval() function for expressions and
exec() for statements. Expressions have a return value, which is shown following
the code example. Statements do not have a return value and show results with calls to print(). Both are executed in
the global context.
Here are simple examples of eval() and exec().
Assign value to variable with eval()
var1 = 123
123
Show variable
var1
123
Assign value to variable with exec()
var2 = 234
print( var2 )
234