Pandas Examples

Introduction

Following is an example of many features of the Pandas library for the Python language. You may think of it as a cheat sheet of sorts but I'm reluctant to call it that because to be useful as such a cheat sheet should be much smaller. It was developed from personal notes accumulated over many years but just released publicaly now for the benefit of my grandson.

You can reach me using the contact from from one of my other sites: Birdland Contact.
This is also available as a: Markdown file.

Bill Wetzel
24-Apr-2025

Preparatory

Imports

import numpy as np import pandas as pd from pandas import DataFrame, Series import matplotlib import matplotlib.pyplot as plt

Series

Make a Series

s = Series( np.random.rand( 10 ))

0 0.994610 1 0.164102 2 0.254481 3 0.760523 4 0.581324 5 0.887759 6 0.550993 7 0.056151 8 0.553723 9 0.028766 dtype: float64

Show Prior Series

0 0.994610 1 0.164102 2 0.254481 3 0.760523 4 0.581324 5 0.887759 6 0.550993 7 0.056151 8 0.553723 9 0.028766 dtype: float64

Make a Series with Named Index

s = Series( np.random.rand( 10 ), ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"] )

a 0.895349 b 0.506245 c 0.959434 d 0.718147 e 0.874011 f 0.432590 g 0.677948 h 0.918245 i 0.032920 j 0.630093 dtype: float64

Index for Series

s.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')

Distinct values (rounded) in Series

s.round( 1 ).value_counts()

0.9 3 0.7 2 0.5 1 1.0 1 0.4 1 0.0 1 0.6 1 Name: count, dtype: int64

Dataframes

Make a Dataframe

df = DataFrame( np.random.rand( 6, 4 ))

0 1 2 3 0 0.526257 0.082531 0.773950 0.430945 1 0.918965 0.518207 0.625018 0.044193 2 0.303664 0.825669 0.075327 0.693838 3 0.078656 0.777591 0.926702 0.385731 4 0.276900 0.667200 0.392218 0.964382 5 0.428334 0.905876 0.446561 0.364982

Columns in DataFrame

df.columns

[0, 1, 2, 3]

Rows (index) of DataFrame

df.index

[0, 1, 2, 3, 4, 5]

Transpose of Dataframe

df.T

0 1 2 3 4 5 0 0.526257 0.918965 0.303664 0.078656 0.276900 0.428334 1 0.082531 0.518207 0.825669 0.777591 0.667200 0.905876 2 0.773950 0.625018 0.075327 0.926702 0.392218 0.446561 3 0.430945 0.044193 0.693838 0.385731 0.964382 0.364982

Description of Dataframe

df.describe()

0 1 2 3 count 6.000000 6.000000 6.000000 6.000000 mean 0.422130 0.629512 0.539963 0.480679 std 0.286645 0.299927 0.302652 0.314579 min 0.078656 0.082531 0.075327 0.044193 25% 0.283591 0.555455 0.405804 0.370169 50% 0.365999 0.722395 0.535790 0.408338 75% 0.501777 0.813649 0.736717 0.628115 max 0.918965 0.905876 0.926702 0.964382

Information about Dataframe

df.info()

RangeIndex: 6 entries, 0 to 5 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 0 6 non-null float64 1 1 6 non-null float64 2 2 6 non-null float64 3 3 6 non-null float64 dtypes: float64(4) memory usage: 324.0 bytes

Shape of Dataframe

df.shape

(6, 4)

Rounded Dataframe

df.round(1)

0 1 2 3 0 0.5 0.1 0.8 0.4 1 0.9 0.5 0.6 0.0 2 0.3 0.8 0.1 0.7 3 0.1 0.8 0.9 0.4 4 0.3 0.7 0.4 1.0 5 0.4 0.9 0.4 0.4

Value Counts of Rounded Dataframe

df.round(1).value_counts()

0 1 2 3 0.1 0.8 0.9 0.4 1 0.3 0.7 0.4 1.0 1 0.8 0.1 0.7 1 0.4 0.9 0.4 0.4 1 0.5 0.1 0.8 0.4 1 0.9 0.5 0.6 0.0 1 Name: count, dtype: int64

Unique Value Counts of Rounded Dataframe

df.round(1).nunique()

0 5 1 5 2 5 3 4 dtype: int64

Rounded Selected Column

df[2].round(1)

0 0.8 1 0.6 2 0.1 3 0.9 4 0.4 5 0.4 Name: 2, dtype: float64

Value Counts of Rounded Selected Column

df[2].round(1).value_counts()

2 0.4 2 0.8 1 0.6 1 0.1 1 0.9 1 Name: count, dtype: int64

Unique Value Counts of Rounded Selected Column

df[2].round(1).nunique()

Get Array from DataFrame for Faster Iteration

df.values

[[0.52625746 0.08253089 0.77394952 0.4309448 ] [0.91896505 0.51820701 0.62501799 0.04419306] [0.30366439 0.82566895 0.07532738 0.69383818] [0.07865641 0.77759054 0.92670187 0.38573085] [0.27690023 0.66719996 0.39221784 0.96438196] [0.42833369 0.90587645 0.44656138 0.3649823 ]]

Assign Dataframe Index to Variable

idx = df.index

[0, 1, 2, 3, 4, 5]

The following two examples are explained in detail in a later section.

Drop Rows Meeting Condition in Specified Column

df.drop( df[ df[2] < .5].index, axis=0 )

0 1 2 3 0 0.526257 0.082531 0.773950 0.430945 1 0.918965 0.518207 0.625018 0.044193 3 0.078656 0.777591 0.926702 0.385731

Drop Columns Meeting Condition in Specified Row

df.drop( df.loc[ :, df.iloc[ 2, : ] < .5 ].columns, axis=1 )

1 3 0 0.082531 0.430945 1 0.518207 0.044193 2 0.825669 0.693838 3 0.777591 0.385731 4 0.667200 0.964382 5 0.905876 0.364982

A Few Properties of DataFrame Index

idx.is_monotonic_increasing

True

idx.is_monotonic_decreasing

False

idx.has_duplicates

False

idx.nlevels

idx.values

[0 1 2 3 4 5]

idx.tolist()

[0, 1, 2, 3, 4, 5]

idx.min()

idx.max()

Make a DataFrame from Argument List with Labeled Rows and Columns

df = DataFrame( [[ 1, 2, 3 ], [4, 5, 6], [7, 8, 9]], index = [ 'R1', 'R2', 'R3' ], columns = ['C1', 'C2', 'C3'] )

C1 C2 C3 R1 1 2 3 R2 4 5 6 R3 7 8 9

Assign CSV Data to a Variable

csv_data = """Color, Code, Alt-Code R1, Red, #ff0000, 255/0/0 R2, Green, #00ff00, 0/255/0 R3, Blue, #0000ff, 0/0/255"""

Make a DataFrame from CSV Data with Labeled Rows and Columns

df = pd.read_csv( StringIO( csv_data ), header = 0, index_col = 0, skipinitialspace = True )

Color Code Alt-Code R1 Red #ff0000 255/0/0 R2 Green #00ff00 0/255/0 R3 Blue #0000ff 0/0/255

Save DataFrame for Later Reference

df_colors = df

Color Code Alt-Code R1 Red #ff0000 255/0/0 R2 Green #00ff00 0/255/0 R3 Blue #0000ff 0/0/255

Assign Dictionary to a Variable

dict_data = { 'c0' : [ 1, 2, 3, 4, 5 ], 'c1' : [ 2, 3, 4, 5, 6 ], 'c2' : [ 3, 4, 5, 6, 7 ] }

Make a DataFrame from Dictionary with Labeled Columns

df = DataFrame( dict_data )

c0 c1 c2 0 1 2 3 1 2 3 4 2 3 4 5 3 4 5 6 4 5 6 7

Assign Dictionary to a Variable

dict_data = { 'r0' : [ 1, 2, 3, 4, 5 ], 'r1' : [ 2, 3, 4, 5, 6 ], 'r2' : [ 3, 4, 5, 6, 7 ] }

Make a DataFrame from Dictionary with Labeled Rows

df = DataFrame.from_dict( dict_data, orient='index' )

0 1 2 3 4 r0 1 2 3 4 5 r1 2 3 4 5 6 r2 3 4 5 6 7

Assign Dictionary to a Variable

# global dict_data dict_data = { 'r0' : { 'c0' : 1, 'c1' : 2, 'c2' : 3, 'c3' : 4, 'c4' : 5 }, 'r1' : { 'c0' : 2, 'c1' : 3, 'c2' : 4, 'c3' : 5, 'c4' : 6 }, 'r2' : { 'c0' : 3, 'c1' : 4, 'c2' : 5, 'c3' : 6, 'c4' : 7 } }

Make a DataFrame from Dictionary with Labeled Rows and Columns

df = DataFrame.from_dict( dict_data, orient='index' )

c0 c1 c2 c3 c4 r0 1 2 3 4 5 r1 2 3 4 5 6 r2 3 4 5 6 7

Convert DataFrame to String

df.to_string()

c0 c1 c2 c3 c4 r0 1 2 3 4 5 r1 2 3 4 5 6 r2 3 4 5 6 7

Iterate Over Data

Make a DataFrame

df = DataFrame( [[ 1, 2 ], [3, 4]], index = [ 'R1', 'R2' ], columns = ['C1', 'C2'] )

C1 C2 R1 1 2 R2 3 4

Iterate on Columns

for n, s in df.items(): # df.iteritems() deprecated print("n:", n) print("s:") print(s) print()

n: C1 s: R1 1 R2 3 Name: C1, dtype: int64 n: C2 s: R1 2 R2 4 Name: C2, dtype: int64

Iterate on Rows

for n, s in df.iterrows(): print("n:", n) print("s:") print(s) print()

n: R1 s: C1 1 C2 2 Name: R1, dtype: int64 n: R2 s: C1 3 C2 4 Name: R2, dtype: int64

Iterate on .values for Fast Access

for row in df.values: print("row:", row) for v in row: print(" v:", v)

row: [1 2] v: 1 v: 2 row: [3 4] v: 3 v: 4

Series Indexing

Make a Series

s = Series( np.random.rand( 5 ), index = ['x0', 'x1', 'x2', 'x3', 'x4' ] )

x0 0.891443 x1 0.501428 x2 0.458724 x3 0.411455 x4 0.794857 dtype: float64

Index by Label

s[ 'x2' ]

0.45872423599707834

Index by Attribute

s.x2

0.45872423599707834

Index by Label with .loc

s.loc[ 'x2' ]

0.45872423599707834

Index by Integer with .iloc

s.iloc[ 2 ]

0.45872423599707834

DataFrame Indexing

Fetch Saved DataFrame

df = df_colors

Color Code Alt-Code R1 Red #ff0000 255/0/0 R2 Green #00ff00 0/255/0 R3 Blue #0000ff 0/0/255

Index Column by Label

df[ 'Color' ]

R1 Red R2 Green R3 Blue Name: Color, dtype: object

Index Column by Attribute

df.Color

R1 Red R2 Green R3 Blue Name: Color, dtype: object

Index Column by Integer with .iloc

df.iloc[ :, 2 ]

R1 255/0/0 R2 0/255/0 R3 0/0/255 Name: Alt-Code, dtype: object

Index Row by Label with .loc

df.loc[ 'R1' ]

Color Red Code #ff0000 Alt-Code 255/0/0 Name: R1, dtype: object

Index Row by Integer with .iloc

df.iloc[ 2 ]

Color Blue Code #0000ff Alt-Code 0/0/255 Name: R3, dtype: object

Index Row/Col by Labels with .loc

df.loc[ 'R1', 'Color' ]

Red

Index Row/Col by Integers with .iloc

df.iloc[ 1, 1 ]

#00ff00

Index by Contents of String

df.query( 'Color == "Green"' )

Color Code Alt-Code R2 Green #00ff00 0/255/0

Boolean Indexing

Make a Dataframe

df = DataFrame( np.random.rand( 6, 4 ))

0 1 2 3 0 0.066228 0.007887 0.410989 0.029063 1 0.067921 0.516712 0.182392 0.618393 2 0.801518 0.915226 0.398700 0.692419 3 0.907892 0.937800 0.667126 0.144018 4 0.407870 0.338500 0.198335 0.195538 5 0.898656 0.418393 0.737517 0.693376

Make Boolean Series on Rows

df[2] < .5

0 True 1 True 2 True 3 False 4 True 5 False Name: 2, dtype: bool

Index Rows by Boolean Series

df[ df[2] < .5 ]

0 1 2 3 0 0.066228 0.007887 0.410989 0.029063 1 0.067921 0.516712 0.182392 0.618393 2 0.801518 0.915226 0.398700 0.692419 4 0.407870 0.338500 0.198335 0.195538

Index Rows by Boolean Series

df[ df[2] > .5 ]

0 1 2 3 3 0.907892 0.937800 0.667126 0.144018 5 0.898656 0.418393 0.737517 0.693376

Make Boolean Series on Columns

df.iloc[:,2] < .5

0 True 1 True 2 True 3 False 4 True 5 False Name: 2, dtype: bool

Index Columns by Boolean Series

df[ df.iloc[:,2] < .5 ]

0 1 2 3 0 0.066228 0.007887 0.410989 0.029063 1 0.067921 0.516712 0.182392 0.618393 2 0.801518 0.915226 0.398700 0.692419 4 0.407870 0.338500 0.198335 0.195538

Index Columns by Boolean Series

df[ df.iloc[:,2] > .5 ]

0 1 2 3 3 0.907892 0.937800 0.667126 0.144018 5 0.898656 0.418393 0.737517 0.693376

Operations on Series / Dataframes

Get Series from Saved Dataframe

s = df_colors[ 'Color' ]

R1 Red R2 Green R3 Blue Name: Color, dtype: object

map() - execute function on each element of series

s.map( lambda x: print( x ))

Red Green Blue

apply() - execute function on each column of dataframe

df_colors.apply( lambda x: print( x ), axis=0 )

R1 Red R2 Green R3 Blue Name: Color, dtype: object R1 #ff0000 R2 #00ff00 R3 #0000ff Name: Code, dtype: object R1 255/0/0 R2 0/255/0 R3 0/0/255 Name: Alt-Code, dtype: object

apply() - execute function on each row of dataframe

df_colors.apply( lambda x: print( x ), axis=1 )

Color Red Code #ff0000 Alt-Code 255/0/0 Name: R1, dtype: object Color Green Code #00ff00 Alt-Code 0/255/0 Name: R2, dtype: object Color Blue Code #0000ff Alt-Code 0/0/255 Name: R3, dtype: object

map() - execute function on each element of dataframe

df_colors.map( lambda x: print( x ))

Red Green Blue #ff0000 #00ff00 #0000ff 255/0/0 0/255/0 0/0/255

Sorting Series

Make Series

s = Series( np.random.rand( 5 ), index=[ 'Ix', 'Ie', 'Ia', 'It', 'Io'])

Ix 0.728740 Ie 0.309253 Ia 0.386008 It 0.428475 Io 0.623328 dtype: float64

Sort by Index

s.sort_index()

Ia 0.386008 Ie 0.309253 Io 0.623328 It 0.428475 Ix 0.728740 dtype: float64

Sort by Value

s.sort_values()

Ie 0.309253 Ia 0.386008 It 0.428475 Io 0.623328 Ix 0.728740 dtype: float64

Sorting Dataframe

Make Dataframe

df = DataFrame( np.random.rand( 6, 4 ), index=['Rc', 'Ra', 'Rx', 'Rj', 'Rw', 'Rn' ], columns = ['Ce', 'Cr', 'Ci', 'Cq' ] )

Ce Cr Ci Cq Rc 0.756230 0.775280 0.323332 0.599263 Ra 0.313539 0.467441 0.978614 0.485691 Rx 0.106189 0.723605 0.215187 0.588402 Rj 0.379624 0.480167 0.131351 0.820475 Rw 0.549709 0.615262 0.919092 0.012299 Rn 0.294391 0.792651 0.384427 0.482769

Sort by Row Index

df.sort_index()

Ce Cr Ci Cq Ra 0.313539 0.467441 0.978614 0.485691 Rc 0.756230 0.775280 0.323332 0.599263 Rj 0.379624 0.480167 0.131351 0.820475 Rn 0.294391 0.792651 0.384427 0.482769 Rw 0.549709 0.615262 0.919092 0.012299 Rx 0.106189 0.723605 0.215187 0.588402

Sort by Column Index

df.sort_index( axis=1 )

Ce Ci Cq Cr Rc 0.756230 0.323332 0.599263 0.775280 Ra 0.313539 0.978614 0.485691 0.467441 Rx 0.106189 0.215187 0.588402 0.723605 Rj 0.379624 0.131351 0.820475 0.480167 Rw 0.549709 0.919092 0.012299 0.615262 Rn 0.294391 0.384427 0.482769 0.792651

Sort by Column Value

df.sort_values( by='Cr' )

Ce Cr Ci Cq Ra 0.313539 0.467441 0.978614 0.485691 Rj 0.379624 0.480167 0.131351 0.820475 Rw 0.549709 0.615262 0.919092 0.012299 Rx 0.106189 0.723605 0.215187 0.588402 Rc 0.756230 0.775280 0.323332 0.599263 Rn 0.294391 0.792651 0.384427 0.482769

Sort by Row Value

df.sort_values( by='Rj', axis=1 )

Ci Ce Cr Cq Rc 0.323332 0.756230 0.775280 0.599263 Ra 0.978614 0.313539 0.467441 0.485691 Rx 0.215187 0.106189 0.723605 0.588402 Rj 0.131351 0.379624 0.480167 0.820475 Rw 0.919092 0.549709 0.615262 0.012299 Rn 0.384427 0.294391 0.792651 0.482769

Summarize Functions on Dataframe

Make Dataframe

df = DataFrame( np.random.rand( 6, 4 ))

0 1 2 3 0 0.237400 0.771043 0.308005 0.461401 1 0.418274 0.263856 0.649630 0.544882 2 0.667811 0.586960 0.890119 0.376049 3 0.501345 0.916115 0.269861 0.449142 4 0.887702 0.289208 0.116405 0.751740 5 0.333775 0.750669 0.235345 0.789468

Sum

df.sum()

0 3.046306 1 3.577851 2 2.469365 3 3.372682 dtype: float64

Count

df.count()

0 6 1 6 2 6 3 6 dtype: int64

Median

df.median()

0 0.459809 1 0.668815 2 0.288933 3 0.503142 dtype: float64

Quantile

df.quantile([0.25, 0.5, 0.75])

0 1 2 3 0.25 0.354899 0.363646 0.243974 0.452207 0.50 0.459809 0.668815 0.288933 0.503142 0.75 0.626194 0.765949 0.564223 0.700026

Min

df.min()

0 0.237400 1 0.263856 2 0.116405 3 0.376049 dtype: float64

Max

df.max()

0 0.887702 1 0.916115 2 0.890119 3 0.789468 dtype: float64

Mean

df.mean()

0 0.507718 1 0.596309 2 0.411561 3 0.562114 dtype: float64

Var

df.var()

0 0.056277 1 0.072312 2 0.086933 3 0.029094 dtype: float64

Std

df.std()

0 0.237228 1 0.268908 2 0.294844 3 0.170570 dtype: float64

Variations of Summarize Functions Shown for sum()

Sum Columns

df.sum()

0 3.046306 1 3.577851 2 2.469365 3 3.372682 dtype: float64

Sum Rows

df.sum( axis=1 )

0 1.777849 1 1.876641 2 2.520939 3 2.136463 4 2.045055 5 2.109256 dtype: float64

Sum Entire Dataframe

df.sum( axis=1 ).sum()

12.466203402723599

Histograms

Make Dataframe

df = DataFrame( np.random.normal( loc = 5, scale = 2, size = [1000, 4] ), dtype = float ).round()

0 1 2 3 0 5.0 7.0 3.0 4.0 1 7.0 5.0 4.0 7.0 2 5.0 5.0 3.0 6.0 3 5.0 5.0 3.0 4.0 4 3.0 4.0 6.0 5.0 .. ... ... ... ... 995 6.0 5.0 4.0 4.0 996 8.0 4.0 7.0 9.0 997 6.0 5.0 4.0 7.0 998 6.0 5.0 3.0 2.0 999 4.0 7.0 6.0 2.0 [1000 rows x 4 columns]

Histogram using pd.value_counts

s = df.apply( lambda x: x.value_counts()).fillna(0)

0 1 2 3 -2.0 0.0 0.0 1 0.0 -1.0 3.0 3.0 3 6.0 0.0 6.0 9.0 3 12.0 1.0 19.0 32.0 28 30.0 2.0 70.0 68.0 70 65.0 3.0 129.0 108.0 129 117.0 4.0 164.0 168.0 185 182.0 5.0 181.0 212.0 193 197.0 6.0 174.0 159.0 182 173.0 7.0 138.0 123.0 112 107.0 8.0 77.0 81.0 58 65.0 9.0 25.0 28.0 23 35.0 10.0 11.0 5.0 10 7.0 11.0 2.0 3.0 2 4.0 12.0 1.0 1.0 1 0.0

Histogram Using Pandas .hist()

h = df.hist( sharex=True, sharey=True, linewidth=1, figsize=(Fig_X, Fig_Y) )

[[ ] [ ]]

Get Series from Column 0 of Dataframe

s = df.iloc[ :, 0].values

[ 5. 7. 5. 5. 3. 3. -1. 4. 6. 7. 7. 4. 4. 6. 7. 4. 2. 5. 5. 6. 7. 3. 4. 9. 6. 5. 5. 4. 6. 8. 4. 7. 3. 0. 8. 7. 5. 2. 5. 5. 7. 7. 4. 6. 5. 5. 7. 6. 5. 6. 3. 3. 7. 4. 7. 3. 6. 7. 3. 3. 8. 10. 0. 3. 3. 4. 5. 7. 6. 7. 6. 3. 5. 7. 5. 8. 6. 0. 5. 6. 3. 8. -1. 7. 2. 6. 6. 4. 7. 6. 3. 6. 7. 5. 6. 2. 8. 4. 7. 8. 5. 5. 7. 8. 10. 7. 4. 7. 6. 3. 6. 6. 2. 6. 4. 3. 8. 5. 4. 5. 1. 7. 3. 5. 4. 7. 2. 4. 4. 5. 5. 3. 7. 6. 7. 8. 2. 5. 7. 8. 2. 7. 4. 4. 5. 5. 4. 5. 4. 10. 6. 8. 6. 2. 9. 5. 7. 6. 6. 5. 6. 3. 5. 4. 9. 2. 5. 6. 3. 5. 4. 5. 8. 4. 6. 8. 3. 6. 1. 5. 8. 4. 2. 4. 8. 7. 5. 8. 6. 6. 5. 6. 8. 7. 8. 7. 1. 5. 4. 2. 7. 6. 8. 8. 4. 3. 4. 8. 6. 7. 4. 6. 6. 3. 7. 4. 7. 5. 4. 5. 3. 3. 5. 3. 2. 5. 7. 6. 4. 5. 6. 4. 2. 3. 3. 7. 5. 6. -0. 7. 7. 3. 6. 4. 7. 6. 1. 5. 7. 5. 6. 7. 2. 8. 5. 11. 5. 2. 5. 5. 4. 5. 2. 7. 6. 8. 3. 7. 3. 6. 5. 2. 8. 3. 5. 4. 7. 5. 7. 4. 7. 5. 8. 6. 5. 6. 6. 8. 8. 7. 3. 4. 7. 4. 2. 5. 9. 7. 4. 5. 7. 6. 4. 5. 5. 5. 5. 4. 7. 3. 7. 4. 5. 2. 4. 6. 3. 3. 3. 3. 5. 7. 5. 6. 5. 8. 3. 5. 6. 7. 5. 3. 3. 5. 10. 3. 2. 7. 7. 8. 4. 9. 5. 2. 3. 5. 3. 8. 1. 8. 6. 6. 5. 4. 4. 7. 2. 1. 6. 3. 5. 6. 5. 3. 6. 8. 3. 5. 5. 5. 9. 6. 6. 3. 5. 3. 4. 1. 6. 8. 4. 8. 6. 4. 9. 6. 4. 3. 4. 4. 4. 3. 8. 5. 5. 6. 8. 6. 6. 4. 6. 5. 7. 3. 5. 2. 1. 7. 5. 4. 6. 7. 1. 6. 7. 6. 5. 8. 4. 3. 7. 5. 9. 4. 3. 5. 7. 3. 6. 3. 5. 9. 4. 5. 4. 8. 3. 3. 3. 4. 3. 7. 7. 4. 4. 7. 3. 3. 5. 6. 5. 5. 4. 8. 8. 2. 8. 4. 6. 2. 3. 2. 8. 5. 7. 7. 8. 7. 4. 7. 4. 3. 6. 8. 7. 4. 7. 7. 6. 5. 3. 7. 5. 1. 6. 7. 7. 2. 6. 5. 5. 4. 3. 6. 4. 7. 8. 5. 7. 4. 6. 6. 3. 3. 7. 4. 6. 4. 4. 5. 6. 6. 3. 6. 6. 8. 6. 7. 8. 4. 8. 7. 3. 3. 4. 9. 3. 4. 2. 7. 4. 3. 5. 4. 8. 6. 4. 5. 7. 5. 2. 6. 2. 4. 5. 3. 4. 6. 2. 8. 4. 2. 2. 6. 4. 3. 10. 6. 5. 6. 5. 3. 3. 7. 4. 4. 5. 2. 2. 3. 6. 4. 6. 6. 6. 6. 4. 2. 4. 7. 7. 3. 3. 6. 2. 4. 7. 4. 5. 5. -0. 5. 7. 4. 6. 5. 7. 5. 3. 4. 2. 3. 6. 7. 4. 7. -1. 6. 5. 6. 6. 7. 4. 4. 6. 9. 8. 6. 2. 6. 5. 2. 6. 7. 12. 8. 7. 3. 5. 6. 2. 9. 4. 4. 2. 6. 4. 7. 2. 4. 2. 4. 4. 2. 5. 5. 7. 6. 4. 4. 6. 4. 5. 7. 4. 5. 4. 8. 3. 8. 5. 2. 6. 6. 5. 6. 10. 5. 5. 7. 6. 5. 3. 2. 5. 6. 2. 5. 7. 4. 5. 7. 5. 5. 9. 8. 7. 10. 5. 4. 4. 4. 2. 7. 6. 1. 9. 3. 3. 3. 3. 4. 9. 3. 9. 7. 7. 4. 5. 8. 4. 4. 7. 4. 5. 7. 9. 3. 2. 7. 6. 4. 4. 3. 4. 4. 2. 5. 5. 9. 7. 5. 5. 5. 4. 2. 3. 6. 9. 4. 2. 4. 5. 5. 9. 5. 5. 3. 5. 3. 7. 3. 5. 2. 4. 4. 5. 5. 8. 6. 3. 6. 7. 5. 3. 3. 5. 8. 6. 6. 3. 4. 6. 5. 7. 3. 4. 4. 6. 2. 8. -0. 4. 3. 2. 7. 4. 3. 5. 10. 5. 8. 7. 4. 6. 6. 3. 5. 5. 8. 6. 5. 4. 7. 1. 5. 7. 10. 4. 5. 6. 5. 6. 4. 4. 8. 4. 6. 6. 10. 3. 8. 7. 6. 6. 3. 8. 6. 9. 3. 5. 5. 3. 5. 6. 6. 5. 4. 7. 5. 6. 6. 2. 5. 2. 7. 2. 6. 1. 8. 3. 11. 3. 4. 6. 8. 6. 6. 7. 5. 6. 5. 1. 2. 4. 4. 1. 4. 8. 3. 6. 3. 7. 4. 4. 5. 3. 6. 7. 3. 2. 6. 9. 4. 3. 1. 7. 6. 5. 6. 5. 6. 6. 6. 5. 6. 6. 1. 4. 7. 6. 2. 3. 5. 6. 5. 8. 6. 9. 4. 6. 3. 7. 3. 8. 3. 9. 4. 4. 3. 2. 5. 3. 5. 2. 4. 7. 2. 5. 3. 6. 6. 3. 8. 6. 8. 7. 2. 7. 10. 6. 4. 8. 9. 8. 4. 2. 1. 1. 3. 6. 6. 5. 4. 5. 3. 3. 6. 4. 4. 7. 6. 6. 3. 4. 8. 5. 7. 7. 7. 5. 5. 4. 4. 6. 7. 7. 6. 4. 7. 5. 2. 5. 2. 4. 3. 4. 6. 6. 8. 5. 3. 4. 5. 7. 6. 8. 6. 6. 4.]

Histogram using np.histogram()

h = np.histogram( s )

(array([ 9, 19, 70, 293, 181, 174, 215, 25, 11, 3]), array([-1. , 0.3, 1.6, 2.9, 4.2, 5.5, 6.8, 8.1, 9.4, 10.7, 12. ]))

Miscellaneous

Fetch Saved DataFrame

df = df_colors

Color Code Alt-Code R1 Red #ff0000 255/0/0 R2 Green #00ff00 0/255/0 R3 Blue #0000ff 0/0/255

Get JSON from Dataframe

df.to_json( orient='records', lines=True )

{"Color":"Red","Code":"#ff0000","Alt-Code":"255\/0\/0"} {"Color":"Green","Code":"#00ff00","Alt-Code":"0\/255\/0"} {"Color":"Blue","Code":"#0000ff","Alt-Code":"0\/0\/255"}

Series with MultiIndex

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']] tuples = list(zip(*arrays)) index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second']) s = pd.Series(np.random.randn(8), index=index) print(s)

first second bar one -0.356158 two -0.384729 baz one -1.038348 two -0.019466 foo one 0.063635 two -0.238940 qux one -1.485342 two 1.291386 dtype: float64

Dataframe with MultiIndex

df = DataFrame( np.arange(12).reshape((4,3)), index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2 ]], columns = [['NJ', 'Ohio', 'Colorado' ]] )

NJ Ohio Colorado a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11

Show Dataframe with MultiIndex

for x, ndf in df.groupby(level=0): print(ndf) print("x:", x)

NJ Ohio Colorado a 1 0 1 2 2 3 4 5 x: a NJ Ohio Colorado b 1 6 7 8 2 9 10 11 x: b

Make a Series

s = pd.Series( { 'I1': '1234', 'I2': '2345', 'I3': '3456' } )

I1 1234 I2 2345 I3 3456 dtype: object

Series to DataFrame

df = s.to_frame()

0 I1 1234 I2 2345 I3 3456

DataFrame Transposed

df.T

I1 I2 I3 0 1234 2345 3456

Pivot Tables

Make a Dataframe with Named Columns

df = pd.DataFrame( [ ["Alice", "North", 200 ], ["Alice", "South", 150 ], ["Bob", "East", 100 ], ["Bob", "West", 100 ], ["Charlie", "North", 120 ], ["Charlie", "West", 200 ], ], columns = [ "Agent", "Region", "Sales" ] )

Agent Region Sales 0 Alice North 200 1 Alice South 150 2 Bob East 100 3 Bob West 100 4 Charlie North 120 5 Charlie West 200

Pivot

df.pivot(index='Agent', columns='Region', values='Sales')

Region East North South West Agent Alice NaN 200.0 150.0 NaN Bob 100.0 NaN NaN 100.0 Charlie NaN 120.0 NaN 200.0

Pivot

df.pivot_table(index='Agent', columns='Region', values='Sales')

Region East North South West Agent Alice NaN 200.0 150.0 NaN Bob 100.0 NaN NaN 100.0 Charlie NaN 120.0 NaN 200.0

Concatenation

Make Dataframes

df1 = DataFrame( np.random.rand( 2, 3 ))

0 1 2 0 0.670060 0.355179 0.345913 1 0.411077 0.543479 0.693139

df2 = DataFrame( np.random.rand( 4, 3 ))

0 1 2 0 0.757974 0.130799 0.752522 1 0.106363 0.190630 0.539620 2 0.342775 0.085893 0.832798 3 0.222075 0.198671 0.068842

df3 = DataFrame( np.random.rand( 2, 4 ))

0 1 2 3 0 0.637892 0.928158 0.887674 0.583632 1 0.055699 0.694496 0.363749 0.389777

Concatenation Dataframes Vertically

pd.concat( [df1, df2] )

0 1 2 0 0.670060 0.355179 0.345913 1 0.411077 0.543479 0.693139 0 0.757974 0.130799 0.752522 1 0.106363 0.190630 0.539620 2 0.342775 0.085893 0.832798 3 0.222075 0.198671 0.068842

Concatenation Dataframes Horizontally

pd.concat( [df1, df3], axis=1 )

0 1 2 0 1 2 3 0 0.670060 0.355179 0.345913 0.637892 0.928158 0.887674 0.583632 1 0.411077 0.543479 0.693139 0.055699 0.694496 0.363749 0.389777

Make a Dataframe

df = DataFrame( np.random.rand( 2, 3 ))

0 1 2 0 0.047693 0.008885 0.391842 1 0.011640 0.697079 0.176610

Make a Series

s = Series( np.random.rand( 3 ))

0 0.909998 1 0.029004 2 0.108992 dtype: float64

Concat Dataframe and Series

pd.concat( [df, s.to_frame().T ] )

0 1 2 0 0.047693 0.008885 0.391842 1 0.011640 0.697079 0.176610 0 0.909998 0.029004 0.108992

Incremental, Detailed Development of Selected Concepts

With some overlap of other examples and in excruciating detail we show here the development of selectors for drop and change. All with integer indexing.

Incremental Development of Row Selection

Make a Dataframe

df = DataFrame( np.random.rand( 6, 4 ))

0 1 2 3 0 0.937316 0.843650 0.581525 0.469518 1 0.861697 0.648787 0.066774 0.456743 2 0.261171 0.925944 0.987221 0.905516 3 0.180667 0.269428 0.585313 0.780752 4 0.429281 0.546803 0.223258 0.501009 5 0.942636 0.306115 0.958083 0.310338

Make Explicit Boolean Row Index

bi = [True, False, True, False, True, False]

[True, False, True, False, True, False]

Dataframe Matching Boolean Row Index

df[ bi ]

0 1 2 3 0 0.937316 0.843650 0.581525 0.469518 2 0.261171 0.925944 0.987221 0.905516 4 0.429281 0.546803 0.223258 0.501009

Drop Test Column

df[2]

0 0.581525 1 0.066774 2 0.987221 3 0.585313 4 0.223258 5 0.958083 Name: 2, dtype: float64

Drop Row Condition

df[2] < .5

0 False 1 True 2 False 3 False 4 True 5 False Name: 2, dtype: bool

Dataframe Matching Drop Condition

df[ df[2] < .5 ]

0 1 2 3 1 0.861697 0.648787 0.066774 0.456743 4 0.429281 0.546803 0.223258 0.501009

Dataframe Matching Drop Condition Index

df[ df[2] < .5 ].index

Index([1, 4], dtype='int64')

Dataframe Matching Drop Condition Index with List Expansion

df[ df[2] < .5 ].index

[1, 4]

Drop Rows Meeting Column Condition

df.drop( df[ df[2] < .5].index, axis=0 )

0 1 2 3 0 0.937316 0.843650 0.581525 0.469518 2 0.261171 0.925944 0.987221 0.905516 3 0.180667 0.269428 0.585313 0.780752 5 0.942636 0.306115 0.958083 0.310338

Change Column Values Meeting Condition

df.iloc[ df[2] < .5, 2 ] = 99999

Show Updated Dataframe

0 1 2 3 0 0.937316 0.843650 0.581525 0.469518 1 0.861697 0.648787 99999.000000 0.456743 2 0.261171 0.925944 0.987221 0.905516 3 0.180667 0.269428 0.585313 0.780752 4 0.429281 0.546803 99999.000000 0.501009 5 0.942636 0.306115 0.958083 0.310338

Incremental Development of Column Selection

Make a Dataframe

df = DataFrame( np.random.rand( 6, 4 ))

0 1 2 3 0 0.287383 0.841523 0.636146 0.898033 1 0.799503 0.258190 0.966333 0.478904 2 0.131248 0.442555 0.361959 0.538689 3 0.841960 0.509757 0.743804 0.792505 4 0.227624 0.837596 0.152987 0.112047 5 0.526719 0.361483 0.140216 0.694413

Make Explicit Boolean Column Index

bi = [True, False, True, False ]

[True, False, True, False]

Dataframe Matching Explicit Boolean Column Index

df.iloc[ :, bi ]

0 2 0 0.287383 0.636146 1 0.799503 0.966333 2 0.131248 0.361959 3 0.841960 0.743804 4 0.227624 0.152987 5 0.526719 0.140216

Drop Test Row

df.iloc[2]

0 0.131248 1 0.442555 2 0.361959 3 0.538689 Name: 2, dtype: float64

Drop Column Condition

df.iloc[ 2, : ] < .5

0 True 1 True 2 True 3 False Name: 2, dtype: bool

Dataframe Matching Drop Condition

df.loc[ :, df.iloc[ 2, : ] < .5 ]

0 1 2 0 0.287383 0.841523 0.636146 1 0.799503 0.258190 0.966333 2 0.131248 0.442555 0.361959 3 0.841960 0.509757 0.743804 4 0.227624 0.837596 0.152987 5 0.526719 0.361483 0.140216

Dataframe Matching Drop Condition Column Index

df.loc[ :, df.iloc[ 2, : ] < .5 ].columns

Index([0, 1, 2], dtype='int64')

Dataframe Matching Drop Condition Column Index with List Expansion

df.loc[ :, df.iloc[ 2, : ] < .5 ].columns

[0, 1, 2]

Drop Columns Meeting Condition

df.drop( df.loc[ :, df.iloc[ 2, : ] < .5 ].columns, axis=1 )

3 0 0.898033 1 0.478904 2 0.538689 3 0.792505 4 0.112047 5 0.694413

Change Column Values Meeting Condition

df.loc[ :, df.iloc[ 2, : ] < .5 ] = 99999

Show Updated Dataframe

0 1 2 3 0 99999.0 99999.0 99999.0 0.898033 1 99999.0 99999.0 99999.0 0.478904 2 99999.0 99999.0 99999.0 0.538689 3 99999.0 99999.0 99999.0 0.792505 4 99999.0 99999.0 99999.0 0.112047 5 99999.0 99999.0 99999.0 0.694413

Arithmetic on Dataframes

Make a Dataframe

df = DataFrame( np.random.rand( 6, 4 ))

0 1 2 3 0 0.001642 0.380782 0.337040 0.094192 1 0.433359 0.639956 0.701023 0.246333 2 0.142322 0.805451 0.234232 0.913534 3 0.413745 0.809982 0.597281 0.651065 4 0.525349 0.699731 0.144319 0.271298 5 0.987510 0.513552 0.783160 0.635383

Make a Series

s = Series( np.random.rand( 4 ))

0 0.577031 1 0.677590 2 0.613750 3 0.375769 dtype: float64

Add a Constant to Every Element of Dataframe

df + 3

0 1 2 3 0 3.001642 3.380782 3.337040 3.094192 1 3.433359 3.639956 3.701023 3.246333 2 3.142322 3.805451 3.234232 3.913534 3 3.413745 3.809982 3.597281 3.651065 4 3.525349 3.699731 3.144319 3.271298 5 3.987510 3.513552 3.783160 3.635383

Add a Series to Every Row of Dataframe

df.add( s, axis='columns' )

0 1 2 3 0 0.578673 1.058373 0.950790 0.469961 1 1.010390 1.317546 1.314773 0.622101 2 0.719354 1.483042 0.847982 1.289303 3 0.990776 1.487572 1.211031 1.026834 4 1.102380 1.377322 0.758069 0.647067 5 1.564541 1.191143 1.396910 1.011152

Add a Series to Every Row of Dataframe

df + s

Add a Series to Every Column of Dataframe

df.add( s, axis='index' )

0 1 2 3 0 0.578673 0.957813 0.914071 0.671224 1 1.110949 1.317546 1.378614 0.923923 2 0.756073 1.419202 0.847982 1.527285 3 0.789513 1.185750 0.973049 1.026834 4 NaN NaN NaN NaN 5 NaN NaN NaN NaN

Add two Dataframes

df + df

0 1 2 3 0 0.003283 0.761564 0.674079 0.188385 1 0.866718 1.279912 1.402046 0.492665 2 0.284645 1.610903 0.468463 1.827069 3 0.827489 1.619963 1.194561 1.302130 4 1.050698 1.399463 0.288637 0.542596 5 1.975019 1.027105 1.566320 1.270766

Missing Data

Make a Dataframe

df = DataFrame( np.random.rand( 6, 4 ))

0 1 2 3 0 0.665310 0.973201 0.334828 0.956764 1 0.520282 0.185836 0.601835 0.306442 2 0.929082 0.639267 0.691352 0.639253 3 0.112779 0.889961 0.306980 0.812285 4 0.520858 0.440499 0.145265 0.106547 5 0.986067 0.328122 0.827735 0.313569

Add Missing Data

df.iloc[3,2] = np.nan

0 1 2 3 0 0.665310 0.973201 0.334828 0.956764 1 0.520282 0.185836 0.601835 0.306442 2 0.929082 0.639267 0.691352 0.639253 3 0.112779 0.889961 NaN 0.812285 4 0.520858 0.440499 0.145265 0.106547 5 0.986067 0.328122 0.827735 0.313569

Drop Rows Having Missing Data

df.dropna()

0 1 2 3 0 0.665310 0.973201 0.334828 0.956764 1 0.520282 0.185836 0.601835 0.306442 2 0.929082 0.639267 0.691352 0.639253 4 0.520858 0.440499 0.145265 0.106547 5 0.986067 0.328122 0.827735 0.313569

Drop Columns Having Missing Data

df.dropna( axis=1 )

0 1 3 0 0.665310 0.973201 0.956764 1 0.520282 0.185836 0.306442 2 0.929082 0.639267 0.639253 3 0.112779 0.889961 0.812285 4 0.520858 0.440499 0.106547 5 0.986067 0.328122 0.313569

Fill Missing Data

df.fillna( 99 )

0 1 2 3 0 0.665310 0.973201 0.334828 0.956764 1 0.520282 0.185836 0.601835 0.306442 2 0.929082 0.639267 0.691352 0.639253 3 0.112779 0.889961 99.000000 0.812285 4 0.520858 0.440499 0.145265 0.106547 5 0.986067 0.328122 0.827735 0.313569

Replace Values

Make Dataframe

df = DataFrame( np.random.rand( 6, 4 )).round(1)

0 1 2 3 0 0.6 0.3 0.2 0.6 1 0.7 1.0 0.5 0.6 2 0.7 0.6 0.2 0.6 3 0.4 0.4 0.8 0.9 4 0.6 0.5 0.7 0.1 5 0.3 0.4 0.9 0.1

Replace Single Value

df.replace( .3, 99 )

0 1 2 3 0 0.6 99.0 0.2 0.6 1 0.7 1.0 0.5 0.6 2 0.7 0.6 0.2 0.6 3 0.4 0.4 0.8 0.9 4 0.6 0.5 0.7 0.1 5 99.0 0.4 0.9 0.1

Replace Values in List

df.replace( [.2, .3, .4], [99, 98, 97] )

0 1 2 3 0 0.6 98.0 99.0 0.6 1 0.7 1.0 0.5 0.6 2 0.7 0.6 99.0 0.6 3 97.0 97.0 0.8 0.9 4 0.6 0.5 0.7 0.1 5 98.0 97.0 0.9 0.1

Replace Values in Dict

df.replace( {.2 : 97, .3 : 98, .4: 99} )

0 1 2 3 0 0.6 98.0 97.0 0.6 1 0.7 1.0 0.5 0.6 2 0.7 0.6 97.0 0.6 3 99.0 99.0 0.8 0.9 4 0.6 0.5 0.7 0.1 5 98.0 99.0 0.9 0.1

Replace Values with Where

df.where( df < .5, 99 )

0 1 2 3 0 99.0 0.3 0.2 99.0 1 99.0 99.0 99.0 99.0 2 99.0 99.0 0.2 99.0 3 0.4 0.4 99.0 99.0 4 99.0 99.0 99.0 0.1 5 0.3 0.4 99.0 0.1

Adding Columns

Make Dataframe

df = DataFrame( np.random.rand( 6, 4 )).round(1)

0 1 2 3 0 0.5 0.6 0.6 0.7 1 0.4 0.7 0.8 0.6 2 0.6 0.1 0.7 0.7 3 0.0 0.8 0.8 0.7 4 0.6 0.6 0.3 0.7 5 0.4 0.8 0.5 0.6

Add Column with Assign

df.assign( Max = df.max( axis=1 ))

0 1 2 3 Max 0 0.5 0.6 0.6 0.7 0.7 1 0.4 0.7 0.8 0.6 0.8 2 0.6 0.1 0.7 0.7 0.7 3 0.0 0.8 0.8 0.7 0.8 4 0.6 0.6 0.3 0.7 0.7 5 0.4 0.8 0.5 0.6 0.8

Add Column with Explicit Assignment

df['Min'] = df.min( axis=1 ) print(df)

0 1 2 3 Min 0 0.5 0.6 0.6 0.7 0.5 1 0.4 0.7 0.8 0.6 0.4 2 0.6 0.1 0.7 0.7 0.1 3 0.0 0.8 0.8 0.7 0.0 4 0.6 0.6 0.3 0.7 0.3 5 0.4 0.8 0.5 0.6 0.4

Merging Dataframes

Make Dataframes with Labeled Rows and Columns

df1 = DataFrame( [[ 1, 2, 3 ], [4, 5, 6], [7, 8, 9]], index = [ 'R1', 'R2', 'R3' ], columns = ['C1', 'C2', 'C3'] ) print( df1 )

C1 C2 C3 R1 1 2 3 R2 4 5 6 R3 7 8 9

df2 = DataFrame( [[ 10, 2, 3 ], [4, 5, 6], [7, 8, 9]], index = [ 'R1', 'R2', 'R4' ], columns = ['C1', 'C4', 'C5'] ) print( df2 )

C1 C4 C5 R1 10 2 3 R2 4 5 6 R4 7 8 9

Join Using Keys from Left Frame

pd.merge( df1, df2, how='left', on='C1', indicator=True)

C1 C2 C3 C4 C5 _merge 0 1 2 3 NaN NaN left_only 1 4 5 6 5.0 6.0 both 2 7 8 9 8.0 9.0 both

Join Using Keys from Right Frame

pd.merge( df1, df2, how='right', on='C1', indicator=True)

C1 C2 C3 C4 C5 _merge 0 10 NaN NaN 2 3 right_only 1 4 5.0 6.0 5 6 both 2 7 8.0 9.0 8 9 both

Join Using Union of Keys from Both Frames

pd.merge( df1, df2, how='outer', on='C1', indicator=True)

C1 C2 C3 C4 C5 _merge 0 1 2.0 3.0 NaN NaN left_only 1 4 5.0 6.0 5.0 6.0 both 2 7 8.0 9.0 8.0 9.0 both 3 10 NaN NaN 2.0 3.0 right_only

Join Using Intersection of Keys from Both Frames

pd.merge( df1, df2, how='inner', on='C1', indicator=True)

C1 C2 C3 C4 C5 _merge 0 4 5 6 5 6 both 1 7 8 9 8 9 both

Appendix

Version Information

Python

platform.python_version()

3.13.3

Pandas

pd.__version__

2.2.3

Matplotlib

matplotlib.__version__

3.10.1

Internals for the Examples

The examples here use the Python eval() function for expressions and exec() for statements. Expressions have a return value, which is shown following the code example. Statements do not have a return value and show results with calls to print(). Both are executed in the global context. Here are simple examples of eval() and exec().

Assign value to variable with eval()

var1 = 123

123

Show variable

var1

123

Assign value to variable with exec()

var2 = 234 print( var2 )

234