Pandas DataFrames
Last updated on 2024-11-10 | Edit this page
Estimated time: 30 minutes
Overview
Questions
- How can I do statistical analysis of tabular data?
Objectives
- Select individual values from a Pandas dataframe.
- Select entire rows or entire columns from a dataframe.
- Select a subset of both rows and columns from a dataframe in a single operation.
- Select a subset of a dataframe by a single Boolean criterion.
Note about Pandas DataFrames/Series
A DataFrame is a collection of Series; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.
Pandas is built on top of the Numpy library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.
What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.
Selecting values
To access a value at the position [i,j]
of a DataFrame,
we have two options, depending on what is the meaning of i
in use. Remember that a DataFrame provides an index as a way to
identify the rows of the table; a row, then, has a position
inside the table as well as a label, which uniquely identifies
its entry in the DataFrame.
Use DataFrame.iloc[..., ...]
to select values by their
(entry) position
- Can specify location by numerical index analogously to 2D version of character selection in strings.
PYTHON
import pandas as pd
data_penguins_named = pd.read_csv('data/data-penguins-named.csv',index_col='name')
print(data_penguins_named.iloc[0, 0])
OUTPUT
Adelie
Use DataFrame.loc[..., ...]
to select values by their
(entry) label (aka index name)
- Can specify location by row and/or column name.
OUTPUT
Adelie
Use :
on its own to mean all columns or all rows.
- Just like Python’s usual slicing notation.
OUTPUT
species Adelie
island Torgersen
bill_length_mm 39.1
bill_depth_mm 18.7
flipper_length_mm 181.0
body_mass_g 3750.0
sex Male
Name: 0, dtype: object
OUTPUT
species Adelie
island Torgersen
bill_length_mm 39.1
bill_depth_mm 18.7
flipper_length_mm 181.0
body_mass_g 3750.0
sex Male
Name: 0, dtype: object
OUTPUT
name
Adelie_Torgersen_0 39.1
Adelie_Torgersen_1 39.5
Adelie_Torgersen_2 40.3
Adelie_Torgersen_3 36.7
Adelie_Torgersen_4 39.3
...
Gentoo_Biscoe_328 47.2
Gentoo_Biscoe_329 46.8
Gentoo_Biscoe_330 50.4
Gentoo_Biscoe_331 45.2
Gentoo_Biscoe_332 49.9
Name: bill_length_mm, Length: 333, dtype: float64
Use comparisons to select data based on value.
- Comparison is applied element by element.
- Returns a similarly-shaped dataframe of
True
andFalse
.
PYTHON
# Use a subset of data.
subset = data_penguins_named.loc[:, 'bill_length_mm':'flipper_length_mm']
print('Subset of data:\n', subset)
# Which values were greater than 10000 ?
print('\nWhere are values large?\n', subset > 200)
OUTPUT
Where are values large?
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
0 False False True True
1 False False True True
2 False False True True
3 False False True True
4 False False True True
.. ... ... ... ...
328 False False True True
329 False False True True
330 False False True True
331 False False True True
332 False False True True
Select values or NaN using a Boolean mask.
- A frame full of Booleans is sometimes called a mask because of how it can be used.
OUTPUT
bill_length_mm bill_depth_mm flipper_length_mm
0 NaN NaN 181.0
1 NaN NaN 186.0
2 NaN NaN 195.0
3 NaN NaN 193.0
4 NaN NaN 190.0
.. ... ... ...
328 NaN NaN 214.0
329 NaN NaN 215.0
330 NaN NaN 222.0
331 NaN NaN 212.0
332 NaN NaN 213.0
- Get the value where the mask is true, and NaN (Not a Number) where it is false.
- Useful because NaNs are ignored by operations like max, min, average, etc.
OUTPUT
bill_length_mm bill_depth_mm flipper_length_mm
count 0.0 0.0 144.000000
mean NaN NaN 215.034722
std NaN NaN 7.819121
min NaN NaN 201.000000
25% NaN NaN 210.000000
50% NaN NaN 215.000000
75% NaN NaN 220.000000
max NaN NaN 231.000000
Group By: split-apply-combine
Pandas vectorizing methods and grouping operations are features that provide users much flexibility to analyse their data.
OUTPUT
species
Adelie 3706.164384
Chinstrap 3733.088235
Gentoo 5092.436975
Name: body_mass_g, dtype: float64
OUTPUT
species island
Adelie Biscoe 38.975000
Dream 38.520000
Torgersen 39.038298
Chinstrap Dream 48.833824
Gentoo Biscoe 47.568067
Name: bill_length_mm, dtype: float64
A different way in which you could get the same output is by using an
additional function .agg()
. Here is an example of a single
aggregation on one column:
OUTPUT
| body_mass_g
species |
Adelie | 3706.164384
Chinstrap | 3733.088235
Gentoo | 5092.436975
And again with a second layer to groupby of island:
There are some other useful ways in which we can use
groupby()
and agg()
. Here we are preforming
multiple aggregation on single column to see mean, median and standard
deviation of the body mass in different species:
OUTPUT
body_mass_g
| mean | median | std
species | | |
Adelie | 3706.164384 | 3700.0 | 458.620135
Chinstrap | 3733.088235 | 3700.0 | 384.335081
Gentoo | 5092.436975 | 5050.0 | 501.476154
In case we want to explore two different columns and see how the mean
of body mass and bill length are different between penguins of different
species, based on the island they live on, we can apply specific
aggregations to each column (in this case both are mean). Use
.reset_index()
at the end if you would like to result in a
dataframe:
PYTHON
new_penguin_data = data_penguins_named.groupby(["species", "island"]).agg({'body_mass_g': 'mean', 'bill_length_mm': 'mean'}).reset_index()
print(new_penguin_data)
OUTPUT
species island body_mass_g bill_length_mm
0 Adelie Biscoe 3709.659091 38.975000
1 Adelie Dream 3701.363636 38.520000
2 Adelie Torgersen 3708.510638 39.038298
3 Chinstrap Dream 3733.088235 48.833824
4 Gentoo Biscoe 5092.436975 47.568067
Looking at unique values
Look at pandas documentation and see what method can be used to get unique values.
To get the count of unique values you can use value_counts method:
The output is
OUTPUT
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
Adelie Biscoe 34.5 18.1 187.0 2900.0 Female 1
Gentoo Biscoe 44.0 13.6 208.0 4350.0 Female 1
43.6 13.9 217.0 4900.0 Female 1
43.5 15.2 213.0 4650.0 Female 1
14.2 220.0 4700.0 Female 1
..
Adelie Torgersen 36.6 17.8 185.0 3700.0 Female 1
36.2 17.2 187.0 3150.0 Female 1
16.1 187.0 3550.0 Female 1
35.9 16.6 190.0 3050.0 Female 1
Gentoo Biscoe 59.6 17.0 230.0 6050.0 Male 1
Filtering
- How can you filter the DataFrame data to get all rows where the species is ‘Adelie’ and the island is ‘Torgersen’?
- What code would you use to filter the DataFrame data to find all entries where the body mass is greater than 4000 grams and the flipper length is greater than 200 mm?
PYTHON
data_penguins[(data_penguins['species'] == 'Adelie') & (data_penguins['island'] == 'Torgersen')]
OUTPUT
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
4 Adelie Torgersen 39.3 20.6 190.0 3650.0 Male
5 Adelie Torgersen 38.9 17.8 181.0 3625.0 Female
... ... ... ... ... ... ... ...
120 Adelie Torgersen 38.8 17.6 191.0 3275.0 Female
121 Adelie Torgersen 41.5 18.3 195.0 4300.0 Male
122 Adelie Torgersen 39.0 17.1 191.0 3050.0 Female
123 Adelie Torgersen 44.1 18.0 210.0 4000.0 Male
124 Adelie Torgersen 38.5 17.9 190.0 3325.0 Female
125 Adelie Torgersen 43.1 19.2 197.0 3500.0 Male
PYTHON
data_penguins[(data_penguins['body_mass_g'] > 4000) & (data_penguins['flipper_length_mm'] > 200)]
OUTPUT
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
85 Adelie Dream 41.1 18.1 205.0 4300.0 Male
89 Adelie Dream 40.8 18.9 208.0 4300.0 Male
95 Adelie Biscoe 41.0 20.0 203.0 4725.0 Male
159 Chinstrap Dream 52.0 18.1 201.0 4050.0 Male
161 Chinstrap Dream 50.5 19.6 201.0 4050.0 Male
... ... ... ... ... ... ... ...
328 Gentoo Biscoe 47.2 13.7 214.0 4925.0 Female
329 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
330 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
331 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
332 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male
Many Ways of Access
There are at least two ways of accessing a value or slice of a
DataFrame: by name or index. However, there are many others. For
example, a single column or row can be accessed either as a
DataFrame
or a Series
object.
Suggest different ways of doing the following operations on a DataFrame:
- Access a single column
- Access a single row
- Access an individual DataFrame element
- Access several columns
- Access several rows
- Access a subset of specific rows and columns
- Access a subset of row and column ranges
1. Access a single column:
PYTHON
# by name
data["col_name"] # as a Series
data[["col_name"]] # as a DataFrame
# by name using .loc
data.T.loc["col_name"] # as a Series
data.T.loc[["col_name"]].T # as a DataFrame
# Dot notation (Series)
data.col_name
# by index (iloc)
data.iloc[:, col_index] # as a Series
data.iloc[:, [col_index]] # as a DataFrame
# using a mask
data.T[data.T.index == "col_name"].T
2. Access a single row:
PYTHON
# by name using .loc
data.loc["row_name"] # as a Series
data.loc[["row_name"]] # as a DataFrame
# by name
data.T["row_name"] # as a Series
data.T[["row_name"]].T # as a DataFrame
# by index
data.iloc[row_index] # as a Series
data.iloc[[row_index]] # as a DataFrame
# using mask
data[data.index == "row_name"]
3. Access an individual DataFrame element:
PYTHON
# by column/row names
data["column_name"]["row_name"] # as a Series
data[["col_name"]].loc["row_name"] # as a Series
data[["col_name"]].loc[["row_name"]] # as a DataFrame
data.loc["row_name"]["col_name"] # as a value
data.loc[["row_name"]]["col_name"] # as a Series
data.loc[["row_name"]][["col_name"]] # as a DataFrame
data.loc["row_name", "col_name"] # as a value
data.loc[["row_name"], "col_name"] # as a Series. Preserves index. Column name is moved to `.name`.
data.loc["row_name", ["col_name"]] # as a Series. Index is moved to `.name.` Sets index to column name.
data.loc[["row_name"], ["col_name"]] # as a DataFrame (preserves original index and column name)
# by column/row names: Dot notation
data.col_name.row_name
# by column/row indices
data.iloc[row_index, col_index] # as a value
data.iloc[[row_index], col_index] # as a Series. Preserves index. Column name is moved to `.name`
data.iloc[row_index, [col_index]] # as a Series. Index is moved to `.name.` Sets index to column name.
data.iloc[[row_index], [col_index]] # as a DataFrame (preserves original index and column name)
# column name + row index
data["col_name"][row_index]
data.col_name[row_index]
data["col_name"].iloc[row_index]
# column index + row name
data.iloc[:, [col_index]].loc["row_name"] # as a Series
data.iloc[:, [col_index]].loc[["row_name"]] # as a DataFrame
# using masks
data[data.index == "row_name"].T[data.T.index == "col_name"].T
4. Access several columns:
PYTHON
# by name
data[["col1", "col2", "col3"]]
data.loc[:, ["col1", "col2", "col3"]]
# by index
data.iloc[:, [col1_index, col2_index, col3_index]]
5. Access several rows
PYTHON
# by name
data.loc[["row1", "row2", "row3"]]
# by index
data.iloc[[row1_index, row2_index, row3_index]]
6. Access a subset of specific rows and columns
PYTHON
# by names
data.loc[["row1", "row2", "row3"], ["col1", "col2", "col3"]]
# by indices
data.iloc[[row1_index, row2_index, row3_index], [col1_index, col2_index, col3_index]]
# column names + row indices
data[["col1", "col2", "col3"]].iloc[[row1_index, row2_index, row3_index]]
# column indices + row names
data.iloc[:, [col1_index, col2_index, col3_index]].loc[["row1", "row2", "row3"]]
7. Access a subset of row and column ranges
PYTHON
# by name
data.loc["row1":"row2", "col1":"col2"]
# by index
data.iloc[row1_index:row2_index, col1_index:col2_index]
# column names + row indices
data.loc[:, "col1_name":"col2_name"].iloc[row1_index:row2_index]
# column indices + row names
data.iloc[:, col1_index:col2_index].loc["row1":"row2"]
Exploring available methods using the
dir()
function
Python includes a dir()
function that can be used to
display all the available methods (functions) that are built into a data
object. In Episode 4, we used some methods with a string. But we can see
many more are available by using dir()
:
This command returns:
PYTHON
['__add__',
...
'__subclasshook__',
'capitalize',
'casefold',
'center',
...
'upper',
'zfill']
You can use help()
or Shift+Tab to
get more information about what these methods do.
Assume Pandas has been imported and the penguins data has been loaded
as data_penguins
. Then, use dir()
to find the
function that prints out the count of data entries across all
columns.
Interpretation
Interpolation is estimation based of known data. Imagine some measurement in the dataset are missing. How would you fill in missing numerical values? What factor would you take into account?
Key Points
- Use
DataFrame.iloc[..., ...]
to select values by integer location. - Use
:
on its own to mean all columns or all rows. - Select multiple columns or rows using
DataFrame.loc
and a named slice. - Result of slicing can be used in further operations.
- Use comparisons to select data based on value.
- Select values or NaN using a Boolean mask.