Using lambda functions with pandas in Python
October 2024
I've been using Python for around 5 years, and it quickly became my favourite day-to-day programming language. Due to its versatility, Python is frequently used in modern tech, including data analytics, data engineering, and data science.
One of the most common libraries used for data analysis in Python is pandas. It allows for the scripting and automation of reading and transforming tabular data in Python, making it quicker and easier to perform frequent actions on data.
When performing data analysis it is almost always necessary to transform data. This might be adding custom fields (e.g. "if column X contains Y then create column Z"), combining two fields, performing calculations, adding suffixes/prefixes, etc.
Depending on what you want to achieve you might use custom defined functions using pandas.DataFrame.apply, but if you want to do a very simple calculation, a lambda function can be used in place of a custom function.
There are pros and cons to using lambda functions. Lambda functions only handle one expression, so if you have a simple "if x then y else z" scenario, a single line of code can be a nice clean solution. If you start getting complicated, a custom function might be more readable (which is a big benefit of Python in the first place). But when you need to do the same simple calculation across all rows in a dataframe, lambda functions are a great tool.
Let's start with a simple tabular dataset which shows student's grades in an assignment:
Student | Grade |
---|---|
Joseph | 90 |
Sally | 92 |
Henry | 37 |
Lisa | 52 |
Alex | 43 |
Michael | 65 |
Code to create this table in Python:
import pandas as pd
dataset = {'Student': ['Joseph','Sally','Henry',
'Lisa','Alex','Michael'], 'Grade': [90,92,37,52,43,65]}
df = pd.DataFrame(data=dataset)
Let's assume that students do not get their total grade returned to them, but instead will get either a "Pass" or "Fail" as their outcome, and "Pass" needs to be 50% or higher. We need to add a column named "Outcome" which will show either "Pass" or "Fail" depending on the value of their grade.
We can use a lambda function to do this as follows:
df['Outcome'] = df['Grade'].apply(lambda x: 'Pass' if x >= 50 else 'Fail')
In this code:
- The new column "Outcome" will be created based on the column "Grade";
- The column "Grade" will have the lambda function applied to it;
- The lambda function iterates through each row of "Grade" and passes it as the variable x, then starts with the positive result "Pass" if the value in x is greater than or equal to 50. If the value x doesn't meet this criteria, return "Fail".
This one line of code will return the following updated dataframe:
Student | Grade | Outcome |
---|---|---|
Joseph | 90 | Pass |
Sally | 92 | Pass |
Henry | 37 | Fail |
Lisa | 52 | Pass |
Alex | 43 | Fail |
Michael | 65 | Pass |
Lambda functions can also be applied to multiple columns in a single dataframe. Let's assume that the class takes two grades from the students, and a student will only pass if their average over both grades is at least 50.
import pandas as pd
dataset = {'Student': ['Joseph','Sally','Henry',
'Lisa','Alex','Michael'], 'Grade1': [90,92,37,52,43,65], 'Grade2': [87,43,52,64,57,53]}
df = pd.DataFrame(data=dataset)
df['Outcome'] = df.apply(lambda x: 'Pass' if (x['Grade1'] + x['Grade2'])/2 >= 50 else 'Fail', axis=1)
In this code the variable 'x' has actually become the entire dataframe, so we need to specify from which columns we are taking values (e.g. df['Grade1'] becomes x['Grade1']).
When using the apply function over an entire dataframe, if the intention is to apply to every row then "axis=1" needs to be passed as well.
This code produces the following result:
Student | Grade1 | Grade2 | Outcome |
---|---|---|---|
Joseph | 90 | 87 | Pass |
Sally | 92 | 43 | Pass |
Henry | 37 | 52 | Fail |
Lisa | 52 | 64 | Pass |
Alex | 43 | 57 | Pass |
Michael | 65 | 53 | Pass |
Lambda functions can be a very useful way to minimise code for relativity simple calculations, but should be considered based on the use case. When calculations become more complicated, a custom function might be more readable.
For more information, check out the pandas documentation on DataFrame.apply.