How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.
1. Quick Examples of GroupBy Multiple Columns
Following are examples of how to groupby on multiple columns & apply multiple aggregations.
# Quick Examples# Groupby multiple columnsresult = df.groupby(['Courses','Fee']).count()print(result)# Groupby multiple columns and aggregate on selected columnresult = df.groupby(['Courses','Fee'])['Courses'].count()print(result)# Groupby multiple columns and aggregate()result = df.groupby(['Courses','Fee'])['Duration'].aggregate('count')print(result)# Groupby multiple aggregationsresult = df.groupby('Courses')['Fee'].aggregate(['min','max'])print(result)# Groupby & multiple aggregations on different columnsresult = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})print(result)
2. Pandas GroupBy Multiple Columns Example
Most of the time when you are working on a real-time project in Pandas DataFrame you are required to do groupby on multiple columns. You can do so by passing a list of column names to DataFrame.groupby() function. Let’s create a DataFrame to understand this with examples.
import pandas as pdtechnologies = { 'Courses':["Spark","PySpark","Hadoop","Python","PySpark","Spark","Spark"], 'Fee' :[20000,25000,26000,22000,25000,20000,35000], 'Duration':['30day','40days','35days','40days','60days','60days','70days'], 'Discount':[1000,2300,1200,2500,2000,2000,3000] }df = pd.DataFrame(technologies)print("Create DataFrame:\n", df)
Yields below output.
Now let’s do a group on multiple columns and then calculate count aggregation.
# Groupby multiple columnsresult = df.groupby(['Courses','Fee']).count()print("After grouping by multiple columns:\n", result)
Yields below output. When you apply count on the entire DataFrame, pretty much all columns will have the same values.
So when you want to group by count just select a column, you can even select from your group columns.
# Group by multiple columns and get # count of one of grouping columnresult = df.groupby(['Courses','Fee'])['Courses'].count(\n", result)print("Get count of one of the grouping column:\n", result)# Output:# Get count of one of the grouping column:# Courses Fee # Hadoop 26000 1# PySpark 25000 2# Python 22000 1# Spark 20000 2# 35000 1# Name: Courses, dtype: int6
2. Using aggregate()
Alternatively, you can also use the aggregate() function. This takes the count
function as a string param.
# Groupby multiple columns and aggregate()result = df.groupby(['Courses','Fee'])['Courses'].aggregate('count')print("After grouping by multiple columns:\n", result)
Yields below output.
# Output:After grouping by multiple columns:Courses Fee Hadoop 26000 1PySpark 25000 2Python 22000 1Spark 20000 2 35000 1Name: Duration, dtype: int64
3. Pandas Multiple Aggregations Example
You can also compute multiple aggregations at the same time in Pandas by using the list to the aggregate()
.
# Groupby & multiple aggregationsresult = df.groupby('Courses')['Fee'].aggregate(['min','max'])print("After applying multiple aggregations on multiple group columns:\n", result)
Yields below output.
# Output:# After applying multiple aggregations on multiple group columns: min maxCourses Hadoop 26000 26000PySpark 25000 25000Python 22000 22000Spark 20000 35000
The above example calculates min
and max
on the Fee
column. Let’s extend this to compute different aggregations on different columns.
Note that applying multiple aggregations to a single column in pandas DataFrame will result in aMultiIndex.
# Groupby multiple columns & multiple aggregationsresult = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})print("After applying multiple aggregations on single group column:\n", result)
Yields below output. Notice that this creates MultiIndex. Working with multi-indexed columns is not easy so I’d recommend flattening by renaming the columns.
# Output:# After applying multiple aggregations on single group column: Duration Fee count min maxCourses Hadoop 1 26000 26000PySpark 2 25000 25000Python 1 22000 22000Spark 3 20000 35000
Conclusion
In this article, you have learned how to group DataFrame rows by multiple columns and also learned how to compute different aggregations on a column.
Related Articles
- Pandas groupby Aggregate Explained
- Pandas groupby Transform
- Pandas groupby Sort within Groups
- Pandas groupby() and count() with Examples
- Pandas groupby() Explained With Examples
- Pandas groupby() and sum() With Examples
- Pandas Group Rows into List Using groupby()
- How to groupby Index in Pandas?
- Pandas Percentage Total With groupby
- Pandas Check Column Contains a Value in DataFrame
- Extract Pandas column value based on another column
References
As a data science enthusiast with a deep understanding of pandas, let me walk you through the concepts discussed in the article about grouping by multiple columns in a pandas DataFrame and computing multiple aggregations.
Grouping by Multiple Columns in Pandas DataFrame
1. Quick Examples of GroupBy Multiple Columns
The article starts with quick examples demonstrating how to use the groupby()
function on multiple columns and apply various aggregations:
-
Grouping by multiple columns:
result = df.groupby(['Courses','Fee']).count()
-
Grouping by multiple columns and aggregating on a selected column:
result = df.groupby(['Courses','Fee'])['Courses'].count()
-
Grouping by multiple columns and using
aggregate()
:result = df.groupby(['Courses','Fee'])['Duration'].aggregate('count')
-
Grouping by a single column and applying multiple aggregations:
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
-
Grouping and applying multiple aggregations on different columns:
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
2. Pandas GroupBy Multiple Columns Example
The article then provides a hands-on example using a DataFrame with information about courses, fees, duration, and discounts. The DataFrame is created as follows:
technologies = {
'Courses': ["Spark","PySpark","Hadoop","Python","PySpark","Spark","Spark"],
'Fee': [20000,25000,26000,22000,25000,20000,35000],
'Duration': ['30day','40days','35days','40days','60days','60days','70days'],
'Discount': [1000,2300,1200,2500,2000,2000,3000]
}
df = pd.DataFrame(technologies)
The article then proceeds to demonstrate grouping by multiple columns and performing count aggregation.
3. Using aggregate()
The article introduces an alternative method using the aggregate()
function, which allows for the calculation of multiple aggregations simultaneously. For instance:
result = df.groupby(['Courses','Fee'])['Courses'].aggregate('count')
4. Pandas Multiple Aggregations Example
The article further explores computing multiple aggregations simultaneously, either on the same column or different columns:
-
Applying multiple aggregations on the same group column:
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
-
Applying multiple aggregations on different group columns:
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
5. Conclusion
The article concludes by summarizing the key concepts, emphasizing the ability to group DataFrame rows by multiple columns and compute different aggregations on a column.
References
These concepts provide a comprehensive guide for anyone working with pandas DataFrames and needing to perform groupby operations with multiple columns and multiple aggregations.