Pandas GroupBy Multiple Columns Explained - Spark By {Examples} (2024)

How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.

1. Quick Examples of GroupBy Multiple Columns

Following are examples of how to groupby on multiple columns & apply multiple aggregations.

# Quick Examples# Groupby multiple columnsresult = df.groupby(['Courses','Fee']).count()print(result)# Groupby multiple columns and aggregate on selected columnresult = df.groupby(['Courses','Fee'])['Courses'].count()print(result)# Groupby multiple columns and aggregate()result = df.groupby(['Courses','Fee'])['Duration'].aggregate('count')print(result)# Groupby multiple aggregationsresult = df.groupby('Courses')['Fee'].aggregate(['min','max'])print(result)# Groupby & multiple aggregations on different columnsresult = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})print(result)

2. Pandas GroupBy Multiple Columns Example

Most of the time when you are working on a real-time project in Pandas DataFrame you are required to do groupby on multiple columns. You can do so by passing a list of column names to DataFrame.groupby() function. Let’s create a DataFrame to understand this with examples.

import pandas as pdtechnologies = { 'Courses':["Spark","PySpark","Hadoop","Python","PySpark","Spark","Spark"], 'Fee' :[20000,25000,26000,22000,25000,20000,35000], 'Duration':['30day','40days','35days','40days','60days','60days','70days'], 'Discount':[1000,2300,1200,2500,2000,2000,3000] }df = pd.DataFrame(technologies)print("Create DataFrame:\n", df)

Yields below output.

Pandas GroupBy Multiple Columns Explained - Spark By {Examples} (1)

Now let’s do a group on multiple columns and then calculate count aggregation.

# Groupby multiple columnsresult = df.groupby(['Courses','Fee']).count()print("After grouping by multiple columns:\n", result)

Yields below output. When you apply count on the entire DataFrame, pretty much all columns will have the same values.

So when you want to group by count just select a column, you can even select from your group columns.

# Group by multiple columns and get # count of one of grouping columnresult = df.groupby(['Courses','Fee'])['Courses'].count(\n", result)print("Get count of one of the grouping column:\n", result)# Output:# Get count of one of the grouping column:# Courses Fee # Hadoop 26000 1# PySpark 25000 2# Python 22000 1# Spark 20000 2# 35000 1# Name: Courses, dtype: int6

2. Using aggregate()

Alternatively, you can also use the aggregate() function. This takes the count function as a string param.

# Groupby multiple columns and aggregate()result = df.groupby(['Courses','Fee'])['Courses'].aggregate('count')print("After grouping by multiple columns:\n", result)

Yields below output.

# Output:After grouping by multiple columns:Courses Fee Hadoop 26000 1PySpark 25000 2Python 22000 1Spark 20000 2 35000 1Name: Duration, dtype: int64

3. Pandas Multiple Aggregations Example

You can also compute multiple aggregations at the same time in Pandas by using the list to the aggregate().

# Groupby & multiple aggregationsresult = df.groupby('Courses')['Fee'].aggregate(['min','max'])print("After applying multiple aggregations on multiple group columns:\n", result)

Yields below output.

# Output:# After applying multiple aggregations on multiple group columns: min maxCourses Hadoop 26000 26000PySpark 25000 25000Python 22000 22000Spark 20000 35000

The above example calculates min and max on the Fee column. Let’s extend this to compute different aggregations on different columns.

Note that applying multiple aggregations to a single column in pandas DataFrame will result in aMultiIndex.

# Groupby multiple columns & multiple aggregationsresult = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})print("After applying multiple aggregations on single group column:\n", result)

Yields below output. Notice that this creates MultiIndex. Working with multi-indexed columns is not easy so I’d recommend flattening by renaming the columns.

# Output:# After applying multiple aggregations on single group column: Duration Fee count min maxCourses Hadoop 1 26000 26000PySpark 2 25000 25000Python 1 22000 22000Spark 3 20000 35000

Conclusion

In this article, you have learned how to group DataFrame rows by multiple columns and also learned how to compute different aggregations on a column.

Related Articles

References

As a data science enthusiast with a deep understanding of pandas, let me walk you through the concepts discussed in the article about grouping by multiple columns in a pandas DataFrame and computing multiple aggregations.

Grouping by Multiple Columns in Pandas DataFrame

1. Quick Examples of GroupBy Multiple Columns

The article starts with quick examples demonstrating how to use the groupby() function on multiple columns and apply various aggregations:

  • Grouping by multiple columns:

    result = df.groupby(['Courses','Fee']).count()
  • Grouping by multiple columns and aggregating on a selected column:

    result = df.groupby(['Courses','Fee'])['Courses'].count()
  • Grouping by multiple columns and using aggregate():

    result = df.groupby(['Courses','Fee'])['Duration'].aggregate('count')
  • Grouping by a single column and applying multiple aggregations:

    result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
  • Grouping and applying multiple aggregations on different columns:

    result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})

2. Pandas GroupBy Multiple Columns Example

The article then provides a hands-on example using a DataFrame with information about courses, fees, duration, and discounts. The DataFrame is created as follows:

technologies = {
    'Courses': ["Spark","PySpark","Hadoop","Python","PySpark","Spark","Spark"],
    'Fee': [20000,25000,26000,22000,25000,20000,35000],
    'Duration': ['30day','40days','35days','40days','60days','60days','70days'],
    'Discount': [1000,2300,1200,2500,2000,2000,3000]
}

df = pd.DataFrame(technologies)

The article then proceeds to demonstrate grouping by multiple columns and performing count aggregation.

3. Using aggregate()

The article introduces an alternative method using the aggregate() function, which allows for the calculation of multiple aggregations simultaneously. For instance:

result = df.groupby(['Courses','Fee'])['Courses'].aggregate('count')

4. Pandas Multiple Aggregations Example

The article further explores computing multiple aggregations simultaneously, either on the same column or different columns:

  • Applying multiple aggregations on the same group column:

    result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
  • Applying multiple aggregations on different group columns:

    result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})

5. Conclusion

The article concludes by summarizing the key concepts, emphasizing the ability to group DataFrame rows by multiple columns and compute different aggregations on a column.

References

These concepts provide a comprehensive guide for anyone working with pandas DataFrames and needing to perform groupby operations with multiple columns and multiple aggregations.

Pandas GroupBy Multiple Columns Explained - Spark By {Examples} (2024)
Top Articles
Latest Posts
Article information

Author: Mr. See Jast

Last Updated:

Views: 5728

Rating: 4.4 / 5 (75 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Mr. See Jast

Birthday: 1999-07-30

Address: 8409 Megan Mountain, New Mathew, MT 44997-8193

Phone: +5023589614038

Job: Chief Executive

Hobby: Leather crafting, Flag Football, Candle making, Flying, Poi, Gunsmithing, Swimming

Introduction: My name is Mr. See Jast, I am a open, jolly, gorgeous, courageous, inexpensive, friendly, homely person who loves writing and wants to share my knowledge and understanding with you.