Part 8: Data Manipulation in Grouping and Aggregation

Dataemia
26 Min Read


Author(s): Raj kumar

Originally published on Towards AI.

Part 8: Data Manipulation in Grouping and Aggregation

Every business decision starts with a question. What are our total sales by region? Which product categories generate the most revenue? How do customer segments compare in profitability? These questions all share something in common: they require grouping data and calculating aggregates.

Grouping and aggregation are the backbone of business intelligence. They transform raw transactions into actionable insights. They turn thousands of individual data points into clear summaries that executives can understand and act upon. Without these operations, you’re drowning in details. With them, you see patterns, trends, and opportunities.

But here’s the critical point that many analysts miss: how you group and aggregate your data fundamentally shapes the story it tells. The same dataset can produce wildly different conclusions depending on the aggregation method you choose. Average revenue per customer masks the difference between whale customers and small buyers. Total sales by quarter hides monthly volatility. Median income tells a different story than mean income.

This is where the subtitle becomes relevant: summaries, KPIs, and portfolio metrics aren’t just constructed through grouping and aggregation. They can be distorted by them. Not through malice, but through thoughtless application of default methods. An analyst who blindly uses mean() when median() would be more appropriate isn’t lying. They’re just telling an incomplete or misleading story.

Why Grouping and Aggregation Matter

Consider a retail company analyzing store performance. They have transaction-level data: every sale, every product, every timestamp. This granular data is valuable, but it’s overwhelming. To make decisions about which stores to expand, which to close, and which need intervention, they need summaries.

They group by store and calculate total revenue. Simple enough. But what if some stores are larger than others? Revenue per square foot becomes the metric. What if some stores are in expensive real estate markets? Revenue per dollar of rent matters. What if customer traffic varies? Revenue per customer visit tells another story.

Each grouping and aggregation choice highlights different aspects of performance. None are wrong. All are partial views. The skill lies in choosing the right aggregation for the question at hand and being honest about what it shows and what it hides.

The Power of Pandas GroupBy

Pandas provides groupby operations that mirror SQL’s GROUP BY but with more flexibility and power. The pattern is consistent and intuitive:

df.groupby('column').aggregation_function()

This simple pattern unlocks complex analysis. You can group by one column or many. You can apply one aggregation or several. You can use built-in functions or write custom logic. You can filter groups, transform values within groups, or create entirely new features based on group statistics.

The operations covered in this guide represent the essential toolkit for group-based analysis. Single column grouping for basic summaries. Multiple column grouping for dimensional analysis. Multiple aggregations for comprehensive views. Custom functions for specialized calculations. Pivot tables for reshaping. Cross-tabulation for categorical relationships. Rolling windows for time-based aggregates.

What You’ll Learn

This guide walks through ten fundamental grouping and aggregation patterns with practical business examples. You’ll see how to create sales reports by region, analyze customer behavior by segment, build portfolio performance metrics, and construct financial KPIs. More importantly, you’ll understand when each method is appropriate and what assumptions it carries.

The complete example at the end demonstrates a realistic business scenario: analyzing e-commerce performance across multiple dimensions. This reflects how you’ll actually use these operations in practice, combining multiple grouping strategies to answer layered business questions.

A Warning About Aggregates

Before we dive into the mechanics, remember this: aggregates destroy information. When you calculate an average, you lose the distribution. When you sum values, you lose individual contributions. When you count records, you lose their content.

This isn’t a flaw. It’s the point. Aggregation trades detail for clarity. But you need to be conscious of what you’re trading away. Always ask: what am I losing by summarizing this way? Would a different aggregation tell a different story? Am I inadvertently hiding important variation?

The best analysts don’t just group and aggregate. They think critically about whether their chosen method serves the analysis honestly.

Write on Medium

Let’s start with the simplest case: grouping by a single column.

Understanding Basic Grouping

The foundation of all grouping operations is the single-column group with a single aggregation function.

1. Group by Single Column

import pandas as pd
import numpy as np

# Sample sales data
data = {
'region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
'product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'],
'sales': [1200, 800, 600, 1500, 900, 700, 1300, 850],
'quantity': [2, 4, 3, 2, 3, 4, 2, 5]
}
df = pd.DataFrame(data)
print("Original Data:")
print(df)
print()
# Group by region and calculate mean sales
region_avg = df.groupby('region')['sales'].mean()
print("Average Sales by Region:")
print(region_avg)
print()
# Group by region with multiple statistics
region_summary = df.groupby('region')['sales'].agg(['sum', 'mean', 'count'])
print("Complete Region Summary:")
print(region_summary)

Output:

Original Data:
region product sales quantity
0 North Laptop 1200 2
1 South Phone 800 4
2 East Tablet 600 3
3 West Laptop 1500 2
4 North Phone 900 3
5 South Tablet 700 4
6 East Laptop 1300 2
7 West Phone 850 5

Average Sales by Region:
region
East 950.0
North 1050.0
South 750.0
West 1175.0
Name: sales, dtype: float64
Complete Region Summary:
sum mean count
region
East 1900 950.0 2
North 2100 1050.0 2
South 1500 750.0 2
West 2350 1175.0 2

Use case: Basic performance summaries, regional comparisons, identifying top performers.

2. Group by Multiple Columns

# Group by region and product
region_product_sales = df.groupby(['region', 'product'])['sales'].sum()

print("Sales by Region and Product:")
print(region_product_sales)
print()
# Convert to readable format
region_product_df = region_product_sales.reset_index()
print("As DataFrame:")
print(region_product_df)

Output:

Sales by Region and Product:
region product
East Laptop 1300
Tablet 600
North Laptop 1200
Phone 900
South Phone 800
Tablet 700
West Laptop 1500
Phone 850
Name: sales, dtype: int64

As DataFrame:
region product sales
0 East Laptop 1300
1 East Tablet 600
2 North Laptop 1200
3 North Phone 900
4 South Phone 800
5 South Tablet 700
6 West Laptop 1500
7 West Phone 850

Use case: Multi-dimensional analysis, cross-category comparisons, hierarchical reporting.

Advanced Aggregation Techniques

3. Group by with Multiple Aggregations

# Apply different aggregations to different columns
multi_agg = df.groupby('region').agg({
'sales': ['sum', 'mean', 'max'],
'quantity': ['sum', 'mean']
})

print("Multiple Aggregations by Region:")
print(multi_agg)
print()
# Flatten column names for easier access
multi_agg.columns = ['_'.join(col).strip() for col in multi_agg.columns.values]
multi_agg = multi_agg.reset_index()
print("Flattened Column Names:")
print(multi_agg)

Output:

Multiple Aggregations by Region:
sales quantity
sum mean max sum mean
region
East 1900 950.0 1300 5 2.5
North 2100 1050.0 1200 5 2.5
South 1500 750.0 800 8 4.0
West 2350 1175.0 1500 7 3.5

Flattened Column Names:
region sales_sum sales_mean sales_max quantity_sum quantity_mean
0 East 1900 950.0 1300 5 2.5
1 North 2100 1050.0 1200 5 2.5
2 South 1500 750.0 800 8 4.0
3 West 2350 1175.0 1500 7 3.5

Use case: Comprehensive performance dashboards, executive summaries, quarterly reports.

4. Group by with Custom Function

# Create custom aggregation function
def sales_range(x):
"""Calculate the range (max - min) of sales"""
return x.max() - x.min()

# Apply custom function
region_range = df.groupby('region')['sales'].apply(sales_range)
print("Sales Range by Region:")
print(region_range)
print()
# Multiple custom calculations
def custom_metrics(group):
"""Calculate custom metrics for each group"""
return pd.Series({
'total_sales': group['sales'].sum(),
'avg_quantity': group['quantity'].mean(),
'sales_per_unit': group['sales'].sum() / group['quantity'].sum(),
'max_transaction': group['sales'].max()
})
custom_summary = df.groupby('region').apply(custom_metrics)
print("Custom Metrics by Region:")
print(custom_summary)

Output:

Sales Range by Region:
region
East 700
North 300
South 100
West 650
Name: sales, dtype: int64

Custom Metrics by Region:
total_sales avg_quantity sales_per_unit max_transaction
region
East 1900 2.5 380.0 1300
North 2100 2.5 420.0 1200
South 1500 4.0 187.5 800
West 2350 3.5 335.7 1500

Use case: Specialized business metrics, custom KPI calculations, industry-specific measures.

5. Group by with Size

# Count number of records in each group
region_counts = df.groupby('region').size()

print("Number of Transactions by Region:")
print(region_counts)
print()
# Compare with count (which excludes NaN)
region_count_comparison = pd.DataFrame({
'size': df.groupby('region').size(),
'count': df.groupby('region')['sales'].count()
})
print("Size vs Count:")
print(region_count_comparison)

Output:

Number of Transactions by Region:
region
East 2
North 2
South 2
West 2
dtype: int64

Size vs Count:
size count
region
East 2 2
North 2 2
South 2 2
West 2 2

Use case: Transaction volume analysis, data completeness checks, activity tracking.

6. Group by with Filter

# Filter groups based on aggregate condition
# Keep only regions with more than 1 transaction
filtered_groups = df.groupby('region').filter(lambda x: len(x) > 1)

print("Regions with More Than 1 Transaction:")
print(filtered_groups)
print()
# Filter based on aggregate value
# Keep only regions where total sales > 2000
high_sales_regions = df.groupby('region').filter(lambda x: x['sales'].sum() > 2000)
print("Regions with Total Sales > 2000:")
print(high_sales_regions)

Output:

Regions with More Than 1 Transaction:
region product sales quantity
0 North Laptop 1200 2
1 South Phone 800 4
2 East Tablet 600 3
3 West Laptop 1500 2
4 North Phone 900 3
5 South Tablet 700 4
6 East Laptop 1300 2
7 West Phone 850 5

Regions with Total Sales > 2000:
region product sales quantity
0 North Laptop 1200 2
3 West Laptop 1500 2
4 North Phone 900 3
7 West Phone 850 5

Use case: Identifying top performers, removing outliers, focusing analysis on significant groups.

7. Group by with Transform

# Add group statistics back to original dataframe
df['region_avg_sales'] = df.groupby('region')['sales'].transform('mean')
df['pct_of_region_sales'] = (df['sales'] / df.groupby('region')['sales'].transform('sum')) * 100

print("Data with Group Statistics:")
print(df[['region', 'product', 'sales', 'region_avg_sales', 'pct_of_region_sales']])

Output:

Data with Group Statistics:
region product sales region_avg_sales pct_of_region_sales
0 North Laptop 1200 1050.0 57.142857
1 South Phone 800 750.0 53.333333
2 East Tablet 600 950.0 31.578947
3 West Laptop 1500 1175.0 63.829787
4 North Phone 900 1050.0 42.857143
5 South Tablet 700 750.0 46.666667
6 East Laptop 1300 950.0 68.421053
7 West Phone 850 1175.0 36.170213

Use case: Calculating percentages of total, standardizing values, creating relative metrics.

Reshaping with Pivot Tables

8. Pivot Table

# Create pivot table
pivot_sales = pd.pivot_table(
df,
values='sales',
index='region',
columns='product',
aggfunc='sum',
fill_value=0
)

print("Pivot Table - Sales by Region and Product:")
print(pivot_sales)
print()
# Multiple aggregation functions
pivot_multi = pd.pivot_table(
df,
values='sales',
index='region',
columns='product',
aggfunc=['sum', 'mean'],
fill_value=0
)
print("Pivot Table with Multiple Aggregations:")
print(pivot_multi)

Output:

Pivot Table - Sales by Region and Product:
product Laptop Phone Tablet
region
East 1300 0 600
North 1200 900 0
South 0 800 700
West 1500 850 0

Pivot Table with Multiple Aggregations:
sum mean
product Laptop Phone Tablet Laptop Phone Tablet
region
East 1300 0 600 1300.0 0.0 600.0
North 1200 900 0 1200.0 900.0 0.0
South 0 800 700 0.0 800.0 700.0
West 1500 850 0 1500.0 850.0 0.0

Use case: Creating crosstab reports, building Excel-like summaries, dimensional analysis.

Cross-Tabulation and Rolling Windows

9. Cross-Tabulation

# Simple cross-tabulation
crosstab_result = pd.crosstab(df['region'], df['product'])

print("Cross-Tabulation - Count of Products by Region:")
print(crosstab_result)
print()
# Cross-tabulation with values
crosstab_sales = pd.crosstab(
df['region'],
df['product'],
values=df['sales'],
aggfunc='sum'
)
print("Cross-Tabulation - Total Sales by Region and Product:")
print(crosstab_sales)

Output:

Cross-Tabulation - Count of Products by Region:
product Laptop Phone Tablet
region
East 1 0 1
North 1 1 0
South 0 1 1
West 1 1 0

Cross-Tabulation - Total Sales by Region and Product:
product Laptop Phone Tablet
region
East 1300.0 NaN 600.0
North 1200.0 900.0 NaN
South NaN 800.0 700.0
West 1500.0 850.0 NaN

Use case: Frequency analysis, categorical relationships, contingency tables.

10. Rolling Window Calculations

# Create time series data
dates = pd.date_range('2024-01-01', periods=10, freq='D')
time_series_df = pd.DataFrame({
'date': dates,
'sales': [100, 150, 120, 180, 160, 200, 170, 190, 210, 220]
})

# Calculate 3-day rolling average
time_series_df['rolling_mean_3d'] = time_series_df['sales'].rolling(window=3).mean()
# Calculate 3-day rolling sum
time_series_df['rolling_sum_3d'] = time_series_df['sales'].rolling(window=3).sum()
print("Rolling Window Calculations:")
print(time_series_df)

Output:

Rolling Window Calculations:
date sales rolling_mean_3d rolling_sum_3d
0 2024-01-01 100 NaN NaN
1 2024-01-02 150 NaN NaN
2 2024-01-03 120 123.333333 370.0
3 2024-01-04 180 150.000000 450.0
4 2024-01-05 160 153.333333 460.0
5 2024-01-06 200 180.000000 540.0
6 2024-01-07 170 176.666667 530.0
7 2024-01-08 190 186.666667 560.0
8 2024-01-09 210 190.000000 570.0
9 2024-01-10 220 206.666667 620.0

Use case: Moving averages, trend smoothing, time-based metrics.

Complete Production Example: E-Commerce Performance Analysis

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# ============================================================================
# PRODUCTION E-COMMERCE ANALYSIS PIPELINE
# ============================================================================
def generate_ecommerce_data():
"""
Generate realistic e-commerce transaction data
"""

np.random.seed(42)

# Generate transactions over 3 months
num_transactions = 500

# Date range
start_date = pd.Timestamp('2024-01-01')
dates = [start_date + pd.Timedelta(days=np.random.randint(0, 90)) for _ in range(num_transactions)]

# Generate data
data = {
'order_id': [f'ORD{str(i).zfill(4)}' for i in range(1, num_transactions + 1)],
'date': dates,
'customer_id': [f'CUST{np.random.randint(1000, 1200)}' for _ in range(num_transactions)],
'region': np.random.choice(['North', 'South', 'East', 'West'], num_transactions, p=[0.3, 0.25, 0.25, 0.2]),
'category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Books'], num_transactions),
'product': [f'Product_{np.random.randint(1, 50)}' for _ in range(num_transactions)],
'quantity': np.random.randint(1, 6, num_transactions),
'unit_price': np.random.uniform(20, 500, num_transactions).round(2),
'customer_segment': np.random.choice(['Premium', 'Standard', 'Basic'], num_transactions, p=[0.2, 0.5, 0.3])
}

df = pd.DataFrame(data)

# Calculate total amount
df['total_amount'] = (df['quantity'] * df['unit_price']).round(2)

# Sort by date
df = df.sort_values('date').reset_index(drop=True)

return df

def analyze_ecommerce_performance(df):
"""
Comprehensive e-commerce performance analysis using grouping and aggregation
"""

print("="*80)
print("E-COMMERCE PERFORMANCE ANALYSIS - GROUPING & AGGREGATION")
print("="*80)
print()

print(f"Dataset Overview: {len(df)} transactions")
print(f"Date Range: {df['date'].min().date()} to {df['date'].max().date()}")
print(f"Total Revenue: ${df['total_amount'].sum():,.2f}")
print()

# ========================================
# ANALYSIS 1: Regional Performance
# ========================================
print("="*80)
print("ANALYSIS 1: REGIONAL PERFORMANCE")
print("="*80)
print()

regional_performance = df.groupby('region').agg({
'order_id': 'count',
'total_amount': ['sum', 'mean', 'median'],
'quantity': 'sum'
}).round(2)

regional_performance.columns = ['Orders', 'Total_Revenue', 'Avg_Order', 'Median_Order', 'Units_Sold']
regional_performance = regional_performance.sort_values('Total_Revenue', ascending=False)

print(regional_performance)
print()

# ========================================
# ANALYSIS 2: Category Performance
# ========================================
print("="*80)
print("ANALYSIS 2: CATEGORY PERFORMANCE")
print("="*80)
print()

category_performance = df.groupby('category').agg({
'order_id': 'count',
'total_amount': ['sum', 'mean'],
'quantity': 'sum'
}).round(2)

category_performance.columns = ['Orders', 'Total_Revenue', 'Avg_Order', 'Units_Sold']
category_performance['Revenue_Share_%'] = (
(category_performance['Total_Revenue'] / category_performance['Total_Revenue'].sum()) * 100
).round(2)

category_performance = category_performance.sort_values('Total_Revenue', ascending=False)

print(category_performance)
print()

# ========================================
# ANALYSIS 3: Customer Segment Analysis
# ========================================
print("="*80)
print("ANALYSIS 3: CUSTOMER SEGMENT ANALYSIS")
print("="*80)
print()

segment_analysis = df.groupby('customer_segment').agg({
'order_id': 'count',
'total_amount': ['sum', 'mean', 'median'],
'customer_id': lambda x: x.nunique()
}).round(2)

segment_analysis.columns = ['Orders', 'Total_Revenue', 'Avg_Order', 'Median_Order', 'Unique_Customers']
segment_analysis['Orders_per_Customer'] = (
segment_analysis['Orders'] / segment_analysis['Unique_Customers']
).round(2)

segment_analysis = segment_analysis.sort_values('Total_Revenue', ascending=False)

print(segment_analysis)
print()

# ========================================
# ANALYSIS 4: Multi-Dimensional Analysis
# ========================================
print("="*80)
print("ANALYSIS 4: REGION × CATEGORY PERFORMANCE")
print("="*80)
print()

pivot_analysis = pd.pivot_table(
df,
values='total_amount',
index='region',
columns='category',
aggfunc='sum',
fill_value=0
).round(2)

print("Revenue by Region and Category:")
print(pivot_analysis)
print()

# Add row and column totals
pivot_analysis['Total'] = pivot_analysis.sum(axis=1)
pivot_analysis.loc['Total'] = pivot_analysis.sum()

print("With Totals:")
print(pivot_analysis)
print()

# ========================================
# ANALYSIS 5: Top Performers
# ========================================
print("="*80)
print("ANALYSIS 5: TOP PERFORMERS")
print("="*80)
print()

# Top 5 customers by revenue
top_customers = df.groupby('customer_id').agg({
'total_amount': 'sum',
'order_id': 'count'
}).round(2)
top_customers.columns = ['Total_Spent', 'Orders']
top_customers = top_customers.sort_values('Total_Spent', ascending=False).head(5)

print("Top 5 Customers by Revenue:")
print(top_customers)
print()

# Top 5 products by quantity sold
top_products = df.groupby('product').agg({
'quantity': 'sum',
'total_amount': 'sum',
'order_id': 'count'
}).round(2)
top_products.columns = ['Units_Sold', 'Revenue', 'Orders']
top_products = top_products.sort_values('Units_Sold', ascending=False).head(5)

print("Top 5 Products by Units Sold:")
print(top_products)
print()

# ========================================
# ANALYSIS 6: Custom Metrics
# ========================================
print("="*80)
print("ANALYSIS 6: CUSTOM BUSINESS METRICS")
print("="*80)
print()

def calculate_segment_metrics(group):
"""Calculate custom metrics for each segment"""
return pd.Series({
'total_revenue': group['total_amount'].sum(),
'avg_basket_size': group['total_amount'].mean(),
'revenue_per_unit': group['total_amount'].sum() / group['quantity'].sum(),
'order_frequency': len(group) / group['customer_id'].nunique(),
'high_value_orders_%': (group['total_amount'] > group['total_amount'].quantile(0.75)).sum() / len(group) * 100
})

custom_metrics = df.groupby('customer_segment').apply(calculate_segment_metrics).round(2)

print(custom_metrics)
print()

# ========================================
# ANALYSIS 7: Time-Based Analysis
# ========================================
print("="*80)
print("ANALYSIS 7: MONTHLY TRENDS")
print("="*80)
print()

# Extract month
df['month'] = df['date'].dt.to_period('M')

monthly_trends = df.groupby('month').agg({
'order_id': 'count',
'total_amount': 'sum',
'customer_id': lambda x: x.nunique()
}).round(2)

monthly_trends.columns = ['Orders', 'Revenue', 'Unique_Customers']
monthly_trends['Avg_Order_Value'] = (monthly_trends['Revenue'] / monthly_trends['Orders']).round(2)

print(monthly_trends)
print()

# ========================================
# ANALYSIS 8: Filtered Group Analysis
# ========================================
print("="*80)
print("ANALYSIS 8: HIGH-VALUE REGIONS (Revenue > $20,000)")
print("="*80)
print()

high_value_regions = df.groupby('region').filter(lambda x: x['total_amount'].sum() > 20000)

high_value_summary = high_value_regions.groupby('region').agg({
'order_id': 'count',
'total_amount': ['sum', 'mean']
}).round(2)

high_value_summary.columns = ['Orders', 'Total_Revenue', 'Avg_Order']

print(high_value_summary)
print()

# ========================================
# KEY INSIGHTS SUMMARY
# ========================================
print("="*80)
print("KEY INSIGHTS SUMMARY")
print("="*80)
print()

# Best performing region
best_region = regional_performance.index[0]
best_region_revenue = regional_performance.loc[best_region, 'Total_Revenue']

# Best performing category
best_category = category_performance.index[0]
best_category_revenue = category_performance.loc[best_category, 'Total_Revenue']

# Most valuable segment
best_segment = segment_analysis.index[0]
best_segment_avg = segment_analysis.loc[best_segment, 'Avg_Order']

print(f"Total Transactions: {len(df):,}")
print(f"Total Revenue: ${df['total_amount'].sum():,.2f}")
print(f"Average Order Value: ${df['total_amount'].mean():.2f}")
print(f"Unique Customers: {df['customer_id'].nunique()}")
print()
print(f"Best Region: {best_region} (${best_region_revenue:,.2f})")
print(f"Best Category: {best_category} (${best_category_revenue:,.2f})")
print(f"Most Valuable Segment: {best_segment} (Avg: ${best_segment_avg:.2f}/order)")
print()

return df

def main():
"""
Main execution function
"""

print("\n")
print("*"*80)
print("GROUPING & AGGREGATION - COMPREHENSIVE E-COMMERCE ANALYSIS")
print("*"*80)
print("\n")

# Generate data
print("Generating e-commerce transaction data...")
df = generate_ecommerce_data()
print(f"✓ Generated {len(df)} transactions")
print()

# Run analysis
analyzed_df = analyze_ecommerce_performance(df)

print("*"*80)
print("ANALYSIS COMPLETE")
print("*"*80)
print()

if __name__ == "__main__":
main()

Output

********************************************************************************
GROUPING & AGGREGATION - COMPREHENSIVE E-COMMERCE ANALYSIS
********************************************************************************

Generating e-commerce transaction data...
✓ Generated 500 transactions

================================================================================
E-COMMERCE PERFORMANCE ANALYSIS - GROUPING & AGGREGATION
================================================================================

Dataset Overview: 500 transactions
Date Range: 2024-01-01 to 2024-03-30
Total Revenue: $396,481.43

================================================================================
ANALYSIS 1: REGIONAL PERFORMANCE
================================================================================

Orders Total_Revenue Avg_Order Median_Order Units_Sold
region
North 155 125099.71 807.09 692.40 488
East 130 98400.67 756.93 636.36 378
South 109 90413.83 829.48 733.80 342
West 106 82567.22 778.94 617.78 314

================================================================================
ANALYSIS 2: CATEGORY PERFORMANCE
================================================================================

Orders Total_Revenue Avg_Order Units_Sold Revenue_Share_%
category
Books 131 104223.48 795.60 425 26.29
Clothing 128 102928.17 804.13 407 25.96
Home 120 98202.02 818.35 349 24.77
Electronics 121 91127.76 753.12 341 22.98

================================================================================
ANALYSIS 3: CUSTOMER SEGMENT ANALYSIS
================================================================================

Orders Total_Revenue Avg_Order Median_Order Unique_Customers Orders_per_Customer
customer_segment
Standard 248 191353.46 771.59 653.50 146 1.70
Basic 158 123444.41 781.30 667.65 105 1.50
Premium 94 81683.56 868.95 751.54 77 1.22

================================================================================
ANALYSIS 4: REGION × CATEGORY PERFORMANCE
================================================================================

Revenue by Region and Category:
category Books Clothing Electronics Home
region
East 17561.22 32884.10 28561.88 19393.47
North 37380.82 30243.75 29432.82 28042.32
South 21772.61 21071.63 20479.52 27090.07
West 27508.83 18728.69 12653.54 23676.16

With Totals:
category Books Clothing Electronics Home Total
region
East 17561.22 32884.10 28561.88 19393.47 98400.67
North 37380.82 30243.75 29432.82 28042.32 125099.71
South 21772.61 21071.63 20479.52 27090.07 90413.83
West 27508.83 18728.69 12653.54 23676.16 82567.22
Total 104223.48 102928.17 91127.76 98202.02 396481.43

================================================================================
ANALYSIS 5: TOP PERFORMERS
================================================================================

Top 5 Customers by Revenue:
Total_Spent Orders
customer_id
CUST1112 8861.38 8
CUST1144 6565.50 7
CUST1117 6188.97 7
CUST1176 5631.78 5
CUST1055 5286.26 5

Top 5 Products by Units Sold:
Units_Sold Revenue Orders
product
Product_21 51 13271.51 17
Product_22 50 14430.44 12
Product_19 45 11720.75 11
Product_2 44 12815.98 16
Product_16 44 10212.88 13

================================================================================
ANALYSIS 6: CUSTOM BUSINESS METRICS
================================================================================

total_revenue avg_basket_size revenue_per_unit order_frequency high_value_orders_%
customer_segment
Basic 123444.41 781.30 247.79 1.50 25.32
Premium 81683.56 868.95 257.64 1.22 25.53
Standard 191353.46 771.59 256.56 1.70 25.00

================================================================================
ANALYSIS 7: MONTHLY TRENDS
================================================================================

Orders Revenue Unique_Customers Avg_Order_Value
month
2024-01 173 141530.13 119 818.09
2024-02 172 122307.41 109 711.09
2024-03 155 132643.89 109 855.77

================================================================================
ANALYSIS 8: HIGH-VALUE REGIONS (Revenue > $20,000)
================================================================================

Orders Total_Revenue Avg_Order
region
East 130 98400.67 756.93
North 155 125099.71 807.09
South 109 90413.83 829.48
West 106 82567.22 778.94

================================================================================
KEY INSIGHTS SUMMARY
================================================================================

Total Transactions: 500
Total Revenue: $396,481.43
Average Order Value: $792.96
Unique Customers: 183

Best Region: North ($125,099.71)
Best Category: Books ($104,223.48)
Most Valuable Segment: Standard (Avg: $771.59/order)

********************************************************************************
ANALYSIS COMPLETE
********************************************************************************

This comprehensive output shows:

  • 500 transactions analyzed across 3 months
  • $396,481.43 in total revenue
  • North region leading with $125,099.71 (155 orders)
  • Books category performing best with 26.29% revenue share
  • Standard segment generating the most revenue ($191,353.46)
  • Complete pivot table showing revenue by region and category
  • Top 5 customers and products
  • Monthly trends showing January as the strongest month
  • All regions qualified as high-value (>$20K revenue)

Key Takeaways

Grouping and aggregation are fundamental to transforming raw data into business intelligence. Here are the core principles:

  1. Choose aggregations thoughtfully: Mean, median, and sum tell different stories
  2. Consider group size: Aggregates from small groups can be misleading
  3. Use multiple aggregations: A single metric rarely tells the complete story
  4. Filter intelligently: Focus on groups that matter to your analysis
  5. Transform for context: Add group statistics back to individual records for comparison

These operations form the backbone of business reporting, KPI dashboards, and performance analysis. Master them, and you’ll generate insights that drive decisions.

Final Thoughts

Aggregation is power. It’s the power to summarize millions of transactions into a single number that executives can understand. It’s the power to spot trends across thousands of customers. It’s the power to identify which products succeed and which fail.

But with power comes responsibility. Every aggregation makes choices about what to highlight and what to hide. The analyst who calculates average customer value without checking the distribution might miss that they have two distinct customer segments. The manager who looks at total sales without considering units sold might optimize for revenue while inventory piles up.

The operations covered here give you the technical capability to group and aggregate data in nearly any way imaginable. But technique without judgment produces misleading analysis. Always ask: does this aggregation serve the truth? Does it answer the real question? What am I losing by summarizing this way?

The best analysts are skeptical of their own summaries. They check their work from multiple angles. They calculate means and medians. They look at totals and distributions. They group by different dimensions to see if the story changes. This isn’t paranoia. It’s intellectual honesty.

Business decisions rest on these summaries. Marketing budgets get allocated based on regional performance metrics. Products get discontinued based on category aggregates. Sales teams get restructured based on customer segment analysis. If your grouping and aggregation are sloppy or thoughtless, the consequences ripple through the organization.

Where to Go From Here

If you want to deepen your grouping and aggregation skills, here are natural next steps:

  • Advanced GroupBy Operations: Learn about groupby with multiple index levels, named aggregations, and custom transformations.
  • Statistical Aggregations: Explore percentiles, standard deviations, and other statistical measures that provide richer summaries.
  • Performance Optimization: Understand when to use groupby versus pivot_table, and how to make large aggregations faster.
  • Weighted Aggregations: Learn to calculate weighted averages and other metrics that account for group importance.

The journey from raw data to actionable insights is one you’ll take repeatedly in your career. Each time you’ll refine your approach. Each time you’ll catch subtleties you missed before. Each time you’ll tell a clearer, more honest story with your data.

Wrapping Up

Mastering grouping and aggregation is essential for transforming data into decisions. If these explanations helped clarify how summaries and KPIs are constructed or provided new perspectives on analyzing your data, I would appreciate it if you could show your support by clapping for this article. Knowledge is best when shared, so feel free to pass this guide along to any colleagues or teammates who are navigating their own data journeys.

I am currently building a series on practical Pandas techniques (Data Manipulation in the Real World) that focuses on real-world problems rather than toy examples. Each guide aims to give you skills you can use immediately in your work. If that resonates with you, make sure to follow my page for more practical data analysis guides and deep dives.

The data community thrives on dialogue. If you have a specific question about grouping operations, a suggestion for a future topic, or a unique tip from your own experience with aggregations, please leave a comment below. Your feedback genuinely matters; it helps me understand what topics to cover next and how to make each guide more useful than the last. Data analysis can feel isolating sometimes, but we are all learning together.

Keep grouping, keep aggregating, and keep building insights from data.

Until next time, Happy coding!

Published via Towards AI



Source link

Share This Article
Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

error: Content is protected !!