Day 9: Pandas GroupBy, Merge, Join & Concat | Python ML & Data Science
Goal of Day 9: Learn how to aggregate, combine, and manipulate multiple datasets using groupby()
, merge()
, join()
, and concat()
in Pandas. These are essential tools for data preprocessing and feature engineering in Machine Learning & Data Science.
1. Pandas groupby()
What is it?
groupby()
is used to split data into groups based on column values, then apply functions like sum()
, mean()
, count()
, etc., and finally combine results back into a DataFrame.
Where to use?
- When you want to calculate aggregated statistics (e.g., average sales per region).
- When summarizing data for reporting and analysis.
How to use?
import pandas as pd
data = {'Region': ['East','West','East','South','West'],
'Sales': [100, 200, 150, 300, 250]}
df = pd.DataFrame(data)
# Group by 'Region' and calculate mean sales
print(df.groupby('Region')['Sales'].mean())
Why use it?
Because it allows you to analyze data at group level instead of row level, making it easier to find trends and insights (e.g., best-performing region).
agg()
to apply multiple functions at once.df.groupby('Region')['Sales'].agg(['mean', 'sum', 'max'])
2. Pandas merge()
(SQL-style joins)
What is it?
merge()
is used to combine two DataFrames based on one or more common columns, similar to SQL joins.
Where to use?
- When combining customer and order data using a customer ID.
- When you need SQL-like control over how data is joined.
How to use?
df1 = pd.DataFrame({'ID': [1,2,3], 'Name': ['A','B','C']})
df2 = pd.DataFrame({'ID': [1,2,4], 'Order': ['Book','Pen','Pencil']})
# Inner join (only matching IDs)
print(pd.merge(df1, df2, on='ID', how='inner'))
# Left join (all from df1, match from df2)
print(pd.merge(df1, df2, on='ID', how='left'))
Why use it?
Because datasets are often stored in different files or tables, and you need to bring them together for analysis (like combining sales & customer data).
suffixes=('_left','_right')
to handle duplicate column names.3. Pandas join()
What is it?
join()
is a simpler way to combine two DataFrames based on their index.
Where to use?
- When both DataFrames share the same index (e.g., User ID as index).
- When you want a more concise syntax compared to
merge()
.
How to use?
df1 = pd.DataFrame({'Name': ['A','B','C']}, index=[1,2,3])
df2 = pd.DataFrame({'Age': [25,30,35]}, index=[1,2,3])
print(df1.join(df2)) # Join based on index
Why use it?
It is faster and cleaner than merge()
when indexes are already aligned.
4. Pandas concat()
What is it?
concat()
is used to combine multiple DataFrames either row-wise (stacking vertically) or column-wise (side by side).
Where to use?
- When appending monthly sales reports into one dataset.
- When combining datasets that don’t share a common key.
How to use?
df1 = pd.DataFrame({'A': [1,2], 'B': [3,4]})
df2 = pd.DataFrame({'A': [5,6], 'B': [7,8]})
# Row-wise concat
print(pd.concat([df1, df2], axis=0))
# Column-wise concat
print(pd.concat([df1, df2], axis=1))
Why use it?
Because it allows you to easily combine multiple datasets without needing a join key.
keys=['Q1','Q2']
while concatenating to create hierarchical indexes.5. Mini Project / Hands-On Practice
Dataset suggestion: Use a retail sales dataset (like Kaggle’s Superstore dataset).
- Group sales by region & category using
groupby()
. - Merge customer data with orders using
merge()
. - Join user details with location info using
join()
. - Concatenate quarterly data into one dataframe using
concat()
.
6. Key Takeaways
groupby()
→ Aggregation & summarization.merge()
→ SQL-style joins on keys.join()
→ Index-based joining.concat()
→ Simple stacking of DataFrames.
By mastering these, you can handle real-world data that is split across multiple sources, making it ready for Machine Learning and deeper analysis.