Day 8 — Pandas Data Cleaning (Missing data & Duplicates)
рдпрд╣ рдкूрд░ा static, Blogger-friendly chapter рд╣ै — рд╣рд░ section рдоें theory + explaination + Python (pandas) code snippets + expected output tables рджिрдП рдЧрдП рд╣ैं, рддाрдХि рдЖрдк рдЗрд╕рдХो рд╕ीрдзे рдЕрдкрдиे рдм्рд▓ॉрдЧ рдоें paste рдХрд░ рдХे рдкрдв़ा/рдк्рд░рд╢िрдХ्рд╖िрдд рдХрд░ рд╕рдХें।
рд╕ैрдж्рдзाрди्рддिрдХ рдкрд░िрдЪрдп — Pandas Data Cleaning рдХ्рдпा рд╣ै?
рдкाрдЗрдерди рдоें pandas рдПрдХ рд╢рдХ्рддिрд╢ाрд▓ी рд▓ाрдЗрдм्рд░ेрд░ी рд╣ै рдЬिрд╕рдХा рдк्рд░рдпोрдЧ table-like data (DataFrame) рдХे рд╕ाрде рдХिрдпा рдЬाрддा рд╣ै।Data Cleaning рдХा рдорддрд▓рдм рд╣ै: raw/real-world datasets рдоें рдЖрдиे рд╡ाрд▓ी рдЧрд▓рддिрдпों рдФрд░ inconsistencies рдХो рдаीрдХ рдХрд░рдиा рддाрдХि analysis, visualization рдФрд░ machine learning models рдкрд░ рднрд░ोрд╕ा рдХिрдпा рдЬा рд╕рдХे।
- Missing values (NaN, None, blanks): detect рдХрд░рдиे рдХे рд▓िрдП
isnull(), рд╣рдЯाрдиे рдХे рд▓िрдПdropna(), рднрд░рдиे рдХे рд▓िрдПfillna(). - Duplicates: duplicate rows рдпा duplicate identifiers — detect рдХрд░рдиे рдХे рд▓िрдП
duplicated(), рд╣рдЯाрдиे рдХे рд▓िрдПdrop_duplicates(). - Consistency: data types рд╕рд╣ी рдХрд░рдиा (
astype()), strings clean рдХрд░рдиा (str.strip(), lower/upper), рдФрд░ invalid values рдХो replace рдХрд░рдиा।
рдПрдХ рд╕्рдкрд╖्рдЯ рдЙрджाрд╣рд░рдг: рдпрджि sales data рдоें рдХुрдЫ rows рдХी price missing рд╣ै, рддो total revenue рдЧрд▓рдд рдиिрдХрд▓ेрдЧा; рдпा рдпрджि рдПрдХ customer рдХी рд░िрдХॉрд░्рдб duplicate рд╣ै рддो analysis рдоें overcount рд╣ोрдЧा।
рдХ्рдпों рд╕ीрдЦрдиा рдЪाрд╣िрдП? (Importance)
- Real-world data рдХрднी рднी perfect рдирд╣ीं рд╣ोрддा — cleaning рдЬрд░ूрд░ी рд╣ै।
- Dirty data рд╕े рдЧрд▓рдд insights рдФрд░ рдЧрд▓рдд business decisions рд╣ो рд╕рдХрддे рд╣ैं।
- Machine learning models clean data рдкрд░ рд╣ी рд╕рд╣ी рд╕ीрдЦрддे рд╣ैं — missing/duplicate data accuracy рдШрдЯाрддे рд╣ैं।
- рдк्рд░ैрдХ्рдЯिрдХрд▓ data scientist рдмрдирдиे рдХे рд▓िрдП рдпрд╣ рд╕рдмрд╕े рдкрд╣рд▓ी skill рд╣ै।
- Data cleaning рд╕े reporting, visualization рдФрд░ storage рдмेрд╣рддрд░ рд╣ोрддे рд╣ैं।
рдХрд╣ाँ рдФрд░ рдХैрд╕े рдЗрд╕्рддेрдоाрд▓ рдХрд░ेंрдЧे?
Data cleaning рд╣рд░ рдЬрдЧрд╣ рдХाрдо рдЖрддा рд╣ै — e-commerce, banking, healthcare, survey data, education рдЖрджि। рд╕ाрдоाрди्рдп workflow:
- Data load:
pd.read_csv(),pd.read_excel() - Inspect: shape,
head(),info(),describe(), missing counts (isnull().sum()) - Handle missing: drop рдпा fill (mean/median/0/forward-fill/ custom)
- Handle duplicates: identify using
duplicated()рдФрд░drop_duplicates() - Data type conversions, trimming strings, replace invalid values
- Final check рдФрд░ save
рдЕрдм рдЖрдкрдХे рд╣рд░ рдк्рд░рд╢्рди рдХे рд╡िрд╕्рддृрдд рдЬрд╡ाрдм — рд╣рд░ рдПрдХ рдХे 5 рдЙрджाрд╣рд░рдг (Step-by-step + code + expected output)
Q1: "Pandas Data Cleaning рдХ्рдпा рд╣ै?" — 5 Examples
рдиीрдЪे рд╣рд░ example рдЫोрдЯे dataframes рдХे рд╕ाрде рджिрдЦाрдпा рдЧрдпा рд╣ै — code рдкрдв़िрдП рдФрд░ рдЙрд╕рдХे рдиीрдЪे expected output table рдоौрдЬूрдж рд╣ै рддाрдХि beginner рднी рд╕рдордЭ рд╕рдХे рдХि рдХ्рдпा рд╣ोрддा рд╣ै।
Example 1 — Missing values detection (identify рдХрд░рдиा)
import pandas as pd
df = pd.DataFrame({ 'Name': ['Amit', 'Ravi', 'Sita', None], 'Age': [25, None, 30, 28] })
missing counts
print(df) print(' Missing counts:') print(df.isnull().sum())Expected output (table):
| Name | Age |
|---|---|
| Amit | 25.0 |
| Ravi | |
| Sita | 30.0 |
| 28.0 |
Missing counts: Name: 1, Age: 1 — рдпрд╣ рдмрддाрддा рд╣ै рдХिрди columns рдоें NaN рд╣ै।
Example 2 — Fill missing with mean
# same df as above fill missing age with mean age mean_age = df['Age'].mean() df['Age'] = df['Age'].fillna(mean_age) print(df)
Expected: Age column рдХे missing рдХो average рд╕े replace рдХрд░ рджेंрдЧे; Name рдХा missing рдЕрднी рднी рд░рд╣ेрдЧा рдЕрдЧрд░ рд╣рдо рдЙрд╕े drop/replace рдирд╣ीं рдХрд░рддे।
Example 3 — Drop missing rows
# drop any row with at least one NaN clean = df.dropna() print(clean)
Expected: рд╡ो рд╕ाрд░ी rows рд╣рдЯेंрдЧी рдЬिрдирдоें рдХोрдИ NaN рд╣ै — рдХेрд╡рд▓ fully complete rows рдмрдЪेंрдЧी।
Example 4 — Detect duplicates
df2 = pd.DataFrame({'id':[1,1,2,3],'val':[10,10,20,30]})
print(df2) print('duplicated mask:') print(df2.duplicated())Expected: duplicated() boolean series рджेрдЧा — True рд╡рд╣ां рдЬрд╣ां same row рдкрд╣рд▓े рдЖ рдЪुрдХा рд╣ै।
Example 5 — Remove duplicates
df2_unique = df2.drop_duplicates() print(df2_unique)
Expected: duplicate row remove рд╣ो рдЬाрдПрдЧी рдФрд░ unique rows рдмрдЪेंрдЧी।
Q2: Pandas Data Cleaning рдХा Use рдХ्рдпा рд╣ै? — 5 Examples
Use-cases рдЫोрдЯे real-world contexts рдоें — рд╣рд░ рдПрдХ рдХे рд╕ाрде code + expected result:
Example 1 — Sales dataset: missing price -> fill with mean price
df_sales = pd.DataFrame({'product':['A','B','C','D'],'price':[100, None, 150, None]})
fill missing
df_sales['price'] = df_sales['price'].fillna(df_sales['price'].mean()) print(df_sales)Expected: B рдФрд░ D рдХी price average рд╕े replace рд╣ो рдЬाрдПрдЧी — рдЬिрд╕рд╕े total revenue calculate рдХрд░рдиा meaningful рд╣ोрдЧा।
Example 2 — Hospital: duplicate patient records рд╣рдЯाрдиा
df_pat = pd.DataFrame({'patient_id':[101,102,101,103],'name':['R','S','R','T'],'age':[30,25,30,40]})
remove duplicates based on patient_id
df_pat_unique = df_pat.drop_duplicates(subset=['patient_id']) print(df_pat_unique)Expected: patient_id 101 duplicate entry remove рд╣ोрдЧी — accurate patient count рдоिрд▓ेрдЧा।
Example 3 — School: missing marks -> fill with 0 (absent)
df_marks = pd.DataFrame({'student':['A','B','C'],'marks':[80, None, 90]})
df_marks['marks'] = df_marks['marks'].fillna(0) print(df_marks)Expected: B рдХा marks 0 set рд╣ोрдЧा, рдЬिрд╕рд╕े class average calculate рдХрд░рдиा straightforward рд╣ोрдЧा।
Example 4 — Banking: missing transaction amounts -> drop those rows
df_tx = pd.DataFrame({'tx_id':[1,2,3,4],'amount':[200,None,500,None]})
df_tx_clean = df_tx.dropna(subset=['amount']) print(df_tx_clean)Expected: рдХेрд╡рд▓ valid transactions рд░рд╣ेंрдЧी; null amounts ignore рд╣ो рдЬाрдПँрдЧी।
Example 5 — E-commerce: duplicate orders remove by order_id
df_orders = pd.DataFrame({'order_id':[1001,1002,1001,1003],'customer':['X','Y','X','Z'],'value':[500,300,500,400]})
df_orders = df_orders.drop_duplicates(subset=['order_id']) print(df_orders)Expected: duplicate order 1001 remove — correct order count рдФрд░ total revenue рдоिрд▓ेрдЧी।
Q3: рдЗрд╕े рдХ्рдпों рд╕ीрдЦрдиा рдЪाрд╣िрдП? (5 Examples — consequences of NOT cleaning)
рдпрд╣ाँ рд╣рдо рджिрдЦाрдПँрдЧे рдХी рдЕрдЧрд░ data clean рди рдХिрдпा рдЬाрдП рддो рдХ्рдпा рдЧрд▓рдд рд╣ोрдЧा (simple, clear examples)।
Example 1 — Missing values change statistics (mean wrong)
df = pd.DataFrame({'val':[10, None, 30]})
print('mean (skips NaN):', df['val'].mean())
рдЕрдЧрд░ рд╣рдо NaN рдХो 0 рдоाрдирдХрд░ add рдХрд░ें рддो result рдмрджрд▓ेрдЧा
print('sum skipping NaN:', df['val'].sum())Explanation: Statistics calculation default behavior NaN рдХे рд╕ाрде рдЕрд▓рдЧ рд╣ोрддी рд╣ै — рдЕрдЧрд░ logically NaN рдХो 0 рдоाрдирдиा рд╣ै рддो fillna(0) рдХрд░ें, рд╡рд░рдиा leave рдХрд░ें।
Example 2 — Duplicates cause overcount
dfdup = pd.DataFrame({'item':['A','A','B'],'qty':[1,1,2]})
print('rows before:', len(dfdup)) print(dfdup) dfdup2 = dfdup.drop_duplicates() print('rows after:', len(dfdup2))Explanation: Duplicate rows рд╣ोрдиे рд╕े aggregates (рдЬैрд╕े total items) рдЧрд▓рдд рд╣ोंрдЧे।
Example 3 — Arithmetic with NaN propagates NaN
dfA = pd.DataFrame({'a':[1, None, 3]})
print((dfA['a'] + 2).tolist())Explanation: NaN arithmetic рдоें рд░рд╣े рддो final results рдоें NaN рдЖ рд╕рдХрддे рд╣ैं — preprocessing рдоें fill рдпा drop рдХрд░рдиा рдЪाрд╣िрдП।
Example 4 — Healthcare: duplicate patient distorts counts
df_pat = pd.DataFrame({'pid':[1,2,1,3]})
print('unique patients:', df_pat['pid'].nunique()) print('rows before:', len(df_pat)) print(df_pat.drop_duplicates())Explanation: Duplicate patient rows рд╕े patient count рдФрд░ per-patient stats рдЧрд▓рдд рд╣ोंрдЧे।
Example 5 — Survey completeness percent
df_s = pd.DataFrame({'q1':[1,None,0], 'q2':[None,1,1]})
print('overall completeness (%) =', df_s.notnull().mean().mean()*100)Explanation: рдЕрдЧрд░ completeness рдХрдо рд╣ै рддो survey results reliable рдирд╣ीं рд╣ोंрдЧे — cleaning рдФрд░ imputation рдХैрд╕े рдХрд░ेंрдЧे рдпрд╣ design decision рд╣ै।
Q4: рдЗрд╕े рдХрд╣ाँ рдФрд░ рдХैрд╕े use рдХрд░ेंрдЧे? (5 Practical examples)
Small, real preprocessing patterns рдЬो рдЖрдк data pipelines рдоें рдмाрд░-рдмाрд░ рджेрдЦेंрдЧे:
Example 1 — E-commerce: dedupe customer table for recommender
df = pd.DataFrame({'cust_id':[1,2,1,3],'visits':[5,2,5,1]})
dedupe keeping first
df_u = df.drop_duplicates(subset=['cust_id']) print(df_u)Use: Recommendation models per-user stats рдмрдиाрддे рд╕рдордп duplicate rows problem рд╣ोрдЧा — dedupe рдХрд░рдХे fix рдХрд░ें।
Example 2 — Banking: remove null transaction amounts before fraud model
df = pd.DataFrame({'tx':[1,2,3],'amt':[100,None,200]})
df_clean = df.dropna(subset=['amt']) print(df_clean)Use: ML model рдХो numeric amount рдЪाрд╣िрдП — null rows рдоॉрдбрд▓ рдоें confuse рдХрд░ेंрдЧी।
Example 3 — Healthcare: impute missing age with median
df = pd.DataFrame({'pid':[1,2,3],'age':[25,None,45]})
df['age'] = df['age'].fillna(df['age'].median()) print(df)Use: Median imputation outliers рд╕े рдХрдо рдк्рд░рднाрд╡िрдд рд╣ोрддा рд╣ै।
Example 4 — Education: forward-fill attendance for continuous days
df = pd.DataFrame({'day':[1,2,3,4],'present':[1,None,None,1]})
df['present'] = df['present'].fillna(method='ffill') print(df)Use: Continuous time-series рдоें forward/backward fill practical рд╣ोрддा рд╣ै।
Example 5 — Govt survey: drop respondents with too many blanks
df = pd.DataFrame({'r':[1,2,3],'a':[1,None,None],'b':[None,2,None],'c':[3,4,None]})
keep rows with fewer than 2 blanks
df_clean = df[df.isnull().sum(axis=1) < 2] print(df_clean)Use: Low-quality respondents remove рдХрд░рдХे overall dataset quality improve рдХрд░ें।
Q5: рдЗрд╕े рдХ्рдпों use рдХрд░рдиा рдЪाрд╣िрдП? (5 Examples — benefits)
Benefits рдХो рдЫोрдЯे рдЙрджाрд╣рд░рдгों рдоें рджिрдЦा рд░рд╣े рд╣ैं:
Example 1 — Stable statistics after cleaning
df = pd.DataFrame({'v':[10,None,30,40]})
df2 = df.copy() df2['v'] = df2['v'].fillna(df2['v'].mean()) print('mean before:', df['v'].mean()) print('mean after:', df2['v'].mean()) print(df2)Benefit: Imputation рд╕े statistics consistent рд╣ोंрдЧे — reporting reliable рдмрдиेрдЧा।
Example 2 — Remove duplicates saves rows and storage
df = pd.DataFrame({'id':[1,1,2,3]})
print('before rows:', len(df)) df_u = df.drop_duplicates() print('after rows:', len(df_u))Benefit: Deduplication рд╕े storage рдЧिрд░ेрдЧा рдФрд░ analysis рд╕рд╣ी рд╣ोрдЧा।
Example 3 — Arithmetic works after fill
df = pd.DataFrame({'a':[1,None,3]})
res = (df['a'].fillna(0) + 2).tolist() print(res)Benefit: Calculations predictable рд╣ोंрдЧी; NaN рд╕े unexpected errors рдирд╣ीं рдЖрдПँрдЧे।
Example 4 — Visualization ready (no gaps)
df = pd.DataFrame({'x':[1,2,3],'y':[10,None,30]})
df['y'] = df['y'].fillna(method='ffill') print(df)Benefit: Visual plots рдоें gaps рдирд╣ीं рджिрдЦेंрдЧे — trends рд╕ाрдл рджिрдЦेंрдЧे।
Example 5 — Business totals correct
df = pd.DataFrame({'sale':[100,None,200]})
print('sum default (NaN ignored):', df['sale'].sum()) df['sale'] = df['sale'].fillna(0) print('sum after fillna(0):', df['sale'].sum())Benefit: Total sales calc рдХрд░рдиे рдХे рд▓िрдП NaN рдХो 0 рдоाрдирдиा рдЬрд░ूрд░ी рд╣ो рд╕рдХрддा рд╣ै — cleaning рд╕े business metrics рд╕рд╣ी рдоिрд▓ेंрдЧे।