рдХोрдИ рдкрд░िрдгाрдо рдирд╣ीं рдоिрд▓ा

    ЁЯУШ Day 8: Pandas Data Cleaning (Missing Data & Duplicates) - python in ml and data scientist

    Day 8 — Pandas Data Cleaning (Missing data & Duplicates)

    рдпрд╣ рдкूрд░ा static, Blogger-friendly chapter рд╣ै — рд╣рд░ section рдоें theory + explaination + Python (pandas) code snippets + expected output tables рджिрдП рдЧрдП рд╣ैं, рддाрдХि рдЖрдк рдЗрд╕рдХो рд╕ीрдзे рдЕрдкрдиे рдм्рд▓ॉрдЧ рдоें paste рдХрд░ рдХे рдкрдв़ा/рдк्рд░рд╢िрдХ्рд╖िрдд рдХрд░ рд╕рдХें।

    рдиोрдЯ: рдпрд╣ Static HTML рд╣ै — рдоैंрдиे interactive (Pyodide) script рд╣рдЯा рджी рд╣ै рддाрдХि Blogger рдЬैрд╕े рдк्рд▓ेрдЯрдл़ॉрд░्рдо рдкрд░ рдмिрдиा external scripts рдХे рднी рдпрд╣ рд╕рд╣ी рд╕े рджिрдЦे।

    рд╕ैрдж्рдзाрди्рддिрдХ рдкрд░िрдЪрдп — Pandas Data Cleaning рдХ्рдпा рд╣ै?

    рдкाрдЗрдерди рдоें pandas рдПрдХ рд╢рдХ्рддिрд╢ाрд▓ी рд▓ाрдЗрдм्рд░ेрд░ी рд╣ै рдЬिрд╕рдХा рдк्рд░рдпोрдЧ table-like data (DataFrame) рдХे рд╕ाрде рдХिрдпा рдЬाрддा рд╣ै।Data Cleaning рдХा рдорддрд▓рдм рд╣ै: raw/real-world datasets рдоें рдЖрдиे рд╡ाрд▓ी рдЧрд▓рддिрдпों рдФрд░ inconsistencies рдХो рдаीрдХ рдХрд░рдиा рддाрдХि analysis, visualization рдФрд░ machine learning models рдкрд░ рднрд░ोрд╕ा рдХिрдпा рдЬा рд╕рдХे।

    • Missing values (NaN, None, blanks): detect рдХрд░рдиे рдХे рд▓िрдП isnull(), рд╣рдЯाрдиे рдХे рд▓िрдП dropna(), рднрд░рдиे рдХे рд▓िрдП fillna().
    • Duplicates: duplicate rows рдпा duplicate identifiers — detect рдХрд░рдиे рдХे рд▓िрдП duplicated(), рд╣рдЯाрдиे рдХे рд▓िрдП drop_duplicates().
    • Consistency: data types рд╕рд╣ी рдХрд░рдиा (astype()), strings clean рдХрд░рдиा (str.strip(), lower/upper), рдФрд░ invalid values рдХो replace рдХрд░рдиा।

    рдПрдХ рд╕्рдкрд╖्рдЯ рдЙрджाрд╣рд░рдг: рдпрджि sales data рдоें рдХुрдЫ rows рдХी price missing рд╣ै, рддो total revenue рдЧрд▓рдд рдиिрдХрд▓ेрдЧा; рдпा рдпрджि рдПрдХ customer рдХी рд░िрдХॉрд░्рдб duplicate рд╣ै рддो analysis рдоें overcount рд╣ोрдЧा।

    рдХ्рдпों рд╕ीрдЦрдиा рдЪाрд╣िрдП? (Importance)

    1. Real-world data рдХрднी рднी perfect рдирд╣ीं рд╣ोрддा — cleaning рдЬрд░ूрд░ी рд╣ै।
    2. Dirty data рд╕े рдЧрд▓рдд insights рдФрд░ рдЧрд▓рдд business decisions рд╣ो рд╕рдХрддे рд╣ैं।
    3. Machine learning models clean data рдкрд░ рд╣ी рд╕рд╣ी рд╕ीрдЦрддे рд╣ैं — missing/duplicate data accuracy рдШрдЯाрддे рд╣ैं।
    4. рдк्рд░ैрдХ्рдЯिрдХрд▓ data scientist рдмрдирдиे рдХे рд▓िрдП рдпрд╣ рд╕рдмрд╕े рдкрд╣рд▓ी skill рд╣ै।
    5. Data cleaning рд╕े reporting, visualization рдФрд░ storage рдмेрд╣рддрд░ рд╣ोрддे рд╣ैं।

    рдХрд╣ाँ рдФрд░ рдХैрд╕े рдЗрд╕्рддेрдоाрд▓ рдХрд░ेंрдЧे?

    Data cleaning рд╣рд░ рдЬрдЧрд╣ рдХाрдо рдЖрддा рд╣ै — e-commerce, banking, healthcare, survey data, education рдЖрджि। рд╕ाрдоाрди्рдп workflow:

    1. Data load: pd.read_csv(), pd.read_excel()
    2. Inspect: shape, head(), info(), describe(), missing counts (isnull().sum())
    3. Handle missing: drop рдпा fill (mean/median/0/forward-fill/ custom)
    4. Handle duplicates: identify using duplicated() рдФрд░ drop_duplicates()
    5. Data type conversions, trimming strings, replace invalid values
    6. Final check рдФрд░ save

    рдЕрдм рдЖрдкрдХे рд╣рд░ рдк्рд░рд╢्рди рдХे рд╡िрд╕्рддृрдд рдЬрд╡ाрдм — рд╣рд░ рдПрдХ рдХे 5 рдЙрджाрд╣рд░рдг (Step-by-step + code + expected output)

    Q1: "Pandas Data Cleaning рдХ्рдпा рд╣ै?" — 5 Examples

    рдиीрдЪे рд╣рд░ example рдЫोрдЯे dataframes рдХे рд╕ाрде рджिрдЦाрдпा рдЧрдпा рд╣ै — code рдкрдв़िрдП рдФрд░ рдЙрд╕рдХे рдиीрдЪे expected output table рдоौрдЬूрдж рд╣ै рддाрдХि beginner рднी рд╕рдордЭ рд╕рдХे рдХि рдХ्рдпा рд╣ोрддा рд╣ै।

    Example 1 — Missing values detection (identify рдХрд░рдиा)

    import pandas as pd
    
    df = pd.DataFrame({ 'Name': ['Amit', 'Ravi', 'Sita', None], 'Age': [25, None, 30, 28] })
    
    missing counts
    
    print(df) print(' Missing counts:') print(df.isnull().sum())

    Expected output (table):

    NameAge
    Amit25.0
    Ravi
    Sita30.0
    28.0

    Missing counts: Name: 1, Age: 1 — рдпрд╣ рдмрддाрддा рд╣ै рдХिрди columns рдоें NaN рд╣ै।

    Example 2 — Fill missing with mean

    # same df as above
    
    fill missing age with mean age
    
    mean_age = df['Age'].mean() df['Age'] = df['Age'].fillna(mean_age) print(df)

    Expected: Age column рдХे missing рдХो average рд╕े replace рдХрд░ рджेंрдЧे; Name рдХा missing рдЕрднी рднी рд░рд╣ेрдЧा рдЕрдЧрд░ рд╣рдо рдЙрд╕े drop/replace рдирд╣ीं рдХрд░рддे।

    Example 3 — Drop missing rows

    # drop any row with at least one NaN
    
    clean = df.dropna() print(clean)

    Expected: рд╡ो рд╕ाрд░ी rows рд╣рдЯेंрдЧी рдЬिрдирдоें рдХोрдИ NaN рд╣ै — рдХेрд╡рд▓ fully complete rows рдмрдЪेंрдЧी।

    Example 4 — Detect duplicates

    df2 = pd.DataFrame({'id':[1,1,2,3],'val':[10,10,20,30]})
    
    print(df2) print('duplicated mask:') print(df2.duplicated())

    Expected: duplicated() boolean series рджेрдЧा — True рд╡рд╣ां рдЬрд╣ां same row рдкрд╣рд▓े рдЖ рдЪुрдХा рд╣ै।

    Example 5 — Remove duplicates

    df2_unique = df2.drop_duplicates()
    
    print(df2_unique)

    Expected: duplicate row remove рд╣ो рдЬाрдПрдЧी рдФрд░ unique rows рдмрдЪेंрдЧी।


    Q2: Pandas Data Cleaning рдХा Use рдХ्рдпा рд╣ै? — 5 Examples

    Use-cases рдЫोрдЯे real-world contexts рдоें — рд╣рд░ рдПрдХ рдХे рд╕ाрде code + expected result:

    Example 1 — Sales dataset: missing price -> fill with mean price

    df_sales = pd.DataFrame({'product':['A','B','C','D'],'price':[100, None, 150, None]})
    
    fill missing
    
    df_sales['price'] = df_sales['price'].fillna(df_sales['price'].mean()) print(df_sales)

    Expected: B рдФрд░ D рдХी price average рд╕े replace рд╣ो рдЬाрдПрдЧी — рдЬिрд╕рд╕े total revenue calculate рдХрд░рдиा meaningful рд╣ोрдЧा।

    Example 2 — Hospital: duplicate patient records рд╣рдЯाрдиा

    df_pat = pd.DataFrame({'patient_id':[101,102,101,103],'name':['R','S','R','T'],'age':[30,25,30,40]})
    
    remove duplicates based on patient_id
    
    df_pat_unique = df_pat.drop_duplicates(subset=['patient_id']) print(df_pat_unique)

    Expected: patient_id 101 duplicate entry remove рд╣ोрдЧी — accurate patient count рдоिрд▓ेрдЧा।

    Example 3 — School: missing marks -> fill with 0 (absent)

    df_marks = pd.DataFrame({'student':['A','B','C'],'marks':[80, None, 90]})
    
    df_marks['marks'] = df_marks['marks'].fillna(0) print(df_marks)

    Expected: B рдХा marks 0 set рд╣ोрдЧा, рдЬिрд╕рд╕े class average calculate рдХрд░рдиा straightforward рд╣ोрдЧा।

    Example 4 — Banking: missing transaction amounts -> drop those rows

    df_tx = pd.DataFrame({'tx_id':[1,2,3,4],'amount':[200,None,500,None]})
    
    df_tx_clean = df_tx.dropna(subset=['amount']) print(df_tx_clean)

    Expected: рдХेрд╡рд▓ valid transactions рд░рд╣ेंрдЧी; null amounts ignore рд╣ो рдЬाрдПँрдЧी।

    Example 5 — E-commerce: duplicate orders remove by order_id

    df_orders = pd.DataFrame({'order_id':[1001,1002,1001,1003],'customer':['X','Y','X','Z'],'value':[500,300,500,400]})
    
    df_orders = df_orders.drop_duplicates(subset=['order_id']) print(df_orders)

    Expected: duplicate order 1001 remove — correct order count рдФрд░ total revenue рдоिрд▓ेрдЧी।


    Q3: рдЗрд╕े рдХ्рдпों рд╕ीрдЦрдиा рдЪाрд╣िрдП? (5 Examples — consequences of NOT cleaning)

    рдпрд╣ाँ рд╣рдо рджिрдЦाрдПँрдЧे рдХी рдЕрдЧрд░ data clean рди рдХिрдпा рдЬाрдП рддो рдХ्рдпा рдЧрд▓рдд рд╣ोрдЧा (simple, clear examples)।

    Example 1 — Missing values change statistics (mean wrong)

    df = pd.DataFrame({'val':[10, None, 30]})
    
    print('mean (skips NaN):', df['val'].mean())
    
    рдЕрдЧрд░ рд╣рдо NaN рдХो 0 рдоाрдирдХрд░ add рдХрд░ें рддो result рдмрджрд▓ेрдЧा
    
    print('sum skipping NaN:', df['val'].sum())

    Explanation: Statistics calculation default behavior NaN рдХे рд╕ाрде рдЕрд▓рдЧ рд╣ोрддी рд╣ै — рдЕрдЧрд░ logically NaN рдХो 0 рдоाрдирдиा рд╣ै рддो fillna(0) рдХрд░ें, рд╡рд░рдиा leave рдХрд░ें।

    Example 2 — Duplicates cause overcount

    dfdup = pd.DataFrame({'item':['A','A','B'],'qty':[1,1,2]})
    
    print('rows before:', len(dfdup)) print(dfdup) dfdup2 = dfdup.drop_duplicates() print('rows after:', len(dfdup2))

    Explanation: Duplicate rows рд╣ोрдиे рд╕े aggregates (рдЬैрд╕े total items) рдЧрд▓рдд рд╣ोंрдЧे।

    Example 3 — Arithmetic with NaN propagates NaN

    dfA = pd.DataFrame({'a':[1, None, 3]})
    
    print((dfA['a'] + 2).tolist())

    Explanation: NaN arithmetic рдоें рд░рд╣े рддो final results рдоें NaN рдЖ рд╕рдХрддे рд╣ैं — preprocessing рдоें fill рдпा drop рдХрд░рдиा рдЪाрд╣िрдП।

    Example 4 — Healthcare: duplicate patient distorts counts

    df_pat = pd.DataFrame({'pid':[1,2,1,3]})
    
    print('unique patients:', df_pat['pid'].nunique()) print('rows before:', len(df_pat)) print(df_pat.drop_duplicates())

    Explanation: Duplicate patient rows рд╕े patient count рдФрд░ per-patient stats рдЧрд▓рдд рд╣ोंрдЧे।

    Example 5 — Survey completeness percent

    df_s = pd.DataFrame({'q1':[1,None,0], 'q2':[None,1,1]})
    
    print('overall completeness (%) =', df_s.notnull().mean().mean()*100)

    Explanation: рдЕрдЧрд░ completeness рдХрдо рд╣ै рддो survey results reliable рдирд╣ीं рд╣ोंрдЧे — cleaning рдФрд░ imputation рдХैрд╕े рдХрд░ेंрдЧे рдпрд╣ design decision рд╣ै।


    Q4: рдЗрд╕े рдХрд╣ाँ рдФрд░ рдХैрд╕े use рдХрд░ेंрдЧे? (5 Practical examples)

    Small, real preprocessing patterns рдЬो рдЖрдк data pipelines рдоें рдмाрд░-рдмाрд░ рджेрдЦेंрдЧे:

    Example 1 — E-commerce: dedupe customer table for recommender

    df = pd.DataFrame({'cust_id':[1,2,1,3],'visits':[5,2,5,1]})
    
    dedupe keeping first
    
    df_u = df.drop_duplicates(subset=['cust_id']) print(df_u)

    Use: Recommendation models per-user stats рдмрдиाрддे рд╕рдордп duplicate rows problem рд╣ोрдЧा — dedupe рдХрд░рдХे fix рдХрд░ें।

    Example 2 — Banking: remove null transaction amounts before fraud model

    df = pd.DataFrame({'tx':[1,2,3],'amt':[100,None,200]})
    
    df_clean = df.dropna(subset=['amt']) print(df_clean)

    Use: ML model рдХो numeric amount рдЪाрд╣िрдП — null rows рдоॉрдбрд▓ рдоें confuse рдХрд░ेंрдЧी।

    Example 3 — Healthcare: impute missing age with median

    df = pd.DataFrame({'pid':[1,2,3],'age':[25,None,45]})
    
    df['age'] = df['age'].fillna(df['age'].median()) print(df)

    Use: Median imputation outliers рд╕े рдХрдо рдк्рд░рднाрд╡िрдд рд╣ोрддा рд╣ै।

    Example 4 — Education: forward-fill attendance for continuous days

    df = pd.DataFrame({'day':[1,2,3,4],'present':[1,None,None,1]})
    
    df['present'] = df['present'].fillna(method='ffill') print(df)

    Use: Continuous time-series рдоें forward/backward fill practical рд╣ोрддा рд╣ै।

    Example 5 — Govt survey: drop respondents with too many blanks

    df = pd.DataFrame({'r':[1,2,3],'a':[1,None,None],'b':[None,2,None],'c':[3,4,None]})
    
    keep rows with fewer than 2 blanks
    
    df_clean = df[df.isnull().sum(axis=1) < 2] print(df_clean)

    Use: Low-quality respondents remove рдХрд░рдХे overall dataset quality improve рдХрд░ें।


    Q5: рдЗрд╕े рдХ्рдпों use рдХрд░рдиा рдЪाрд╣िрдП? (5 Examples — benefits)

    Benefits рдХो рдЫोрдЯे рдЙрджाрд╣рд░рдгों рдоें рджिрдЦा рд░рд╣े рд╣ैं:

    Example 1 — Stable statistics after cleaning

    df = pd.DataFrame({'v':[10,None,30,40]})
    
    df2 = df.copy() df2['v'] = df2['v'].fillna(df2['v'].mean()) print('mean before:', df['v'].mean()) print('mean after:', df2['v'].mean()) print(df2)

    Benefit: Imputation рд╕े statistics consistent рд╣ोंрдЧे — reporting reliable рдмрдиेрдЧा।

    Example 2 — Remove duplicates saves rows and storage

    df = pd.DataFrame({'id':[1,1,2,3]})
    
    print('before rows:', len(df)) df_u = df.drop_duplicates() print('after rows:', len(df_u))

    Benefit: Deduplication рд╕े storage рдЧिрд░ेрдЧा рдФрд░ analysis рд╕рд╣ी рд╣ोрдЧा।

    Example 3 — Arithmetic works after fill

    df = pd.DataFrame({'a':[1,None,3]})
    
    res = (df['a'].fillna(0) + 2).tolist() print(res)

    Benefit: Calculations predictable рд╣ोंрдЧी; NaN рд╕े unexpected errors рдирд╣ीं рдЖрдПँрдЧे।

    Example 4 — Visualization ready (no gaps)

    df = pd.DataFrame({'x':[1,2,3],'y':[10,None,30]})
    
    df['y'] = df['y'].fillna(method='ffill') print(df)

    Benefit: Visual plots рдоें gaps рдирд╣ीं рджिрдЦेंрдЧे — trends рд╕ाрдл рджिрдЦेंрдЧे।

    Example 5 — Business totals correct

    df = pd.DataFrame({'sale':[100,None,200]})
    
    print('sum default (NaN ignored):', df['sale'].sum()) df['sale'] = df['sale'].fillna(0) print('sum after fillna(0):', df['sale'].sum())

    Benefit: Total sales calc рдХрд░рдиे рдХे рд▓िрдП NaN рдХो 0 рдоाрдирдиा рдЬрд░ूрд░ी рд╣ो рд╕рдХрддा рд╣ै — cleaning рд╕े business metrics рд╕рд╣ी рдоिрд▓ेंрдЧे।


    Provided for learners of beepship.blogspot.com — Day 8 Pandas Data Cleaning chapter python in machine learning and data scientist.

    рдПрдХ рдЯिрдк्рдкрдгी рднेрдЬें

    рдФрд░ рдирдпा рдкुрд░ाрдиे

    ┘Ж┘Е┘И╪░╪м ╪з┘Д╪з╪к╪╡╪з┘Д