Python for Machine Learning & Data Science — Day 3 (Hinglish Guide)

Python

Machine Learning

Data Science

Day 3 — Python for Machine Learning & Data Science (Easy Hinglish Guide)

Aaj hum DataFrame, cleaning, preprocessing aur EDA ko bahut hi simple way mein samjhenge — thoda fun, bahut practice. Har code block ke saath copy button hai — directly copy kar ke try karo.

Reading time: ~15–22 min • Level: Beginner → Intermediate • Includes examples & solutions

Related: Day 1 • Day 2

Introduction — short & simple

Namaste! Agar aapne Day 1 aur Day 2 follow kiya hai, toh great. Agar nahi kiya — koi problem nahi. Yeh Day 3 guide aise likha gaya hai ki har ek cheez aap step-by-step samajh jao. Language friendly hai — thoda English, thoda Hindi (Hinglish) — taaki learning mazedaaar rahe.

Tip: Har code block ke top-right me "Copy" button hai. Paste karo apne Jupyter notebook ya VS Code mein aur run karo. Agar kuch error aaye toh error message copy karke mujhe bhejo — I’ll help.

1) Pandas Basics — kyun aur kaise?

Pandas ek library hai jo table-like data handle karne mein best hai. DataFrame ko spreadsheet samjho. Rows aur columns, chhota table, bahut powerful operations.

Example — DataFrame banana & basic commands


import pandas as pd

data = {
  'Department': ['IT', 'HR', 'IT', 'Finance', 'HR'],
  'Employee':   ['A', 'B', 'C', 'D', 'E'],
  'Salary':     [60000, 45000, 65000, 70000, 48000]
}
df = pd.DataFrame(data)
print(df.head())           # top rows
print(df.info())           # column types, missing values
print(df.describe())       # numeric summary (count, mean, std, etc.)

Yeh commands bahut useful hain — head(), info(), describe(). Jab bhi naya dataset mile, pehle ye teen dekh lo.

Illustration: DataFrame and groupby summary — ek hi line mein aggregate nikal lo.

GroupBy ka simple example


# Average salary by Department
avg = df.groupby('Department')['Salary'].mean().reset_index()
print(avg)
# Output shows average salary per department

Ye bahut powerful hai — group by karke summary statistics nikal lo (mean, sum, count, median).

2) Data Cleaning — sabse important

Real life data usually messy hota hai. Missing values, duplicates, wrong types, stray spaces — sab aa jaata hai. Clean karna zaroori hai warna model galat seekhega.

Check missing values


# Check missing values
print(df.isna().sum())   # kaunse columns me kitne NA hain

Fill missing values — examples & solutions

Numeric column: median ya mean se fill karo. Categorical: mode se. Kabhi-kabhi domain knowledge important hai.


# Example: numeric and categorical imputation
import numpy as np

df2 = pd.DataFrame({
  'Name': ['Tom','Nick','John','Sam','Anu'],
  'Age':  [25, np.nan, 30, 28, np.nan],
  'City': ['Delhi','Mumbai', None, 'Delhi', 'Mumbai']
})

# Fill Age with median
df2['Age'] = df2['Age'].fillna(df2['Age'].median())

# Fill City with mode (most common)
df2['City'] = df2['City'].fillna(df2['City'].mode()[0])

print(df2)

Solution: Age ke NA ko median se replace kiya, City ke NA ko most common city se replace kiya. Simple and safe.

Remove duplicates


# Duplicate rows
df_dup = pd.DataFrame({
  'ID': [1,2,2,3,4,4],
  'Name': ['A','B','B','C','D','D']
})
print(df_dup.duplicated())   # shows True for duplicates
df_dup = df_dup.drop_duplicates()
print(df_dup)

Important: drop_duplicates() original index ko change kar sakta hai — agar chahiye toh inplace=True use karo.

Fix data types


# Example: convert column to numeric (errors='coerce' will put NaN for invalid)
df['Salary'] = pd.to_numeric(df['Salary'], errors='coerce')
# Convert date string to datetime
df['join_date'] = pd.to_datetime(df['join_date'], errors='coerce')

3) Preprocessing — model-ready banana

ML models ko numbers chahiye. Categorical data ko convert karna aur features ko scale karna zaroori hai.

Label encoding & One-hot encoding


# Label map (simple binary)
df['Gender'] = df['Gender'].map({'Male':1, 'Female':0})

# One-hot encoding for multi-category
df = pd.get_dummies(df, columns=['City'], prefix='city')

One-hot se new columns ban jaate hain — jaise city_Delhi, city_Mumbai — 1/0 values.

Feature scaling — kyun?

Algorithms like KNN, SVM, or neural nets are sensitive to scale. Scaling helps.


from sklearn.preprocessing import MinMaxScaler, StandardScaler

scaler = MinMaxScaler()         # scales between 0 and 1
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age','Salary']])
# or
std = StandardScaler()
df[['Age','Salary']] = std.fit_transform(df[['Age','Salary']])

Tip: Fit scaler on training data only. Use the same scaler to transform validation/test sets to avoid data leakage.

4) EDA — Visualize and understand

EDA se pata chalta hai data ka shape: distribution, outliers, correlation. Thoda plotting practice karo.

Histogram example


import matplotlib.pyplot as plt
import seaborn as sns

# Histogram of Age
sns.histplot(df['Age'], bins=20, kde=True)
plt.title('Age distribution')
plt.show()

Scatter & correlation


# Scatter: Salary vs Age
plt.scatter(df['Age'], df['Salary'])
plt.xlabel('Age')
plt.ylabel('Salary')
plt.title('Salary vs Age')
plt.show()

# Correlation heatmap
sns.heatmap(df.select_dtypes(include='number').corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

Illustration: Line, Histogram and Scatter — basic EDA tools.

Practice: Kisi bhi dataset pe 3 charts banao: histogram, scatter aur heatmap. Ek sentence likho har chart ka insight (e.g., \"Age distribution is right skewed\").

5) Mini Project: Titanic — step-by-step

Titanic dataset famous hai — chhota, understandable aur insights bhara. Aaj hum simple EDA aur basics karenge — survival ke patterns dekhte hain.

Load the dataset


import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

titanic = sns.load_dataset('titanic')
print(titanic.head())
print(titanic.info())

Cleaning (age & embarked)


# Fill missing values
titanic['age'] = titanic['age'].fillna(titanic['age'].median())
titanic['embarked'] = titanic['embarked'].fillna(titanic['embarked'].mode()[0])

# Drop columns not needed for simple analysis
titanic = titanic.drop(columns=['deck','alive','class'])  # optional
print(titanic.isna().sum())

Survival by gender & class


# Survival by gender
sns.countplot(x='sex', hue='survived', data=titanic)
plt.title('Survival by Sex')
plt.show()

# Survival by passenger class
sns.countplot(x='pclass', hue='survived', data=titanic)
plt.title('Survival by Pclass')
plt.show()

Concept: Females had higher survival rate (visual idea).

Short analysis — kya seekha?

Females survive karne ke chance zyada the — social priority during rescue.
Higher class (1st) had better survival than 3rd class — cabins, access, priority.
Age bhi matter karta hai — bachcho ko priority milti thi.

Solution idea: Agar aap model banana chahte ho (Logistic Regression), to encode, scale, split data, fir model fit karo. We will cover exact model steps on Day 4.

6) Practice exercises + Solutions (must do)

Yeh exercises roz karne chahiye — practice se confidence aata hai.

Exercise 1: Iris dataset pairplot


import seaborn as sns
iris = sns.load_dataset('iris')
sns.pairplot(iris, hue='species')
plt.show()
# Observe which features separate species best (petal length/width usually)

Solution tip: Pairplot me petal length & width often species ko clearly separate karte hain. Ek sentence likho har plot ke liye.

Exercise 2: GroupBy practice


# Titanic: group by sex and calculate survival rate
group = titanic.groupby('sex')['survived'].mean().reset_index()
print(group)
# 'survived' mean gives survival probability per group

Solution: Output se pata chalega ki females ka survival rate higher hai — numeric form me dikhai dega.

Exercise 3: Clean a messy CSV


# Load your CSV
df = pd.read_csv('sales.csv')

# Quick cleaning steps
df = df.drop_duplicates()
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
df['amount'] = df['amount'].fillna(df['amount'].median())

Solution approach: Always convert types, handle NA, and remove duplicates. Fir EDA karo.

7) Wrap-up & Next steps

Aaj aapne kya kya dekha (short recap): Pandas basics, cleaning, encoding, scaling, EDA, Titanic mini-project, aur practice problems. Bahut achha — ab aap model banana start kar sakte ho (Day 4 will cover Logistic Regression & Decision Trees).

Agar chaho toh main abhi se ek ready-to-run Jupyter Notebook (.ipynb) bana ke de sakta hoon jisme yeh saare code cells ho — ek click se run kar paoge. Bol do.

Python for Machine Learning & Data Science: Day 3 Complete Practice Guide