Python for Machine Learning & Data Science — Day 4 (Complete Practice Guide)

Python for Machine Learning & Data Science — Day 4 Full Guide

Python for Machine Learning & Data Science — Day 4

Complete Blogger-ready practice guide with 5+ examples for each topic. Easy language in English + Hindi.

Overview / अवलोकन

English: Day 4 covers preprocessing, feature engineering, model building, pipelines, hyperparameter tuning, and evaluation. You will learn step-by-step with multiple examples for each topic, so that you can practice and implement machine learning workflows effectively.
हिन्दी: दिन 4 में डेटा की तैयारी, फीचर इंजीनियरिंग, मॉडलिंग, पाइपलाइन, हाइपरपैरामीटर ट्यूनिंग और मूल्यांकन शामिल है। हर भाग में कई उदाहरण दिए गए हैं जिससे आप मशीन लर्निंग के कार्यप्रवाह को सही तरीके से सीख सकेंगे और प्रैक्टिस कर सकेंगे।

1) Data Preprocessing / डेटा पूर्व-प्रसंस्करण

English: Data preprocessing is the first and crucial step before applying any machine learning model. It involves cleaning data, handling missing values, scaling numeric features, and encoding categorical variables. Proper preprocessing improves model performance.
हिन्दी: डेटा पूर्व-प्रसंस्करण ML मॉडल लगाने से पहले सबसे महत्वपूर्ण चरण है। इसमें डेटा की सफाई, missing values को संभालना, संख्यात्मक features को scale करना और categorical variables को encode करना शामिल है। सही preprocessing से मॉडल की प्रदर्शन क्षमता बढ़ती है।

1.1 Handling Missing Values / Missing Values को संभालना

English: Many real-world datasets have missing values (NaN). These missing values can affect model accuracy and predictions. Techniques include:
  • Drop missing rows or columns
  • Fill missing values with mean, median, or mode
  • Use machine learning based imputations like KNN
हिन्दी: असली datasets में अक्सर missing values होती हैं। ये मॉडल की सटीकता और predictions पर असर डालती हैं। समाधान:
  • Missing rows या columns को drop करें
  • Mean, median या mode से fill करें
  • KNN जैसी ML आधारित imputation का उपयोग करें
# Example 1 — Drop missing rows
df = df.dropna()

# Example 2 — Fill missing values with mean
df['age'].fillna(df['age'].mean(), inplace=True)

# Example 3 — Fill missing values with median
df['age'].fillna(df['age'].median(), inplace=True)

# Example 4 — Fill missing categorical values with mode
df['embarked'].fillna(df['embarked'].mode()[0], inplace=True)

# Example 5 — Use SimpleImputer from sklearn
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
df[['age']] = imputer.fit_transform(df[['age']])

📝 Your Turn

Try changing the imputation strategy to median or mode and see how the dataset changes.

1.2 Scaling Numeric Features / संख्यात्मक फीचर्स को स्केल करना

English: Scaling numeric features ensures that all features contribute equally to model training. Common methods include StandardScaler (z-score), MinMaxScaler (0–1 range).
हिन्दी: सभी features का model training में बराबर योगदान हो, इसके लिए numeric features को scale किया जाता है। आमतौर पर StandardScaler (z-score) और MinMaxScaler (0–1) का उपयोग किया जाता है।

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Using StandardScaler
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

# Using MinMaxScaler
minmax = MinMaxScaler()
df[num_cols] = minmax.fit_transform(df[num_cols])

# Example 3 — Scale only 'age' and 'fare'
df[['age','fare']] = scaler.fit_transform(df[['age','fare']])

# Example 4 — Check before and after scaling
print(df[['age','fare']].head())

# Example 5 — Save scaler for future use
import joblib
joblib.dump(scaler, 'scaler.save')

📝 Your Turn

Try scaling with MinMaxScaler and compare the numeric ranges before and after scaling.

1.3 Encoding Categorical Variables / श्रेणीबद्ध डेटा को एनकोड करना

English: ML models require numeric input. Categorical variables like 'sex' or 'embarked' must be converted to numbers using techniques such as OneHotEncoder or LabelEncoder.
हिन्दी: ML मॉडल को numeric input चाहिए। 'sex' या 'embarked' जैसे categorical variables को numbers में बदलना होता है। OneHotEncoder या LabelEncoder का उपयोग किया जा सकता है।

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# LabelEncoder example
le = LabelEncoder()
df['sex_encoded'] = le.fit_transform(df['sex'])

# OneHotEncoder example
ohe = OneHotEncoder(sparse=False)
encoded_cols = ohe.fit_transform(df[['embarked']])
import numpy as np
df[ohe.get_feature_names_out()] = encoded_cols

# Example 3 — Encoding multiple categorical columns
cat_cols = ['sex','embarked']
for col in cat_cols:
    df[col+'_encoded'] = le.fit_transform(df[col])

# Example 4 — Drop original categorical columns
df = df.drop(cat_cols, axis=1)

# Example 5 — Check encoded values
print(df.head())

📝 Your Turn

Try OneHotEncoding only 'sex' column and check the new dataframe shape.

2) Feature Engineering / फीचर इंजीनियरिंग

English: Feature engineering is the process of creating new input features from existing ones to improve model performance. Examples include mathematical transformations, combining features, or extracting date/time components.
हिन्दी: Feature engineering मौजूदा features से नए features बनाने की प्रक्रिया है जिससे model की सटीकता बढ़े। इसमें mathematical transformations, features का combination, या date/time से नया feature निकालना शामिल है।

2.1 Log Transform / लॉग ट्रांसफॉर्म

English: Log transformation reduces skewness of numeric features and stabilizes variance.
हिन्दी: Log transformation skewness को कम करता है और variance को स्थिर करता है।

import numpy as np

# Example 1 — log transform
df['log_fare'] = np.log1p(df['fare'])

# Example 2 — log transform with condition
df['log_fare'] = np.log1p(df['fare'].replace(0,1))

# Example 3 — visualize before/after
import matplotlib.pyplot as plt
plt.hist(df['fare'], bins=20)
plt.show()
plt.hist(df['log_fare'], bins=20)
plt.show()

# Example 4 — log transform another numeric column
df['log_age'] = np.log1p(df['age'])

# Example 5 — check transformed features
print(df[['fare','log_fare','age','log_age']].head())

📝 Your Turn

Create a new log-transformed feature for 'total_value' in sales dataset and check distribution.

2.2 Extracting Date/Time Features / दिनांक/समय से फीचर्स निकालना

English: Convert date columns into useful features like day, month, weekday for better prediction.
हिन्दी: Date columns को day, month, weekday जैसे features में बदलें ताकि मॉडल predictions बेहतर करे।

# Example dataset
df['date'] = pd.to_datetime(df['date'])

# Example 1 — extract year, month, day
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day

# Example 2 — extract weekday
df['weekday'] = df['date'].dt.weekday

# Example 3 — extract week number
df['week_no'] = df['date'].dt.isocalendar().week

# Example 4 — combine features
df['month_week'] = df['month']*10 + df['week_no']

# Example 5 — drop original date
df = df.drop(['date'], axis=1)

print(df.head())

📝 Your Turn

Extract hour and minute features from a datetime column and check the results.

3) Modeling & Validation / मॉडलिंग और सत्यापन

English: After preprocessing and feature engineering, you can train models. Use train-test split, cross-validation, and hyperparameter tuning to evaluate models effectively.
हिन्दी: Preprocessing और feature engineering के बाद आप models train कर सकते हैं। Train-test split, cross-validation और hyperparameter tuning से model का मूल्यांकन करें।

3.1 Train/Test Split / ट्रेन और टेस्ट डेटा विभाजन

English: Split dataset into training and testing to validate model performance on unseen data.
हिन्दी: Dataset को train और test में विभाजित करें ताकि model नए data पर कैसे perform करता है यह देखा जा सके।

from sklearn.model_selection import train_test_split
X = df.drop('target', axis=1)
y = df['target']

# Example 1 — 70-30 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Example 2 — 80-20 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Example 3 — stratified split for classification
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)

# Example 4 — check shapes
print(X_train.shape, X_test.shape)

# Example 5 — random_state reproducibility
X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, random_state=1)

📝 Your Turn

Try splitting dataset with test_size=0.25 and check train/test sizes.

3.2 Cross-Validation / क्रॉस-वैलिडेशन

English: Cross-validation gives better estimate of model performance by splitting data into multiple folds.
हिन्दी: Cross-validation model की सटीकता का बेहतर अनुमान देती है, डेटा को कई folds में विभाजित करके।

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
scores = cross_val_score(rf, X, y, cv=5)

print("Average CV Score:", scores.mean())

📝 Your Turn

Try cross-validation with cv=10 and compare scores.

3.3 Hyperparameter Tuning (GridSearchCV) / हाइपरपैरामीटर ट्यूनिंग

English: GridSearchCV finds the best combination of hyperparameters for a model.
हिन्दी: GridSearchCV मॉडल के लिए सबसे अच्छे hyperparameters खोजता है।

from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

param_grid = {'n_neighbors':[3,5,7]}
knn = KNeighborsClassifier()
grid = GridSearchCV(knn, param_grid, cv=5)
grid.fit(X, y)

print("Best params:", grid.best_params_)
print("Best CV score:", grid.best_score_)

📝 Your Turn

Try adding weights=['uniform','distance'] in param_grid and see best params.

4) End-to-End Example — Customer Churn

English: This example shows preprocessing, feature engineering, and model training together on a sample churn dataset.
हिन्दी: यह उदाहरण preprocessing, feature engineering और model training को एक साथ churn dataset पर दिखाता है।

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# Sample dataset
df = pd.read_csv('customer_churn.csv')

# Preprocessing
num_cols = ['age','balance']
cat_cols = ['gender','region']
num_transformer = StandardScaler()
cat_transformer = OneHotEncoder()

preprocessor = ColumnTransformer([
    ('num', num_transformer, num_cols),
    ('cat', cat_transformer, cat_cols)
])

# Pipeline
clf = Pipeline([
    ('prep', preprocessor),
    ('classifier', RandomForestClassifier())
])

# Train/test split
X = df.drop('churn', axis=1)
y = df['churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Hyperparameter tuning
param_grid = {'classifier__n_estimators':[50,100],'classifier__max_depth':[5,10]}
grid = GridSearchCV(clf, param_grid, cv=5)
grid.fit(X_train, y_train)

# Evaluation
y_pred = grid.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Best Parameters:", grid.best_params_)

📝 Your Turn

Try changing test_size=0.2 and see how accuracy changes. Also, add another numeric column to num_cols and retrain.

5) Practice Exercises / अभ्यास

English: Try these exercises to reinforce today’s topics.
हिन्दी: नीचे दिए गए अभ्यास से आप आज सीखी हुई चीज़ों को प्रैक्टिस कर पाएंगे।

1. Impute Missing Values


# Fill missing 'age' in Titanic dataset with mean and median, compare results

2. Scale Features and Compare Accuracy


# Scale 'age' and 'fare' using StandardScaler and MinMaxScaler, compare LogisticRegression accuracy

3. Create New Features


# Generate 3 new features from sales dataset: total_value, discounted_price, value_per_item

4. Cross Validation


# Compare RandomForestClassifier and SVC using 5-fold cross-validation

5. GridSearchCV


# Find best 'k' for KNN classifier using GridSearchCV on Titanic dataset

📝 Your Turn

Try modifying any parameter, feature, or preprocessing step in each exercise and observe how the model accuracy or outputs change. Copy, paste, run, and experiment!

Python for Machine Learning & Data Science — Day 4 Tutorial | Practice, Learn, Implement

एक टिप्पणी भेजें

और नया पुराने

نموذج الاتصال