Python for Machine Learning & Data Science — Day 6 Full Guide

Python for Machine Learning & Data Science — Day 6

Topic-packed: MLOps basics, deployment, monitoring, time-series, unsupervised learning, recommenders, interpretability, fairness — with Hinglish explanations + runnable examples.

1) MLOps & Model Deployment / MLOps और मॉडल डिप्लॉयमेंट

Hinglish explanation: Jab aapka model achha chalne lage, next step hota hai usko production mein lagaana — users ya services ke liye available karna. Is process ko deployment kehte hain, aur jab deployment + monitoring + repeatable pipelines sikh liye jaate hain to usko MLOps bolte hain. Yahan hum simple options dekhenge: Flask API, FastAPI, Docker, and a quick look at serverless or cloud deployment concepts.

1.1 Option A — Simple Flask API (local)

Why: Flask is minimal, good for learning. Ye approach local testing ke liye best hai before containerization.

# Example 1 — simple Flask app serving a scikit-learn model
# Save this as app.py and run `flask run` (or python app.py)
from flask import Flask, request, jsonify
import joblib
import numpy as np

app = Flask(__name__)
model = joblib.load('model.joblib')  # save model earlier with joblib.dump

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()  # expect {"features": [f1, f2, ...]}
    X = np.array(data['features']).reshape(1, -1)
    pred = model.predict(X).tolist()
    proba = model.predict_proba(X).tolist() if hasattr(model,'predict_proba') else None
    return jsonify({'prediction': pred, 'probability': proba})

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=5000)

Practice:

Train a simple model (e.g., RandomForest on Iris), save with joblib.dump, then create this Flask app and test POST requests using curl or Postman.

1.2 Option B — FastAPI (better for async & docs)

# Example 2 — FastAPI example (automatic docs with /docs)
# Save as main.py and run: uvicorn main:app --reload
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np

class InFeatures(BaseModel):
    features: list

app = FastAPI()
model = joblib.load('model.joblib')

@app.post('/predict')
def predict(payload: InFeatures):
    X = np.array(payload.features).reshape(1, -1)
    pred = model.predict(X).tolist()
    return {'prediction': pred}

Practice:

Run FastAPI and open http://localhost:8000/docs to see interactive docs. Try sending sample JSON.

1.3 Containerize with Docker

# Example 3 — sample Dockerfile for Flask app
# Dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 5000
CMD ["python", "app.py"]

Practice:

Build & run Docker image: docker build -t mymodel . then docker run -p 5000:5000 mymodel. Test endpoint locally.

1.4 Option C — Serverless / Cloud Notes

Short: AWS Lambda (with API Gateway), GCP Cloud Functions, or Azure Functions can serve small models. Use when scale or cost model suits serverless. Bigger models: deploy to managed services (SageMaker, Vertex AI).

1.5 Example: Save & Load model (joblib)

# Example 5 — train, save, and load
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import joblib
X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier(n_estimators=10)
clf.fit(X, y)
joblib.dump(clf, 'model.joblib')
clf2 = joblib.load('model.joblib')
print("Loaded model predicts:", clf2.predict(X[:2]))

Practice:

Train a model and serve it locally using any of the above methods. Note latency and throughput for a few requests.

2) Model Monitoring & CI/CD / मॉनिटरिंग और CI/CD

Hinglish: Production mein model lagane ke baad zaroori hota hai ki aap monitor karein: latency, error rates, prediction distribution, data drift, model drift. Also, adopt CI/CD for models and pipelines so that repeatable, testable deployments possible ho.

2.1 What to monitor?

Latency & request errors
Prediction distribution drift (population stats)
Data quality (missing features, out-of-bound values)
Model performance on labelled subset (if available)

2.2 Quick code: log predictions to file for offline monitoring

# Example 1 — simple logging of requests for later analysis
import json
from datetime import datetime

def log_prediction(request_id, features, prediction, proba=None):
    record = {
        'ts': datetime.utcnow().isoformat(),
        'request_id': request_id,
        'features': features,
        'prediction': prediction,
        'probability': proba
    }
    with open('predictions.log', 'a') as f:
        f.write(json.dumps(record) + '\\n')

Practice:

Run a few predictions, then parse predictions.log with pandas and look at distribution of predictions and input features over time.

2.3 Simple drift detection (feature mean change)

# Example 2 — naive drift detection by comparing means
import pandas as pd

# assume predictions.log converted to CSV or JSON lines as df
# baseline_stats is a dict saved during offline validation
baseline_stats = {'age_mean': 35.2, 'income_mean': 54000}
df = pd.read_json('predictions.log', lines=True)
current_age_mean = df['features'].apply(lambda x: x[0]).mean()
if abs(current_age_mean - baseline_stats['age_mean']) > 5:
    print("Age mean drift detected")

Practice:

Create a baseline using holdout or training data, then simulate new requests with a shifted feature and observe drift detection.

2.4 CI/CD for ML pipelines (concept)

Short: CI/CD pipeline for ML includes: data tests, model tests (unit tests + performance tests), container build, integration tests, and automated deployment. Tools: GitHub Actions, Jenkins, GitLab CI, MLflow, TFX.

2.5 Quick example: GitHub Actions snippet (sketch)

# .github/workflows/ci.yml (sketch)
name: ml-ci
on: [push]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - name: Set up Python
      uses: actions/setup-python@v2
      with: python-version: 3.10
    - name: Install
      run: pip install -r requirements.txt
    - name: Run tests
      run: pytest

Practice:

Add a small unit test for your data preprocessing and run it on push using GitHub Actions (free for public repos).

3) Time Series Forecasting / टाइम सीरीज़ फ़ोरकास्टिंग

Hinglish: Time series data mein observations ordered by time hote hain. Common tasks: forecasting future values, anomaly detection. Important to consider seasonality, trend, and autocorrelation. We'll cover ARIMA, Prophet (sketch), and simple ML approaches with lag features.

3.1 Classical approach: ARIMA (statsmodels)

# Example 1 — ARIMA using statsmodels (install statsmodels)
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
# sample data: daily series in df['y']
# df = pd.read_csv('daily.csv', parse_dates=['ds'], index_col='ds')
# For demo create synthetic
rng = pd.date_range('2021-01-01', periods=200)
ts = pd.Series(10 + 0.1 * (range(200)) + 2 * np.sin(np.linspace(0,20,200)), index=rng)
model = ARIMA(ts, order=(2,1,2))
res = model.fit()
print(res.summary())
print("Forecast next 5:", res.forecast(5))

Practice:

Fit ARIMA on a time series and plot residuals to check assumptions (white noise residuals desirable).

3.2 Prophet (Facebook/Meta Prophet)

# Example 2 — Prophet (install prophet package)
from prophet import Prophet
df = pd.DataFrame({'ds': rng, 'y': ts})
m = Prophet()
m.fit(df)
future = m.make_future_dataframe(periods=30)
forecast = m.predict(future)
print(forecast[['ds','yhat','yhat_lower','yhat_upper']].tail())

Practice:

Try Prophet with holidays or weekly seasonality settings and observe changes in forecasts.

3.3 ML with lag features

# Example 3 — use lags + RandomForest
import pandas as pd, numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import TimeSeriesSplit
# create lag features
df_ts = pd.DataFrame({'y': ts})
for l in range(1,8): df_ts[f'lag_{l}'] = df_ts['y'].shift(l)
df_ts.dropna(inplace=True)
X = df_ts.drop('y', axis=1).values
y = df_ts['y'].values
tscv = TimeSeriesSplit(n_splits=5)
rf = RandomForestRegressor(n_estimators=50)
# cross-validate
from sklearn.model_selection import cross_val_score
print('MAE (cv):', -cross_val_score(rf, X, y, cv=tscv, scoring='neg_mean_absolute_error').mean())

Practice:

Create rolling-window features like moving averages and include them as predictors. Compare model error.

3.4 Anomaly detection in time series

# Example 4 — simple z-score anomaly detection
rolling_mean = ts.rolling(7).mean()
rolling_std = ts.rolling(7).std()
zscore = (ts - rolling_mean) / rolling_std
anomalies = ts[np.abs(zscore) > 2]
print('Anomalies found:', anomalies.head())

Practice:

Simulate spikes in the series and check detection sensitivity by changing z-score threshold.

3.5 Metrics for forecast evaluation

# Example 5 — compute MAE, MAPE for forecast
from sklearn.metrics import mean_absolute_error
def mape(y_true, y_pred): return np.mean(np.abs((y_true - y_pred)/y_true))*100
y_true = np.array([100,120,130])
y_pred = np.array([90,115,140])
print('MAE:', mean_absolute_error(y_true,y_pred))
print('MAPE:', mape(y_true,y_pred))

Practice:

Use rolling-origin evaluation (walk-forward) when evaluating forecasts for realistic results.

4) Unsupervised Learning — PCA & Clustering / अनसुपरवाइज़्ड सीखना

Hinglish: Jab labels nahi hote, unsupervised methods useful hote hain to explore structure — dimensionality reduction with PCA, clustering with k-means, hierarchical clustering, DBSCAN. Use-cases: anomaly detection, customer segmentation, visualization.

4.1 PCA — Principal Component Analysis

# Example 1 — PCA on iris features
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
X, y = load_iris(return_X_y=True)
pca = PCA(n_components=2)
Xp = pca.fit_transform(X)
print('Explained variance ratio:', pca.explained_variance_ratio_)
plt.scatter(Xp[:,0], Xp[:,1], c=y)
plt.title('PCA 2D')
plt.show()

Practice:

Check how many components required to capture 90% variance using pca.explained_variance_ratio_.

4.2 k-Means Clustering

# Example 2 — k-means clustering and elbow method
from sklearn.cluster import KMeans
inertia = []
for k in range(1,8):
    km = KMeans(n_clusters=k, random_state=0).fit(X)
    inertia.append(km.inertia_)
print('Inertia:', inertia)
# pick k by elbow and plot clusters
km = KMeans(n_clusters=3, random_state=0).fit(X)
labels = km.labels_
plt.scatter(X[:,0], X[:,1], c=labels)
plt.title('k-means clusters (first two features)')
plt.show()

Practice:

Try silhouette scores for different k to select cluster count (sklearn.metrics.silhouette_score).

4.3 DBSCAN (density-based)

# Example 3 — DBSCAN for clusters of varying shape
from sklearn.cluster import DBSCAN
import numpy as np
# create sample with noise (use sklearn.datasets.make_moons or make_blobs)
from sklearn.datasets import make_moons
Xm, _ = make_moons(n_samples=300, noise=0.05)
db = DBSCAN(eps=0.2, min_samples=5).fit(Xm)
plt.scatter(Xm[:,0], Xm[:,1], c=db.labels_)
plt.title('DBSCAN Clusters')
plt.show()

Practice:

Change eps and min_samples to see dense vs sparse cluster assignments and noise (-1 label).

4.4 Hierarchical Clustering

# Example 4 — hierarchical dendrogram
from scipy.cluster.hierarchy import linkage, dendrogram
Z = linkage(X[:50], method='ward')
dendrogram(Z)
plt.title('Dendrogram (first 50 samples)')
plt.show()

Practice:

Cut dendrogram at different heights and inspect cluster membership.

4.5 Use unsupervised learning for feature engineering

# Example 5 — use cluster label as feature
km = KMeans(n_clusters=4).fit(X)
cluster_feat = km.predict(X).reshape(-1,1)
# append cluster_feat to original dataset for downstream supervised model
import numpy as np
X_new = np.hstack([X, cluster_feat])
print('New feature shape:', X_new.shape)

Practice:

Train a supervised model with and without cluster feature and compare performance.

5) Recommender Systems / रेकमेंडर सिस्टम्स

Hinglish: Recommenders help suggest items to users. Basic approaches: popularity-based, collaborative filtering (user-item matrix), content-based, and hybrid systems. We'll show simple matrix factorization and nearest-neighbour methods.

5.1 Popularity-based baseline

# Example 1 — Top-N popular items
import pandas as pd
# sample ratings
ratings = pd.DataFrame({'user':[1,2,1,3,2],'item':[10,10,11,11,12],'rating':[5,4,5,4,3]})
top_items = ratings.groupby('item')['rating'].count().sort_values(ascending=False)
print('Top items by count:\\n', top_items)

Practice:

Implement top-N by average rating and compare to count-based ranking.

5.2 Collaborative Filtering — user-based KNN

# Example 2 — user-user collaborative filtering (sketch)
from sklearn.metrics.pairwise import cosine_similarity
user_item = ratings.pivot_table(index='user', columns='item', values='rating').fillna(0)
sim = cosine_similarity(user_item)
sim_df = pd.DataFrame(sim, index=user_item.index, columns=user_item.index)
print(sim_df)
# recommend items liked by similar users but not by target user

Practice:

Pick a target user and recommend top items using weighted scores from neighbors.

5.3 Matrix Factorization (SVD)

# Example 3 — SVD using sklearn TruncatedSVD on user-item matrix
from sklearn.decomposition import TruncatedSVD
U = user_item.values
svd = TruncatedSVD(n_components=2)
latent = svd.fit_transform(U)
reconstructed = svd.inverse_transform(latent)
print('Reconstructed matrix shape:', reconstructed.shape)

Practice:

Compute predicted score for missing entries and recommend highest predicted items for a user.

5.4 Content-based recommendations

# Example 4 — content-based using TF-IDF for item descriptions
from sklearn.feature_extraction.text import TfidfVectorizer
docs = ["action sci-fi", "romantic drama", "sci-fi adventure", "action thriller"]
tfidf = TfidfVectorizer().fit_transform(docs)
from sklearn.metrics.pairwise import linear_kernel
cos_sim = linear_kernel(tfidf[0:1], tfidf).flatten()
print(cos_sim)

Practice:

For a given item description, recommend similar items using cosine similarity on TF-IDF vectors.

5.5 Hybrid & evaluation metrics for recommenders

Hinglish: Hybrid = combine collaborative + content. Evaluate with precision@k, recall@k, MAP, NDCG. Offline evaluation needs careful train-test splitting (leave-one-out or time-based).

# Example 5 — precision@k (sketch)
def precision_at_k(true_items, pred_items, k):
    return len(set(true_items) & set(pred_items[:k])) / k
print('Precision@3:', precision_at_k([10,11], [11,10,12], 3))

Practice:

Implement precision@k for your recommender and report for a test set.

6) Model Interpretability & Fairness / इंटरप्रेटेबिलिटी और निष्पक्षता

Hinglish: Interpretability = explain why model made a prediction. Tools: feature importance (tree), coefficients (linear), SHAP, LIME. Fairness = ensure model doesn't discriminate across sensitive groups. We'll show SHAP usage (sketch) and fairness checks.

6.1 Feature importance & coefficients

# Example 1 — tree feature importances
rf = RandomForestClassifier(n_estimators=20).fit(X, y)
print('Feature importances:', rf.feature_importances_)
# Linear model coefficients
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression().fit(X, (y))
print('Coefficients:', lr.coef_)

Practice:

Use feature importances to remove weak features and retrain. Observe effect on performance.

6.2 SHAP (sketch)

# Example 2 — SHAP quick sketch (install shap)
# import shap
# explainer = shap.TreeExplainer(rf)
# shap_values = explainer.shap_values(X[:50])
# shap.summary_plot(shap_values, X[:50])
print("SHAP example requires shap installation; use on local machine/Colab.")

Practice:

Install shap and run summary and force plots to explain specific predictions.

6.3 Fairness checks — group metrics

# Example 3 — check accuracy across groups
import numpy as np
# assume df has 'sensitive' column (e.g., gender) and predictions in pred
# simulate
sensitive = np.random.choice(['A','B'], size=len(y))
pred = rf.predict(X)
from sklearn.metrics import accuracy_score
acc_A = accuracy_score(y[sensitive=='A'], pred[sensitive=='A'])
acc_B = accuracy_score(y[sensitive=='B'], pred[sensitive=='B'])
print('Acc A:', acc_A, 'Acc B:', acc_B)

Practice:

If performance differs largely, consider reweighing, collecting more data, or fairness-aware algorithms (a fast research step).

6.4 Counterfactual explanations (concept)

Short: Counterfactual = minimal change to input that flips prediction. Useful in model transparency (e.g., "If your income were X higher, you would be approved").

6.5 Model cards & documentation

Practice: Create a small model card with intended use, data summary, metrics by group, known limitations — store with model artifacts for governance.

7) End-to-End Mini Project & Practice / एंड-टू-एंड मिनी प्रोजेक्ट

Hinglish: Ab ek chota real-world style mini-project karte hain: load dataset, preprocess, feature engineering, train two models, evaluate, deploy locally (sketch). Is project ko step-by-step run karoge to full pipeline ka idea clear ho jayega.

7.1 Dataset & Problem

Use a customer churn or credit dataset. Yeh template generic hai — bas CSV ka naam aur columns adapt karna.

7.2 Full pipeline (runnable template)


# Mini project pipeline template — place customer_churn.csv in working dir
import pandas as pd, numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
import joblib

# Load
df = pd.read_csv('customer_churn.csv')
df = df.dropna(subset=['churn'])
df['age'].fillna(df['age'].median(), inplace=True)
num_cols = ['age','balance']
cat_cols = ['gender','region']

# Train-test split
X = df.drop(['customer_id','churn'], axis=1)
y = df['churn'].map({'No':0, 'Yes':1}) if df['churn'].dtype=='object' else df['churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

# Preprocessor and pipeline
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), num_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore', sparse=False), cat_cols)
])

pipe = Pipeline([('pre', preprocessor), ('clf', RandomForestClassifier(random_state=42))])
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)
print(classification_report(y_test, pred))
print('ROC-AUC:', roc_auc_score(y_test, pipe.predict_proba(X_test)[:,1]))

# Save model artifact
joblib.dump(pipe, 'churn_pipeline.joblib')
print("Saved pipeline as churn_pipeline.joblib")

# Quick local deploy: test curl (see Flask/FastAPI snippets earlier)

Practice:

Run this template end-to-end. After saving pipeline, create simple FastAPI endpoint that loads churn_pipeline.joblib and returns prediction for sample input.

7.3 Additional experiments (5 ideas)

Try LightGBM/XGBoost instead of RandomForest and compare ROC-AUC.
Add interaction features (age * balance) and re-evaluate.
Implement K-fold target encoding for high-cardinality categories and compare leakage-aware scores.
Build a simple dashboard (Streamlit) that sends sample inputs to your model and shows prediction + SHAP explanation.
Simulate production traffic and log predictions — then run drift detection code from Section 2.

Resources & Next Steps

Scikit-learn docs: model selection, pipelines, decomposition, clustering.
Statsmodels and Prophet: time-series tools.
Docker docs and cloud provider tutorials (AWS, GCP, Azure) for deployment.
SHAP and LIME for interpretability.
MLflow for model registry, tracking, and simple deployment helpers.

Thank You Learn Free and Earn More..

Python for Machine Learning & Data Science — Day 6 Full Guide & Practice (Unsupervised Learning, Clustering, Dimensionality Reduction)

1) MLOps & Model Deployment / MLOps और मॉडल डिप्लॉयमेंट

1.1 Option A — Simple Flask API (local)

Practice:

1.2 Option B — FastAPI (better for async & docs)

Practice:

1.3 Containerize with Docker

Practice:

1.4 Option C — Serverless / Cloud Notes

1.5 Example: Save & Load model (joblib)

Practice:

2) Model Monitoring & CI/CD / मॉनिटरिंग और CI/CD

2.1 What to monitor?

2.2 Quick code: log predictions to file for offline monitoring

Practice:

2.3 Simple drift detection (feature mean change)

Practice:

2.4 CI/CD for ML pipelines (concept)

2.5 Quick example: GitHub Actions snippet (sketch)

Practice:

3) Time Series Forecasting / टाइम सीरीज़ फ़ोरकास्टिंग

3.1 Classical approach: ARIMA (statsmodels)

Practice:

3.2 Prophet (Facebook/Meta Prophet)

Practice:

3.3 ML with lag features

Practice:

3.4 Anomaly detection in time series

Practice:

3.5 Metrics for forecast evaluation

Practice:

4) Unsupervised Learning — PCA & Clustering / अनसुपरवाइज़्ड सीखना

4.1 PCA — Principal Component Analysis

Practice:

4.2 k-Means Clustering

Practice:

4.3 DBSCAN (density-based)

Practice:

4.4 Hierarchical Clustering

Practice:

4.5 Use unsupervised learning for feature engineering

Practice:

5) Recommender Systems / रेकमेंडर सिस्टम्स

5.1 Popularity-based baseline

Practice:

5.2 Collaborative Filtering — user-based KNN

Practice:

5.3 Matrix Factorization (SVD)

Practice:

5.4 Content-based recommendations

Practice:

5.5 Hybrid & evaluation metrics for recommenders

Practice:

6) Model Interpretability & Fairness / इंटरप्रेटेबिलिटी और निष्पक्षता

6.1 Feature importance & coefficients

Practice:

6.2 SHAP (sketch)

Practice:

6.3 Fairness checks — group metrics

Practice:

6.4 Counterfactual explanations (concept)

6.5 Model cards & documentation

7) End-to-End Mini Project & Practice / एंड-टू-एंड मिनी प्रोजेक्ट

7.1 Dataset & Problem

7.2 Full pipeline (runnable template)

Practice:

7.3 Additional experiments (5 ideas)

Resources & Next Steps

एक टिप्पणी भेजें

نموذج الاتصال