Python for Machine Learning & Data Science — Day 6
Topic-packed: MLOps basics, deployment, monitoring, time-series, unsupervised learning, recommenders, interpretability, fairness — with Hinglish explanations + runnable examples.
1) MLOps & Model Deployment / MLOps और मॉडल डिप्लॉयमेंट
1.1 Option A — Simple Flask API (local)
# Example 1 — simple Flask app serving a scikit-learn model
# Save this as app.py and run `flask run` (or python app.py)
from flask import Flask, request, jsonify
import joblib
import numpy as np
app = Flask(__name__)
model = joblib.load('model.joblib') # save model earlier with joblib.dump
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json() # expect {"features": [f1, f2, ...]}
X = np.array(data['features']).reshape(1, -1)
pred = model.predict(X).tolist()
proba = model.predict_proba(X).tolist() if hasattr(model,'predict_proba') else None
return jsonify({'prediction': pred, 'probability': proba})
if __name__ == '__main__':
app.run(debug=True, host='0.0.0.0', port=5000)
Practice:
Train a simple model (e.g., RandomForest on Iris), save with joblib.dump
, then create this Flask app and test POST requests using curl or Postman.
1.2 Option B — FastAPI (better for async & docs)
# Example 2 — FastAPI example (automatic docs with /docs)
# Save as main.py and run: uvicorn main:app --reload
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np
class InFeatures(BaseModel):
features: list
app = FastAPI()
model = joblib.load('model.joblib')
@app.post('/predict')
def predict(payload: InFeatures):
X = np.array(payload.features).reshape(1, -1)
pred = model.predict(X).tolist()
return {'prediction': pred}
Practice:
Run FastAPI and open http://localhost:8000/docs
to see interactive docs. Try sending sample JSON.
1.3 Containerize with Docker
# Example 3 — sample Dockerfile for Flask app
# Dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 5000
CMD ["python", "app.py"]
Practice:
Build & run Docker image: docker build -t mymodel .
then docker run -p 5000:5000 mymodel
. Test endpoint locally.
1.4 Option C — Serverless / Cloud Notes
1.5 Example: Save & Load model (joblib)
# Example 5 — train, save, and load
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import joblib
X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier(n_estimators=10)
clf.fit(X, y)
joblib.dump(clf, 'model.joblib')
clf2 = joblib.load('model.joblib')
print("Loaded model predicts:", clf2.predict(X[:2]))
Practice:
Train a model and serve it locally using any of the above methods. Note latency and throughput for a few requests.
2) Model Monitoring & CI/CD / मॉनिटरिंग और CI/CD
2.1 What to monitor?
- Latency & request errors
- Prediction distribution drift (population stats)
- Data quality (missing features, out-of-bound values)
- Model performance on labelled subset (if available)
2.2 Quick code: log predictions to file for offline monitoring
# Example 1 — simple logging of requests for later analysis
import json
from datetime import datetime
def log_prediction(request_id, features, prediction, proba=None):
record = {
'ts': datetime.utcnow().isoformat(),
'request_id': request_id,
'features': features,
'prediction': prediction,
'probability': proba
}
with open('predictions.log', 'a') as f:
f.write(json.dumps(record) + '\\n')
Practice:
Run a few predictions, then parse predictions.log
with pandas and look at distribution of predictions and input features over time.
2.3 Simple drift detection (feature mean change)
# Example 2 — naive drift detection by comparing means
import pandas as pd
# assume predictions.log converted to CSV or JSON lines as df
# baseline_stats is a dict saved during offline validation
baseline_stats = {'age_mean': 35.2, 'income_mean': 54000}
df = pd.read_json('predictions.log', lines=True)
current_age_mean = df['features'].apply(lambda x: x[0]).mean()
if abs(current_age_mean - baseline_stats['age_mean']) > 5:
print("Age mean drift detected")
Practice:
Create a baseline using holdout or training data, then simulate new requests with a shifted feature and observe drift detection.
2.4 CI/CD for ML pipelines (concept)
2.5 Quick example: GitHub Actions snippet (sketch)
# .github/workflows/ci.yml (sketch)
name: ml-ci
on: [push]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with: python-version: 3.10
- name: Install
run: pip install -r requirements.txt
- name: Run tests
run: pytest
Practice:
Add a small unit test for your data preprocessing and run it on push using GitHub Actions (free for public repos).
3) Time Series Forecasting / टाइम सीरीज़ फ़ोरकास्टिंग
3.1 Classical approach: ARIMA (statsmodels)
# Example 1 — ARIMA using statsmodels (install statsmodels)
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
# sample data: daily series in df['y']
# df = pd.read_csv('daily.csv', parse_dates=['ds'], index_col='ds')
# For demo create synthetic
rng = pd.date_range('2021-01-01', periods=200)
ts = pd.Series(10 + 0.1 * (range(200)) + 2 * np.sin(np.linspace(0,20,200)), index=rng)
model = ARIMA(ts, order=(2,1,2))
res = model.fit()
print(res.summary())
print("Forecast next 5:", res.forecast(5))
Practice:
Fit ARIMA on a time series and plot residuals to check assumptions (white noise residuals desirable).
3.2 Prophet (Facebook/Meta Prophet)
# Example 2 — Prophet (install prophet package)
from prophet import Prophet
df = pd.DataFrame({'ds': rng, 'y': ts})
m = Prophet()
m.fit(df)
future = m.make_future_dataframe(periods=30)
forecast = m.predict(future)
print(forecast[['ds','yhat','yhat_lower','yhat_upper']].tail())
Practice:
Try Prophet with holidays or weekly seasonality settings and observe changes in forecasts.
3.3 ML with lag features
# Example 3 — use lags + RandomForest
import pandas as pd, numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import TimeSeriesSplit
# create lag features
df_ts = pd.DataFrame({'y': ts})
for l in range(1,8): df_ts[f'lag_{l}'] = df_ts['y'].shift(l)
df_ts.dropna(inplace=True)
X = df_ts.drop('y', axis=1).values
y = df_ts['y'].values
tscv = TimeSeriesSplit(n_splits=5)
rf = RandomForestRegressor(n_estimators=50)
# cross-validate
from sklearn.model_selection import cross_val_score
print('MAE (cv):', -cross_val_score(rf, X, y, cv=tscv, scoring='neg_mean_absolute_error').mean())
Practice:
Create rolling-window features like moving averages and include them as predictors. Compare model error.
3.4 Anomaly detection in time series
# Example 4 — simple z-score anomaly detection
rolling_mean = ts.rolling(7).mean()
rolling_std = ts.rolling(7).std()
zscore = (ts - rolling_mean) / rolling_std
anomalies = ts[np.abs(zscore) > 2]
print('Anomalies found:', anomalies.head())
Practice:
Simulate spikes in the series and check detection sensitivity by changing z-score threshold.
3.5 Metrics for forecast evaluation
# Example 5 — compute MAE, MAPE for forecast
from sklearn.metrics import mean_absolute_error
def mape(y_true, y_pred): return np.mean(np.abs((y_true - y_pred)/y_true))*100
y_true = np.array([100,120,130])
y_pred = np.array([90,115,140])
print('MAE:', mean_absolute_error(y_true,y_pred))
print('MAPE:', mape(y_true,y_pred))
Practice:
Use rolling-origin evaluation (walk-forward) when evaluating forecasts for realistic results.
4) Unsupervised Learning — PCA & Clustering / अनसुपरवाइज़्ड सीखना
4.1 PCA — Principal Component Analysis
# Example 1 — PCA on iris features
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
X, y = load_iris(return_X_y=True)
pca = PCA(n_components=2)
Xp = pca.fit_transform(X)
print('Explained variance ratio:', pca.explained_variance_ratio_)
plt.scatter(Xp[:,0], Xp[:,1], c=y)
plt.title('PCA 2D')
plt.show()
Practice:
Check how many components required to capture 90% variance using pca.explained_variance_ratio_
.
4.2 k-Means Clustering
# Example 2 — k-means clustering and elbow method
from sklearn.cluster import KMeans
inertia = []
for k in range(1,8):
km = KMeans(n_clusters=k, random_state=0).fit(X)
inertia.append(km.inertia_)
print('Inertia:', inertia)
# pick k by elbow and plot clusters
km = KMeans(n_clusters=3, random_state=0).fit(X)
labels = km.labels_
plt.scatter(X[:,0], X[:,1], c=labels)
plt.title('k-means clusters (first two features)')
plt.show()
Practice:
Try silhouette scores for different k to select cluster count (sklearn.metrics.silhouette_score
).
4.3 DBSCAN (density-based)
# Example 3 — DBSCAN for clusters of varying shape
from sklearn.cluster import DBSCAN
import numpy as np
# create sample with noise (use sklearn.datasets.make_moons or make_blobs)
from sklearn.datasets import make_moons
Xm, _ = make_moons(n_samples=300, noise=0.05)
db = DBSCAN(eps=0.2, min_samples=5).fit(Xm)
plt.scatter(Xm[:,0], Xm[:,1], c=db.labels_)
plt.title('DBSCAN Clusters')
plt.show()
Practice:
Change eps and min_samples to see dense vs sparse cluster assignments and noise (-1 label).
4.4 Hierarchical Clustering
# Example 4 — hierarchical dendrogram
from scipy.cluster.hierarchy import linkage, dendrogram
Z = linkage(X[:50], method='ward')
dendrogram(Z)
plt.title('Dendrogram (first 50 samples)')
plt.show()
Practice:
Cut dendrogram at different heights and inspect cluster membership.
4.5 Use unsupervised learning for feature engineering
# Example 5 — use cluster label as feature
km = KMeans(n_clusters=4).fit(X)
cluster_feat = km.predict(X).reshape(-1,1)
# append cluster_feat to original dataset for downstream supervised model
import numpy as np
X_new = np.hstack([X, cluster_feat])
print('New feature shape:', X_new.shape)
Practice:
Train a supervised model with and without cluster feature and compare performance.
5) Recommender Systems / रेकमेंडर सिस्टम्स
5.1 Popularity-based baseline
# Example 1 — Top-N popular items
import pandas as pd
# sample ratings
ratings = pd.DataFrame({'user':[1,2,1,3,2],'item':[10,10,11,11,12],'rating':[5,4,5,4,3]})
top_items = ratings.groupby('item')['rating'].count().sort_values(ascending=False)
print('Top items by count:\\n', top_items)
Practice:
Implement top-N by average rating and compare to count-based ranking.
5.2 Collaborative Filtering — user-based KNN
# Example 2 — user-user collaborative filtering (sketch)
from sklearn.metrics.pairwise import cosine_similarity
user_item = ratings.pivot_table(index='user', columns='item', values='rating').fillna(0)
sim = cosine_similarity(user_item)
sim_df = pd.DataFrame(sim, index=user_item.index, columns=user_item.index)
print(sim_df)
# recommend items liked by similar users but not by target user
Practice:
Pick a target user and recommend top items using weighted scores from neighbors.
5.3 Matrix Factorization (SVD)
# Example 3 — SVD using sklearn TruncatedSVD on user-item matrix
from sklearn.decomposition import TruncatedSVD
U = user_item.values
svd = TruncatedSVD(n_components=2)
latent = svd.fit_transform(U)
reconstructed = svd.inverse_transform(latent)
print('Reconstructed matrix shape:', reconstructed.shape)
Practice:
Compute predicted score for missing entries and recommend highest predicted items for a user.
5.4 Content-based recommendations
# Example 4 — content-based using TF-IDF for item descriptions
from sklearn.feature_extraction.text import TfidfVectorizer
docs = ["action sci-fi", "romantic drama", "sci-fi adventure", "action thriller"]
tfidf = TfidfVectorizer().fit_transform(docs)
from sklearn.metrics.pairwise import linear_kernel
cos_sim = linear_kernel(tfidf[0:1], tfidf).flatten()
print(cos_sim)
Practice:
For a given item description, recommend similar items using cosine similarity on TF-IDF vectors.
5.5 Hybrid & evaluation metrics for recommenders
# Example 5 — precision@k (sketch)
def precision_at_k(true_items, pred_items, k):
return len(set(true_items) & set(pred_items[:k])) / k
print('Precision@3:', precision_at_k([10,11], [11,10,12], 3))
Practice:
Implement precision@k for your recommender and report for a test set.
6) Model Interpretability & Fairness / इंटरप्रेटेबिलिटी और निष्पक्षता
6.1 Feature importance & coefficients
# Example 1 — tree feature importances
rf = RandomForestClassifier(n_estimators=20).fit(X, y)
print('Feature importances:', rf.feature_importances_)
# Linear model coefficients
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression().fit(X, (y))
print('Coefficients:', lr.coef_)
Practice:
Use feature importances to remove weak features and retrain. Observe effect on performance.
6.2 SHAP (sketch)
# Example 2 — SHAP quick sketch (install shap)
# import shap
# explainer = shap.TreeExplainer(rf)
# shap_values = explainer.shap_values(X[:50])
# shap.summary_plot(shap_values, X[:50])
print("SHAP example requires shap installation; use on local machine/Colab.")
Practice:
Install shap
and run summary and force plots to explain specific predictions.
6.3 Fairness checks — group metrics
# Example 3 — check accuracy across groups
import numpy as np
# assume df has 'sensitive' column (e.g., gender) and predictions in pred
# simulate
sensitive = np.random.choice(['A','B'], size=len(y))
pred = rf.predict(X)
from sklearn.metrics import accuracy_score
acc_A = accuracy_score(y[sensitive=='A'], pred[sensitive=='A'])
acc_B = accuracy_score(y[sensitive=='B'], pred[sensitive=='B'])
print('Acc A:', acc_A, 'Acc B:', acc_B)
Practice:
If performance differs largely, consider reweighing, collecting more data, or fairness-aware algorithms (a fast research step).
6.4 Counterfactual explanations (concept)
6.5 Model cards & documentation
7) End-to-End Mini Project & Practice / एंड-टू-एंड मिनी प्रोजेक्ट
7.1 Dataset & Problem
7.2 Full pipeline (runnable template)
# Mini project pipeline template — place customer_churn.csv in working dir
import pandas as pd, numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
import joblib
# Load
df = pd.read_csv('customer_churn.csv')
df = df.dropna(subset=['churn'])
df['age'].fillna(df['age'].median(), inplace=True)
num_cols = ['age','balance']
cat_cols = ['gender','region']
# Train-test split
X = df.drop(['customer_id','churn'], axis=1)
y = df['churn'].map({'No':0, 'Yes':1}) if df['churn'].dtype=='object' else df['churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
# Preprocessor and pipeline
preprocessor = ColumnTransformer([
('num', StandardScaler(), num_cols),
('cat', OneHotEncoder(handle_unknown='ignore', sparse=False), cat_cols)
])
pipe = Pipeline([('pre', preprocessor), ('clf', RandomForestClassifier(random_state=42))])
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)
print(classification_report(y_test, pred))
print('ROC-AUC:', roc_auc_score(y_test, pipe.predict_proba(X_test)[:,1]))
# Save model artifact
joblib.dump(pipe, 'churn_pipeline.joblib')
print("Saved pipeline as churn_pipeline.joblib")
# Quick local deploy: test curl (see Flask/FastAPI snippets earlier)
Practice:
Run this template end-to-end. After saving pipeline, create simple FastAPI endpoint that loads churn_pipeline.joblib
and returns prediction for sample input.
7.3 Additional experiments (5 ideas)
- Try LightGBM/XGBoost instead of RandomForest and compare ROC-AUC.
- Add interaction features (age * balance) and re-evaluate.
- Implement K-fold target encoding for high-cardinality categories and compare leakage-aware scores.
- Build a simple dashboard (Streamlit) that sends sample inputs to your model and shows prediction + SHAP explanation.
- Simulate production traffic and log predictions — then run drift detection code from Section 2.
Resources & Next Steps
- Scikit-learn docs: model selection, pipelines, decomposition, clustering.
- Statsmodels and Prophet: time-series tools.
- Docker docs and cloud provider tutorials (AWS, GCP, Azure) for deployment.
- SHAP and LIME for interpretability.
- MLflow for model registry, tracking, and simple deployment helpers.