Python for Machine Learning & Data Science — Day 6 Full Guide & Practice (Unsupervised Learning, Clustering, Dimensionality Reduction)

Python for Machine Learning & Data Science — Day 6 Full Guide

Python for Machine Learning & Data Science — Day 6

Topic-packed: MLOps basics, deployment, monitoring, time-series, unsupervised learning, recommenders, interpretability, fairness — with Hinglish explanations + runnable examples.

1) MLOps & Model Deployment / MLOps और मॉडल डिप्लॉयमेंट

Hinglish explanation: Jab aapka model achha chalne lage, next step hota hai usko production mein lagaana — users ya services ke liye available karna. Is process ko deployment kehte hain, aur jab deployment + monitoring + repeatable pipelines sikh liye jaate hain to usko MLOps bolte hain. Yahan hum simple options dekhenge: Flask API, FastAPI, Docker, and a quick look at serverless or cloud deployment concepts.

1.1 Option A — Simple Flask API (local)

Why: Flask is minimal, good for learning. Ye approach local testing ke liye best hai before containerization.
# Example 1 — simple Flask app serving a scikit-learn model
# Save this as app.py and run `flask run` (or python app.py)
from flask import Flask, request, jsonify
import joblib
import numpy as np

app = Flask(__name__)
model = joblib.load('model.joblib')  # save model earlier with joblib.dump

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()  # expect {"features": [f1, f2, ...]}
    X = np.array(data['features']).reshape(1, -1)
    pred = model.predict(X).tolist()
    proba = model.predict_proba(X).tolist() if hasattr(model,'predict_proba') else None
    return jsonify({'prediction': pred, 'probability': proba})

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=5000)

Practice:

Train a simple model (e.g., RandomForest on Iris), save with joblib.dump, then create this Flask app and test POST requests using curl or Postman.

1.2 Option B — FastAPI (better for async & docs)

# Example 2 — FastAPI example (automatic docs with /docs)
# Save as main.py and run: uvicorn main:app --reload
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np

class InFeatures(BaseModel):
    features: list

app = FastAPI()
model = joblib.load('model.joblib')

@app.post('/predict')
def predict(payload: InFeatures):
    X = np.array(payload.features).reshape(1, -1)
    pred = model.predict(X).tolist()
    return {'prediction': pred}

Practice:

Run FastAPI and open http://localhost:8000/docs to see interactive docs. Try sending sample JSON.

1.3 Containerize with Docker

# Example 3 — sample Dockerfile for Flask app
# Dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 5000
CMD ["python", "app.py"]

Practice:

Build & run Docker image: docker build -t mymodel . then docker run -p 5000:5000 mymodel. Test endpoint locally.

1.4 Option C — Serverless / Cloud Notes

Short: AWS Lambda (with API Gateway), GCP Cloud Functions, or Azure Functions can serve small models. Use when scale or cost model suits serverless. Bigger models: deploy to managed services (SageMaker, Vertex AI).

1.5 Example: Save & Load model (joblib)

# Example 5 — train, save, and load
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import joblib
X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier(n_estimators=10)
clf.fit(X, y)
joblib.dump(clf, 'model.joblib')
clf2 = joblib.load('model.joblib')
print("Loaded model predicts:", clf2.predict(X[:2]))

Practice:

Train a model and serve it locally using any of the above methods. Note latency and throughput for a few requests.

2) Model Monitoring & CI/CD / मॉनिटरिंग और CI/CD

Hinglish: Production mein model lagane ke baad zaroori hota hai ki aap monitor karein: latency, error rates, prediction distribution, data drift, model drift. Also, adopt CI/CD for models and pipelines so that repeatable, testable deployments possible ho.

2.1 What to monitor?

  • Latency & request errors
  • Prediction distribution drift (population stats)
  • Data quality (missing features, out-of-bound values)
  • Model performance on labelled subset (if available)

2.2 Quick code: log predictions to file for offline monitoring

# Example 1 — simple logging of requests for later analysis
import json
from datetime import datetime

def log_prediction(request_id, features, prediction, proba=None):
    record = {
        'ts': datetime.utcnow().isoformat(),
        'request_id': request_id,
        'features': features,
        'prediction': prediction,
        'probability': proba
    }
    with open('predictions.log', 'a') as f:
        f.write(json.dumps(record) + '\\n')

Practice:

Run a few predictions, then parse predictions.log with pandas and look at distribution of predictions and input features over time.

2.3 Simple drift detection (feature mean change)

# Example 2 — naive drift detection by comparing means
import pandas as pd

# assume predictions.log converted to CSV or JSON lines as df
# baseline_stats is a dict saved during offline validation
baseline_stats = {'age_mean': 35.2, 'income_mean': 54000}
df = pd.read_json('predictions.log', lines=True)
current_age_mean = df['features'].apply(lambda x: x[0]).mean()
if abs(current_age_mean - baseline_stats['age_mean']) > 5:
    print("Age mean drift detected")

Practice:

Create a baseline using holdout or training data, then simulate new requests with a shifted feature and observe drift detection.

2.4 CI/CD for ML pipelines (concept)

Short: CI/CD pipeline for ML includes: data tests, model tests (unit tests + performance tests), container build, integration tests, and automated deployment. Tools: GitHub Actions, Jenkins, GitLab CI, MLflow, TFX.

2.5 Quick example: GitHub Actions snippet (sketch)

# .github/workflows/ci.yml (sketch)
name: ml-ci
on: [push]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - name: Set up Python
      uses: actions/setup-python@v2
      with: python-version: 3.10
    - name: Install
      run: pip install -r requirements.txt
    - name: Run tests
      run: pytest

Practice:

Add a small unit test for your data preprocessing and run it on push using GitHub Actions (free for public repos).

3) Time Series Forecasting / टाइम सीरीज़ फ़ोरकास्टिंग

Hinglish: Time series data mein observations ordered by time hote hain. Common tasks: forecasting future values, anomaly detection. Important to consider seasonality, trend, and autocorrelation. We'll cover ARIMA, Prophet (sketch), and simple ML approaches with lag features.

3.1 Classical approach: ARIMA (statsmodels)

# Example 1 — ARIMA using statsmodels (install statsmodels)
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
# sample data: daily series in df['y']
# df = pd.read_csv('daily.csv', parse_dates=['ds'], index_col='ds')
# For demo create synthetic
rng = pd.date_range('2021-01-01', periods=200)
ts = pd.Series(10 + 0.1 * (range(200)) + 2 * np.sin(np.linspace(0,20,200)), index=rng)
model = ARIMA(ts, order=(2,1,2))
res = model.fit()
print(res.summary())
print("Forecast next 5:", res.forecast(5))

Practice:

Fit ARIMA on a time series and plot residuals to check assumptions (white noise residuals desirable).

3.2 Prophet (Facebook/Meta Prophet)

# Example 2 — Prophet (install prophet package)
from prophet import Prophet
df = pd.DataFrame({'ds': rng, 'y': ts})
m = Prophet()
m.fit(df)
future = m.make_future_dataframe(periods=30)
forecast = m.predict(future)
print(forecast[['ds','yhat','yhat_lower','yhat_upper']].tail())

Practice:

Try Prophet with holidays or weekly seasonality settings and observe changes in forecasts.

3.3 ML with lag features

# Example 3 — use lags + RandomForest
import pandas as pd, numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import TimeSeriesSplit
# create lag features
df_ts = pd.DataFrame({'y': ts})
for l in range(1,8): df_ts[f'lag_{l}'] = df_ts['y'].shift(l)
df_ts.dropna(inplace=True)
X = df_ts.drop('y', axis=1).values
y = df_ts['y'].values
tscv = TimeSeriesSplit(n_splits=5)
rf = RandomForestRegressor(n_estimators=50)
# cross-validate
from sklearn.model_selection import cross_val_score
print('MAE (cv):', -cross_val_score(rf, X, y, cv=tscv, scoring='neg_mean_absolute_error').mean())

Practice:

Create rolling-window features like moving averages and include them as predictors. Compare model error.

3.4 Anomaly detection in time series

# Example 4 — simple z-score anomaly detection
rolling_mean = ts.rolling(7).mean()
rolling_std = ts.rolling(7).std()
zscore = (ts - rolling_mean) / rolling_std
anomalies = ts[np.abs(zscore) > 2]
print('Anomalies found:', anomalies.head())

Practice:

Simulate spikes in the series and check detection sensitivity by changing z-score threshold.

3.5 Metrics for forecast evaluation

# Example 5 — compute MAE, MAPE for forecast
from sklearn.metrics import mean_absolute_error
def mape(y_true, y_pred): return np.mean(np.abs((y_true - y_pred)/y_true))*100
y_true = np.array([100,120,130])
y_pred = np.array([90,115,140])
print('MAE:', mean_absolute_error(y_true,y_pred))
print('MAPE:', mape(y_true,y_pred))

Practice:

Use rolling-origin evaluation (walk-forward) when evaluating forecasts for realistic results.

4) Unsupervised Learning — PCA & Clustering / अनसुपरवाइज़्ड सीखना

Hinglish: Jab labels nahi hote, unsupervised methods useful hote hain to explore structure — dimensionality reduction with PCA, clustering with k-means, hierarchical clustering, DBSCAN. Use-cases: anomaly detection, customer segmentation, visualization.

4.1 PCA — Principal Component Analysis

# Example 1 — PCA on iris features
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
X, y = load_iris(return_X_y=True)
pca = PCA(n_components=2)
Xp = pca.fit_transform(X)
print('Explained variance ratio:', pca.explained_variance_ratio_)
plt.scatter(Xp[:,0], Xp[:,1], c=y)
plt.title('PCA 2D')
plt.show()

Practice:

Check how many components required to capture 90% variance using pca.explained_variance_ratio_.

4.2 k-Means Clustering

# Example 2 — k-means clustering and elbow method
from sklearn.cluster import KMeans
inertia = []
for k in range(1,8):
    km = KMeans(n_clusters=k, random_state=0).fit(X)
    inertia.append(km.inertia_)
print('Inertia:', inertia)
# pick k by elbow and plot clusters
km = KMeans(n_clusters=3, random_state=0).fit(X)
labels = km.labels_
plt.scatter(X[:,0], X[:,1], c=labels)
plt.title('k-means clusters (first two features)')
plt.show()

Practice:

Try silhouette scores for different k to select cluster count (sklearn.metrics.silhouette_score).

4.3 DBSCAN (density-based)

# Example 3 — DBSCAN for clusters of varying shape
from sklearn.cluster import DBSCAN
import numpy as np
# create sample with noise (use sklearn.datasets.make_moons or make_blobs)
from sklearn.datasets import make_moons
Xm, _ = make_moons(n_samples=300, noise=0.05)
db = DBSCAN(eps=0.2, min_samples=5).fit(Xm)
plt.scatter(Xm[:,0], Xm[:,1], c=db.labels_)
plt.title('DBSCAN Clusters')
plt.show()

Practice:

Change eps and min_samples to see dense vs sparse cluster assignments and noise (-1 label).

4.4 Hierarchical Clustering

# Example 4 — hierarchical dendrogram
from scipy.cluster.hierarchy import linkage, dendrogram
Z = linkage(X[:50], method='ward')
dendrogram(Z)
plt.title('Dendrogram (first 50 samples)')
plt.show()

Practice:

Cut dendrogram at different heights and inspect cluster membership.

4.5 Use unsupervised learning for feature engineering

# Example 5 — use cluster label as feature
km = KMeans(n_clusters=4).fit(X)
cluster_feat = km.predict(X).reshape(-1,1)
# append cluster_feat to original dataset for downstream supervised model
import numpy as np
X_new = np.hstack([X, cluster_feat])
print('New feature shape:', X_new.shape)

Practice:

Train a supervised model with and without cluster feature and compare performance.

5) Recommender Systems / रेकमेंडर सिस्टम्स

Hinglish: Recommenders help suggest items to users. Basic approaches: popularity-based, collaborative filtering (user-item matrix), content-based, and hybrid systems. We'll show simple matrix factorization and nearest-neighbour methods.

5.1 Popularity-based baseline

# Example 1 — Top-N popular items
import pandas as pd
# sample ratings
ratings = pd.DataFrame({'user':[1,2,1,3,2],'item':[10,10,11,11,12],'rating':[5,4,5,4,3]})
top_items = ratings.groupby('item')['rating'].count().sort_values(ascending=False)
print('Top items by count:\\n', top_items)

Practice:

Implement top-N by average rating and compare to count-based ranking.

5.2 Collaborative Filtering — user-based KNN

# Example 2 — user-user collaborative filtering (sketch)
from sklearn.metrics.pairwise import cosine_similarity
user_item = ratings.pivot_table(index='user', columns='item', values='rating').fillna(0)
sim = cosine_similarity(user_item)
sim_df = pd.DataFrame(sim, index=user_item.index, columns=user_item.index)
print(sim_df)
# recommend items liked by similar users but not by target user
    

Practice:

Pick a target user and recommend top items using weighted scores from neighbors.

5.3 Matrix Factorization (SVD)

# Example 3 — SVD using sklearn TruncatedSVD on user-item matrix
from sklearn.decomposition import TruncatedSVD
U = user_item.values
svd = TruncatedSVD(n_components=2)
latent = svd.fit_transform(U)
reconstructed = svd.inverse_transform(latent)
print('Reconstructed matrix shape:', reconstructed.shape)

Practice:

Compute predicted score for missing entries and recommend highest predicted items for a user.

5.4 Content-based recommendations

# Example 4 — content-based using TF-IDF for item descriptions
from sklearn.feature_extraction.text import TfidfVectorizer
docs = ["action sci-fi", "romantic drama", "sci-fi adventure", "action thriller"]
tfidf = TfidfVectorizer().fit_transform(docs)
from sklearn.metrics.pairwise import linear_kernel
cos_sim = linear_kernel(tfidf[0:1], tfidf).flatten()
print(cos_sim)

Practice:

For a given item description, recommend similar items using cosine similarity on TF-IDF vectors.

5.5 Hybrid & evaluation metrics for recommenders

Hinglish: Hybrid = combine collaborative + content. Evaluate with precision@k, recall@k, MAP, NDCG. Offline evaluation needs careful train-test splitting (leave-one-out or time-based).
# Example 5 — precision@k (sketch)
def precision_at_k(true_items, pred_items, k):
    return len(set(true_items) & set(pred_items[:k])) / k
print('Precision@3:', precision_at_k([10,11], [11,10,12], 3))

Practice:

Implement precision@k for your recommender and report for a test set.

6) Model Interpretability & Fairness / इंटरप्रेटेबिलिटी और निष्पक्षता

Hinglish: Interpretability = explain why model made a prediction. Tools: feature importance (tree), coefficients (linear), SHAP, LIME. Fairness = ensure model doesn't discriminate across sensitive groups. We'll show SHAP usage (sketch) and fairness checks.

6.1 Feature importance & coefficients

# Example 1 — tree feature importances
rf = RandomForestClassifier(n_estimators=20).fit(X, y)
print('Feature importances:', rf.feature_importances_)
# Linear model coefficients
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression().fit(X, (y))
print('Coefficients:', lr.coef_)

Practice:

Use feature importances to remove weak features and retrain. Observe effect on performance.

6.2 SHAP (sketch)

# Example 2 — SHAP quick sketch (install shap)
# import shap
# explainer = shap.TreeExplainer(rf)
# shap_values = explainer.shap_values(X[:50])
# shap.summary_plot(shap_values, X[:50])
print("SHAP example requires shap installation; use on local machine/Colab.")

Practice:

Install shap and run summary and force plots to explain specific predictions.

6.3 Fairness checks — group metrics

# Example 3 — check accuracy across groups
import numpy as np
# assume df has 'sensitive' column (e.g., gender) and predictions in pred
# simulate
sensitive = np.random.choice(['A','B'], size=len(y))
pred = rf.predict(X)
from sklearn.metrics import accuracy_score
acc_A = accuracy_score(y[sensitive=='A'], pred[sensitive=='A'])
acc_B = accuracy_score(y[sensitive=='B'], pred[sensitive=='B'])
print('Acc A:', acc_A, 'Acc B:', acc_B)

Practice:

If performance differs largely, consider reweighing, collecting more data, or fairness-aware algorithms (a fast research step).

6.4 Counterfactual explanations (concept)

Short: Counterfactual = minimal change to input that flips prediction. Useful in model transparency (e.g., "If your income were X higher, you would be approved").

6.5 Model cards & documentation

Practice: Create a small model card with intended use, data summary, metrics by group, known limitations — store with model artifacts for governance.

7) End-to-End Mini Project & Practice / एंड-टू-एंड मिनी प्रोजेक्ट

Hinglish: Ab ek chota real-world style mini-project karte hain: load dataset, preprocess, feature engineering, train two models, evaluate, deploy locally (sketch). Is project ko step-by-step run karoge to full pipeline ka idea clear ho jayega.

7.1 Dataset & Problem

Use a customer churn or credit dataset. Yeh template generic hai — bas CSV ka naam aur columns adapt karna.

7.2 Full pipeline (runnable template)


# Mini project pipeline template — place customer_churn.csv in working dir
import pandas as pd, numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
import joblib

# Load
df = pd.read_csv('customer_churn.csv')
df = df.dropna(subset=['churn'])
df['age'].fillna(df['age'].median(), inplace=True)
num_cols = ['age','balance']
cat_cols = ['gender','region']

# Train-test split
X = df.drop(['customer_id','churn'], axis=1)
y = df['churn'].map({'No':0, 'Yes':1}) if df['churn'].dtype=='object' else df['churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

# Preprocessor and pipeline
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), num_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore', sparse=False), cat_cols)
])

pipe = Pipeline([('pre', preprocessor), ('clf', RandomForestClassifier(random_state=42))])
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)
print(classification_report(y_test, pred))
print('ROC-AUC:', roc_auc_score(y_test, pipe.predict_proba(X_test)[:,1]))

# Save model artifact
joblib.dump(pipe, 'churn_pipeline.joblib')
print("Saved pipeline as churn_pipeline.joblib")

# Quick local deploy: test curl (see Flask/FastAPI snippets earlier)
    

Practice:

Run this template end-to-end. After saving pipeline, create simple FastAPI endpoint that loads churn_pipeline.joblib and returns prediction for sample input.

7.3 Additional experiments (5 ideas)

  1. Try LightGBM/XGBoost instead of RandomForest and compare ROC-AUC.
  2. Add interaction features (age * balance) and re-evaluate.
  3. Implement K-fold target encoding for high-cardinality categories and compare leakage-aware scores.
  4. Build a simple dashboard (Streamlit) that sends sample inputs to your model and shows prediction + SHAP explanation.
  5. Simulate production traffic and log predictions — then run drift detection code from Section 2.

Resources & Next Steps

  • Scikit-learn docs: model selection, pipelines, decomposition, clustering.
  • Statsmodels and Prophet: time-series tools.
  • Docker docs and cloud provider tutorials (AWS, GCP, Azure) for deployment.
  • SHAP and LIME for interpretability.
  • MLflow for model registry, tracking, and simple deployment helpers.
Thank You Learn Free and Earn More..

Python for Machine Learning & Data Science — Day 6 • beepship.blogspot.com • Practice, Build, Deploy

एक टिप्पणी भेजें

और नया पुराने

نموذج الاتصال