A full end-to-end machine learning and Flask web application that predicts diabetes risk and visualises explainability insights for individual or batch predictions. Built with Python, scikit-learn, pandas, SHAP, and Flask, this project demonstrates both data science excellence and software engineering maturity — from raw data ingestion to interactive model deployment.
⚙️ Two integrated components:
- End-to-End Machine Learning Pipeline — data processing → training → evaluation → explainability.
- Interactive Flask Dashboard — real-time single & batch prediction app powered by the trained model.
- Automated ML Pipeline: Modular scripts for loading, preprocessing, EDA, training, evaluation, and model explainability.
- Interactive Web Dashboard: Built with Flask + Bootstrap + Chart.js, for clinicians and analysts to interactively explore model predictions.
- Explainability: Integrated SHAP/LIME interpretability tools, visualising local and global feature contributions.
- Production-ready structure: Logging, testing, CI, and pre-commit hooks aligned with professional ML engineering standards.
- Collaborative foundation: Code documented with Javadoc-style docstrings, pytest coverage, and a clean file architecture.
Diabetes-Risk-Prediction-Project/
├── data/ # raw dataset (diabetes.csv)
├── models/ # saved ML models (.joblib)
├── reports/ # generated plots, explainability & metrics
│ ├── explain/ # SHAP/LIME visual outputs
│ ├── models/ # trained model artifacts
│ └── figures/ # EDA & evaluation plots
├── src/
│ ├── data_loading.py
│ ├── data_processing.py
│ ├── data_exploration.py
│ ├── data_visualisation.py
│ ├── statistical_analysis.py
│ ├── model_training.py
│ ├── model_evaluation.py
│ └── dashboard/ # Flask dashboard app
│ ├── app.py
│ ├── predict.py
│ ├── routes.py
│ ├── templates/
│ │ └── index.html
│ └── static/
├── tests/ # unit, integration, and dashboard tests
├── main.py # unified pipeline runner
├── requirements.txt
├── pyproject.toml
└── README.md
This component implements a complete data science workflow — from ingestion to explainability — using the Pima Indians Diabetes dataset.
-
Data Loading: Load
data/diabetes.csvinto pandas and validate structure.python src/data_loading.py --data ./data/diabetes.csv -
Data Preprocessing: Handle missing values, normalize numerical features, and encode categorical variables.
python src/data_processing.py --data ./data/diabetes.csv --out reports -
Exploratory Data Analysis (EDA): Generate descriptive statistics, correlations, and visualizations (BMI, glucose, etc.).
python src/data_exploration.py --data ./data/diabetes.csv --out reports -
Statistical Analysis: Run hypothesis tests and feature significance analysis.
python src/statistical_analysis.py --data ./data/diabetes.csv --out reports -
Model Training: Train Logistic Regression, Random Forest, Gradient Boosting, or XGBoost models. Save the best model to
reports/models/.python src/model_training.py --data ./data/diabetes.csv --model rf --out_dir reports
-
Model Evaluation: Evaluate accuracy, ROC AUC, and confusion matrix; save plots.
python src/model_evaluation.py --model reports/models/rf_best.joblib --out reports -
Explainability & Feature Importance: Generate SHAP plots and local explanations stored under
reports/explain/. -
Run Entire Pipeline Automatically:
python main.py
An interactive dashboard that loads the trained model from reports/models/ and enables both single and batch predictions.
# 1. From the repo root
python -m pip install -r requirements.txt
# 2. Run the Flask app
PYTHONPATH=src python src/dashboard/app.py
# 3. Visit
http://127.0.0.1:5000or explicitly specify a model path:
python src/dashboard/app.py --model reports/models/rf_best.joblib| Panel | Description |
|---|---|
| Quick Single Prediction | Enter medical features manually → get predicted probability and SHAP explanation. |
| Batch CSV Upload | Upload a .csv file with multiple patients → get batch summary, visualized histogram, and downloadable explainability artifacts. |
| Notes & Guidance | Practical interpretation guide for clinicians and data scientists. |
All predictions, explanations, and generated files are timestamped and stored in reports/explain/.
- Single Prediction No Data
- Single Prediction With Data
- Batch Prediction No Data
- Batch Prediction With Data
Run tests locally before pushing:
PYTHONPATH=src pytest -qThe repository includes:
- Unit tests: For
ModelWrapper, preprocessing, and data loaders. - API tests: Flask routes and endpoints (
/predict,/predict_batch). - Integration tests: Pipeline execution to ensure end-to-end consistency.
CI/CD integration (via GitHub Actions) ensures tests run automatically on every push.
- Language: Python 3.11+
- Core Libraries: pandas, numpy, scikit-learn, joblib, shap, matplotlib, seaborn
- Web Framework: Flask + Bootstrap + Chart.js
- Testing: pytest, pre-commit, black, isort, ruff
- Tools: VS Code, GitHub Actions CI, pre-commit hooks, reportlab for PDF export
| Folder | Description |
|---|---|
reports/models/ |
Trained model artifacts (.joblib) |
reports/explain/ |
SHAP local & global explanations |
reports/ |
EDA visuals, evaluation plots, and logs |
data/ |
Input dataset (diabetes.csv) |
tests/ |
Pytest suite |
For production:
- Replace
app.secret_keywith an environment variable. - Serve via Gunicorn or Waitress instead of Flask’s dev server.
- Mount static files via Nginx.
- Optionally containerize using Docker with health checks.
This dashboard enables clinicians or data scientists to:
- Instantly assess diabetes risk for new patients.
- Interpret which medical features contribute most to the prediction.
- Batch-evaluate risk profiles for large datasets.
- Export explainability artifacts (HTML/PNG) for audit and reporting.
Special thanks to:
- The National Institute of Diabetes and Digestive and Kidney Diseases — for the original dataset.
- OpenAI’s ChatGPT (GPT-5) — for advanced assistance in refactoring, debugging, and structuring production-ready code, documentation, and CI integration.
- The open-source community for continuous innovation in Python, Flask, and ML tooling.
Adrian Adewunmi