Home Credit Default Risk
Here’s a detailed and engaging description for your GitHub repository:
📊 Predict Credit Default Risk using Machine Learning
Welcome to the Home Credit Default Risk project! This repository showcases a complete end-to-end pipeline for predicting credit default risk using structured data provided by Home Credit. The goal is to identify potential loan defaulters to assist financial institutions in minimizing risk while maximizing customer satisfaction.
-
Data Loading & Preprocessing:
- Efficient loading of multiple datasets.
- Comprehensive preprocessing: handling missing values, encoding categorical features, and scaling numeric features.
-
Feature Aggregation & Engineering:
- Advanced aggregation techniques for POS, credit card, and installment payments.
- Intelligent merging of datasets for holistic feature representation.
-
Machine Learning Pipeline:
- Supports multiple models: XGBoost, Random Forest, Logistic Regression.
- Hyperparameter flexibility for tuning models.
- Robust evaluation using metrics like ROC-AUC and confusion matrix.
-
Visualization:
- Feature importance.
- Correlation matrices.
- ROC and precision-recall curves.
-
Streamlit Application:
- Interactive UI for exploring the pipeline.
- Upload datasets, preprocess data, train models, and evaluate performance.
├── data/ # Raw and processed datasets
├── src/ # Core scripts for pipeline
│ ├── load_data.py # Data loading
│ ├── join.py # Aggregation and merging
│ ├── preprocessing.py # Data preprocessing
│ ├── train_model.py # Model training
│ ├── evaluate_model.py # Model evaluation
│ └── visualize.py # Visualizations
├── main.py # Pipeline orchestration
├── streamlit_app.py # Streamlit interactive app
├── README.md # Project documentation
├── requirements.txt # Python dependencies
- Load Data: Load multiple datasets like application, credit card, and POS balance data.
- Join & Aggregate: Combine datasets and engineer new features.
- Preprocess: Handle missing values, encode features, and scale data.
- Train Models: Experiment with XGBoost, Random Forest, and Logistic Regression.
- Evaluate & Visualize: Assess model performance and visualize insights.
# Clone the repository
git clone https://github.com/yourusername/home-credit-default-risk.git
cd home-credit-default-risk
# Install dependencies
pip install -r requirements.txt
# Run the pipeline
python main.py
# Run Streamlit app
streamlit run streamlit_app.py
- Python
- Pandas and NumPy: Data manipulation.
- Scikit-learn: Preprocessing and evaluation.
- XGBoost: Advanced machine learning.
- Matplotlib and Seaborn: Data visualization.
- Streamlit: Interactive app.
- Incorporate advanced models (e.g., CatBoost, LightGBM).
- Add automated hyperparameter tuning.
- Extend visualization capabilities.
- Include time-series analysis for sequential datasets.
This project is licensed under the MIT License.
Contributions are welcome! If you’d like to improve the project or fix a bug:
- Fork the repository.
- Create your feature branch:
git checkout -b feature-name
. - Commit your changes:
git commit -m 'Add feature-name'
. - Push to the branch:
git push origin feature-name
. - Open a pull request.
For any questions, feel free to reach out via [email protected] or create an issue.