Rossmann operates over 3,000 drug stores across 7 European countries. Store managers are tasked with predicting daily sales up to six weeks in advance. Sales are influenced by factors including promotions, competition, school and state holidays, seasonality, and store locality.
This project builds a regression-based machine learning model to forecast sales for 1,115 Rossmann stores using historical data, helping the business make data-driven decisions on budgets, hiring, incentives, and growth plans.
- Stores: 1,115 Rossmann stores across Europe
- Features: Store type, assortment, promotions, competition distance, school/state holidays, day of week, and more
- Target: Daily sales revenue
- Download Links:
- Exploratory Data Analysis (EDA) — Analyzed sales trends, seasonality, promotional impact, and store-level patterns
- Feature Engineering — Extracted and transformed features from promotions, competition, holidays, and temporal attributes
- Model Benchmarking — Trained and compared 5 regression models to identify the best performer
- Evaluation — Used MAE, MAPE, and RMSE as evaluation metrics
| Model | MAE | MAPE (%) | RMSE |
|---|---|---|---|
| Random Forest Regressor | 383.06 | 5.46 | 577.59 |
| SARIMA | 365.87 | 12.66 | 434.03 |
| XGBoost Regressor | 509.39 | 7.27 | 739.63 |
| Linear Regression | 1045.57 | 15.05 | 1458.30 |
| LR Lasso | 1107.31 | 15.66 | 1582.54 |
Random Forest Regressor achieved the best balance of performance with the lowest MAPE of 5.46%, meaning predictions deviate from actual sales by only ~5.5% on average.
RandomForestRegressor(
n_estimators=30,
random_state=42,
criterion='gini',
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
max_features='auto',
bootstrap=True,
oob_score=False,
class_weight=None
)- Language: Python
- Libraries: Pandas, NumPy, Scikit-learn, XGBoost, Statsmodels (SARIMA), Matplotlib, Seaborn
- Models: Random Forest, SARIMA, XGBoost, Linear Regression, Lasso Regression
-
Clone the repository:
git clone https://github.com/varshil009/Rossmann-Regression.git cd Rossmann-Regression -
Install dependencies:
pip install pandas numpy scikit-learn xgboost statsmodels matplotlib seaborn
-
Download the dataset using the links above and place the files in the project directory.
-
Run the notebook:
jupyter notebook "Rossman Regression.ipynb"
Rossmann-Regression/
├── ML_process.ipynb # ML pipeline and model training
├── Rossman Regression.ipynb # EDA and data preprocessing
└── README.md # Project documentation
- Random Forest outperformed all other models on MAPE (5.46%), the most business-relevant metric for sales forecasting
- SARIMA achieved the lowest MAE but had a significantly higher MAPE (12.66%), indicating inconsistent percentage-wise accuracy across stores
- Feature engineering on temporal and promotional features was critical to improving model performance
- Forecasting at store level enables targeted business decisions for budgets, staffing, and inventory management