This project aims to predict house prices in Austin, Texas, using various machine learning models. The project includes comprehensive data preprocessing, feature engineering, model training, and evaluation. The final model is applied to a holdout dataset for predictions.
- Data Import and Exploration: Loading and initial exploration of the dataset to understand its structure and characteristics.
- Data Manipulation:
- Added a new column
logLatestPrice
by taking the logarithm oflatestPrice
. - Derived the age of the house and categorized it into different age groups.
- Converted binary columns to numeric values and created a new feature
total_amenities
. - Extracted and converted sale date components into year, month, day, and season.
- Added a new column
- Geospatial Features:
- Performed clustering based on latitude and longitude to create a new feature
crossCluster
.
- Performed clustering based on latitude and longitude to create a new feature
- Modeling:
- Trained and evaluated multiple models, including Regression Tree, Pruned Tree, Bagging with Random Forest, XGBoost, BART, Ridge, and Lasso Regression.
- Calculated and compared MSE and RMSE for each model to identify the best performer.
- Model Comparison and Conclusion:
- Summarized the performance of different models and provided insights and recommendations based on the model performance.
- Application on Holdout Set:
- Applied the best model on the holdout dataset for final predictions.
The dataset used in this project consists of various features, including geospatial data, sale dates, and property characteristics. The main target variable is the logarithm of the latest sale price (logLatestPrice
).
- Geospatial Data: Latitude, Longitude, and Clustering (
crossCluster
). - Temporal Data: Sale Year, Sale Month, and Season.
- Property Characteristics: Number of Bedrooms, Bathrooms, Stories, Garage Spaces, etc.
- Aggregated Amenities: Total number of amenities based on binary columns like
hasAssociation
,hasGarage
, etc.
- Regression Tree: A simple decision tree model.
- Pruned Tree: A decision tree model pruned for better generalization.
- Bagging with Random Forest: An ensemble method using multiple decision trees.
- Random Forest: A popular ensemble learning method that builds multiple decision trees and merges them together to get a more accurate and stable prediction.
- XGBoost: An efficient implementation of gradient boosting that is widely used for machine learning competitions.
- BART (Bayesian Additive Regression Trees): A non-parametric Bayesian model that provides flexible and robust predictions.
- Ridge Regression: A linear regression model with L2 regularization.
- Lasso Regression: A linear regression model with L1 regularization, which also performs feature selection.
The models were evaluated based on their Mean Squared Error (MSE) and Root Mean Squared Error (RMSE). Below is a summary of the performance:
Model | MSE | RMSE |
---|---|---|
Single Tree | 107,407.4 | 327.73 |
Pruned Tree | 91,419.22 | 302.36 |
Bagging | 67,584.75 | 259.97 |
Random Forest | 67,067.45 | 258.97 |
XGBoost | 26,151.38 | 161.71 |
BART | 19,995.7 | 141.41 |
Ridge Regression | 60,583.55 | 246.14 |
Lasso Regression | 57,363.87 | 239.51 |
The BART model outperformed all other models, achieving the lowest RMSE, indicating its superior ability to capture complex relationships in the data.
BART and XGBoost emerged as the top performers, with BART having a slight edge in terms of RMSE. These models are well-suited for capturing complex, non-linear relationships in the data.
- Further hyperparameter tuning for models like XGBoost and Random Forest.
- Exploring more advanced feature engineering techniques.
- Considering additional ensemble methods like stacking to further improve predictions.
To replicate this project, follow these steps:
- Clone the repository.
- Run Rmd file and Run the Jupyter notebook to see the analysis and model training process.
- The predictions on the holdout set are saved as
austinhouses_holdout_predictions.csv
.
For any questions or suggestions, feel free to contact Pranav Garg.
You can copy and paste this structure into a README.md
file for your GitHub repository. Adjust the contact information and any other details as needed.