The research objective of our simulation project will be to perform a model comparison between a random forest model and a logistic regression model in the context of a binary classification problem. The simulation will be modelled after one conducted by Kirasich et al. (2018), in which a similar study was carried out.
We will differentiate our study through the addition of unique scenarios not looked at in their study - impact of missing values and study of two different missing values imputation - Random Forest Imputation and mode imputation.
To measure the performance of our models and to substantiate our research objectives, we will use the misclassification rate (accuracy) along with the AUC/ROC and Cumulative Lift curves to visualize the results of the simulation. We will additionally use the AUC/ROC curve to compare the sensitivity (true positive rate) of the competing model to see if either model performs better for this specific performance metric.
Kirasich, Kaitlin; Smith, Trace; and Sadler, Bivin (2018) "Random Forest vs Logistic Regression: Binary Classification for Heterogeneous Datasets," SMU Data Science Review: Vol. 1 : No. 3 , Article 9. Available at: https://scholar.smu.edu/datasciencereview/vol1/iss3/9