The files to the challenge can be found here. There will be four csv files to be downloaded.
- dengue_features_train.csv
- dengue_labels_train.csv
- dengue_features_test.csv
- submission_format.csv
- Poisson Regression
- Negative Binomial Regression
Poisson is used when the mean and variance of data are equal or close to equal. Negative Binomial (Pascal) is used when they are different.
As this dataset is big, some features might be strongly correlated to the target variable .i.e. total_cases and some might be weakly correlated. So our job is to drop the features which have a less correlation factor. As this dataset consists data of two different cities, we might need to separate them and treat them as two different datasets.train_features_sj.total_cases.mean()
> 34.18
train_features_sj.total_cases.var()
> 2640.04
train_features_iq.total_cases.mean()
> 7.56
train_features_iq.total_cases.var()
> 115.89
As mean and variance of total_cases is absolutely different for both the cities, we will use the Negative Binomial Regression.
On plotting the correlations of the features with the target variable we get the following result:
So we can see that few features have negative correlation. So it's wise to remove those features from the dataset and also those which have less (close to 0) correlation.
The training set for both cities is split into a training set and a validation set.train_features_sj.shape
> (936, 16)
train_features_iq.shape
> (520, 17)
So for SJ a good split will be 800-136 and for IQ it will be 400-120.
We make the model using Negative Binomial of Generalized Linear Model (GLM) class. Refer the code for further understanding. After making the predictions on the validation set, we plot it.- Numpy
- Pandas
- Matplotlib
- Seaborn
- Statsmodels
- Scikit-learn
- The meanabs() error for SJ: 20.58
- The meanabs() error for IQ: 10.26





