DengAI Challenge 2018

The files to the challenge can be found here. There will be four csv files to be downloaded.

dengue_features_train.csv
dengue_labels_train.csv
dengue_features_test.csv
submission_format.csv

Missing Values (NaNs)

In the dataset, we can see many missing values (NaNs) which have to be replaced. As the data changes with time, I chose the ffill()function to replace it with the last known value.

Type of Regression

As the target variable total_cases is always a non-negative integer, this is a count regression problem. So now, we have two choices for this:

Poisson Regression
Negative Binomial Regression

Poisson is used when the mean and variance of data are equal or close to equal. Negative Binomial (Pascal) is used when they are different.

Growth of Dengue in the two cities over time

Correlations

As this dataset is big, some features might be strongly correlated to the target variable .i.e. total_cases and some might be weakly correlated. So our job is to drop the features which have a less correlation factor. As this dataset consists data of two different cities, we might need to separate them and treat them as two different datasets.

train_features_sj.total_cases.mean()
> 34.18
train_features_sj.total_cases.var()
> 2640.04
train_features_iq.total_cases.mean()
> 7.56
train_features_iq.total_cases.var()
> 115.89

As mean and variance of total_cases is absolutely different for both the cities, we will use the Negative Binomial Regression.

On plotting the correlations of the features with the target variable we get the following result:

So we can see that few features have negative correlation. So it's wise to remove those features from the dataset and also those which have less (close to 0) correlation.

Train-Test Split

The training set for both cities is split into a training set and a validation set.

train_features_sj.shape
> (936, 16)
train_features_iq.shape
> (520, 17)

So for SJ a good split will be 800-136 and for IQ it will be 400-120.

Final Model

We make the model using Negative Binomial of Generalized Linear Model (GLM) class. Refer the code for further understanding. After making the predictions on the validation set, we plot it.

Requirements

Numpy
Pandas
Matplotlib
Seaborn
Statsmodels
Scikit-learn

To run the model: python model.py

Results

The meanabs() error for SJ: 20.58
The meanabs() error for IQ: 10.26

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
images		images
Final_Submission.csv		Final_Submission.csv
LICENSE		LICENSE
README.md		README.md
dengue_features_test.csv		dengue_features_test.csv
dengue_features_train.csv		dengue_features_train.csv
dengue_labels_train.csv		dengue_labels_train.csv
model.py		model.py
submission_format.csv		submission_format.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DengAI Challenge 2018

Missing Values (NaNs)

Type of Regression

Growth of Dengue in the two cities over time

Correlations

Train-Test Split

Final Model

Requirements

To run the model: python model.py

Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DengAI Challenge 2018

Missing Values (NaNs)

Type of Regression

Growth of Dengue in the two cities over time

Correlations

Train-Test Split

Final Model

Requirements

To run the model: python model.py

Results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages