Skip to content

Achieved top 4% on leaderboard by developing ML model, feature engineering, and utilizing XG-Boost algorithm to process and analyze large dataset, demonstrating ability to handle sizable data and extract insights, with F1 score of 0.68 showcasing effective modeling of large-scale educational datasets.

License

Notifications You must be signed in to change notification settings

jojo142/MoneyInMotionPerformancefromGamePlay

Repository files navigation

MoneyInMotionPerformancefromGamePlay

kaggle

Feature Engineering:

The first section of the code focuses on feature engineering. It includes a function called feature_engineer that takes a train dataset as input and performs various feature engineering operations. The code utilizes different grouping and aggregation techniques to generate new features based on categorical and numerical variables. The final output is a processed dataframe containing the engineered features.

Data Preparation and Model Training:

The next section focuses on data preparation and model training. It begins by importing necessary modules such as sklearn.model_selection, xgboost, and sklearn.metrics. The code defines the number of splits for cross-validation using the GroupKFold class. It also initializes an empty dataframe (oof) to store out-of-fold predictions and a dictionary (models) to store trained models.The code then enters a loop to perform cross-validation. Within each iteration, the train_index and test_index for the current fold are generated using gkf.split(). For each fold, the code defines the parameters for an XGBoost classifier and iterates over different question numbers. It filters the training and validation data based on the question number and level group. The XGBoost classifier is trained on the filtered data and evaluated on the validation set. The trained model is stored in the models dictionary, and the predictions on the validation set are stored in the oof dataframe. The feature engineering section utilizes loops to iterate over different features and data groups. It also generates binary features for specific events and sums up event occurrences and elapsed time for each group.

Evaluation and Threshold Optimization:

The next section of the code focuses on evaluation and threshold optimization. It initializes a copy of the oof dataframe (true) to store the true labels. The code then enters a while loop to iterate over different threshold values. For each threshold value, it calculates the F1 score using the predicted labels from the oof dataframe and the true labels from the true dataframe. The F1 score and threshold value are stored in separate lists (listA and listB).The loop also keeps track of the best F1 score and its corresponding threshold value. After the loop, the code calculates the overall F1 score using the best threshold and prints it.

Testing and Prediction:

The final section of the code focuses on testing and prediction. It defines a dictionary (limits) that specifies the lower and upper question numbers for each level group. The code then enters a loop to iterate over the test data and sample_submission. Within each iteration, it performs feature engineering on the test data and retrieves the level group. Based on the level group, the code determines the question number limits using the limits dictionary. Next, the code iterates over the question numbers within the limits and retrieves the trained model corresponding to the level group and question number. It predicts the probability of correctness for the test data using the model. The code updates the 'correct' column in the sample_submission dataframe based on the predicted probability and the best threshold obtained from the evaluation phase. Then code makes predictions using the updated sample_submission dataframe and submits them via the environment by calling the env.predict() function.

About

Achieved top 4% on leaderboard by developing ML model, feature engineering, and utilizing XG-Boost algorithm to process and analyze large dataset, demonstrating ability to handle sizable data and extract insights, with F1 score of 0.68 showcasing effective modeling of large-scale educational datasets.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published