Thursday 9 a.m.–12:20 p.m.
Winning Machine Learning Competitions With Scikit-Learn
Ben Hamner
- Audience level:
- Intermediate
- Category:
- Science
Description
This tutorial will offer an introduction machine learning and how to apply it to a Kaggle competition. We will cover methodologies that have worked well across a diverse set of problems, and then work on a current Kaggle competition together using iPython notebook and scikit-learn. We will cover concepts including feature extraction, feature selection, model evaluation, and data visualization.
Abstract
Machine learning forms the core of many intelligent services we use today, including language translation, web search, movie recommendation, and spam detection. Python's ecosystem provides a high quality array of tools for developing insights on these use cases and applying machine learning in production.
In this tutorial, we will provide a hands-on introduction to the concepts of machine learning and the process of applying these concepts in a competition setting. We'll start out with an overview of machine learning applications and how computers can learn from data. Then, we'll look at algorithms and methodologies that have been demonstrated to work well in a wide variety of applications, and what makes these algorithms tick.
For the bulk of the tutorial, we'll focus on a live Kaggle competition. We'll load the data into iPython notebook for interactive exploration and visualization, and use this to gain a basic understanding of what's in the data. From there, we'll extract features and train a model using scikit-learn. This will bring us to our first Kaggle submission.
Next, we'll switch out of iPython notebook and start structuring the code for repeatability, using git for version control and Make for an explicit dependency graph. We'll learn how to structure the problem for offline evaluation and then use scikit-learn's clean model API's to train many models simultaneously and perform feature selection and hyperparameter optimization.
At this point, we'll provide suggestions for how to further improve on the problem and then finish with an hour-long lab, with tutorial participants working individually or in groups to improve their methodologies and getting advice as needed.
By the end of this tutorial, participants will have a basic understanding of how to identify problems where machine learning can add value, along with how to use machine learning and the Python ecosystem to address these problems. They will be able to apply these techniques to their work, hobbies, Kaggle competitions, and research.
Student Handout
No handouts have been provided yet for this tutorial