When it comes to machine learning, what most people care about is making the most of models and that usually means putting models into production. MLOPs standing for Machine Learning Operations involves set of processes that make it possible to design, train, evaluate, and to deploy models.
This page serves as a comprehensive guide to MLOps. It gives a brief introduction to MLOps and why it is an important area of study, relevant learning resources (such as courses, books, papers), tools, and various communities in MLOps. Here is a rough outline:
- Introduction to MLOps
- MLOps learning resources
- Courses
- Books
- Papers
- Blogs, people and communities
- MLOPs tools landscape
- Conclusion
Machine learning evolves faster than other other fields. Thus, new resources will evolve, new tools will come while existing tools will become irrelevant. This page will be updated continously(my hope) to ensure you get a practical blueprint for learning MLOps.
What is MLOps and why is everybody saying MLOps these days? As alluded to in the beginning of this tutorial, MLOps stands for Machine Learning Operations. It is a relatively new field(or nascent field according to Shankar et al.)that deals with operationalizing machine learning. MLOps roughly refers to a set of processes or methodologies designed to ensure reliable and efficient deployment of machine learning models in production.
MLOps workflow contains at least 6 stages which are discussed in brief below:
-
Data collection and cleaning: This is the first and foremost task in any typical machine learning project. Data collection refers to collecting data from various sources. Data can be sourced from database, scraping internet, APIs, etc...Once the data is collected, the next step is usually cleaning it. This can involve labeling and other data wrangling activities.
-
Feature engineering: Once data is collected and cleaned, you are not done with data yet. You may want to create new features, transform existing features, or extract new features from existing features. That's what feature engineering is all about. It is a process of using domain knowledge to create new features or to transform existing features. Feature engineering is typically done after data cleaning and before building model.
-
Building and training models: Model building is a step that follow data preprocessing. The kind of model and the tools you use roughly depends on the dataset and the problem. For instance, when working with structured data and performing discriminative tasks, your model maybe something like random forest or support vector machines or linear models. When performing image classification on large-scale dataset, your model maybe based on convolutional neural networks(CNNs). When doing sentiment analysis, your model maybe based on recurrent networks or finetuned BERT. After designing the model, the next task is to train it. Model building and training are iterative processes and they involves lots of experimentations.
-
Model evaluation: On the course of training, a model is evaluated on validation set(validation set can be a portion of training set that is not used in training a model) to compute performance metrics such as accuracy. Performance metrics depend on problem and dataset. For classification tasks, your performance metric can be accuracy. For regression, your metric can be mean-squared error(MSE). Validation data are different to test data. Validation data are used to evaluate the model during training while test data are used to evaluate the final model after training. When the model is good enough, the next step is to deploy it.
-
Model deployment: Model deployment is the process of putting model into production to make it easily accessible to users, developers, or other applications. Model deployment is one of the last stages in machine learning life-cycle. Model can be deployed on cloud(via cloud services) or on edge-devices.
-
Model monitoring: This step involves watching the performance of the deployed model overtime. Real-world is messy and many things can go wrong. Overtime, data can change(data drift) or a model can decay(concept drift). It's important to monitor relevant pipelines(data or model related) and tracking key metrics so you can know when things break or when there is a bug causing mis-prediction.
Designing machine learning models is an iterative process. The ML life-cycle stages discussed above represent a rough overview of tasks involved in shipping ML models but it is not exhaustive and some stages are not straight-forward. While most of us spend time on building models, model is a tiny thing in entire ML lifecycle and there are other things that are not neccarily related to machine learning. There is nothing that depicts the complexity of MLOps than the picture below.
The following is a list of learning resources. The list highlights courses, books, papers, blogs, and active MLOps communities.
-
Machine Learning Engineering for Production (MLOps) Specialization: This inarguable one of the best MLOps courses out there. It is taught by Andrew Ng., Laurence Moroney, and Robert Crowe. The first course of the specialization walks you through machine learning projects life-cycle while the rest courses focus on designing data pipelines, model pipelines and deploying models. The only caveat(might not a caveat if you are working with TF ecosystem) of the specialization is that the last 3 courses are all about TensorFlow Extended. The entire course is available on Coursera and the first course Introduction to Machine Learning in Production of the specialization is available for free on YouTube.
-
Full Stack Deep Learning(FSDL): FSD is inarguably the most practical MLOps course among all other MLOps courses. Quoting the course website, "FSDL brings people together to learn and share best practices for the full stack: from problem selection, data management, and picking a GPU to web deployment, monitoring, and retraining." The course website provides more information about the course. The course 2022 iteration is found here and course YouTube channel contains lectures(see 2022 playlist).
-
CS 329S: Machine Learning Systems Design: CS 329S provides an iterative framework for developing, deploying reliable and scalable machine learning systems. The course covers a wide range of topics such as data management, data engineering, feature engineering, model selection approaches, training, scaling, deploying and monitoring ML systems, and human side of ML projects. The lecture notes and slides are publicly available.
-
Effective MLOps: Model Development: This is a free course from awesome WandB that teaches how to build end-to-end machine learning pipelines. The course can be found here. Similar course for CI/CD for Machine Learning (GitOps) can be found here.
-
MIT Introduction to Data-Centric AI(DCAI): This is the first-ever course on data-centric AI. While DCAI is relatively new field too, the practices done in DCAI are same as what's done in MLOps when working with data. The course materials can be found on the course website and lecture videos here.
-
Made With ML: Made With ML contains resources for learning ML foundations and MLOps, all through intuitive explanations, clean code and visualizations. Made With ML repository can be found here. A dedicated MLOps repository for learning "how to combine machine learning with software engineering to develop, deploy and maintain production ML applications" can be found here.
The following are a few popular books in MLOps world. These books covers almost anything you'd want to know about MLOps. The books are not listed in any order but if you are to pick one book, take Chip Huyen book on designing ML systems or MLE by Burkov or Kleppmann book.
-
Machine Learning Engineering by Andriy Burkov, free to read!
-
Rules of Machine Learning: Best Practices for ML Engineering, free to read!
As alluded to in the beginning of this guide, MLOps is a new field in both industry and completely new field in academia and that means there are not many academic literatures on the topic. Below, we list few papers that are worth reading. If there is a paper we missed, feel free to contact me on Twitter.
-
A Meta-Summary of Challenges in Building Products with ML Components – Collecting Experiences from 4758+ Practitioners: ArXiv | Mar 2023
-
Operationalizing Machine Learning: An Interview Study: ArXiv | Tweet | Video, Transcript | Sep 2022
-
Machine Learning Operations (MLOps): Overview, Definition, and Architecture: ArXiv | May 2022
-
Adoption and Effects of Software Engineering Best Practices in Machine Learning: ArXiv | Jul 2020
-
Hidden Technical Debt in Machine Learning Systems: Paper | 2015
There are way too many blogs about MLOps. In this section, instead of listing blog titles, we provide a few list of popular blogs(and authors) that write about MLOps or related topics. Related communities are also included.
-
Chip Huyen Blog: Chip Huyen writes a lot about designing machine learning systems and putting them in production. Her popular blogs related to MLOps are Machine Learning Tools Landscape v2 (+84 new tools), Real-time machine learning: challenges and solutions, and Data Distribution Shifts and Monitoring.
-
Eugene Yan: Eugene writes about designing and operating machine learning systems. His popular writings about ML in production can be found here. He also maintains a list of papers and tech blogs about applied ML and is creator of applying ML.
-
Lj Miranda Notebook: Miranda documents his experiments and share study notes on different topics. His popular blogs about MLOps are How to improve software engineering skills as a researcher, Navigating the MLOps tooling landscape (Part 1: The Lifecycle, Part 2: The Ecosystem, Part 3: The Strategies.
-
MLOps Community Blog: MLOps Community one of the best MLOps communities out there with exclusive blogs on the topic. A few examples of their blogs: The Minimum Set of Must-Haves for MLOps, MLOps is 98% Data Engineering, A Practitioner’s Guide to Monitoring Machine Learning Applications. In addition to blogs, MLOps community has a great podcast.
-
Software Engineering for Machine Learning: SE ML collects, validates and shares machine learning engineering best practices. You can check their Engineering best practices for Machine Learning and The 2020 State of Engineering Practices for Machine Learning.
-
Awesome MLOps provides a curated list of awesome MLOps tools.
Fields that are mature tend to have standard tools that every developer can point to easily when you ask them. MLOps is not like that yet. There are way too many tools since everybody is trying to contribute and it's almost impossible to list every tool. We hope the tools will mature overtime. In the meantime, I think it's good to follow tools that people in MLOps recommend. You can check them in one of the blogs we mentioned such as Machine Learning Tools Landscape v2 (+84 new tools) and Navigating the MLOps tooling landscape (Part 1: The Lifecycle, Part 2: The Ecosystem, Part 3: The Strategies. Shankar et al. and Kreuzberger et al. also provide a list of MLOps tools in their papers Operationalizing Machine Learning: An Interview Study and Machine Learning Operations (MLOps): Overview, Definition, and Architecture respectively.
Thanks for checking this MLOps guide. MLOps is a huge and interdisciplinary field that combines best practices from machine learning, software engineering, and data engineering. We shared lots of resources but and we understand you can't take all. Taking one course can get you started. Reading one book can get you started. Doing one project can get you started. Reading one blog can help you learn something new.
If there is a course or paper or blog or book that you think should be added to this guide, feel free to reach out on Twitter.
This guide is a part of Complete Machine Learning Package.