Local stack for starting ML model and data development
The Machine Learning Local Stack (MLLS) is designed for Data Scientists and Machine Learning Engineers to develop and prototype data and models, in a manner that parallels practices in a professional setting.
In particular, this manner of development includes
- Orchestration of ELT (extract, load, transform) operations with Dagster & dbt
- Training and inference of models using Sagemaker in local Docker containers
- Analytics Dashboarding with Metabase
- Local database implementation with duckdb
- Separated Python environments using Pipenv
Data Scientists and Machine Learning Engineers who are currently employed may find this tool useful to prototype models and then easily break down the requirements for their related Data Engineering and Software Development teams to implement. Rather than a jupyter notebook which may conflate requirements of data processing, model training, inference, and analysis, this process separates each of these steps to be implemented elsewhere as desired.
Students may also find this tool useful as an education tool and to familiarize oneself with the process and technical tools to develop data and a machine learning model.
The MLLS is built using Visual Studio Code's dev container feature. This feature makes use of Docker containers to standardize the environment that is running everything you do in that VS Code context. This layer aims to provide a standardization in the experience that doesn't depend on your particular OS and hardware.
The particular dev container used by MLLS is called "Docker in Docker" and allows us to run Docker containers within the Docker container that is actually running our dev container (a la the movie Inception). This feature is utilized to run local Sagemaker containers for training and inference, as well as a Metabase container for analytics and visualizations.
- Install Visual Studio Code
- Install Visual Studo Code Extension - Dev containers
- Install Docker
- Confirm Docker is installed by executing command
docker -v - Open VSCode to a New Window and open this repository's directory
- You may see a notification that the Folder containers a Dev Container configuration file. If so, click on "Reopen in Container"
- If you do not see this notification, press
F1key and begin typing the following until you can see the option "Dev Containers: Rebuild and reopen in Container".
- If you do not see this notification, press
- This action will reopen the VSCode within a Docker container suitable to develop and locally run the application.
New to dagster-dbt? - Check out the tutorial
- A dagster-dbt project has already been scaffolded in this repo - dagster-dbt
- Dagster Project containing code for dagster orchestration which is scaffolded to automatically import dbt project code
- dbt project containing a scaffolded dbt project with some example SQL models (note model has distinct meaning in Data Engineering context).
- The Python environment used for dagster-dbt is defined using by Pipfile and can be updated via pipenv
- To begin local dagster development, execute
./start_dagster.sh, which will start dagster webserver in development mode within the dev container and forward a port locally to http://localhost:3000/ where you can execute and view logs of dagster processes. - All data will be stored in a duckdb file which is not checked into source code and will be ignored by git - dev.duckdb
Metabase is an analytics tool that you can run locally. To start up your local Metabase, run ./start_metabase.sh
First time running
- Navigate through the on screen setup process. The data is used to register you as a user for your local metabase instance (stored in metabase-data)
- When you get to "Add your data", select "DuckDb"
- Display name: dev.duckdb
- Database file: /database/dev.duckdb
- Enable: Establish a read-only connection
- Choose "Connect Database"
- The data created by dagster-dbt is now accessible in your metabase instance.
The development container uses a Sagemaker specific pipenv Python environment which contains the sagemaker and sagemaker-local python packages. This environment allows us to use a python script to run specified local Docker containers combined with the relevant model python code and data.
The sagemaker_example is based on - https://github.com/aws-samples/amazon-sagemaker-local-mode/tree/3ff2ac5f687db27c17c7046e4d4a7d6f5f3323ea/tensorflow_script_mode_local_training_and_serving