Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: onnx runtime shared sessions #430

Merged
merged 27 commits into from
Nov 1, 2024
Merged

feat: onnx runtime shared sessions #430

merged 27 commits into from
Nov 1, 2024

Conversation

kallebysantos
Copy link
Contributor

This PR is an adapted part of #368 work, that was closed due proposal changing

What kind of change does this PR introduce?

Feature, Enhancement

What is the current behavior?

model sessions are eager evaluated and not survive cross worker life-cycles

What is the new behavior?

image

This PR introduces shared sessions logic and more other ort improvements like GPU support and optimizations.

👷 It's also a base foundation for an owned onnx runtime that can be integrated direclty with huggingface/transformers.js library, that will allow a better inference without the needed of coupling models like gte-small on edge-runtime image.

Tester docker image:

You can get a docker image of this PR from docker hub:

# default runtime
docker pull kallebysantos/edge-runtime:latest

# gpu with cuda provider
docker pull kallebysantos/edge-runtime:latest-cuda

Session lifecycle:

This PR introduces a Lazy map of ort:Sessions, it means that sessions will be loaded once and then shared between worker cycles.

Cleaning up sessions:
Each ort:Session is attached to an Arc smart pointer and will only be dropped if no consumer is attached to it, but in order to that users must explicit call the EdgeRuntime.ai.tryCleanupUnusedSession() method.

NOTE: This method is only available for the main worker

// cleanup unused sessions every 30s
setInterval(async () => {
  const { activeUserWorkersCount } = await EdgeRuntime.getRuntimeMetrics();
  if (activeUserWorkersCount > 0) {
    return;
  }
  try {
    const cleanupCount = await EdgeRuntime.ai.tryCleanupUnusedSession();
    if (cleanupCount == 0) {
      return;
    }
    console.log('EdgeRuntime.ai.tryCleanupUnusedSession', cleanupCount);
  } catch (e) {
    console.error(e.toString());
  }
}, 30 * 1000);

GPU Support:

The gpu support allows session inference in specialized hardware and its backed with CUDA. There is no configuration to do by the final user, just call the Session for gte-small. But in order to enable gpu inference the Dockerfile now has two main build stages (That should be specified during docker build) :

edge-runtime (CPU only):
This stage builds the default edge-runtime, where ort::Session's are loaded using CPU.

docker build --target "edge-runtime" .

Resulting image with ~150 Mb

edge-runtime-cuda (GPU/CPU):
This stage builds the default edge-runtime in a nvidia/cuda machine that allows loading using GPU or CPU(as fallback).

docker build --target "edge-runtime-cuda" .

Resulting image with ~2.20 GB

Each stage needs to install the appropriated onnx-runtime. So in order that, the install_onnx.sh has updated with a 4º parameter flag --gpu, that will download a cuda version from the official microsoft/onnxruntime repository.

Using GPU image:

In order to use the gpu image the docker-compose file must include the following properties for the functions service:

services:
  functions:
    # Built was describe before
    image: supabase/edge-runtime:latest-cuda
    # or directly by compose
    build:
      context: .
      dockerfile: Dockerfile
      target: edge-runtime-cuda

   # Required to use gpu inside the container
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1 # Change here if more devices are installed
              capabilities: [gpu]

IMPORTANT NOTE: The target infrastructure must be prepared with NVIDIA Container Toolkit to allow gpu support inside docker


Final considerations:

Like I'd describe before, this is an adapted work from #368 where we spitted out only the core features that improves ort support for edge-runtime.

Finally, thanks for @nyannyacha that help me a loot 🙏

Dockerfile Show resolved Hide resolved
@nyannyacha
Copy link
Collaborator

👷 It's also a base foundation for an owned onnx runtime that can be integrated direclty with huggingface/transformers.js#947 library, that will allow a better inference without the needed of coupling models like gte-small on edge-runtime image.

If so, are you also preparing other PR after this PR has been merged?
Overall, the PR looks good. 😁

@nyannyacha
Copy link
Collaborator

Anyway, I'll be testing this locally with k6 soon.
If there are any issues, I'll let you know. 😋

@kallebysantos kallebysantos force-pushed the ai branch 2 times, most recently from 57d6ccd to 8601949 Compare October 31, 2024 16:33
kallebysantos and others added 23 commits October 31, 2024 16:35
`install_onnx` script now supports `--gpu` flag to download runtime with
cuda provider

Signed-off-by: kallebysantos <[email protected]>
Signed-off-by: kallebysantos <[email protected]>
- Using `HashMap` allows the reusing of sessions between
requests.

Signed-off-by: kallebysantos <[email protected]>
Copy link
Collaborator

@nyannyacha nyannyacha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LTGM 👍

I've loadtested locally with k6 and it seems to be working fine.

Many thanks for your time and effort in bringing this feature to us!
I look forward to seeing you again in your follow-up PR.

cc @laktek

@laktek laktek merged commit fc80ebb into supabase:main Nov 1, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants