Skip to content

Default AMI Fails to detect Nvidia driver on AWS g6e #1480

Open
@OLSecret

Description

Getting:

Digest: sha256:5e8ed922ecacdb1071096eebef5af11563fd0c2c8bce9143ea3898768994080f
  Status: Downloaded newer image for iterativeai/cml:0-dvc3-base1-gpu
  docker.io/iterativeai/cml:0-dvc3-base1-gpu
  /usr/bin/docker create --name 41bde5f6557b4c82bb0400b08e5ca5b0_iterativeaicml0dvc3base1gpu_78f5fb --label 380bf3 --workdir /__w/SecretModels/SecretModels --network github_network_5168857de2994b2fabc54139db02ee1f --gpus all -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work":"/__w" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/externals":"/__e":ro -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work/_temp":"/__w/_temp" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work/_actions":"/__w/_actions" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work/_tool":"/__w/_tool" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work/_temp/_github_home":"/github/home" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" iterativeai/cml:0-dvc
  215e811a3e2b95ada680b3f3db404ac68abec62295af2daf3a516db6e0d4099a
  /usr/bin/docker start 215e811a3e2b95ada680b3f3db404ac68abec62295af2daf3a516db6e0d4099a
  Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
  nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
  Error: failed to start containers: 215e811a3e2b95ada680b3f3db404ac68abec62295af2daf3a516db6e0d4099a
  Error: Docker start fail with exit code 1

from setup like:

name: model-style-train-on-manual_call

on:
  workflow_dispatch:
    inputs:
      model_name:
        description: 'Hugging Face model name to use for training'
        required: true
        default: 'euclaise/gpt-neox-122m-minipile-digits'

jobs:
  launch-runner:
    runs-on: ubuntu-latest
    permissions:
      contents: write
      actions: write
    steps:
      - uses: actions/setup-node@v3
        with:
          node-version: '16'
      - uses: actions/setup-python@v4
        with:
           python-version: '3.x'
      - uses: actions/checkout@v3
      - uses: iterative/setup-cml@v2
      - name: Deploy runner on EC2
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.CML_AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.CML_AWS_SECRET_ACCESS_KEY }}
        run: |
          cml runner launch \
              --cloud=aws \
              --cloud-hdd-size=256 \
              --cloud-region=us-west-2 \
              --cloud-type=g6e.xlarge \
              --cloud-gpu=v100 \
              --labels=cml-gpu 

  run:
    needs: launch-runner
    runs-on: [self-hosted, cml-gpu ]
    container:
      image: docker://iterativeai/cml:0-dvc3-base1-gpu
      options: --gpus all
    timeout-minutes: 40000
    permissions:
      contents: read
      actions: write
    steps:
      - uses: actions/setup-node@v3
        with:
          node-version: '16'
      - uses: actions/checkout@v3
      - uses: robinraju/release-downloader@v1
        with:
          tag: 'style'
          fileName: '*.jsonl'
      - name: Train models
        env:
          GITHUB_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          REPO_TOKEN: ${{ github.token }}
          DEBIAN_FRONTEND: noninteractive
          MODEL_NAME: ${{ github.event.inputs.model_name }}
        run: |
          echo $NODE_OPTIONS

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions