Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add CUDA 12.5 images #689

Merged
merged 5 commits into from
Jul 20, 2024
Merged

Conversation

jameslamb
Copy link
Member

Contributes to rapidsai/build-planning#73

Proposes the following:

  • adding CUDA 12.5 images
  • PR builds:
    • +1 test job covering (cuda=12.5, arch=arm64)
  • branch builds:
    • +1 test job covering (cuda=12.5, arch=x86_64)
    • -1 test job covering (cuda=12.2, arch=arm64)

context: rapidsai/build-planning#73 (comment)

Notes for Reviewers

Per offline discussion with @raydouglass , this would be accompanied by a deprecation notice in the RAPIDS 24.08 release stating that the CUDA 12.2 images will be removed in some future release (future release not yet determined).

@jameslamb jameslamb added the 2 - In Progress Currenty a work in progress label Jul 16, 2024
@jameslamb jameslamb requested a review from a team as a code owner July 16, 2024 20:30
Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks James! 🙏

Am curious what you think about swapping where the x86_64 and aarch64 jobs are

For more context CUDA 12.2 added some interesting things like cuFile (for ARM) and HMM, which are helpful to test on CUDA 12.2 ARM specifically. Am not aware of a similar change in CUDA 12.5 (though could be missing it)

Given there are fewer ARM machines than x86_64, it may make sense to use x86_64 primarily in PRs and run ARM more on branches. That would mean less ARM jobs are needed during PR development

Anyways please let me know what you think 🙂

matrix-test.yaml Outdated Show resolved Hide resolved
matrix-test.yaml Outdated Show resolved Hide resolved
matrix-test.yaml Outdated Show resolved Hide resolved
@jameslamb
Copy link
Member Author

Thanks @jakirkham ! That's helpful context I didn't have. I just applied your suggestions.

@jakirkham
Copy link
Member

Updating to pull in PR ( #690 ) now that is in branch-24.08

@jameslamb
Copy link
Member Author

See this

9.818 Running dfg on cuspatial
9.939 Traceback (most recent call last):
9.939   File "/opt/conda/bin/rapids-dependency-file-generator", line 10, in <module>
9.939     sys.exit(main())
9.939              ^^^^^^
9.939   File "/opt/conda/lib/python3.11/site-packages/rapids_dependency_file_generator/_cli.py", line 121, in main
9.939     make_dependency_files(
9.939   File "/opt/conda/lib/python3.11/site-packages/rapids_dependency_file_generator/_rapids_dependency_file_generator.py", line 421, in make_dependency_files
9.940     raise ValueError(f"No matching matrix found in '{include}' for: {matrix_combo}")
9.940 ValueError: No matching matrix found in 'cuda_version' for: {'cuda': '12.5', 'arch': 'x86_64', 'py': '3.9'}

(build link)

This will be broken until rapidsai/cuspatial#1405 is merged. And that is blocked until CI in cuspatial is fixed (which could be done by merging rapidsai/cuspatial#1407).

Copy link

@KyleFromNVIDIA KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one quick question. Approved so it doesn't block merging.

Dockerfile Show resolved Hide resolved
Comment on lines 5 to 6
ARG LINUX_DISTRO=ubuntu
ARG LINUX_DISTRO_VER=22.04
Copy link
Member

@jakirkham jakirkham Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This need not be done in this PR. Just the discussion above ( #689 (comment) ) made me wonder

Should we do the same thing for the Linux version?

Suggested change
ARG LINUX_DISTRO=ubuntu
ARG LINUX_DISTRO_VER=22.04
ARG LINUX_DISTRO=unset
ARG LINUX_DISTRO_VER=unset

If this sounds reasonable, happy to send a 2nd PR with this change. Also ok not doing this if we don't think it makes sense

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes in my opinion, this is a pattern that should be used in any Dockerfile where it's unacceptable to silently fall back to default values if configuration is not passed.

I just didn't want to go any further with it in this PR and grow the diff here.

I'd support a 2nd PR with this change.

@jakirkham
Copy link
Member

jakirkham commented Jul 19, 2024

One of the 11.8.0 tests failed on CI. Restarted it

That said, we are not changing 11.8.0 builds here and all other builds pass

Edit: Actually that won't work since the images get deleted after

@jakirkham
Copy link
Member

Going to try restarting from scratch. If that doesn't work, we can discuss next steps on Monday

@raydouglass
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit d240fc3 into rapidsai:branch-24.08 Jul 20, 2024
36 of 46 checks passed
@jakirkham
Copy link
Member

Thanks all! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currenty a work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants