Skip to content

Conversation

@cjac
Copy link
Contributor

@cjac cjac commented Jul 26, 2024

gpu/README.md:

  • Clarified which configurations are supported

gpu/install_gpu_driver.sh

  • using OS comparison functions instead of string comparisons
  • Updates necessary to bring tests back to passing. Verified and updated a bunch of urls and installation instructions.

gpu/test_gpu.py:

  • Tests which confirm successful runs of nvidia-smi now pass
  • Tests to confirm that GPU monitoring agent installation and health pass
  • Tests to confirm installation of the full CUDA 11 and 12 stacks now pass
  • Tests to confirm MIG functionality are disabled
  • Tests to confirm launching of spark jobs with access to gpu are disabled
  • Tests to confirm xgboost functionality are disabled
  • Tests are no longer skipped on Rocky
  • Tests no longer exercise CUDA 10 use cases

cloudbuild/presubmit.sh:

  • Full test suite run is presently disabled so that I could make modifications to shared resources while performing only GPU tests. The continue should be removed from this change before submitting.

integration_tests/dataproc_test_case.py:

  • switching to pd-ssd
  • disabling secure boot for gpu-related actions
  • max-age changes from 2h to 30m
  • max-idle set to 5m
  • boot disk size reduced to 50GB

gpu/install_gpu_driver.sh
* replace add_nonfree_components with add_contrib_components
* perform the apt-get update outside of the update function
* using OS comparison functions instead of string comparisons
* small grammar fix

gpu/test_gpu.py
* only testing 11.8 and 12.4 CUDA variants
* removed conditional test skips
* using g2 machine type and GPU_L4 when testing mig
@cjac cjac marked this pull request as draft July 26, 2024 21:29
@cjac
Copy link
Contributor Author

cjac commented Jul 26, 2024

Manually tested:

✅ 2.2-rocky9 with secure boot DISabled, CUDA=11.8
✅ 2.2-rocky9 with secure boot ENabled, CUDA=12.4
✅ 2.1-rocky8
✅ 2.0-rocky8
✅ 2.2-ubuntu22 with secure boot DISabled, CUDA=11.8
✅ 2.2-ubuntu22 with secure boot ENabled, CUDA=12.4
✅ 2.1-ubuntu20
✅ 2.0-ubuntu18
✅ 2.2-debian12 with secure boot DISabled, CUDA=11.8
✅ 2.2-debian12 with secure boot ENabled, CUDA=12.4,11.8
✅ 2.1-debian11 with CUDA=12.4,11.8
✅ 2.0-debian10 with CUDA=12.4,11.8

cjac added 3 commits July 26, 2024 15:28
* redirecting to file descriptor 2 instead of the file named 2
* removing nvidia module if it is installed
* adding safety checks around the use of nvidia-smi
* using appropriate mok file paths
* removed uname -r suffix from kernel-devel package name for rocky
@cjac
Copy link
Contributor Author

cjac commented Jul 27, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jul 27, 2024

It looks like 2.2-ubuntu22 is failing to build the 520 driver
gs://dataproc-108de5de-43c2-4a4b-979a-adebc15a58a8-us-central1/google-cloud-dataproc-metainfo/51abc585-86ef-4ec2-b84c-8b224f4bb899/test-gpu-standard-2-2-20240727-003840-g5sl-m/dataproc-initialization-script-0_output

@cjac
Copy link
Contributor Author

cjac commented Jul 27, 2024

/gcbrun

1 similar comment
@cjac
Copy link
Contributor Author

cjac commented Jul 27, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jul 27, 2024

/gcbrun

cjac added 5 commits July 27, 2024 17:07
modified:   gpu/install_gpu_driver.sh
* added is_debian10
* corrected codename ; it should be oldoldstable
* using is_$distname() functions a bunch
function install_nvidia_cudnn
* installing libcudnn9 on rocky9
function install_nvidia_gpu_driver
* redirect output of module build to log files
function hadoop_properties
* using /usr/local/bin/bdconfig for bdconfig command so this script
  works under sudo -i on a repro cluster
modified:   gpu/test_gpu.py
* skipping all tests for now to ease debugging with ci infrastructure
* removed logic to skip rocky builds
* reduced disk size from 200G to 50G since we're using pd-ssd for the
  ci project now, and we want to be careful with that expensive
  resource
modified:   gpu/install_gpu_driver.sh
* cleaned up variable names
* broke out nccl shortname from normal shortname
* specifying versions when installing driver and cudatools on rocky
@cjac
Copy link
Contributor Author

cjac commented Jul 28, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jul 28, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jul 28, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jul 28, 2024

Nice. GPU tests are passing again. Kind of. :-)

@cjac
Copy link
Contributor Author

cjac commented Jul 28, 2024

/gcbrun

added some whitespace to spark arguments to clarify what is happening
@cjac
Copy link
Contributor Author

cjac commented Jul 31, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Aug 1, 2024

Next round of manual tests

✅ 2.0-debian10 with CUDA=11.8,12.4
✅ 2.0-ubuntu18 with CUDA=12.4,11.8
✅ 2.0-rocky8 with CUDA=12.4
✅ 2.1-debian11 with CUDA=12.4,11.8
✅ 2.1-ubuntu20 with CUDA=11.8,12.4
✅ 2.1-rocky8 with CUDA=12.4,11.8
✅ 2.2-debian12 with CUDA=12.4,11.8
✅ 2.2-ubuntu22 with CUDA=12.4,11.8
❓ 2.2-rocky9 - kernel-devel was removed from the repos again

@cjac
Copy link
Contributor Author

cjac commented Aug 1, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Aug 2, 2024

/gcbrun


if not FLAGS.skip_cleanup:
args.append("--max-age=2h")
args.append("--max-age=30m")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does GPU tests take to finish? I thought the cluster creation itself takes lot of time around 30m.

Copy link
Contributor Author

@cjac cjac Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The startup script and init action take about 10-16 minutes from gcloud command issued to cluster in active state. But I'm either merging changes to files outside of gpu/ into Prince's changes or discarding them. I'm discarding this one, but would keep it in gpu/test_gpu.py if I could.

Copy link
Contributor Author

@cjac cjac Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see what you're saying. The tests don't take long to run. The delays come from gcloud compute ssh mostly. Once we turn the MIG tests back on, we need to account for nvidia-smi being super slow on systems with 8 attached H100s. The commands are simple and should not take more than 30s to execute each.

With the longest measured duration of fetching, building, installing, signing kernel modules and installing about 10GB of binary support libraries taking about 16 minutes, plus 30s for each of about 10 tests, then we have 21 minutes. We could try reducing it to 25m and see if all tests pass, but I think leaving five minutes of headroom for congested networks or retries will give us a more reliable pass rate.

Copy link
Contributor Author

@cjac cjac Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are stats from the most recent run

Step #2 - "dataproc-2.0-debian10-tests": 2024-08-02T19:27:53.262772599Z //gpu:test_gpu                                                           PASSED in 2313.8s
Step #3 - "dataproc-2.0-rocky8-tests": 2024-08-02T18:58:06.475604281Z //gpu:test_gpu                                                           PASSED in 376.8s
Step #5 - "dataproc-2.1-debian11-tests": 2024-08-02T19:29:09.794418578Z //gpu:test_gpu                                                           PASSED in 2401.5s
Step #6 - "dataproc-2.1-rocky8-tests": 2024-08-02T18:55:34.482352277Z //gpu:test_gpu                                                           PASSED in 268.7s
Step #7 - "dataproc-2.1-ubuntu20-tests": 2024-08-02T19:28:33.297038722Z //gpu:test_gpu                                                           PASSED in 2473.7s
Step #8 - "dataproc-2.2-debian12-tests": 2024-08-02T19:19:00.625695199Z //gpu:test_gpu                                                           PASSED in 1917.9s
Step #9 - "dataproc-2.2-rocky9-tests": 2024-08-02T18:55:36.599559802Z //gpu:test_gpu                                                           PASSED in 264.5s

@cjac
Copy link
Contributor Author

cjac commented Aug 2, 2024

/gcbrun

@cjac cjac marked this pull request as ready for review August 2, 2024 20:23
@cjac cjac requested a review from prince-cs August 2, 2024 20:23
@cjac cjac marked this pull request as draft August 2, 2024 20:23
…ion_tests/ and cloudbuild/ into prince's branch before squash/merge
@cjac cjac marked this pull request as ready for review August 2, 2024 20:35
@prince-cs
Copy link
Collaborator

@cjac cjac merged commit 5ced7a6 into GoogleCloudDataproc:master Aug 2, 2024
@cjac cjac self-assigned this May 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants