Description
System Info
transformers
version: 4.47.0- Platform: Linux-5.15.0-1073-azure-x86_64-with-glibc2.35
- Python version: 3.11.0rc1
- Huggingface_hub version: 0.26.5
- Safetensors version: 0.4.2
- Accelerate version: 0.31.0
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Users can leverage the the powerful pad_to_multiple_of
parameter in the DataCollatorForSeq2Seq
class (and other data collator classes) whose documentation references the following:
"This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta)."
This seems to be slightly erroneous - the first Nvidia GPU architecture to introduce tensor cores was indeed the Volta architecture, but it has a compute capability of 7.0, not 7.5. Nvidia's compute capability 7.5 chips were introduced in the Turing (T4) architecture. Presumably, the documentation is only slightly erroneous in it's numeric definition - V100 (Volta) chips should indeed see a large benefit in performance, but so also should the T4 (Turing) chips.
Interestingly, the Volta arch chips, despite being older and a lower compute capability, actually have twice the number of tensor cores and CUDA cores (640 and 5,120) that the Turing chips have (320 and 2,560). In this sense, training on Volta should give larger improvement than training on Turing when passing this arg. Nonetheless, users are still capable of leveraging the tensor cores on the newer, less powerful T4 if they so choose. By accurately relabeling the original compute capability of Volta as 7.0 one removes any confusion here.
A link to Nvidia's directory of CUDA compute capabilities can be found here: https://developer.nvidia.com/cuda-gpus
Expected behavior
The doc strings of the DataCollatorWithPadding
, DataCollatorForTokenClassification
, and DataCollatorForSeq2Seq
classes, which make use of the pad_to_multiple_of
argument, should be updated to with only a slight adjustment to the numeric definition of its compute capability. They should read:
"This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.0 (Volta).",
as opposed to the current and slightly erroneous:
"This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta)."
Lastly, the DataCollatorForLanguageModeling
class also allows for users to pass the pad_to_multiple_of
argument, but makes no reference to the added benefit of using this arg on GPUs with tensor cores. Presumably this docstring should read the same as the other data collator classes.
I will fork the repo and make these changes myself before linking the PR to this issue. @johngrahamreynolds