DataCollator documentation references the wrong version of Nvidia GPUs for accelerated training on tensor cores

### System Info

- `transformers` version: 4.47.0
- Platform: Linux-5.15.0-1073-azure-x86_64-with-glibc2.35
- Python version: 3.11.0rc1
- Huggingface_hub version: 0.26.5
- Safetensors version: 0.4.2
- Accelerate version: 0.31.0

### Who can help?

@stevhliu

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Users can leverage the the powerful `pad_to_multiple_of` parameter in the `DataCollatorForSeq2Seq` class (and other data collator classes) whose documentation references the following: 

"This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta)." 

This seems to be slightly erroneous - the first Nvidia GPU architecture to introduce tensor cores was indeed the Volta architecture, but it has a compute capability of *7.0*, not *7.5*. Nvidia's compute capability 7.5 chips were introduced in the Turing (T4) architecture. Presumably, the documentation is only slightly erroneous in it's numeric definition - V100 (Volta) chips should indeed see a large benefit in performance, but so also should the T4 (Turing) chips. 

Interestingly, the Volta arch chips, despite being older and a lower compute capability, actually have twice the number of tensor cores and CUDA cores (640 and 5,120) that the Turing chips have (320 and 2,560). In this sense, training on Volta should give larger improvement than training on Turing when passing this arg. Nonetheless, users are still capable of leveraging the tensor cores on the newer, less powerful T4 if they so choose. By accurately relabeling the original compute capability of Volta as *7.0* one removes any confusion here.

A link to Nvidia's directory of CUDA compute capabilities can be found here: [https://developer.nvidia.com/cuda-gpus](https://developer.nvidia.com/cuda-gpus)

### Expected behavior

The doc strings of the `DataCollatorWithPadding`, `DataCollatorForTokenClassification`, and `DataCollatorForSeq2Seq` classes, which make use of the `pad_to_multiple_of` argument, should be updated to with only a slight adjustment to the numeric definition of its compute capability. They should read:

"This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= **7.0** (Volta).",

as opposed to the current and slightly erroneous:

"This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= **7.5** (Volta)." 

Lastly, the `DataCollatorForLanguageModeling` class also allows for users to pass the `pad_to_multiple_of` argument, but makes no reference to the added benefit of using this arg on GPUs with tensor cores. Presumably this docstring should read the same as the other data collator classes.

I will fork the repo and make these changes myself before linking the PR to this issue. @johngrahamreynolds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DataCollator documentation references the wrong version of Nvidia GPUs for accelerated training on tensor cores #35174

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DataCollator documentation references the wrong version of Nvidia GPUs for accelerated training on tensor cores #35174

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions