Skip to content

run_mlm_flax on tpu v5-pods #35205

@peregilk

Description

@peregilk

System Info

Latest update of both transformers and jax

Who can help?

@ArthurZucker I am trying to use the run_mlm_flax.py to train a Roberta model on a v5-256 pod. However, while a single v3-8 is capable of running with per_device_batch_size=128, the v5-256 are only able to run with per_device_batch_size=2. Any ideas?

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Using default code.

Expected behavior

I would expect a v5-256 to run a lot faster here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions