Skip to content

Grant required IAM roles to Compute Engine default SA when --managed-mldiagnostics is passed during xpk cluster create#1187

Open
rapatchi wants to merge 1 commit into
AI-Hypercomputer:mainfrom
rapatchi:permission_fix
Open

Grant required IAM roles to Compute Engine default SA when --managed-mldiagnostics is passed during xpk cluster create#1187
rapatchi wants to merge 1 commit into
AI-Hypercomputer:mainfrom
rapatchi:permission_fix

Conversation

@rapatchi

@rapatchi rapatchi commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

When provisioning clusters with --managed-mldiagnostics, XLA ML diagnostics requires roles/hypercomputecluster.editor, roles/storage.objectUser, and roles/logging.logWriter to be bound to the Compute Engine default service account.

This commit:

  1. Automatically resolves projectNumber and grants these 3 required IAM roles via gcloud projects add-iam-policy-binding during cluster create when --managed-mldiagnostics is enabled.
  2. Updates user documentation (permissions.md, clusters.md) and unit test coverage accordingly.

Issue

If not done permissions need to be given manually for mldiagonstics to work.

Testing

Have you performed any manual testing on your change?

Prior IAM Bindings:
image

Cluster Creation Logs:

(xpk_local_venv) rapatchi@rapatchi2:~/xpk_fork/xpk_sa$ xpk cluster create --cluster=maxtest-cluster1 --tpu-type=v5litepod-8 --project=rapatchiconsumer --zone=us-central1-b --num-nodes=2 --spot --managed-mldiagnostics
[XPK] Starting xpk v0.1.dev903+g2b0dc6334
...
[XPK] Task: `Get Project Number` is implemented by `gcloud projects describe rapatchiconsumer --format="value(projectNumber)"`
[XPK] Granting necessary roles to [email protected]
[XPK] Task: `Grant roles/hypercomputecluster.editor` is implemented by `gcloud projects add-iam-policy-binding rapatchiconsumer --member="serviceAccount:[email protected]" --role="roles/hypercomputecluster.editor" --condition=None`
[XPK] Task: `Grant roles/hypercomputecluster.editor` succeeded.
[XPK] Task: `Grant roles/storage.objectUser` is implemented by `gcloud projects add-iam-policy-binding rapatchiconsumer --member="serviceAccount:[email protected]" --role="roles/storage.objectUser" --condition=None`
[XPK] Task: `Grant roles/storage.objectUser` succeeded.
[XPK] Task: `Grant roles/logging.logWriter` is implemented by `gcloud projects add-iam-policy-binding rapatchiconsumer --member="serviceAccount:[email protected]" --role="roles/logging.logWriter" --condition=None`
[XPK] Task: `Grant roles/logging.logWriter` succeeded.
[XPK] Task: `Determine server supported GKE versions for default gke version` is implemented by `gcloud container get-server-config --project=rapatchiconsumer --region=us-central1 --flatten="channels" --filter="channels.channel=RAPID" --format="value(channels.defaultVersion)"`
...

Post Creation:

image

Have you verified use cases affected by goldens? Yes

Comment thread src/xpk/commands/cluster.py
Comment thread src/xpk/commands/cluster.py Outdated
…agnostics

When provisioning clusters with --managed-mldiagnostics, XLA ML diagnostics
requires roles/hypercomputecluster.editor, roles/storage.objectUser, and
roles/logging.logWriter to be bound to the Compute Engine default service account.

This commit:
1. Automatically resolves projectNumber and grants these 3 IAM roles via
   gcloud projects add-iam-policy-binding during cluster create when
   --managed-mldiagnostics is enabled.
2. Explicitly specifies --condition=None to ensure non-interactive command
   compatibility when existing IAM policies contain conditional bindings.
3. Updates user documentation (permissions.md, clusters.md) and unit
   test coverage accordingly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants