Skip to content

sandbox: atomic state writes, multi-gateway SSH config, smarter post-register hint#5537

Closed
akshaysingla-db wants to merge 3 commits into
databricks:mainfrom
akshaysingla-db:akshay/sandbox-followups
Closed

sandbox: atomic state writes, multi-gateway SSH config, smarter post-register hint#5537
akshaysingla-db wants to merge 3 commits into
databricks:mainfrom
akshaysingla-db:akshay/sandbox-followups

Conversation

@akshaysingla-db

Copy link
Copy Markdown
Collaborator

Summary

Three followups from the post-merge review of the gateway-host resolution path, stacked on #5536.

1. Atomic saveState

os.WriteFile is open-truncate-write — a concurrent reader (another CLI invocation racing this one) can see a half-written sandbox.json and loadState returns a parse error, which getDefault/getGatewayHost silently swallow as "". The acceptance suite documents the symptom in its script.prepare (per-test HOME isolation). Production code now gets the real atomic write via tmp + Rename, matching writeManagedConfig in sshconfig.go.

2. One Host stanza per cached gateway

Previously, registering against a workspace in region B after one in region A overwrote the SSH config block, silently breaking IDE Remote-SSH for workspace A. Now register reads the deduplicated set of gateway hosts across every cached profile (new allGatewayHosts helper) and emits one Host stanza per unique gateway:

# Managed by `databricks sandbox register`. Manual edits will be overwritten.
Host us-west-2.service-direct.dev.databricks.com
    Port 2222
    IdentityFile ~/.ssh/sandbox_ed25519
    IdentitiesOnly yes
Host us-east-1.service-direct.prod.databricks.com
    Port 2222
    IdentityFile ~/.ssh/sandbox_ed25519
    IdentitiesOnly yes

buildSSHConfigBlock takes []string instead of a single host; maybeWriteSSHConfig no longer takes a gateway parameter at all and reads from state.

3. Post-register hint branches on getDefault

sandbox register used to unconditionally suggest databricks sandbox ssh, even when the user had no default sandbox (so ssh would error). It now suggests databricks sandbox create when there's no default, and ssh only when there's something to connect to.

Test plan

  • go test ./cmd/sandbox/... passes
  • go test ./acceptance -run TestAccept/cmd/sandbox passes
  • go build ./... clean
  • ./task lint clean
  • New: TestBuildSSHConfigBlockMultipleGateways pins per-gateway repetition
  • Renamed: TestMaybeWriteSSHConfigSkipsWhenNoGatewaysCached (was …SkipsOnEmptyGateway) — the function no longer takes a gateway parameter

Stacking note

Stacked on #5536 (drop heuristic + --gateway flag). The diff against main currently includes that work too; collapses to just these three changes once #5536 merges.

This pull request and its description were written by Isaac.

Now that the server stamps `gateway_host` on every SshKey and Sandbox
response, `register` always populates the cache, and every subsequent
operation that touches the API gets the workspace's real gateway
directly. Two things become dead weight:

1. The hardcoded heuristic `resolveGatewayHost(workspaceHost)` that
   returned `uw2.dbrx.dev` for non-staging workspaces and
   `ue1.s.dbrx.dev` for staging. It produced a wrong answer for any
   workspace whose gateway didn't match those two values.

2. The `--gateway` flag on `sandbox ssh`. It was a manual override
   that made sense only when the CLI couldn't learn the gateway
   itself; with the cache always populated post-register, the only
   thing the flag now does is let a user typo their way into dialing
   a random host. The footgun outweighs the override value.

Removes:

- `defaultGatewayHost = "uw2.dbrx.dev"`
- `stagingDefaultGatewayHost = "ue1.s.dbrx.dev"`
- `resolveGatewayHost(workspaceHost string) string`
- `--gateway` flag and the `gatewayHost` local var in `newSSHCommand`

New resolution chain in `sandbox ssh`: fresh API response (populated
by the `api.get` of the resolved sandbox ID) → cached value (set by
any prior register / list / create / get / start / stop). If both
empty, error out with a pointer to `databricks sandbox register`.

Tests added:

- `acceptance/cmd/sandbox/register/success/` — pins the new contract:
  registerKey returns SshKey with gateway_host, register caches it in
  ~/.databricks/sandbox.json, prints the consent-skip notice in the
  non-interactive acceptance path. Parent `test.toml` now Ignores
  `.ssh` so the generated key + managed config don't appear as
  unexpected output files.
- `TestMaybeWriteSSHConfigSkipsOnEmptyGateway` — pins the
  empty-gateway short-circuit so a future refactor can't silently
  drop the "skip when server didn't stamp a host" branch.

Co-authored-by: Isaac
…register hint

Three followups from the post-merge review of the gateway-host
resolution path.

1. **Atomic saveState** (state.go). `os.WriteFile` is
   open-truncate-write, which means a concurrent reader (another CLI
   invocation racing this one) can see a half-written sandbox.json
   and fail to parse it — `loadState` then silently returns "" for
   `getDefault`/`getGatewayHost`. The acceptance suite documents the
   symptom in its script.prepare; production code now gets the real
   atomic write via tmp + Rename, mirroring `writeManagedConfig` in
   sshconfig.go.

2. **One Host stanza per cached gateway** (sshconfig.go, state.go).
   Previously, registering against a workspace in region B after one
   in region A overwrote the SSH config block, silently breaking IDE
   Remote-SSH for workspace A. Now `register` reads the deduplicated
   set of gateway hosts across every cached profile (new
   `allGatewayHosts` helper) and emits one Host stanza per unique
   gateway. `buildSSHConfigBlock` takes `[]string` instead of a
   single host; `maybeWriteSSHConfig` no longer takes a gateway
   parameter at all and reads from state.

   The same-gateway case (multiple profiles in the same region) is
   still trivially handled — the dedup collapses them.

3. **Post-register hint branches on `getDefault`** (register.go). The
   tail of `sandbox register` used to unconditionally suggest
   `databricks sandbox ssh`, even when the user had no default
   sandbox configured (so `ssh` would error). It now suggests
   `databricks sandbox create` when there's no default, and `ssh`
   only when there's one to connect to.

Tests:

- `TestBuildSSHConfigBlockMultipleGateways` pins the per-gateway
  repetition (Host line, Port, IdentityFile) in the new block shape.
- `TestMaybeWriteSSHConfigSkipsOnEmptyGateway` is renamed and
  refocused to `TestMaybeWriteSSHConfigSkipsWhenNoGatewaysCached` —
  the function no longer takes a gateway parameter, so the guarded
  branch is now "state is empty".
- The `acceptance/cmd/sandbox/register/success` golden picks up the
  new "Run databricks sandbox create" suggestion and the pluralised
  "add the sandbox gateway block(s)" consent message.

Stacked on databricks#5536 (drop heuristic + --gateway flag). The diff against
main currently includes that work too; collapses to the changes
above once databricks#5536 merges.

Co-authored-by: Isaac
@github-actions

Copy link
Copy Markdown
Contributor

Approval status: pending

/acceptance/cmd/sandbox/ - needs approval

5 files changed
Suggested: @pietern
Also eligible: @shuochen0311

/cmd/sandbox/ - needs approval

5 files changed
Suggested: @pietern
Also eligible: @shuochen0311

Any maintainer (@andrewnester, @anton-107, @denik, @pietern, @shreyas-goenka, @simonfaltum, @renaudhartert-db) can approve all areas.
See OWNERS for ownership rules.

@github-actions

Copy link
Copy Markdown
Contributor

An authorized user can trigger integration tests manually by following the instructions below:

Trigger:
go/deco-tests-run/cli

Inputs:

  • PR number: 5537
  • Commit SHA: 90942ef5517a62b0ca4db3b81d4b134e388d5e6e

Checks will be approved automatically on success.

@akshaysingla-db

Copy link
Copy Markdown
Collaborator Author

Superseded by #5543 — same diff pushed to an upstream branch so CI can pull JFrog tokens (fork PRs can't, per GitHub OIDC permission rules).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant