Skip to content

Simplify lookup sync da_checker oracle#9428

Merged
mergify[bot] merged 3 commits into
sigp:unstablefrom
dapplion:simplify-lookup-sync-dachecker-oracle
Jun 5, 2026
Merged

Simplify lookup sync da_checker oracle#9428
mergify[bot] merged 3 commits into
sigp:unstablefrom
dapplion:simplify-lookup-sync-dachecker-oracle

Conversation

@dapplion

@dapplion dapplion commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Issue Addressed

Implementing gloas lookup sync is currently incompatible with the GossipBlockProcessResult mechanism.

Today it's implemented such that if we receive a sucessful GossipBlockProcessResult we directly mark the lookup as Complete and delete it. In Gloas we can't delete a lookup after block import, as we may still have FULL child awaiting the payload.

IMO this GossipBlockProcessResult brings a lot of headache and edge cases that we can just live without. Also the reset_request business is nasty and can easily leave the lookup in a bad state.

Proposed Changes

If we get rid of GossipBlockProcessResult we only pay the following performance penalty:

  • Lookup is created exactly while the block's payload is being execution validated
  • (new degradation) we download the block again
  • send the block for processing but the duplicate cache prevents double execution

So in the worst case we spend a few KBs of extra download bandwidth. Remember each block is downloaded 8x times through gossip in the happy case.

@dapplion dapplion requested a review from jxs as a code owner June 5, 2026 20:47
@dapplion dapplion added ready-for-review The code is ready for review syncing labels Jun 5, 2026
// Instead we always re-download the block eagerly and de-duplicate the processing. So in
// the happy case we just download the block again if the lookup is created while execution
// processing the block.
BlockProcessStatus::NotValidated(..) => {}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key change: No more setting the request in Pending state. We just continue and download the block.

@mergify

mergify Bot commented Jun 5, 2026

Copy link
Copy Markdown

Some required checks have failed. Could you please take a look @dapplion? 🙏

@mergify mergify Bot added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Jun 5, 2026
@dapplion

dapplion commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

Manual lookup-sync testing on mainnet

Ran this branch against live mainnet to confirm block lookups still fire and recover after the GossipBlockProcessResult / on_external_processing_result removal.

Setup: branch built and checkpoint-synced from a recent finalized checkpoint (slot 14,489,088), peered on real mainnet (post-Fulu — actual data-column custody lookups), with a mock always-VALID Engine-API EL. Node reached head, fully validated. Debug logs grepped for block_lookups events.

1. Restart / fall-behind → lookups fire. Stopped the BN for ~4 slots, restarted. It walked the unknown-parent chain block-by-block — Created block lookupSync RPC request sent method: BlocksByRootParentUnknown { parent_root } → lookup for the parent, ~8 deep — then imported and returned to head. Overwhelmingly lookup-driven (≫ range-sync lines).

2. Internet cut mid-request → in-flight requests fail, lookups recover. Severed only the node's connectivity (loopback/EL preserved). In-flight lookup/custody/range RPCs errored immediately (RpcError(Disconnected), RpcError(IoError("connection lost"))), peers dropped 31→5. On restore, lookups were recreated and each reached Imported. No stuck lookups.

3. EL down mid parent-chain → clean defer + recover. Killed the EL, forced a gap, restarted with the EL dead. With the EL unreachable the node held (el_offline:true, no optimistic import) and sync deferred the lookups rather than failing them — Ignoring unknown parent request … reason: "execution engine offline". Restarting the EL flipped Sync's view on execution engine state … Offline → Online; the parent-chain lookups resumed and the node caught the ~19-slot gap back to head.

Across all three, every lookup reached a terminal Imported result and the node returned to head, with 0 occurrences of stuck / TooManyAttempts / MissingComponentsAfterAllProcessed — i.e. the eager re-download + duplicate-cache dedup recovers cleanly without the old reset path.

Caveats: the mock EL means execution payloads aren't really validated, so this exercises the CL sync/lookup machinery and the EL-offline gating, not EL consensus; and these are manual observational runs, not deterministic tests.

Testing performed with AI assistance.

@pawanjay176 pawanjay176 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. I agree that the scenario is not that common and the uncommon case also isn't too bad (just one extra download).

@pawanjay176 pawanjay176 added ready-for-merge This PR is ready to merge. and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Jun 5, 2026
@mergify mergify Bot added the queued label Jun 5, 2026
@mergify

mergify Bot commented Jun 5, 2026

Copy link
Copy Markdown

Merge Queue Status

This pull request spent 28 minutes 59 seconds in the queue, including 27 minutes 42 seconds running CI.

Required conditions to merge

mergify Bot added a commit that referenced this pull request Jun 5, 2026
@mergify mergify Bot merged commit 8e4df4a into sigp:unstable Jun 5, 2026
38 checks passed
@mergify mergify Bot removed the queued label Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-merge This PR is ready to merge. syncing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants