Simplify lookup sync da_checker oracle#9428
Conversation
| // Instead we always re-download the block eagerly and de-duplicate the processing. So in | ||
| // the happy case we just download the block again if the lookup is created while execution | ||
| // processing the block. | ||
| BlockProcessStatus::NotValidated(..) => {} |
There was a problem hiding this comment.
This is the key change: No more setting the request in Pending state. We just continue and download the block.
|
Some required checks have failed. Could you please take a look @dapplion? 🙏 |
Manual lookup-sync testing on mainnetRan this branch against live mainnet to confirm block lookups still fire and recover after the Setup: branch built and checkpoint-synced from a recent finalized checkpoint (slot 14,489,088), peered on real mainnet (post-Fulu — actual data-column custody lookups), with a mock always- 1. Restart / fall-behind → lookups fire. Stopped the BN for ~4 slots, restarted. It walked the unknown-parent chain block-by-block — 2. Internet cut mid-request → in-flight requests fail, lookups recover. Severed only the node's connectivity (loopback/EL preserved). In-flight lookup/custody/range RPCs errored immediately ( 3. EL down mid parent-chain → clean defer + recover. Killed the EL, forced a gap, restarted with the EL dead. With the EL unreachable the node held ( Across all three, every lookup reached a terminal Caveats: the mock EL means execution payloads aren't really validated, so this exercises the CL sync/lookup machinery and the EL-offline gating, not EL consensus; and these are manual observational runs, not deterministic tests. Testing performed with AI assistance. |
pawanjay176
left a comment
There was a problem hiding this comment.
I like this. I agree that the scenario is not that common and the uncommon case also isn't too bad (just one extra download).
Merge Queue Status
This pull request spent 28 minutes 59 seconds in the queue, including 27 minutes 42 seconds running CI. Required conditions to merge
|
Issue Addressed
Implementing gloas lookup sync is currently incompatible with the
GossipBlockProcessResultmechanism.Today it's implemented such that if we receive a sucessful
GossipBlockProcessResultwe directly mark the lookup as Complete and delete it. In Gloas we can't delete a lookup after block import, as we may still have FULL child awaiting the payload.IMO this
GossipBlockProcessResultbrings a lot of headache and edge cases that we can just live without. Also thereset_requestbusiness is nasty and can easily leave the lookup in a bad state.Proposed Changes
If we get rid of
GossipBlockProcessResultwe only pay the following performance penalty:So in the worst case we spend a few KBs of extra download bandwidth. Remember each block is downloaded 8x times through gossip in the happy case.