Propagate snapshot load failure during IoTConsensus AddPeer#17935
Conversation
During region migration, when the target peer failed to load the transferred snapshot, the failure was silently swallowed: the target's IoTConsensus RPC handler returned SUCCESS regardless, so the coordinator activated the new peer and marked AddRegionPeerProcedure / RegionMigrateProcedure successful. The migration was reported complete while the destination replica actually had no data, leading to silent data loss once the source replica was dropped. The coordinator side already handles a non-SUCCESS triggerSnapshotLoad response correctly (it throws ConsensusGroupModifyPeerException, which fails the AddPeer task and rolls the procedure back without deleting the source replica). The only broken link was that snapshot-load failure was never reportable, because IStateMachine.loadSnapshot returned void and the implementations swallowed errors. Change IStateMachine.loadSnapshot to return boolean (true on success): - DataRegionStateMachine / SchemaRegionStateMachine / ConfigRegionState Machine return false when loading fails (and SchemaRegionStateMachine now guards its body so an exception is reported rather than thrown). - IoTConsensusServerImpl.loadSnapshot returns false if loading any receive folder fails (removing the long-standing TODO). - IoTConsensusRPCServiceProcessor.triggerSnapshotLoad returns a non- SUCCESS status when loadSnapshot fails, so the coordinator's existing error path fires and AddPeer fails instead of falsely succeeding. - SimpleConsensusServerImpl forwards the boolean; the Ratis ApplicationStateMachineProxy logs a failure (its behavior is otherwise unchanged). Test state machines updated accordingly. Add AddPeerSnapshotLoadFailureTest: a real two-node IoTConsensus group where the target's loadSnapshot is forced to fail; it verifies that addRemotePeer reaches the load step, throws ConsensusException, and does not leave the target peer active. The test fails against the old code and passes with the fix.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #17935 +/- ##
============================================
+ Coverage 41.07% 41.15% +0.08%
Complexity 318 318
============================================
Files 5257 5257
Lines 365010 365159 +149
Branches 47180 47207 +27
============================================
+ Hits 149918 150277 +359
+ Misses 215092 214882 -210 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Thanks for fixing the snapshot-load failure propagation. The main AddPeer control flow looks reasonable, and the new test covers the case where the target state machine reports load failure. I found two concerns:
|
There was a problem hiding this comment.
Pull request overview
This PR makes snapshot-load failures observable and propagated during IoTConsensus AddPeer, preventing migrations from being marked successful when the destination replica failed to load the transferred snapshot (a scenario that can lead to silent data loss).
Changes:
- Change
IStateMachine.loadSnapshot(...)to returnbooleanand update state machine implementations to report success/failure. - Propagate snapshot-load failure through IoTConsensus RPC (
triggerSnapshotLoad) and server-side snapshot loading. - Add a regression test covering AddPeer behavior when the target’s snapshot load fails.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| iotdb-core/datanode/src/main/java/org/apache/iotdb/db/consensus/statemachine/schemaregion/SchemaRegionStateMachine.java | Return boolean from loadSnapshot and report failures instead of silently swallowing exceptions. |
| iotdb-core/datanode/src/main/java/org/apache/iotdb/db/consensus/statemachine/dataregion/DataRegionStateMachine.java | Return boolean from loadSnapshot and propagate loader failure via return value. |
| iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/consensus/statemachine/ConfigRegionStateMachine.java | Return boolean from loadSnapshot for ConfigRegion state machine. |
| iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/IStateMachine.java | API change: loadSnapshot now returns boolean with updated contract docs. |
| iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/iot/IoTConsensusServerImpl.java | Make snapshot-load result reportable to callers by returning boolean. |
| iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/iot/service/IoTConsensusRPCServiceProcessor.java | Return non-success RPC status when snapshot loading fails. |
| iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/simple/SimpleConsensusServerImpl.java | Forward loadSnapshot boolean result. |
| iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/ApplicationStateMachineProxy.java | Log application state machine snapshot load failures (now detectable via boolean). |
| iotdb-core/consensus/src/test/java/org/apache/iotdb/consensus/iot/AddPeerSnapshotLoadFailureTest.java | New regression test validating AddPeer fails (and doesn’t activate target) when snapshot load fails. |
| iotdb-core/consensus/src/test/java/org/apache/iotdb/consensus/simple/SimpleConsensusTest.java | Update test state machine to match new loadSnapshot signature. |
| iotdb-core/consensus/src/test/java/org/apache/iotdb/consensus/ratis/TestUtils.java | Update test state machine to match new loadSnapshot signature and return result. |
| iotdb-core/consensus/src/test/java/org/apache/iotdb/consensus/iot/util/TestStateMachine.java | Update test state machine to match new loadSnapshot signature. |
| iotdb-core/consensus/src/test/java/org/apache/iotdb/consensus/EmptyStateMachine.java | Update test state machine to match new loadSnapshot signature. |
Comments suppressed due to low confidence (1)
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/consensus/statemachine/dataregion/DataRegionStateMachine.java:131
loadSnapshotnow returns a boolean, but this implementation can still report success even when the snapshot directory has nosequence/...andunsequence/...content.SnapshotLoader#createLinksFromSnapshotDirToDataDirWithoutLogonly logs and returns when both are missing, so this method may incorrectly returntrueand allow activation with empty data. Consider validating that at least one of these expected directories exists underlatestSnapshotRootDirand returningfalseotherwise (before invokingSnapshotLoader).
public boolean loadSnapshot(File latestSnapshotRootDir) {
String databaseName = region.getDatabaseName();
String dataRegionIdString = region.getDataRegionIdString();
DataRegionId regionId = new DataRegionId(Integer.parseInt(dataRegionIdString));
try {
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…pshot Two fixes following review feedback on snapshot-load-failure propagation: 1. IoTConsensusServerImpl.loadSnapshot: a DataRegion configures one receive folder per local data dir and the FolderManager spreads snapshot fragments across them, so a given snapshot only materializes under the folders that actually received fragments. Reporting failure when *any* folder lacks the snapshot turned a healthy multi-data-dir transfer into a spurious failure. Now only load from folders that contain the snapshot and skip the rest, while still failing when a folder that has it fails to load or when no folder received the snapshot at all. 2. ApplicationStateMachineProxy.loadSnapshot (Ratis): previously logged and then advanced lastAppliedTermIndex even when the application state machine rejected the snapshot, letting Ratis proceed on incomplete data. Now fail (re)initialization instead of advancing the applied index. Adds a multi-recvSnapshotDir regression test asserting a healthy transfer that spreads across a subset of receive folders still succeeds (verified to fail against the previous load-from-every-folder behavior).
ConfigPlanExecutor.loadSnapshot was void: it returned early on a missing snapshot dir and tracked per-processor failures only to decide whether to log success, so ConfigRegionStateMachine.loadSnapshot always returned true even when the load failed. Make ConfigPlanExecutor.loadSnapshot return a boolean (false on missing dir or any processor failure) and have ConfigRegionStateMachine.loadSnapshot propagate it, so AddPeer and other callers observe a failed ConfigRegion snapshot load.
|
Thanks for the careful review, @Caideyipi! Both concerns are addressed: 1. Multi-data-dir false failure in You're right. A DataRegion sets
I also added the regression test you suggested — 2.
Separately, Copilot flagged that All consensus tests pass ( |
Two follow-ups on the IoTConsensus snapshot-load path: - IoTConsensusRPCServiceProcessor.triggerSnapshotLoad now reports MIGRATE_REGION_ERROR (the snapshot load runs as part of the AddPeer flow, and this is the code the region-migration handlers already use) instead of the generic INTERNAL_SERVER_ERROR. - IoTConsensusServerImpl.loadSnapshot now returns on the first failing receive folder instead of continuing. Once one folder fails the replica's snapshot is already broken, and loading the remaining folders is not only pointless but harmful, since each DataRegion load wipes the data dirs before relinking. The other consensus protocols were audited and need no change: Simple and Ratis load a snapshot with a single state-machine call (already fail-fast), and IoTConsensusV2 replicates via pipe rather than a snapshot-load loop.
ConfigRegionStateMachine.loadSnapshot wrapped both the actual snapshot data load and the best-effort pipe-listener recomputation in one try/catch, and my previous change made the catch return false. That meant a recoverable pipe-listener IOException (which the original code deliberately logged and swallowed) would now be reported as a snapshot-load failure, which in turn aborts ConfigNode (re)initialization via the Ratis loadSnapshot contract. Scope the boolean to the data load (executor.loadSnapshot) only, and keep the pipe-listener recomputation as non-fatal post-processing: a failure there is still logged but no longer flips the load result to false.
An empty region produces a snapshot with zero fragments, so the source transmits 0 files and the target materializes no snapshot under any receive folder. My earlier change reported "no folder contained the snapshot" as a failure, which broke AddPeer for empty regions: migrating an empty data region (e.g. while removing a DataNode) failed at DO_ADD_REGION_PEER and the removal timed out. This reproduced in the Cluster IT - 1C3D job (IoTDBRemoveDataNodeNormalIT.success1C4DIoTTestUseTableSQL). loadSnapshot now loads only from the receive folders that actually contain the snapshot, fails on the first folder whose load fails (preserving the original V2-987 failure propagation), and treats an absent snapshot (empty region) as a no-op success. Adds addRemotePeerSucceedsWhenSnapshotIsEmpty, verified to fail against the previous behavior.
|
I found one remaining contract gap in the snapshot-load failure propagation. SchemaRegionStateMachine.loadSnapshot(...) now returns rue after calling schemaRegion.loadSnapshot(...), but the underlying ISchemaRegion implementations still swallow load failures:
Because those methods are still �oid and do not rethrow or return a result, SchemaRegionStateMachine.loadSnapshot(...) can report success even when the schema snapshot was not actually loaded. This weakens the new IStateMachine.loadSnapshot boolean contract. It also matters for Ratis, since ApplicationStateMachineProxy now relies on a alse return to avoid advancing lastAppliedTermIndex after a rejected snapshot. Suggested fix: make the schema-region load path report success/failure as well, or make SchemaRegionStateMachine.loadSnapshot(...) verify that the underlying schema snapshot was actually loaded before returning rue. A focused regression test around a failing schema snapshot load would help keep this contract from regressing. |
SchemaRegionStateMachine.loadSnapshot returned true after calling schemaRegion.loadSnapshot(...), but the underlying ISchemaRegion implementations swallowed load failures: both SchemaRegionPBTreeImpl and SchemaRegionMemoryImpl catch the load exception, log FAILED_TO_LOAD_SNAPSHOT, fall back to init(), and return void. So a failed schema-snapshot load was reported as success, weakening the IStateMachine.loadSnapshot boolean contract (and letting the Ratis snapshot-install path advance the applied index after a rejected snapshot). Make ISchemaRegion.loadSnapshot return boolean (false when the load failed and the region fell back to an empty state), have both implementations report it, and have SchemaRegionStateMachine propagate the data-load result while keeping the pipe-listener recomputation as non-fatal post-processing. Adds testLoadSnapshotReportsFailureWhenSnapshotIsMissing and asserts the success path returns true in the existing snapshot tests; verified the new test fails across all schema-region modes against the previous behavior.
|
Good catch — fixed in ddeb9f8. You're right: I made the schema-region load path report success/failure:
Added a focused regression test — |
|
Caideyipi
left a comment
There was a problem hiding this comment.
LGTM. The snapshot-load failure propagation gaps raised during review have been addressed: AddPeer now observes load failures, multi receive-directory and empty-snapshot cases are covered, ConfigRegion/SchemaRegion propagate the boolean contract, and the Ratis path no longer advances applied index after a rejected snapshot.



Problem
During region migration (AddPeer), if the target peer failed to load the transferred snapshot, the failure was silently swallowed. The target's IoTConsensus RPC handler returned
SUCCESSregardless, so the coordinator activated the new peer andAddRegionPeerProcedure/RegionMigrateProcedurewere both marked successful. The control plane reported the migration complete while the destination replica actually held no data — and once the source replica was dropped, the data was silently lost (queries returnedcount=0/max_time=null).Observed (from the report): the destination DataNode logged
Exception occurs when loading snapshot ... Cannot find .../sequence/root.test/1 or .../unsequence/...andFail to load snapshot, yet immediately afterwards set the peeractive status to true, and the ConfigNode logged[AddRegion] successand[MigrateRegion] success.Root cause
The coordinator side is already correct:
IoTConsensusServerImpl.triggerSnapshotLoad(the RPC sender) checks the response status and throwsConsensusGroupModifyPeerExceptionon failure, which causesaddRemotePeerto fail, the AddPeer task to be markedFAIL, and the procedure to roll back without deleting the source replica.The only broken link was that a snapshot-load failure was never reportable in the first place:
IStateMachine.loadSnapshotreturnedvoid.DataRegionStateMachine.loadSnapshotcaught the exception (ornullregion) and just logged it.IoTConsensusServerImpl.loadSnapshotignored the result (it carried a long-standing// TODO: throw exception if the snapshot load failed).IoTConsensusRPCServiceProcessor.triggerSnapshotLoadtherefore returnedSUCCESSunconditionally.Fix
Make snapshot-load failure reportable by changing
IStateMachine.loadSnapshotto returnboolean(trueon success):DataRegionStateMachine/SchemaRegionStateMachine/ConfigRegionStateMachinereturnfalsewhen loading fails (SchemaRegionStateMachinenow guards its body so a failure is reported rather than thrown).IoTConsensusServerImpl.loadSnapshotreturnsfalseif loading any receive folder fails (removing the TODO).IoTConsensusRPCServiceProcessor.triggerSnapshotLoadreturns a non-SUCCESSstatus whenloadSnapshotfails, so the coordinator's existing error path fires and AddPeer fails instead of falsely succeeding.SimpleConsensusServerImplforwards the boolean; the RatisApplicationStateMachineProxylogs a load failure (its behavior is otherwise unchanged). Test state machines are updated accordingly.Test
AddPeerSnapshotLoadFailureTestbuilds a real two-node IoTConsensus group and forces the target peer'sloadSnapshotto fail. It verifies thataddRemotePeer:ConsensusExceptioninstead of silently succeeding,The test fails against the old code and passes with the fix. Existing IoTConsensus tests (
ReplicateTest,StabilityTest) still pass.🤖 Generated with Claude Code