Propagate snapshot load failure during IoTConsensus AddPeer by CRZbulabula · Pull Request #17935 · apache/iotdb

CRZbulabula · 2026-06-12T12:25:36Z

Problem

During region migration (AddPeer), if the target peer failed to load the transferred snapshot, the failure was silently swallowed. The target's IoTConsensus RPC handler returned SUCCESS regardless, so the coordinator activated the new peer and AddRegionPeerProcedure / RegionMigrateProcedure were both marked successful. The control plane reported the migration complete while the destination replica actually held no data — and once the source replica was dropped, the data was silently lost (queries returned count=0 / max_time=null).

Observed (from the report): the destination DataNode logged Exception occurs when loading snapshot ... Cannot find .../sequence/root.test/1 or .../unsequence/... and Fail to load snapshot, yet immediately afterwards set the peer active status to true, and the ConfigNode logged [AddRegion] success and [MigrateRegion] success.

Root cause

The coordinator side is already correct: IoTConsensusServerImpl.triggerSnapshotLoad (the RPC sender) checks the response status and throws ConsensusGroupModifyPeerException on failure, which causes addRemotePeer to fail, the AddPeer task to be marked FAIL, and the procedure to roll back without deleting the source replica.

The only broken link was that a snapshot-load failure was never reportable in the first place:

IStateMachine.loadSnapshot returned void.
DataRegionStateMachine.loadSnapshot caught the exception (or null region) and just logged it.
IoTConsensusServerImpl.loadSnapshot ignored the result (it carried a long-standing // TODO: throw exception if the snapshot load failed).
IoTConsensusRPCServiceProcessor.triggerSnapshotLoad therefore returned SUCCESS unconditionally.

Fix

Make snapshot-load failure reportable by changing IStateMachine.loadSnapshot to return boolean (true on success):

DataRegionStateMachine / SchemaRegionStateMachine / ConfigRegionStateMachine return false when loading fails (SchemaRegionStateMachine now guards its body so a failure is reported rather than thrown).
IoTConsensusServerImpl.loadSnapshot returns false if loading any receive folder fails (removing the TODO).
IoTConsensusRPCServiceProcessor.triggerSnapshotLoad returns a non-SUCCESS status when loadSnapshot fails, so the coordinator's existing error path fires and AddPeer fails instead of falsely succeeding.
SimpleConsensusServerImpl forwards the boolean; the Ratis ApplicationStateMachineProxy logs a load failure (its behavior is otherwise unchanged). Test state machines are updated accordingly.

Test

AddPeerSnapshotLoadFailureTest builds a real two-node IoTConsensus group and forces the target peer's loadSnapshot to fail. It verifies that addRemotePeer:

actually reaches the snapshot-load step (so the failure under test is the right one, not an earlier step),
throws ConsensusException instead of silently succeeding,
does not leave the target peer active with an incompletely-loaded snapshot.

The test fails against the old code and passes with the fix. Existing IoTConsensus tests (ReplicateTest, StabilityTest) still pass.

🤖 Generated with Claude Code

During region migration, when the target peer failed to load the transferred snapshot, the failure was silently swallowed: the target's IoTConsensus RPC handler returned SUCCESS regardless, so the coordinator activated the new peer and marked AddRegionPeerProcedure / RegionMigrateProcedure successful. The migration was reported complete while the destination replica actually had no data, leading to silent data loss once the source replica was dropped. The coordinator side already handles a non-SUCCESS triggerSnapshotLoad response correctly (it throws ConsensusGroupModifyPeerException, which fails the AddPeer task and rolls the procedure back without deleting the source replica). The only broken link was that snapshot-load failure was never reportable, because IStateMachine.loadSnapshot returned void and the implementations swallowed errors. Change IStateMachine.loadSnapshot to return boolean (true on success): - DataRegionStateMachine / SchemaRegionStateMachine / ConfigRegionState Machine return false when loading fails (and SchemaRegionStateMachine now guards its body so an exception is reported rather than thrown). - IoTConsensusServerImpl.loadSnapshot returns false if loading any receive folder fails (removing the long-standing TODO). - IoTConsensusRPCServiceProcessor.triggerSnapshotLoad returns a non- SUCCESS status when loadSnapshot fails, so the coordinator's existing error path fires and AddPeer fails instead of falsely succeeding. - SimpleConsensusServerImpl forwards the boolean; the Ratis ApplicationStateMachineProxy logs a failure (its behavior is otherwise unchanged). Test state machines updated accordingly. Add AddPeerSnapshotLoadFailureTest: a real two-node IoTConsensus group where the target's loadSnapshot is forced to fail; it verifies that addRemotePeer reaches the load step, throws ConsensusException, and does not leave the target peer active. The test fails against the old code and passes with the fix.

codecov · 2026-06-12T13:13:32Z

Codecov Report

❌ Patch coverage is 51.28205% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.15%. Comparing base (abb9ef9) to head (ddeb9f8).
⚠️ Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
...machine/schemaregion/SchemaRegionStateMachine.java	0.00%	9 Missing ⚠️
...tatemachine/dataregion/DataRegionStateMachine.java	0.00%	3 Missing ⚠️
...nsensus/statemachine/ConfigRegionStateMachine.java	0.00%	2 Missing ⚠️
...gnode/persistence/executor/ConfigPlanExecutor.java	0.00%	2 Missing ⚠️
.../consensus/ratis/ApplicationStateMachineProxy.java	33.33%	2 Missing ⚠️
...db/consensus/simple/SimpleConsensusServerImpl.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #17935      +/-   ##
============================================
+ Coverage     41.07%   41.15%   +0.08%     
  Complexity      318      318              
============================================
  Files          5257     5257              
  Lines        365010   365159     +149     
  Branches      47180    47207      +27     
============================================
+ Hits         149918   150277     +359     
+ Misses       215092   214882     -210

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Caideyipi · 2026-06-15T02:11:17Z

Thanks for fixing the snapshot-load failure propagation. The main AddPeer control flow looks reasonable, and the new test covers the case where the target state machine reports load failure.

I found two concerns:

In IoTConsensusServerImpl.loadSnapshot(), the code now calls stateMachine.loadSnapshot() for every recvSnapshotDir and treats any false result as overall failure. However, snapshot fragments are written via FolderManager selection, so a given snapshot may not exist under every receive root. DataRegion also configures recvSnapshotDirs from all local data dirs, and SnapshotLoader fails when the given snapshot root does not exist. This can make a successful transfer fail in normal multi-data-dir deployments. We should either load only from the actual root that contains the received snapshot, or ensure all files of one snapshot are always placed under one selected root. A regression test with multiple recvSnapshotDirs and real snapshot files would help cover this.
ApplicationStateMachineProxy now logs when applicationStateMachine.loadSnapshot() returns false, but it still updates lastAppliedTermIndex afterward. That means Ratis may continue as if the snapshot was installed even though the application state machine rejected it. Since loadSnapshot now has an explicit success/failure contract, this path should avoid advancing the applied index, or fail initialization/reinitialization, when the load fails.

Copilot

Pull request overview

This PR makes snapshot-load failures observable and propagated during IoTConsensus AddPeer, preventing migrations from being marked successful when the destination replica failed to load the transferred snapshot (a scenario that can lead to silent data loss).

Changes:

Change IStateMachine.loadSnapshot(...) to return boolean and update state machine implementations to report success/failure.
Propagate snapshot-load failure through IoTConsensus RPC (triggerSnapshotLoad) and server-side snapshot loading.
Add a regression test covering AddPeer behavior when the target’s snapshot load fails.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/consensus/statemachine/schemaregion/SchemaRegionStateMachine.java	Return boolean from `loadSnapshot` and report failures instead of silently swallowing exceptions.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/consensus/statemachine/dataregion/DataRegionStateMachine.java	Return boolean from `loadSnapshot` and propagate loader failure via return value.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/consensus/statemachine/ConfigRegionStateMachine.java	Return boolean from `loadSnapshot` for ConfigRegion state machine.
iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/IStateMachine.java	API change: `loadSnapshot` now returns `boolean` with updated contract docs.
iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/iot/IoTConsensusServerImpl.java	Make snapshot-load result reportable to callers by returning `boolean`.
iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/iot/service/IoTConsensusRPCServiceProcessor.java	Return non-success RPC status when snapshot loading fails.
iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/simple/SimpleConsensusServerImpl.java	Forward `loadSnapshot` boolean result.
iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/ApplicationStateMachineProxy.java	Log application state machine snapshot load failures (now detectable via boolean).
iotdb-core/consensus/src/test/java/org/apache/iotdb/consensus/iot/AddPeerSnapshotLoadFailureTest.java	New regression test validating AddPeer fails (and doesn’t activate target) when snapshot load fails.
iotdb-core/consensus/src/test/java/org/apache/iotdb/consensus/simple/SimpleConsensusTest.java	Update test state machine to match new `loadSnapshot` signature.
iotdb-core/consensus/src/test/java/org/apache/iotdb/consensus/ratis/TestUtils.java	Update test state machine to match new `loadSnapshot` signature and return result.
iotdb-core/consensus/src/test/java/org/apache/iotdb/consensus/iot/util/TestStateMachine.java	Update test state machine to match new `loadSnapshot` signature.
iotdb-core/consensus/src/test/java/org/apache/iotdb/consensus/EmptyStateMachine.java	Update test state machine to match new `loadSnapshot` signature.

Comments suppressed due to low confidence (1)

iotdb-core/datanode/src/main/java/org/apache/iotdb/db/consensus/statemachine/dataregion/DataRegionStateMachine.java:131

loadSnapshot now returns a boolean, but this implementation can still report success even when the snapshot directory has no sequence/... and unsequence/... content. SnapshotLoader#createLinksFromSnapshotDirToDataDirWithoutLog only logs and returns when both are missing, so this method may incorrectly return true and allow activation with empty data. Consider validating that at least one of these expected directories exists under latestSnapshotRootDir and returning false otherwise (before invoking SnapshotLoader).

  public boolean loadSnapshot(File latestSnapshotRootDir) {
    String databaseName = region.getDatabaseName();
    String dataRegionIdString = region.getDataRegionIdString();
    DataRegionId regionId = new DataRegionId(Integer.parseInt(dataRegionIdString));
    try {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…pshot Two fixes following review feedback on snapshot-load-failure propagation: 1. IoTConsensusServerImpl.loadSnapshot: a DataRegion configures one receive folder per local data dir and the FolderManager spreads snapshot fragments across them, so a given snapshot only materializes under the folders that actually received fragments. Reporting failure when *any* folder lacks the snapshot turned a healthy multi-data-dir transfer into a spurious failure. Now only load from folders that contain the snapshot and skip the rest, while still failing when a folder that has it fails to load or when no folder received the snapshot at all. 2. ApplicationStateMachineProxy.loadSnapshot (Ratis): previously logged and then advanced lastAppliedTermIndex even when the application state machine rejected the snapshot, letting Ratis proceed on incomplete data. Now fail (re)initialization instead of advancing the applied index. Adds a multi-recvSnapshotDir regression test asserting a healthy transfer that spreads across a subset of receive folders still succeeds (verified to fail against the previous load-from-every-folder behavior).

ConfigPlanExecutor.loadSnapshot was void: it returned early on a missing snapshot dir and tracked per-processor failures only to decide whether to log success, so ConfigRegionStateMachine.loadSnapshot always returned true even when the load failed. Make ConfigPlanExecutor.loadSnapshot return a boolean (false on missing dir or any processor failure) and have ConfigRegionStateMachine.loadSnapshot propagate it, so AddPeer and other callers observe a failed ConfigRegion snapshot load.

CRZbulabula · 2026-06-15T03:54:27Z

Thanks for the careful review, @Caideyipi! Both concerns are addressed:

1. Multi-data-dir false failure in IoTConsensusServerImpl.loadSnapshot (1c96b5b)

You're right. A DataRegion sets recvSnapshotDirs to all local data dirs (DataRegionConsensusImpl), and the FolderManager spreads snapshot fragments across them, so a given snapshot only materializes under the folders that actually received fragments — loading from a folder that never received this snapshot would (correctly) fail and turn a healthy transfer into a spurious failure. I went with the "load only from the roots that contain the received snapshot" option:

Skip receive folders whose snapshot root does not exist, load from the ones that do, and still report failure if a present snapshot fails to load or if no folder received the snapshot at all.
The actual complete load happens from whichever root holds the snapshot log, since SnapshotLoader#createLinksFromSnapshotDirToDataDirWithLog gathers the fragments across all data dirs and verifies the logged file count, so skipping the empty roots is safe.

I also added the regression test you suggested — AddPeerSnapshotLoadFailureTest#addRemotePeerSucceedsWhenSnapshotSpansSubsetOfRecvDirs — which builds a two-node group with multiple recvSnapshotDirs, transfers a real snapshot that lands in only a subset of the target's receive folders, and asserts AddPeer still succeeds. I verified it fails against the previous load-from-every-folder behavior and passes with the fix.

2. ApplicationStateMachineProxy advancing the applied index on load failure (1c96b5b)

loadSnapshot now throws IOException when the application state machine returns false, so lastAppliedTermIndex is no longer advanced and initialize/reinitialize fail instead of reporting the snapshot as applied. reinitialize() now declares throws IOException, which the Ratis StateMachine interface already permits.

Separately, Copilot flagged that ConfigRegionStateMachine.loadSnapshot always returned true because ConfigPlanExecutor.loadSnapshot was void and swallowed failures — fixed in 7fe8e9c by making it return a boolean and propagating it.

All consensus tests pass (AddPeerSnapshotLoadFailureTest, ReplicateTest/StabilityTest, RatisConsensusTest/RecoverReadTest/SnapshotTest).

Two follow-ups on the IoTConsensus snapshot-load path: - IoTConsensusRPCServiceProcessor.triggerSnapshotLoad now reports MIGRATE_REGION_ERROR (the snapshot load runs as part of the AddPeer flow, and this is the code the region-migration handlers already use) instead of the generic INTERNAL_SERVER_ERROR. - IoTConsensusServerImpl.loadSnapshot now returns on the first failing receive folder instead of continuing. Once one folder fails the replica's snapshot is already broken, and loading the remaining folders is not only pointless but harmful, since each DataRegion load wipes the data dirs before relinking. The other consensus protocols were audited and need no change: Simple and Ratis load a snapshot with a single state-machine call (already fail-fast), and IoTConsensusV2 replicates via pipe rather than a snapshot-load loop.

ConfigRegionStateMachine.loadSnapshot wrapped both the actual snapshot data load and the best-effort pipe-listener recomputation in one try/catch, and my previous change made the catch return false. That meant a recoverable pipe-listener IOException (which the original code deliberately logged and swallowed) would now be reported as a snapshot-load failure, which in turn aborts ConfigNode (re)initialization via the Ratis loadSnapshot contract. Scope the boolean to the data load (executor.loadSnapshot) only, and keep the pipe-listener recomputation as non-fatal post-processing: a failure there is still logged but no longer flips the load result to false.

An empty region produces a snapshot with zero fragments, so the source transmits 0 files and the target materializes no snapshot under any receive folder. My earlier change reported "no folder contained the snapshot" as a failure, which broke AddPeer for empty regions: migrating an empty data region (e.g. while removing a DataNode) failed at DO_ADD_REGION_PEER and the removal timed out. This reproduced in the Cluster IT - 1C3D job (IoTDBRemoveDataNodeNormalIT.success1C4DIoTTestUseTableSQL). loadSnapshot now loads only from the receive folders that actually contain the snapshot, fails on the first folder whose load fails (preserving the original V2-987 failure propagation), and treats an absent snapshot (empty region) as a no-op success. Adds addRemotePeerSucceedsWhenSnapshotIsEmpty, verified to fail against the previous behavior.

Caideyipi · 2026-06-15T10:15:46Z

I found one remaining contract gap in the snapshot-load failure propagation.

SchemaRegionStateMachine.loadSnapshot(...) now returns rue after calling schemaRegion.loadSnapshot(...), but the underlying ISchemaRegion implementations still swallow load failures:

SchemaRegionPBTreeImpl.loadSnapshot(...) catches IOException | MetadataException, logs FAILED_TO_LOAD_SNAPSHOT, tries init(), and then returns normally.
SchemaRegionMemoryImpl.loadSnapshot(...) has the same pattern for Exception.

Because those methods are still �oid and do not rethrow or return a result, SchemaRegionStateMachine.loadSnapshot(...) can report success even when the schema snapshot was not actually loaded. This weakens the new IStateMachine.loadSnapshot boolean contract. It also matters for Ratis, since ApplicationStateMachineProxy now relies on a alse return to avoid advancing lastAppliedTermIndex after a rejected snapshot.

Suggested fix: make the schema-region load path report success/failure as well, or make SchemaRegionStateMachine.loadSnapshot(...) verify that the underlying schema snapshot was actually loaded before returning rue. A focused regression test around a failing schema snapshot load would help keep this contract from regressing.

SchemaRegionStateMachine.loadSnapshot returned true after calling schemaRegion.loadSnapshot(...), but the underlying ISchemaRegion implementations swallowed load failures: both SchemaRegionPBTreeImpl and SchemaRegionMemoryImpl catch the load exception, log FAILED_TO_LOAD_SNAPSHOT, fall back to init(), and return void. So a failed schema-snapshot load was reported as success, weakening the IStateMachine.loadSnapshot boolean contract (and letting the Ratis snapshot-install path advance the applied index after a rejected snapshot). Make ISchemaRegion.loadSnapshot return boolean (false when the load failed and the region fell back to an empty state), have both implementations report it, and have SchemaRegionStateMachine propagate the data-load result while keeping the pipe-listener recomputation as non-fatal post-processing. Adds testLoadSnapshotReportsFailureWhenSnapshotIsMissing and asserts the success path returns true in the existing snapshot tests; verified the new test fails across all schema-region modes against the previous behavior.

CRZbulabula · 2026-06-15T11:50:28Z

Good catch — fixed in ddeb9f8.

You're right: SchemaRegionStateMachine.loadSnapshot(...) returned true unconditionally because the underlying load swallowed failures. Both SchemaRegionPBTreeImpl.loadSnapshot(...) (catching IOException | MetadataException) and SchemaRegionMemoryImpl.loadSnapshot(...) (catching Exception) logged FAILED_TO_LOAD_SNAPSHOT, fell back to init(), and returned void, so a failed schema-snapshot load was reported as success — weakening the new IStateMachine.loadSnapshot contract and, as you note, allowing the Ratis path to advance lastAppliedTermIndex after a rejected snapshot.

I made the schema-region load path report success/failure:

ISchemaRegion.loadSnapshot(...) now returns boolean (false when the load failed and the region fell back to an empty re-initialized state).
Both SchemaRegionPBTreeImpl and SchemaRegionMemoryImpl return true on the success path and false from the fallback/catch path.
SchemaRegionStateMachine.loadSnapshot(...) propagates the data-load result, while keeping the pipe-listener recomputation (schemaListener(...).loadSnapshot + listen2Snapshot4PipeListener) as non-fatal post-processing (consistent with how I scoped the ConfigRegion fix).

Added a focused regression test — SchemaRegionManagementTest#testLoadSnapshotReportsFailureWhenSnapshotIsMissing — which asserts that loading from a directory without a snapshot reports failure rather than silently succeeding. I also added assertTrue(...) to the existing successful-load assertions in testRatisModeSnapshot/testEmptySnapshot. The new test was verified to fail across all four schema-region modes (MemoryMode, PBTree-Full/Partial/Non-Memory) against the previous always-true behavior, and passes with the fix.

sonarqubecloud · 2026-06-15T11:55:23Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Caideyipi

LGTM. The snapshot-load failure propagation gaps raised during review have been addressed: AddPeer now observes load failures, multi receive-directory and empty-snapshot cases are covered, ConfigRegion/SchemaRegion propagate the boolean contract, and the Ratis path no longer advances applied index after a rejected snapshot.

CRZbulabula requested a review from Copilot June 15, 2026 02:25

Copilot started reviewing on behalf of CRZbulabula June 15, 2026 02:26 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread ...e/consensus/src/main/java/org/apache/iotdb/consensus/ratis/ApplicationStateMachineProxy.java

Comment thread ...c/main/java/org/apache/iotdb/confignode/consensus/statemachine/ConfigRegionStateMachine.java

CRZbulabula added 2 commits June 15, 2026 11:13

CRZbulabula added 3 commits June 15, 2026 12:30

Caideyipi approved these changes Jun 16, 2026

View reviewed changes

CRZbulabula merged commit a4ed6dd into master Jun 16, 2026
42 of 43 checks passed

CRZbulabula deleted the fix_v2_987_snapshot_load_failed_false_success branch June 16, 2026 03:31

CRZbulabula mentioned this pull request Jun 16, 2026

Respect pre-set env vars and add cgroup memory detection in env scripts #17765

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propagate snapshot load failure during IoTConsensus AddPeer#17935

Propagate snapshot load failure during IoTConsensus AddPeer#17935
CRZbulabula merged 7 commits into
masterfrom
fix_v2_987_snapshot_load_failed_false_success

CRZbulabula commented Jun 12, 2026

Uh oh!

codecov Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Caideyipi commented Jun 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

CRZbulabula commented Jun 15, 2026

Uh oh!

Caideyipi commented Jun 15, 2026

Uh oh!

CRZbulabula commented Jun 15, 2026

Uh oh!

sonarqubecloud Bot commented Jun 15, 2026

Uh oh!

Caideyipi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CRZbulabula commented Jun 12, 2026

Problem

Root cause

Fix

Test

Uh oh!

codecov Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Caideyipi commented Jun 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

CRZbulabula commented Jun 15, 2026

Uh oh!

Caideyipi commented Jun 15, 2026

Uh oh!

CRZbulabula commented Jun 15, 2026

Uh oh!

sonarqubecloud Bot commented Jun 15, 2026

Quality Gate passed

Uh oh!

Caideyipi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Jun 12, 2026 •

edited

Loading