Skip to content

Propagate snapshot load failure during IoTConsensus AddPeer#17935

Merged
CRZbulabula merged 7 commits into
masterfrom
fix_v2_987_snapshot_load_failed_false_success
Jun 16, 2026
Merged

Propagate snapshot load failure during IoTConsensus AddPeer#17935
CRZbulabula merged 7 commits into
masterfrom
fix_v2_987_snapshot_load_failed_false_success

Conversation

@CRZbulabula

Copy link
Copy Markdown
Contributor

Problem

During region migration (AddPeer), if the target peer failed to load the transferred snapshot, the failure was silently swallowed. The target's IoTConsensus RPC handler returned SUCCESS regardless, so the coordinator activated the new peer and AddRegionPeerProcedure / RegionMigrateProcedure were both marked successful. The control plane reported the migration complete while the destination replica actually held no data — and once the source replica was dropped, the data was silently lost (queries returned count=0 / max_time=null).

Observed (from the report): the destination DataNode logged Exception occurs when loading snapshot ... Cannot find .../sequence/root.test/1 or .../unsequence/... and Fail to load snapshot, yet immediately afterwards set the peer active status to true, and the ConfigNode logged [AddRegion] success and [MigrateRegion] success.

Root cause

The coordinator side is already correct: IoTConsensusServerImpl.triggerSnapshotLoad (the RPC sender) checks the response status and throws ConsensusGroupModifyPeerException on failure, which causes addRemotePeer to fail, the AddPeer task to be marked FAIL, and the procedure to roll back without deleting the source replica.

The only broken link was that a snapshot-load failure was never reportable in the first place:

  • IStateMachine.loadSnapshot returned void.
  • DataRegionStateMachine.loadSnapshot caught the exception (or null region) and just logged it.
  • IoTConsensusServerImpl.loadSnapshot ignored the result (it carried a long-standing // TODO: throw exception if the snapshot load failed).
  • IoTConsensusRPCServiceProcessor.triggerSnapshotLoad therefore returned SUCCESS unconditionally.

Fix

Make snapshot-load failure reportable by changing IStateMachine.loadSnapshot to return boolean (true on success):

  • DataRegionStateMachine / SchemaRegionStateMachine / ConfigRegionStateMachine return false when loading fails (SchemaRegionStateMachine now guards its body so a failure is reported rather than thrown).
  • IoTConsensusServerImpl.loadSnapshot returns false if loading any receive folder fails (removing the TODO).
  • IoTConsensusRPCServiceProcessor.triggerSnapshotLoad returns a non-SUCCESS status when loadSnapshot fails, so the coordinator's existing error path fires and AddPeer fails instead of falsely succeeding.
  • SimpleConsensusServerImpl forwards the boolean; the Ratis ApplicationStateMachineProxy logs a load failure (its behavior is otherwise unchanged). Test state machines are updated accordingly.

Test

AddPeerSnapshotLoadFailureTest builds a real two-node IoTConsensus group and forces the target peer's loadSnapshot to fail. It verifies that addRemotePeer:

  • actually reaches the snapshot-load step (so the failure under test is the right one, not an earlier step),
  • throws ConsensusException instead of silently succeeding,
  • does not leave the target peer active with an incompletely-loaded snapshot.

The test fails against the old code and passes with the fix. Existing IoTConsensus tests (ReplicateTest, StabilityTest) still pass.

🤖 Generated with Claude Code

During region migration, when the target peer failed to load the
transferred snapshot, the failure was silently swallowed: the target's
IoTConsensus RPC handler returned SUCCESS regardless, so the coordinator
activated the new peer and marked AddRegionPeerProcedure /
RegionMigrateProcedure successful. The migration was reported complete
while the destination replica actually had no data, leading to silent
data loss once the source replica was dropped.

The coordinator side already handles a non-SUCCESS triggerSnapshotLoad
response correctly (it throws ConsensusGroupModifyPeerException, which
fails the AddPeer task and rolls the procedure back without deleting the
source replica). The only broken link was that snapshot-load failure was
never reportable, because IStateMachine.loadSnapshot returned void and
the implementations swallowed errors.

Change IStateMachine.loadSnapshot to return boolean (true on success):
- DataRegionStateMachine / SchemaRegionStateMachine / ConfigRegionState
  Machine return false when loading fails (and SchemaRegionStateMachine
  now guards its body so an exception is reported rather than thrown).
- IoTConsensusServerImpl.loadSnapshot returns false if loading any
  receive folder fails (removing the long-standing TODO).
- IoTConsensusRPCServiceProcessor.triggerSnapshotLoad returns a non-
  SUCCESS status when loadSnapshot fails, so the coordinator's existing
  error path fires and AddPeer fails instead of falsely succeeding.
- SimpleConsensusServerImpl forwards the boolean; the Ratis
  ApplicationStateMachineProxy logs a failure (its behavior is otherwise
  unchanged). Test state machines updated accordingly.

Add AddPeerSnapshotLoadFailureTest: a real two-node IoTConsensus group
where the target's loadSnapshot is forced to fail; it verifies that
addRemotePeer reaches the load step, throws ConsensusException, and does
not leave the target peer active. The test fails against the old code
and passes with the fix.
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 51.28205% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.15%. Comparing base (abb9ef9) to head (ddeb9f8).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
...machine/schemaregion/SchemaRegionStateMachine.java 0.00% 9 Missing ⚠️
...tatemachine/dataregion/DataRegionStateMachine.java 0.00% 3 Missing ⚠️
...nsensus/statemachine/ConfigRegionStateMachine.java 0.00% 2 Missing ⚠️
...gnode/persistence/executor/ConfigPlanExecutor.java 0.00% 2 Missing ⚠️
.../consensus/ratis/ApplicationStateMachineProxy.java 33.33% 2 Missing ⚠️
...db/consensus/simple/SimpleConsensusServerImpl.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17935      +/-   ##
============================================
+ Coverage     41.07%   41.15%   +0.08%     
  Complexity      318      318              
============================================
  Files          5257     5257              
  Lines        365010   365159     +149     
  Branches      47180    47207      +27     
============================================
+ Hits         149918   150277     +359     
+ Misses       215092   214882     -210     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Caideyipi

Copy link
Copy Markdown
Collaborator

Thanks for fixing the snapshot-load failure propagation. The main AddPeer control flow looks reasonable, and the new test covers the case where the target state machine reports load failure.

I found two concerns:

  1. In IoTConsensusServerImpl.loadSnapshot(), the code now calls stateMachine.loadSnapshot() for every recvSnapshotDir and treats any false result as overall failure. However, snapshot fragments are written via FolderManager selection, so a given snapshot may not exist under every receive root. DataRegion also configures recvSnapshotDirs from all local data dirs, and SnapshotLoader fails when the given snapshot root does not exist. This can make a successful transfer fail in normal multi-data-dir deployments. We should either load only from the actual root that contains the received snapshot, or ensure all files of one snapshot are always placed under one selected root. A regression test with multiple recvSnapshotDirs and real snapshot files would help cover this.

  2. ApplicationStateMachineProxy now logs when applicationStateMachine.loadSnapshot() returns false, but it still updates lastAppliedTermIndex afterward. That means Ratis may continue as if the snapshot was installed even though the application state machine rejected it. Since loadSnapshot now has an explicit success/failure contract, this path should avoid advancing the applied index, or fail initialization/reinitialization, when the load fails.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes snapshot-load failures observable and propagated during IoTConsensus AddPeer, preventing migrations from being marked successful when the destination replica failed to load the transferred snapshot (a scenario that can lead to silent data loss).

Changes:

  • Change IStateMachine.loadSnapshot(...) to return boolean and update state machine implementations to report success/failure.
  • Propagate snapshot-load failure through IoTConsensus RPC (triggerSnapshotLoad) and server-side snapshot loading.
  • Add a regression test covering AddPeer behavior when the target’s snapshot load fails.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/consensus/statemachine/schemaregion/SchemaRegionStateMachine.java Return boolean from loadSnapshot and report failures instead of silently swallowing exceptions.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/consensus/statemachine/dataregion/DataRegionStateMachine.java Return boolean from loadSnapshot and propagate loader failure via return value.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/consensus/statemachine/ConfigRegionStateMachine.java Return boolean from loadSnapshot for ConfigRegion state machine.
iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/IStateMachine.java API change: loadSnapshot now returns boolean with updated contract docs.
iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/iot/IoTConsensusServerImpl.java Make snapshot-load result reportable to callers by returning boolean.
iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/iot/service/IoTConsensusRPCServiceProcessor.java Return non-success RPC status when snapshot loading fails.
iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/simple/SimpleConsensusServerImpl.java Forward loadSnapshot boolean result.
iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/ApplicationStateMachineProxy.java Log application state machine snapshot load failures (now detectable via boolean).
iotdb-core/consensus/src/test/java/org/apache/iotdb/consensus/iot/AddPeerSnapshotLoadFailureTest.java New regression test validating AddPeer fails (and doesn’t activate target) when snapshot load fails.
iotdb-core/consensus/src/test/java/org/apache/iotdb/consensus/simple/SimpleConsensusTest.java Update test state machine to match new loadSnapshot signature.
iotdb-core/consensus/src/test/java/org/apache/iotdb/consensus/ratis/TestUtils.java Update test state machine to match new loadSnapshot signature and return result.
iotdb-core/consensus/src/test/java/org/apache/iotdb/consensus/iot/util/TestStateMachine.java Update test state machine to match new loadSnapshot signature.
iotdb-core/consensus/src/test/java/org/apache/iotdb/consensus/EmptyStateMachine.java Update test state machine to match new loadSnapshot signature.
Comments suppressed due to low confidence (1)

iotdb-core/datanode/src/main/java/org/apache/iotdb/db/consensus/statemachine/dataregion/DataRegionStateMachine.java:131

  • loadSnapshot now returns a boolean, but this implementation can still report success even when the snapshot directory has no sequence/... and unsequence/... content. SnapshotLoader#createLinksFromSnapshotDirToDataDirWithoutLog only logs and returns when both are missing, so this method may incorrectly return true and allow activation with empty data. Consider validating that at least one of these expected directories exists under latestSnapshotRootDir and returning false otherwise (before invoking SnapshotLoader).
  public boolean loadSnapshot(File latestSnapshotRootDir) {
    String databaseName = region.getDatabaseName();
    String dataRegionIdString = region.getDataRegionIdString();
    DataRegionId regionId = new DataRegionId(Integer.parseInt(dataRegionIdString));
    try {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…pshot

Two fixes following review feedback on snapshot-load-failure propagation:

1. IoTConsensusServerImpl.loadSnapshot: a DataRegion configures one receive
   folder per local data dir and the FolderManager spreads snapshot fragments
   across them, so a given snapshot only materializes under the folders that
   actually received fragments. Reporting failure when *any* folder lacks the
   snapshot turned a healthy multi-data-dir transfer into a spurious failure.
   Now only load from folders that contain the snapshot and skip the rest,
   while still failing when a folder that has it fails to load or when no
   folder received the snapshot at all.

2. ApplicationStateMachineProxy.loadSnapshot (Ratis): previously logged and
   then advanced lastAppliedTermIndex even when the application state machine
   rejected the snapshot, letting Ratis proceed on incomplete data. Now fail
   (re)initialization instead of advancing the applied index.

Adds a multi-recvSnapshotDir regression test asserting a healthy transfer that
spreads across a subset of receive folders still succeeds (verified to fail
against the previous load-from-every-folder behavior).
ConfigPlanExecutor.loadSnapshot was void: it returned early on a missing
snapshot dir and tracked per-processor failures only to decide whether to log
success, so ConfigRegionStateMachine.loadSnapshot always returned true even
when the load failed. Make ConfigPlanExecutor.loadSnapshot return a boolean
(false on missing dir or any processor failure) and have
ConfigRegionStateMachine.loadSnapshot propagate it, so AddPeer and other
callers observe a failed ConfigRegion snapshot load.
@CRZbulabula

Copy link
Copy Markdown
Contributor Author

Thanks for the careful review, @Caideyipi! Both concerns are addressed:

1. Multi-data-dir false failure in IoTConsensusServerImpl.loadSnapshot (1c96b5b)

You're right. A DataRegion sets recvSnapshotDirs to all local data dirs (DataRegionConsensusImpl), and the FolderManager spreads snapshot fragments across them, so a given snapshot only materializes under the folders that actually received fragments — loading from a folder that never received this snapshot would (correctly) fail and turn a healthy transfer into a spurious failure. I went with the "load only from the roots that contain the received snapshot" option:

  • Skip receive folders whose snapshot root does not exist, load from the ones that do, and still report failure if a present snapshot fails to load or if no folder received the snapshot at all.
  • The actual complete load happens from whichever root holds the snapshot log, since SnapshotLoader#createLinksFromSnapshotDirToDataDirWithLog gathers the fragments across all data dirs and verifies the logged file count, so skipping the empty roots is safe.

I also added the regression test you suggested — AddPeerSnapshotLoadFailureTest#addRemotePeerSucceedsWhenSnapshotSpansSubsetOfRecvDirs — which builds a two-node group with multiple recvSnapshotDirs, transfers a real snapshot that lands in only a subset of the target's receive folders, and asserts AddPeer still succeeds. I verified it fails against the previous load-from-every-folder behavior and passes with the fix.

2. ApplicationStateMachineProxy advancing the applied index on load failure (1c96b5b)

loadSnapshot now throws IOException when the application state machine returns false, so lastAppliedTermIndex is no longer advanced and initialize/reinitialize fail instead of reporting the snapshot as applied. reinitialize() now declares throws IOException, which the Ratis StateMachine interface already permits.

Separately, Copilot flagged that ConfigRegionStateMachine.loadSnapshot always returned true because ConfigPlanExecutor.loadSnapshot was void and swallowed failures — fixed in 7fe8e9c by making it return a boolean and propagating it.

All consensus tests pass (AddPeerSnapshotLoadFailureTest, ReplicateTest/StabilityTest, RatisConsensusTest/RecoverReadTest/SnapshotTest).

Two follow-ups on the IoTConsensus snapshot-load path:

- IoTConsensusRPCServiceProcessor.triggerSnapshotLoad now reports
  MIGRATE_REGION_ERROR (the snapshot load runs as part of the AddPeer flow,
  and this is the code the region-migration handlers already use) instead of
  the generic INTERNAL_SERVER_ERROR.

- IoTConsensusServerImpl.loadSnapshot now returns on the first failing receive
  folder instead of continuing. Once one folder fails the replica's snapshot is
  already broken, and loading the remaining folders is not only pointless but
  harmful, since each DataRegion load wipes the data dirs before relinking.

The other consensus protocols were audited and need no change: Simple and Ratis
load a snapshot with a single state-machine call (already fail-fast), and
IoTConsensusV2 replicates via pipe rather than a snapshot-load loop.
ConfigRegionStateMachine.loadSnapshot wrapped both the actual snapshot data
load and the best-effort pipe-listener recomputation in one try/catch, and my
previous change made the catch return false. That meant a recoverable
pipe-listener IOException (which the original code deliberately logged and
swallowed) would now be reported as a snapshot-load failure, which in turn
aborts ConfigNode (re)initialization via the Ratis loadSnapshot contract.

Scope the boolean to the data load (executor.loadSnapshot) only, and keep the
pipe-listener recomputation as non-fatal post-processing: a failure there is
still logged but no longer flips the load result to false.
An empty region produces a snapshot with zero fragments, so the source
transmits 0 files and the target materializes no snapshot under any receive
folder. My earlier change reported "no folder contained the snapshot" as a
failure, which broke AddPeer for empty regions: migrating an empty data region
(e.g. while removing a DataNode) failed at DO_ADD_REGION_PEER and the removal
timed out. This reproduced in the Cluster IT - 1C3D job
(IoTDBRemoveDataNodeNormalIT.success1C4DIoTTestUseTableSQL).

loadSnapshot now loads only from the receive folders that actually contain the
snapshot, fails on the first folder whose load fails (preserving the original
V2-987 failure propagation), and treats an absent snapshot (empty region) as a
no-op success.

Adds addRemotePeerSucceedsWhenSnapshotIsEmpty, verified to fail against the
previous behavior.
@Caideyipi

Copy link
Copy Markdown
Collaborator

I found one remaining contract gap in the snapshot-load failure propagation.

SchemaRegionStateMachine.loadSnapshot(...) now returns rue after calling schemaRegion.loadSnapshot(...), but the underlying ISchemaRegion implementations still swallow load failures:

  • SchemaRegionPBTreeImpl.loadSnapshot(...) catches IOException | MetadataException, logs FAILED_TO_LOAD_SNAPSHOT, tries init(), and then returns normally.
  • SchemaRegionMemoryImpl.loadSnapshot(...) has the same pattern for Exception.

Because those methods are still �oid and do not rethrow or return a result, SchemaRegionStateMachine.loadSnapshot(...) can report success even when the schema snapshot was not actually loaded. This weakens the new IStateMachine.loadSnapshot boolean contract. It also matters for Ratis, since ApplicationStateMachineProxy now relies on a alse return to avoid advancing lastAppliedTermIndex after a rejected snapshot.

Suggested fix: make the schema-region load path report success/failure as well, or make SchemaRegionStateMachine.loadSnapshot(...) verify that the underlying schema snapshot was actually loaded before returning rue. A focused regression test around a failing schema snapshot load would help keep this contract from regressing.

SchemaRegionStateMachine.loadSnapshot returned true after calling
schemaRegion.loadSnapshot(...), but the underlying ISchemaRegion
implementations swallowed load failures: both SchemaRegionPBTreeImpl and
SchemaRegionMemoryImpl catch the load exception, log FAILED_TO_LOAD_SNAPSHOT,
fall back to init(), and return void. So a failed schema-snapshot load was
reported as success, weakening the IStateMachine.loadSnapshot boolean contract
(and letting the Ratis snapshot-install path advance the applied index after a
rejected snapshot).

Make ISchemaRegion.loadSnapshot return boolean (false when the load failed and
the region fell back to an empty state), have both implementations report it,
and have SchemaRegionStateMachine propagate the data-load result while keeping
the pipe-listener recomputation as non-fatal post-processing.

Adds testLoadSnapshotReportsFailureWhenSnapshotIsMissing and asserts the
success path returns true in the existing snapshot tests; verified the new
test fails across all schema-region modes against the previous behavior.
@CRZbulabula

Copy link
Copy Markdown
Contributor Author

Good catch — fixed in ddeb9f8.

You're right: SchemaRegionStateMachine.loadSnapshot(...) returned true unconditionally because the underlying load swallowed failures. Both SchemaRegionPBTreeImpl.loadSnapshot(...) (catching IOException | MetadataException) and SchemaRegionMemoryImpl.loadSnapshot(...) (catching Exception) logged FAILED_TO_LOAD_SNAPSHOT, fell back to init(), and returned void, so a failed schema-snapshot load was reported as success — weakening the new IStateMachine.loadSnapshot contract and, as you note, allowing the Ratis path to advance lastAppliedTermIndex after a rejected snapshot.

I made the schema-region load path report success/failure:

  • ISchemaRegion.loadSnapshot(...) now returns boolean (false when the load failed and the region fell back to an empty re-initialized state).
  • Both SchemaRegionPBTreeImpl and SchemaRegionMemoryImpl return true on the success path and false from the fallback/catch path.
  • SchemaRegionStateMachine.loadSnapshot(...) propagates the data-load result, while keeping the pipe-listener recomputation (schemaListener(...).loadSnapshot + listen2Snapshot4PipeListener) as non-fatal post-processing (consistent with how I scoped the ConfigRegion fix).

Added a focused regression test — SchemaRegionManagementTest#testLoadSnapshotReportsFailureWhenSnapshotIsMissing — which asserts that loading from a directory without a snapshot reports failure rather than silently succeeding. I also added assertTrue(...) to the existing successful-load assertions in testRatisModeSnapshot/testEmptySnapshot. The new test was verified to fail across all four schema-region modes (MemoryMode, PBTree-Full/Partial/Non-Memory) against the previous always-true behavior, and passes with the fix.

@sonarqubecloud

Copy link
Copy Markdown

@Caideyipi Caideyipi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The snapshot-load failure propagation gaps raised during review have been addressed: AddPeer now observes load failures, multi receive-directory and empty-snapshot cases are covered, ConfigRegion/SchemaRegion propagate the boolean contract, and the Ratis path no longer advances applied index after a rejected snapshot.

@CRZbulabula CRZbulabula merged commit a4ed6dd into master Jun 16, 2026
42 of 43 checks passed
@CRZbulabula CRZbulabula deleted the fix_v2_987_snapshot_load_failed_false_success branch June 16, 2026 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants