Skip to content

fix: Improve disk usage in HA by eagerly deleting received WAL file on the replica#3300

Merged
as51340 merged 5 commits intomasterfrom
improve-durability-retention
Oct 1, 2025
Merged

fix: Improve disk usage in HA by eagerly deleting received WAL file on the replica#3300
as51340 merged 5 commits intomasterfrom
improve-durability-retention

Conversation

@as51340
Copy link
Contributor

@as51340 as51340 commented Sep 25, 2025

For a brief period of time there will be snapshot_retention_count+1 snapshots in the system because we first create a snapshot and then delete the oldest one -> important when doing a capacity planning on the K8s side.

Replica doesn't manage efficiently disk space:

  • its WAL files and possibly received snapshots never get deleted (in normal circumstances, replica up, main up) because the cleaning of WAL files is done only when the snapshot is created and replicas don't create snapshots. On replica they get transferred to .old files if SnapshotRpc or WalFilesRpc is received as part of the force reset operation.

WAL files get cleared on the current main only when there are exactly storage_snapshot_retention_count snapshots in the system, including the current one.

The PR adds the following:

  • replica will delete the WAL file it received from the current main.
  • found corrupted snapshot won't be deleted anymore
  • tested and refactored retention code for WAL files and snapshots
  • refactored GetRecoverySteps code

@as51340 as51340 self-assigned this Sep 25, 2025
@as51340 as51340 added CI -build=coverage -test=core Run coverage build and core tests on push CI -build=jepsen -test=core Run jepsen build and core tests on push CI -build=debug -test=core Run debug build and core tests on push CI -build=release -test=core Run release build and core tests on push CI -build=release -test=e2e Run release build and e2e tests on push CI -build=coverage -test=clang_tidy labels Sep 25, 2025
@as51340
Copy link
Contributor Author

as51340 commented Sep 25, 2025

Tracking

  • [Link to Epic/Issue]

Standard development

CI Testing Labels

  • Select the appropriate CI test labels (CI -build=build-name -test=test-suite)

Documentation checklist

  • Add the documentation label
  • Add the bug / feature label
  • Add the milestone for which this feature is intended
    • If not known, set for a later milestone
  • Write a release note, including added/changed clauses
    • Corrupted snapshot files won't be deleted anymore. They won't be deleted either, but we found it incorrect to delete durability files that may contain valid data for an user and that are only temporarily unusable in the current release. #3300
    • Replica will now delete the temporarily created file that is used to receive WAL file replicated from the main. #3330
  • Documentation PR link memgraph/documentation
    • Is back linked to this development PR

@as51340 as51340 force-pushed the improve-durability-retention branch from 7b2e023 to f642567 Compare September 26, 2025 08:36
@as51340 as51340 changed the title refactor: Snapshot retention code feat: Improve disk usage in HA Sep 26, 2025
@as51340 as51340 force-pushed the improve-durability-retention branch from f642567 to 59a3241 Compare September 26, 2025 08:39
@as51340 as51340 added this to the mg-v3.6.0 milestone Sep 29, 2025
@as51340 as51340 added bug bug and removed feature feature labels Sep 29, 2025
@as51340 as51340 requested a review from andrejtonev September 29, 2025 09:05
@as51340 as51340 marked this pull request as ready for review September 29, 2025 09:05
@as51340 as51340 changed the title feat: Improve disk usage in HA bugfix: Improve disk usage in HA by eagerly deleting received WAL file on the replica Sep 29, 2025
@as51340 as51340 force-pushed the improve-durability-retention branch from b4f5e4f to 71e5fd6 Compare September 29, 2025 09:08
@sonarqubecloud
Copy link

Please retry analysis of this Pull-Request directly on SonarQube Cloud

@matea16 matea16 mentioned this pull request Sep 29, 2025
50 tasks
@as51340 as51340 changed the title bugfix: Improve disk usage in HA by eagerly deleting received WAL file on the replica fix: Improve disk usage in HA by eagerly deleting received WAL file on the replica Sep 30, 2025
@as51340 as51340 added this pull request to the merge queue Sep 30, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 30, 2025
@as51340 as51340 added this pull request to the merge queue Oct 1, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 1, 2025
@as51340 as51340 added this pull request to the merge queue Oct 1, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 1, 2025
@as51340 as51340 added this pull request to the merge queue Oct 1, 2025
Merged via the queue into master with commit 1293ccf Oct 1, 2025
36 checks passed
@as51340 as51340 deleted the improve-durability-retention branch October 1, 2025 11:58
as51340 added a commit that referenced this pull request Oct 24, 2025
…n the replica (#3300)

For a brief period of time there will be snapshot_retention_count+1
snapshots in the system because we first create a snapshot and then
delete the oldest one -> important when doing a capacity planning on the
K8s side.

Replica doesn't manage efficiently disk space:
- its WAL files and possibly received snapshots never get deleted (in
normal circumstances, replica up, main up) because the cleaning of WAL
files is done only when the snapshot is created and replicas don't
create snapshots. On replica they get transferred to .old files if
SnapshotRpc or WalFilesRpc is received as part of the force reset
operation.

WAL files get cleared on the current main only when there are exactly
storage_snapshot_retention_count snapshots in the system, including the
current one.

The PR adds the following:
- replica will delete the WAL file it received from the current main.
- found corrupted snapshot won't be deleted anymore
- tested and refactored retention code for WAL files and snapshots
- refactored `GetRecoverySteps` code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug bug Capability - high-availability CI -build=coverage -test=clang_tidy CI -build=coverage -test=core Run coverage build and core tests on push CI -build=debug -test=core Run debug build and core tests on push CI -build=jepsen -test=core Run jepsen build and core tests on push CI -build=release -test=core Run release build and core tests on push CI -build=release -test=e2e Run release build and e2e tests on push Docs needed Docs needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants