[COR-8] Improve the statistics of shards on DBservers #22121

jbajic · 2025-11-19T14:13:35Z

Scope & Purpose

Fixes the issues with the existing shard DBServer metrics.

arangodb_shards_number
arangodb_shards_leader_number
arangodb_shards_out_of_sync
arangodb_shards_not_replicated.
Now these metrics persist throughout Maintenance, and we keep those metrics for every database separately. Meaning, when we iterate dirty databases, we just update these metrics for the database that contains some changes.

This PR also introduces two new metrics:

arangodb_shards_follower_number.yaml => how many follower shards are on this db server (this is calculated as arangodb_shards_number - arangodb_shards_leader_number)
arangodb_followers_out_of_sync_number => how many followers shards are out of sync on this db server

The new metrics are related to follower shards, unlike the existing which only report the state of leaders.

💩 Bugfix
🍕 New feature
🔥 Performance improvement
🔨 Refactoring/simplification

Checklist

Related Information

(Please reference tickets / specification / other PRs etc)

Docs PR:
Enterprise PR: https://github.com/arangodb/enterprise/pull/1561
GitHub issue / Jira ticket: https://arangodb.atlassian.net/browse/COR-8
Design document:

Note

Track per-database shard stats and update only dirty DBs; add follower shard count and follower out-of-sync metrics with docs and tests.

Maintenance/Cluster:
- Track per-database shard statistics during reportInCurrent; update only for dirty databases.
- Aggregate via MaintenanceFeature::updateDatabaseStatistics() to set gauges; compute follower count as shards - leaderShards.
- Introduce maintenance::ShardStatistics in MaintenanceFeature; remove old in-function counters; adjust reportInCurrent/phaseTwo signatures and logic.
- Detect and count followers out of sync on DBServers.
Metrics:
- Add gauges: arangodb_shards_follower_number, arangodb_shard_followers_out_of_sync_number; initialize in MaintenanceFeature.
- Documentation for both new metrics.
Tests:
- Extend cluster metrics tests and add shell-cluster-dbserver-shard-metrics.js covering stability, out-of-sync, not-replicated, and shard move scenarios.
Changelog: note improved shard metrics and new follower-related metrics.

^{Written by Cursor Bugbot for commit 3c79299. This will update automatically on new commits. Configure here.}

cursor

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on December 28

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

tests/js/client/shell/shell-cluster-dbserver-shard-metrics.js

goedderz

Thanks for the PR! I'm mostly, but not quite, finished with the review. The implementation is good, and thanks for the very extensive test suite!

I've added two noteworthy comments on the implementation, and bunch of secondary thoughts about the tests.

goedderz · 2025-12-03T08:47:59Z

arangod/Cluster/Maintenance.cpp

  for (auto const& dbName : dirty) {
+    // initialize database statistics for this database, resetting whatever was
+    // previously
+    feature._databaseShardsStats[dbName] = ShardStatistics{};


I am not sure (just because I don't know and haven't looked up the details) if a database is guaranteed to stay in dirty when an exception is thrown during a maintenance run.

If not, this could leave the statistics in an incomplete state, when the next call to reportInCurrent completes without this database, and calls updateDatabaseStatistics() at the end.

You could check how the handling of dirty databases works and figure out whether this could be a problem (unless you know it already). Or work with a local variable first, and only assign it to feature._databaseShardsStats[dbName] at the end of the loop's body.

I am not sure either, but I think assigning it at the end is still a good idea, even if that assumption changes in the future; this makes the metrics more robust.

goedderz · 2025-12-03T08:51:57Z

CHANGELOG

+* COR-8: Improved performance of shard statistics on DBservers by updating only
+  metrics for databases with changes, rather than recalculating all metrics.


This doesn't improve the performance: previously, already only the dirty databases were counted.

It does have a much more significant effect: previously the metrics did not account for non-dirty databases, so the metrics were, generally, completely off, unless in moments where all databases have been dirty during the last maintenance run.

goedderz · 2025-12-03T08:55:08Z

Documentation/Metrics/arangodb_shard_followers_out_of_sync_number.yaml

+troubleshoot: |
+  If this metric shows a non-zero value for an extended period, it indicates that
+  some follower shards on this DBServer are lagging behind their leaders. This could be due to:
+  - Network issues between this DBServer and the leaders
+  - This DBServer being overloaded or slow
+  - Large amounts of data being replicated


Suggested change

troubleshoot: |

If this metric shows a non-zero value for an extended period, it indicates that

some follower shards on this DBServer are lagging behind their leaders. This could be due to:

- Network issues between this DBServer and the leaders

- This DBServer being overloaded or slow

- Large amounts of data being replicated

troubleshoot: |

If this metric shows a non-zero value for an extended period, it indicates that

some follower shards on this DBServer are lagging behind their leaders. This could be due to:

- Network issues between this DBServer and the leaders

- This DBServer being overloaded or slow

- Large amounts of data being replicated

- Collection locks (write or exclusive) preventing the follower from getting in sync

goedderz · 2025-12-03T09:04:10Z

arangod/Cluster/MaintenanceFeature.h

+  std::unordered_map<std::string, maintenance::ShardStatistics>
+      _databaseShardsStats;


Unless I've overlooked something, you need to make sure that deleting a database also deletes its entry here, or it will be erroneously added to the metrics until the next reboot.

goedderz · 2025-12-03T16:22:06Z

tests/js/client/shell/shell-cluster-collection.js

This file is just reformatted, or did I overlook any relevant change or addition?

Yeah, I am not sure how that snuck in here. I will remove this file from diff

goedderz · 2025-12-03T17:34:55Z

tests/js/client/shell/shell-cluster-dbserver-shard-metrics.js

+        internal.wait(1);
+        const shardsNumMetricValue = getDBServerMetricSum(onlineServers, shardsNumMetric);
+        if (shardsNumMetricValue !== 21) {
+          print(`The metric ${shardsNumMetric} has value ${shardsNumMetricValue} should have been 21`);


As above, I suggest to remove the print statements.

goedderz · 2025-12-04T17:40:15Z

tests/js/client/shell/api/multi/cluster-metrics-cluster-spec.js

+      try {
+        // Create a collection with shards
+        db._create("ConsistencyTestCollection", {numberOfShards: 5, replicationFactor: 2}, undefined, {waitForSyncReplication: true});
+        require("internal").wait(5); // Wait for maintenance to update metrics


Do we have to wait for 5 seconds here? I know I'm a little bit naggy about sleeps in the tests, but we have >>10k tests, and it really quickly adds up whether tests run 1ms or 5s. And additionally, waiting is usually not reliable, so tests that rely on it tend to be flaky.

goedderz · 2025-12-04T17:43:36Z

tests/js/client/shell/shell-cluster-dbserver-shard-metrics.js

+    for(let i = 0; i < 100; i++) {
+      internal.wait(1);


I'd prefer a shorter sleep time (maybe 1ms); this means you'll want to increase the number of iterations (or incrementally increase the sleep time).

goedderz · 2025-12-04T17:57:52Z

tests/js/client/shell/shell-cluster-dbserver-shard-metrics.js

+      // Shutdown followers
+      dbServersWithoutLeader.forEach(server => {
+        server.suspend();
+      });


Just for me to understand: there is only 1 follower, but all other (non-leader, non-follower) servers are suspended as well to prevent a failover from messing up the results?

goedderz · 2025-12-04T18:01:26Z

tests/js/client/shell/shell-cluster-dbserver-shard-metrics.js

+      let followersOutOfSyncNumMetricValue;
+      for(let i = 0; i < 500; i++) {
+        followersOutOfSyncNumMetricValue = getDBServerMetricSum(onlineServers, followersOutOfSyncNumMetric);
+        if (followersOutOfSyncNumMetricValue === 0) {
+          continue;
+        }
+
+        break;
+      }


I understand this loop should wait for the follower to come in sync again. But the 500 iterations seem an unreliable criterion to wait for; shouldn't this rather be a maximum time to wait, and possibly some (short) sleeps between iterations?

goedderz

In accordance with the previous comments, we should also add one test for deleting a collection, and one for deleting a database (and checking the metrics after that).

goedderz · 2025-12-08T13:47:42Z

arangod/Cluster/Maintenance.cpp

+                  bool shardInSync{false};
+                  auto const plannedServers = shardMap.at(shName);
+                  for (const auto& it : VPackArrayIterator(s)) {
+                    if (it.stringView() == serverId) {
+                      shardInSync = true;
+                      break;
+                    }
+                  }


As discussed: shardInSync should stay false if the local database has its leader set to LEADER_NOT_YET_KNOWN, in order to account for resyncs after a reboot.

Compare with this snippet from syncReplicatedShardsWithLeaders:

bool needsResyncBecauseOfRestart = false; if (lshard.isObject()) { // just in case theLeader = lshard.get(THE_LEADER); if (theLeader.isString() && theLeader.stringView() == maintenance::ResignShardLeadership::LeaderNotYetKnownString) { needsResyncBecauseOfRestart = true; } }

cursor · 2025-12-12T15:41:56Z

arangod/Cluster/Maintenance.cpp

+                      shardInSync = true;
+                      break;
+                    }
+                  }


Bug: Follower out-of-sync detection ignores LEADER_NOT_YET_KNOWN state

The follower out-of-sync detection marks a shard as in-sync if the server appears in Current's server list. However, after a reboot, the local shard may have THE_LEADER set to LEADER_NOT_YET_KNOWN (or NOT_YET_TOUCHED), indicating it needs to resync even if Current lists it as in-sync. The code fails to check for this special leader state and incorrectly reports followers as in-sync when they actually require resyncing. This is the same pattern used in syncReplicatedShardsWithLeaders which correctly handles needsResyncBecauseOfRestart.

cursor · 2025-12-12T15:55:02Z

tests/js/client/shell/shell-cluster-dbserver-shard-metrics.js

+      const dbFreeDBServer = dbServers.filter(server => server.id !== leaderServer && server.id !== fromServer);
+      const toServer = dbFreeDBServer[0].id;
+      assertEqual(dbFreeDBServer.length, 1);
+      assertNotEqual(fromServer, dbFreeDBServer);


Bug: Test assertion compares string to array incorrectly

The assertion assertNotEqual(fromServer, dbFreeDBServer) compares a server ID string (fromServer) with an array of server objects (dbFreeDBServer). In JavaScript, comparing a string to an array always returns not-equal, making this assertion meaningless - it will always pass regardless of the actual values. The likely intended comparison was assertNotEqual(fromServer, toServer) to verify the source and destination servers are different before moving a shard.

CHANGELOG

cursor · 2025-12-16T10:19:56Z

arangod/Cluster/Maintenance.cpp

+                    feature._databaseShardsStats[dbName]
+                        .increaseNumberOfFollowersOutOfSync();
+                  }
+                }


Bug: Follower out-of-sync metric never tracked for replication2

The new follower out-of-sync tracking code at lines 2306-2321 is unreachable for replication2. At lines 2069-2072, for replication2, all non-leader shards hit continue and skip processing entirely. Leaders then enter the if branch at line 2076 (since THE_LEADER is empty for leaders), never reaching the else block at line 2306 where increaseNumberOfFollowersOutOfSync() is called. This means the new arangodb_shard_followers_out_of_sync_number metric will always be 0 for databases using replication2.

Additional Locations (1)

arangod/Cluster/Maintenance.cpp#L2068-L2073

Move stats to be tracked globally

c21f9fc

jbajic self-assigned this Nov 19, 2025

cla-bot bot added the cla-signed label Nov 19, 2025

jbajic changed the title ~~[COR-8] Improve the statistics on shards on DBservers~~ [COR-8] Improve the statistics of shards on DBservers Nov 19, 2025

jbajic added 26 commits November 19, 2025 16:33

Move updateting of stats

53961fb

Remove usage of sets

a79332b

Merge branch 'devel' into feature/cor-8-improve-shards-metrics

ec3e512

Use methods

ff5fa84

Change docs

3364274

Add dbserver metrics test

206f431

Refactor and add more asserts

f40b991

Add not replicated test case

94f99ee

Extend existing tests

8b038ca

Merge branch 'devel' into feature/cor-8-improve-shards-metrics

a1a6944

Add additional move tests

40c55ae

Assert eventually

0af5d49

Fix eslint

38efab5

Add changelog

80c4ca2

Add number of folowwers on db servers

4a3e07b

Extend tests

a1483a9

Update changelog

729553f

Add followersOutOfSync metric

43284fc

Add docs for followers_out_of_sync

c188a5f

Update changelog

93d35ca

Update tests

97ac214

Fix followers_out_of_sync and add test

ddfbd23

Fix lint

9d8a819

Add assert

8b779a8

Rename to arangodb_shard_followers_out_of_sync_number

26b57b9

Rename metrics file

bc53435

jbajic added 3 commits November 26, 2025 11:01

Remove assertion

d1cbb33

Fix test

9e47f03

Merge branch 'devel' into feature/cor-8-improve-shards-metrics

8950e10

jbajic marked this pull request as ready for review November 26, 2025 15:53

jbajic requested a review from a team as a code owner November 26, 2025 15:53

jbajic added 2 commits November 27, 2025 09:17

Remove redundant comments

f5e80ce

Merge branch 'devel' into feature/cor-8-improve-shards-metrics

554baee

cursor bot reviewed Dec 3, 2025

View reviewed changes

tests/js/client/shell/shell-cluster-dbserver-shard-metrics.js Show resolved Hide resolved

goedderz reviewed Dec 4, 2025

View reviewed changes

goedderz requested changes Dec 8, 2025

View reviewed changes

jbajic added 5 commits December 12, 2025 11:57

Fix changelog

4e1a085

Remove changes

5970afb

Fix troubleshoot

22445dd

Fix assertions in test

8a34289

Improve how to get shard keys

cd3618d

cursor bot reviewed Dec 12, 2025

View reviewed changes

Remove print and remove hardcoded expectations

bc3ae64

cursor bot reviewed Dec 12, 2025

View reviewed changes

goedderz mentioned this pull request Dec 15, 2025

Add metric yaml #22168

Draft

19 tasks

Merge branch 'devel' into feature/cor-8-improve-shards-metrics

ea468a7

cursor bot reviewed Dec 16, 2025

View reviewed changes

Fix changelog

3c79299

		* COR-8: Improved performance of shard statistics on DBservers by updating only
		metrics for databases with changes, rather than recalculating all metrics.

		std::unordered_map<std::string, maintenance::ShardStatistics>
		_databaseShardsStats;

[COR-8] Improve the statistics of shards on DBservers #22121

Are you sure you want to change the base?

[COR-8] Improve the statistics of shards on DBservers #22121

Conversation

jbajic commented Nov 19, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scope & Purpose

Checklist

Related Information

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This is the final PR Bugbot will review for you during this billing cycle

Uh oh!

Uh oh!

goedderz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goedderz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Dec 12, 2025

Choose a reason for hiding this comment

Bug: Follower out-of-sync detection ignores LEADER_NOT_YET_KNOWN state

Uh oh!

cursor bot Dec 12, 2025

Choose a reason for hiding this comment

Bug: Test assertion compares string to array incorrectly

Uh oh!

Uh oh!

cursor bot Dec 16, 2025

Choose a reason for hiding this comment

Bug: Follower out-of-sync metric never tracked for replication2

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jbajic commented Nov 19, 2025 •

edited by cursor bot

Loading