Skip to content

Conversation

@jbajic
Copy link
Contributor

@jbajic jbajic commented Nov 19, 2025

Scope & Purpose

Fixes the issues with the existing shard DBServer metrics.

  • arangodb_shards_number
  • arangodb_shards_leader_number
  • arangodb_shards_out_of_sync
  • arangodb_shards_not_replicated.
    Now these metrics persist throughout Maintenance, and we keep those metrics for every database separately. Meaning, when we iterate dirty databases, we just update these metrics for the database that contains some changes.

This PR also introduces two new metrics:

  • arangodb_shards_follower_number.yaml => how many follower shards are on this db server (this is calculated as arangodb_shards_number - arangodb_shards_leader_number)
  • arangodb_followers_out_of_sync_number => how many followers shards are out of sync on this db server

The new metrics are related to follower shards, unlike the existing which only report the state of leaders.

  • 💩 Bugfix
  • 🍕 New feature
  • 🔥 Performance improvement
  • 🔨 Refactoring/simplification

Checklist

  • Tests
    • Regression tests
    • C++ Unit tests
    • integration tests
    • resilience tests
  • 📖 CHANGELOG entry made
  • 📚 documentation written (release notes, API changes, ...)
  • Backports
    • Backport for 3.12.0: (Please link PR)
    • Backport for 3.11: (Please link PR)
    • Backport for 3.10: (Please link PR)

Related Information

(Please reference tickets / specification / other PRs etc)


Note

Track per-database shard stats and update only dirty DBs; add follower shard count and follower out-of-sync metrics with docs and tests.

  • Maintenance/Cluster:
    • Track per-database shard statistics during reportInCurrent; update only for dirty databases.
    • Aggregate via MaintenanceFeature::updateDatabaseStatistics() to set gauges; compute follower count as shards - leaderShards.
    • Introduce maintenance::ShardStatistics in MaintenanceFeature; remove old in-function counters; adjust reportInCurrent/phaseTwo signatures and logic.
    • Detect and count followers out of sync on DBServers.
  • Metrics:
    • Add gauges: arangodb_shards_follower_number, arangodb_shard_followers_out_of_sync_number; initialize in MaintenanceFeature.
    • Documentation for both new metrics.
  • Tests:
    • Extend cluster metrics tests and add shell-cluster-dbserver-shard-metrics.js covering stability, out-of-sync, not-replicated, and shard move scenarios.
  • Changelog: note improved shard metrics and new follower-related metrics.

Written by Cursor Bugbot for commit 3c79299. This will update automatically on new commits. Configure here.

@jbajic jbajic self-assigned this Nov 19, 2025
@cla-bot cla-bot bot added the cla-signed label Nov 19, 2025
@jbajic jbajic changed the title [COR-8] Improve the statistics on shards on DBservers [COR-8] Improve the statistics of shards on DBservers Nov 19, 2025
@jbajic jbajic marked this pull request as ready for review November 26, 2025 15:53
@jbajic jbajic requested a review from a team as a code owner November 26, 2025 15:53
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on December 28

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Copy link
Member

@goedderz goedderz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I'm mostly, but not quite, finished with the review. The implementation is good, and thanks for the very extensive test suite!

I've added two noteworthy comments on the implementation, and bunch of secondary thoughts about the tests.

for (auto const& dbName : dirty) {
// initialize database statistics for this database, resetting whatever was
// previously
feature._databaseShardsStats[dbName] = ShardStatistics{};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure (just because I don't know and haven't looked up the details) if a database is guaranteed to stay in dirty when an exception is thrown during a maintenance run.

If not, this could leave the statistics in an incomplete state, when the next call to reportInCurrent completes without this database, and calls updateDatabaseStatistics() at the end.

You could check how the handling of dirty databases works and figure out whether this could be a problem (unless you know it already). Or work with a local variable first, and only assign it to feature._databaseShardsStats[dbName] at the end of the loop's body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure either, but I think assigning it at the end is still a good idea, even if that assumption changes in the future; this makes the metrics more robust.

CHANGELOG Outdated
Comment on lines 4 to 5
* COR-8: Improved performance of shard statistics on DBservers by updating only
metrics for databases with changes, rather than recalculating all metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't improve the performance: previously, already only the dirty databases were counted.

It does have a much more significant effect: previously the metrics did not account for non-dirty databases, so the metrics were, generally, completely off, unless in moments where all databases have been dirty during the last maintenance run.

Comment on lines +16 to +21
troubleshoot: |
If this metric shows a non-zero value for an extended period, it indicates that
some follower shards on this DBServer are lagging behind their leaders. This could be due to:
- Network issues between this DBServer and the leaders
- This DBServer being overloaded or slow
- Large amounts of data being replicated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
troubleshoot: |
If this metric shows a non-zero value for an extended period, it indicates that
some follower shards on this DBServer are lagging behind their leaders. This could be due to:
- Network issues between this DBServer and the leaders
- This DBServer being overloaded or slow
- Large amounts of data being replicated
troubleshoot: |
If this metric shows a non-zero value for an extended period, it indicates that
some follower shards on this DBServer are lagging behind their leaders. This could be due to:
- Network issues between this DBServer and the leaders
- This DBServer being overloaded or slow
- Large amounts of data being replicated
- Collection locks (write or exclusive) preventing the follower from getting in sync

Comment on lines +651 to +652
std::unordered_map<std::string, maintenance::ShardStatistics>
_databaseShardsStats;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I've overlooked something, you need to make sure that deleting a database also deletes its entry here, or it will be erroneously added to the metrics until the next reboot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is just reformatted, or did I overlook any relevant change or addition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am not sure how that snuck in here. I will remove this file from diff

internal.wait(1);
const shardsNumMetricValue = getDBServerMetricSum(onlineServers, shardsNumMetric);
if (shardsNumMetricValue !== 21) {
print(`The metric ${shardsNumMetric} has value ${shardsNumMetricValue} should have been 21`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I suggest to remove the print statements.

try {
// Create a collection with shards
db._create("ConsistencyTestCollection", {numberOfShards: 5, replicationFactor: 2}, undefined, {waitForSyncReplication: true});
require("internal").wait(5); // Wait for maintenance to update metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to wait for 5 seconds here? I know I'm a little bit naggy about sleeps in the tests, but we have >>10k tests, and it really quickly adds up whether tests run 1ms or 5s. And additionally, waiting is usually not reliable, so tests that rely on it tend to be flaky.

Comment on lines +112 to +113
for(let i = 0; i < 100; i++) {
internal.wait(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer a shorter sleep time (maybe 1ms); this means you'll want to increase the number of iterations (or incrementally increase the sleep time).

Comment on lines +341 to +344
// Shutdown followers
dbServersWithoutLeader.forEach(server => {
server.suspend();
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for me to understand: there is only 1 follower, but all other (non-leader, non-follower) servers are suspended as well to prevent a failover from messing up the results?

Comment on lines +353 to +361
let followersOutOfSyncNumMetricValue;
for(let i = 0; i < 500; i++) {
followersOutOfSyncNumMetricValue = getDBServerMetricSum(onlineServers, followersOutOfSyncNumMetric);
if (followersOutOfSyncNumMetricValue === 0) {
continue;
}

break;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this loop should wait for the follower to come in sync again. But the 500 iterations seem an unreliable criterion to wait for; shouldn't this rather be a maximum time to wait, and possibly some (short) sleeps between iterations?

Copy link
Member

@goedderz goedderz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In accordance with the previous comments, we should also add one test for deleting a collection, and one for deleting a database (and checking the metrics after that).

Comment on lines +2309 to +2316
bool shardInSync{false};
auto const plannedServers = shardMap.at(shName);
for (const auto& it : VPackArrayIterator(s)) {
if (it.stringView() == serverId) {
shardInSync = true;
break;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed: shardInSync should stay false if the local database has its leader set to LEADER_NOT_YET_KNOWN, in order to account for resyncs after a reboot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compare with this snippet from syncReplicatedShardsWithLeaders:

        bool needsResyncBecauseOfRestart = false;
        if (lshard.isObject()) {  // just in case
          theLeader = lshard.get(THE_LEADER);
          if (theLeader.isString() &&
              theLeader.stringView() ==
                  maintenance::ResignShardLeadership::LeaderNotYetKnownString) {
            needsResyncBecauseOfRestart = true;
          }
        }

shardInSync = true;
break;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Follower out-of-sync detection ignores LEADER_NOT_YET_KNOWN state

The follower out-of-sync detection marks a shard as in-sync if the server appears in Current's server list. However, after a reboot, the local shard may have THE_LEADER set to LEADER_NOT_YET_KNOWN (or NOT_YET_TOUCHED), indicating it needs to resync even if Current lists it as in-sync. The code fails to check for this special leader state and incorrectly reports followers as in-sync when they actually require resyncing. This is the same pattern used in syncReplicatedShardsWithLeaders which correctly handles needsResyncBecauseOfRestart.

Fix in Cursor Fix in Web

const dbFreeDBServer = dbServers.filter(server => server.id !== leaderServer && server.id !== fromServer);
const toServer = dbFreeDBServer[0].id;
assertEqual(dbFreeDBServer.length, 1);
assertNotEqual(fromServer, dbFreeDBServer);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Test assertion compares string to array incorrectly

The assertion assertNotEqual(fromServer, dbFreeDBServer) compares a server ID string (fromServer) with an array of server objects (dbFreeDBServer). In JavaScript, comparing a string to an array always returns not-equal, making this assertion meaningless - it will always pass regardless of the actual values. The likely intended comparison was assertNotEqual(fromServer, toServer) to verify the source and destination servers are different before moving a shard.

Fix in Cursor Fix in Web

@goedderz goedderz mentioned this pull request Dec 15, 2025
19 tasks
feature._databaseShardsStats[dbName]
.increaseNumberOfFollowersOutOfSync();
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Follower out-of-sync metric never tracked for replication2

The new follower out-of-sync tracking code at lines 2306-2321 is unreachable for replication2. At lines 2069-2072, for replication2, all non-leader shards hit continue and skip processing entirely. Leaders then enter the if branch at line 2076 (since THE_LEADER is empty for leaders), never reaching the else block at line 2306 where increaseNumberOfFollowersOutOfSync() is called. This means the new arangodb_shard_followers_out_of_sync_number metric will always be 0 for databases using replication2.

Additional Locations (1)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants