-
Notifications
You must be signed in to change notification settings - Fork 875
[COR-8] Improve the statistics of shards on DBservers #22121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the final PR Bugbot will review for you during this billing cycle
Your free Bugbot reviews will reset on December 28
Details
Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
goedderz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I'm mostly, but not quite, finished with the review. The implementation is good, and thanks for the very extensive test suite!
I've added two noteworthy comments on the implementation, and bunch of secondary thoughts about the tests.
| for (auto const& dbName : dirty) { | ||
| // initialize database statistics for this database, resetting whatever was | ||
| // previously | ||
| feature._databaseShardsStats[dbName] = ShardStatistics{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure (just because I don't know and haven't looked up the details) if a database is guaranteed to stay in dirty when an exception is thrown during a maintenance run.
If not, this could leave the statistics in an incomplete state, when the next call to reportInCurrent completes without this database, and calls updateDatabaseStatistics() at the end.
You could check how the handling of dirty databases works and figure out whether this could be a problem (unless you know it already). Or work with a local variable first, and only assign it to feature._databaseShardsStats[dbName] at the end of the loop's body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure either, but I think assigning it at the end is still a good idea, even if that assumption changes in the future; this makes the metrics more robust.
CHANGELOG
Outdated
| * COR-8: Improved performance of shard statistics on DBservers by updating only | ||
| metrics for databases with changes, rather than recalculating all metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't improve the performance: previously, already only the dirty databases were counted.
It does have a much more significant effect: previously the metrics did not account for non-dirty databases, so the metrics were, generally, completely off, unless in moments where all databases have been dirty during the last maintenance run.
| troubleshoot: | | ||
| If this metric shows a non-zero value for an extended period, it indicates that | ||
| some follower shards on this DBServer are lagging behind their leaders. This could be due to: | ||
| - Network issues between this DBServer and the leaders | ||
| - This DBServer being overloaded or slow | ||
| - Large amounts of data being replicated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| troubleshoot: | | |
| If this metric shows a non-zero value for an extended period, it indicates that | |
| some follower shards on this DBServer are lagging behind their leaders. This could be due to: | |
| - Network issues between this DBServer and the leaders | |
| - This DBServer being overloaded or slow | |
| - Large amounts of data being replicated | |
| troubleshoot: | | |
| If this metric shows a non-zero value for an extended period, it indicates that | |
| some follower shards on this DBServer are lagging behind their leaders. This could be due to: | |
| - Network issues between this DBServer and the leaders | |
| - This DBServer being overloaded or slow | |
| - Large amounts of data being replicated | |
| - Collection locks (write or exclusive) preventing the follower from getting in sync |
| std::unordered_map<std::string, maintenance::ShardStatistics> | ||
| _databaseShardsStats; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I've overlooked something, you need to make sure that deleting a database also deletes its entry here, or it will be erroneously added to the metrics until the next reboot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is just reformatted, or did I overlook any relevant change or addition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am not sure how that snuck in here. I will remove this file from diff
| internal.wait(1); | ||
| const shardsNumMetricValue = getDBServerMetricSum(onlineServers, shardsNumMetric); | ||
| if (shardsNumMetricValue !== 21) { | ||
| print(`The metric ${shardsNumMetric} has value ${shardsNumMetricValue} should have been 21`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, I suggest to remove the print statements.
| try { | ||
| // Create a collection with shards | ||
| db._create("ConsistencyTestCollection", {numberOfShards: 5, replicationFactor: 2}, undefined, {waitForSyncReplication: true}); | ||
| require("internal").wait(5); // Wait for maintenance to update metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to wait for 5 seconds here? I know I'm a little bit naggy about sleeps in the tests, but we have >>10k tests, and it really quickly adds up whether tests run 1ms or 5s. And additionally, waiting is usually not reliable, so tests that rely on it tend to be flaky.
| for(let i = 0; i < 100; i++) { | ||
| internal.wait(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer a shorter sleep time (maybe 1ms); this means you'll want to increase the number of iterations (or incrementally increase the sleep time).
| // Shutdown followers | ||
| dbServersWithoutLeader.forEach(server => { | ||
| server.suspend(); | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for me to understand: there is only 1 follower, but all other (non-leader, non-follower) servers are suspended as well to prevent a failover from messing up the results?
| let followersOutOfSyncNumMetricValue; | ||
| for(let i = 0; i < 500; i++) { | ||
| followersOutOfSyncNumMetricValue = getDBServerMetricSum(onlineServers, followersOutOfSyncNumMetric); | ||
| if (followersOutOfSyncNumMetricValue === 0) { | ||
| continue; | ||
| } | ||
|
|
||
| break; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this loop should wait for the follower to come in sync again. But the 500 iterations seem an unreliable criterion to wait for; shouldn't this rather be a maximum time to wait, and possibly some (short) sleeps between iterations?
goedderz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In accordance with the previous comments, we should also add one test for deleting a collection, and one for deleting a database (and checking the metrics after that).
| bool shardInSync{false}; | ||
| auto const plannedServers = shardMap.at(shName); | ||
| for (const auto& it : VPackArrayIterator(s)) { | ||
| if (it.stringView() == serverId) { | ||
| shardInSync = true; | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed: shardInSync should stay false if the local database has its leader set to LEADER_NOT_YET_KNOWN, in order to account for resyncs after a reboot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compare with this snippet from syncReplicatedShardsWithLeaders:
bool needsResyncBecauseOfRestart = false;
if (lshard.isObject()) { // just in case
theLeader = lshard.get(THE_LEADER);
if (theLeader.isString() &&
theLeader.stringView() ==
maintenance::ResignShardLeadership::LeaderNotYetKnownString) {
needsResyncBecauseOfRestart = true;
}
}
| shardInSync = true; | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Follower out-of-sync detection ignores LEADER_NOT_YET_KNOWN state
The follower out-of-sync detection marks a shard as in-sync if the server appears in Current's server list. However, after a reboot, the local shard may have THE_LEADER set to LEADER_NOT_YET_KNOWN (or NOT_YET_TOUCHED), indicating it needs to resync even if Current lists it as in-sync. The code fails to check for this special leader state and incorrectly reports followers as in-sync when they actually require resyncing. This is the same pattern used in syncReplicatedShardsWithLeaders which correctly handles needsResyncBecauseOfRestart.
| const dbFreeDBServer = dbServers.filter(server => server.id !== leaderServer && server.id !== fromServer); | ||
| const toServer = dbFreeDBServer[0].id; | ||
| assertEqual(dbFreeDBServer.length, 1); | ||
| assertNotEqual(fromServer, dbFreeDBServer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Test assertion compares string to array incorrectly
The assertion assertNotEqual(fromServer, dbFreeDBServer) compares a server ID string (fromServer) with an array of server objects (dbFreeDBServer). In JavaScript, comparing a string to an array always returns not-equal, making this assertion meaningless - it will always pass regardless of the actual values. The likely intended comparison was assertNotEqual(fromServer, toServer) to verify the source and destination servers are different before moving a shard.
| feature._databaseShardsStats[dbName] | ||
| .increaseNumberOfFollowersOutOfSync(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Follower out-of-sync metric never tracked for replication2
The new follower out-of-sync tracking code at lines 2306-2321 is unreachable for replication2. At lines 2069-2072, for replication2, all non-leader shards hit continue and skip processing entirely. Leaders then enter the if branch at line 2076 (since THE_LEADER is empty for leaders), never reaching the else block at line 2306 where increaseNumberOfFollowersOutOfSync() is called. This means the new arangodb_shard_followers_out_of_sync_number metric will always be 0 for databases using replication2.
Scope & Purpose
Fixes the issues with the existing shard DBServer metrics.
arangodb_shards_numberarangodb_shards_leader_numberarangodb_shards_out_of_syncarangodb_shards_not_replicated.Now these metrics persist throughout Maintenance, and we keep those metrics for every database separately. Meaning, when we iterate dirty databases, we just update these metrics for the database that contains some changes.
This PR also introduces two new metrics:
arangodb_shards_follower_number.yaml=> how many follower shards are on this db server (this is calculated asarangodb_shards_number - arangodb_shards_leader_number)arangodb_followers_out_of_sync_number=> how many followers shards are out of sync on this db serverThe new metrics are related to follower shards, unlike the existing which only report the state of leaders.
Checklist
Related Information
(Please reference tickets / specification / other PRs etc)
Note
Track per-database shard stats and update only dirty DBs; add follower shard count and follower out-of-sync metrics with docs and tests.
reportInCurrent; update only for dirty databases.MaintenanceFeature::updateDatabaseStatistics()to set gauges; compute follower count asshards - leaderShards.maintenance::ShardStatisticsinMaintenanceFeature; remove old in-function counters; adjustreportInCurrent/phaseTwosignatures and logic.arangodb_shards_follower_number,arangodb_shard_followers_out_of_sync_number; initialize inMaintenanceFeature.shell-cluster-dbserver-shard-metrics.jscovering stability, out-of-sync, not-replicated, and shard move scenarios.Written by Cursor Bugbot for commit 3c79299. This will update automatically on new commits. Configure here.