-
Notifications
You must be signed in to change notification settings - Fork 875
Add metric yaml #22168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Add metric yaml #22168
Conversation
| name: arangodb_shards_follower_number | ||
| introducedIn: "3.12.7" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the same as arangodb_metadata_number_follower_shards.yaml?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the same as
arangodb_metadata_number_follower_shards.yaml?
I added a separate file for the coordinator metrics as we previously agreed. If you prefer to use the existing 'arangodb_metadata_number_follower_shards.yaml', we need to update it by including 'coordinator' in the 'exposedBy' section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this file (arangodb_shards_follower_number.yaml) was taken from the PR #22121 (I assume as an example to work with). It can be deleted in this branch, as it will be added when the other one is merged. The same goes for the file arangodb_shard_followers_out_of_sync_number.yaml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleting files.
KVS85
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See review.
| @@ -0,0 +1,35 @@ | |||
| name: arangodb_metadata_number_follower_shards | |||
| introducedIn: "3.12.7" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, put 3.12.8 here instead.
| @@ -0,0 +1,35 @@ | |||
| name: arangodb_metadata_number_not_replicated_shards | |||
| introducedIn: "3.12.7" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, put 3.12.8 here instead.
| @@ -0,0 +1,35 @@ | |||
| name: arangodb_metadata_number_out_of_sync_shards | |||
| introducedIn: "3.12.7" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, put 3.12.8 here instead.
| @@ -0,0 +1,36 @@ | |||
| name: arangodb_metadata_total_number_of_shards | |||
| introducedIn: "3.12.7" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, put 3.12.8 here instead.
| @@ -0,0 +1,23 @@ | |||
| name: arangodb_shard_followers_out_of_sync_number | |||
| introducedIn: "3.12.7" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, put 3.12.8 here instead.
| updateMetadataMetrics(); | ||
|
|
||
| // Update coordinator specific shard metrics computed from Plan | ||
| if (_metadataMetrics.has_value()) { | ||
| updateCoordinatorPlanMetrics(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until now, updateMetadataMetrics() has been the (only) method that updates _metadataMetrics. Now you had the need to add another one to be called during loadCurrent() (updateCoordinatorCurrentMetrics()), and so I also get the sibling updateCoordinatorPlanMetrics() here.
Nevertheless, the current situation with these three methods is a little confusing, at least because updateMetadataMetrics() no longer updates (all the) metadata metrics.
To mitigate that, you could do one or both of
- change the name(s) to clarify their respective responsibilities
- merge
updateCoordinatorPlanMetrics()withupdateMetadataMetrics()
.
Later, when we follow through with loading Plan and Current synchronously, it all can be merged into one call.
| if (it == _shardsToPlanServers.end() || !it->second) { | ||
| // no plan servers recorded | ||
| ++notReplicated; | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case should be impossible, _shards and _shardsToPlanServers have to be consistent here.
It's surely sensible to be defensive and handle it safely regardless. As it hints to a bug at a different place, this should get a maintainer-mode assertion (e.g. TRI_ASSERT(it != _shardsToPlanServers.end() && it->second)). And possibly a log message with an error in production, though that doesn't feel necessary to me.
Fyi, maintainer-mode assertions (TRI_ASSERT) are only compiled in maintainer builds, and are not part of release builds.
| } else { | ||
| ++notReplicated; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, this case should not be possible, either. Let's also add a maintainer-mode assertion (e.g. TRI_ASSERT(!servers.empty()) to cement this.
| uint64_t notReplicated = 0; | ||
|
|
||
| // read plan under read lock to compare leaders/replication factors | ||
| READ_LOCKER(readLocker, _planProt.lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is possibly dangerous, as we're already holding a write lock for _currentProt.lock. We need to be certain this can't introduce deadlocks (e.g. if there is another code path that acquires a write lock for _planProt.lock, and then tries to get a read lock for _currentProt.lock, it could run into a deadlock with this code).
Afair, the ClusterInfo generally only uses one of those locks at a time. But I am not certain, I haven't checked the current code. If there is already code that does that, we must follow a common pattern that avoids deadlocks. And regardless of whether there is such code already, this should be documented at both locks.
Alternatively, we could change the caller to hold only a read lock instead of a write lock for _currentProt.lock: As long as we only ever hold both locks as read locks at the same time, this will prevent deadlocks as well.
I think I prefer the latter solution (only holding read locks), as it also means holding the write lock for a shorter amount of time, and reduce contention with readers: This can be detrimental otherwise. Note that this also holds for not only updateCoordinatorPlanMetrics, but also for the existing function updateMetadataMetrics(): It would be preferable to execute all of them under read locks only! Note that loadPlan and loadCurrent both also hold the _planProt.mutex and _currentProt.mutex, respectively, for their duration: This can be relied on to prevent concurrent calls to the update*Metrics functions.
| auto const& shardId = it.first; | ||
| auto const& serversPtr = it.second; | ||
| ++totalShards; | ||
| size_t currentN = (serversPtr ? serversPtr->size() : 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
serversPtr must never be null, or there's a bug somewhere else. Double-checking it is absolutely fine, but I'd also add a maintainer-mode assertion to spot possible bugs during testing (e.g. TRI_ASSERT(serversPtr != nullptr)).
| if (currentN > 0) { | ||
| followers += (currentN - 1); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly: In addition to serversPtr not being null, I think *serversPtr must never be empty. I'd also add a maintainer-mode assertion to that end. Though in this case, I am not 100% certain, so please do keep the check. :-)
| // read plan under read lock to compare leaders/replication factors | ||
| READ_LOCKER(readLocker, _planProt.lock); | ||
|
|
||
| for (auto const& it : _shardsToCurrentServers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that currently, Current can have stale entries, i.e. for collections/shards that have already been deleted. This could cause miscounts. It may help to think of the Plan as the actual cluster-wide metadata for e.g. a collection, and corresponding Current entries as a last reported state.
I suggest to flip it around: Loop over _shardsToPlanServers instead, and look up the corresponding entries in _shardsToCurrent.
| _metadataMetrics->numberOutOfSyncShards.store(0, std::memory_order_relaxed); | ||
| _metadataMetrics->numberNotReplicatedShards.store(notReplicated, | ||
| std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two metrics should only be updated in updateCoordinatorCurrentMetrics():
| _metadataMetrics->numberOutOfSyncShards.store(0, std::memory_order_relaxed); | |
| _metadataMetrics->numberNotReplicatedShards.store(notReplicated, | |
| std::memory_order_relaxed); |
Which also means we don't need to calculate notReplicated here (and we can't, really).
| _metadataMetrics->totalNumberOfShards.store(totalShards, | ||
| std::memory_order_relaxed); | ||
| _metadataMetrics->numberFollowerShards.store(followers, | ||
| std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two metrics are already being reported by updateCoordinatorCurrentMetrics(), and should not be reported here as well.
| _metadataMetrics->totalNumberOfShards.store(totalShards, | |
| std::memory_order_relaxed); | |
| _metadataMetrics->numberFollowerShards.store(followers, | |
| std::memory_order_relaxed); |
| // out of sync: leader changed or missing | ||
| if (plannedN == 0) { | ||
| // No plan present: treat as out-of-sync | ||
| ++outOfSync; | ||
| } else if (currentLeader.empty()) { | ||
| ++outOfSync; | ||
| } else if (!plannedLeader.empty() && plannedLeader != currentLeader) { | ||
| ++outOfSync; | ||
| } | ||
|
|
||
| if (plannedN > currentN) { | ||
| ++notReplicated; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imo outOfSync and notReplicated don't quite count the right thing; let me try to explain my idea of them (high-level, skipping over some details):
If there is only one in-sync replica (the leader), the shard should be counted as notReplicated. This is also the case for shards that are planned with replicationFactor: 1 (which means, the plan already only contains the leader as the only replica).
If any follower from Plan is missing from the list in Current, it is regarded as out-of-sync.
Note that a shard can be counted as both notReplicated and outOfSync, or any one of them, or neither.
But note that there are some additional edge cases to consider, e.g. during a leader failover or move shard operation. Let's discuss these separately.
Scope & Purpose
(Please describe the changes in this PR for reviewers, motivation, rationale - mandatory)
Checklist
Related Information
(Please reference tickets / specification / other PRs etc)
Note
Add documentation for six new shard-related metrics (totals, not replicated/out-of-sync, follower counts) across coordinator and DBServer.
Documentation/Metrics/arangodb_metadata_total_number_of_shards.yaml:arangodb_metadata_total_number_of_shardsDocumentation/Metrics/arangodb_metadata_number_follower_shards.yaml:arangodb_metadata_number_follower_shardsDocumentation/Metrics/arangodb_metadata_number_not_replicated_shards.yaml:arangodb_metadata_number_not_replicated_shardsDocumentation/Metrics/arangodb_metadata_number_out_of_sync_shards.yaml:arangodb_metadata_number_out_of_sync_shardsDocumentation/Metrics/arangodb_shards_follower_number.yaml:arangodb_shards_follower_numberDocumentation/Metrics/arangodb_shard_followers_out_of_sync_number.yaml:arangodb_shard_followers_out_of_sync_numberintroducedIn: "3.12.7", typegauge, with descriptions and troubleshoot guidance.Written by Cursor Bugbot for commit 35cf940. This will update automatically on new commits. Configure here.