-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support mutable metadata for endpoints #3814
Support mutable metadata for endpoints #3814
Conversation
This change will force a rebuild in updateDynamicHostList(), if metadata has changed. The concrete use-case this fixes is when an EDS service does not emit removals for endpoints that had metadata changes (e.g.: version changed). If metadata doesn't get updated, the subset load balancer won't be able to create new subsets nor remove inactive ones. Fixes envoyproxy#3810 Signed-off-by: Raul Gutierrez Segales <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little unsure that we want the canary/metadata mutation methods in HostDescription, but otherwise this looks good modulo a couple of small things.
(*i)->canary(host->canary()); | ||
|
||
// If metadata changed, we need to rebuild. See github issue #3810. | ||
health_changed = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should rename this var to reflect that it's tracking changes to more than just health.
@@ -109,6 +109,11 @@ parseClusterSocketOptions(const envoy::api::v2::Cluster& config, | |||
return cluster_options; | |||
} | |||
|
|||
bool metadataEq(const envoy::api::v2::core::Metadata& lhs, | |||
const envoy::api::v2::core::Metadata& rhs) { | |||
return Protobuf::util::MessageDifferencer::Equivalent(lhs, rhs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just inline the call since it's only used once.
* rename `health_changed` to `hosts_changed`, since we track more than health now. * inline call to Equivalent() Signed-off-by: Raul Gutierrez Segales <[email protected]>
@zuercher addressed — thanks! Hmm, I don't have a better idea for avoiding the mutation methods in HostDescription... would moving them to Host help? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel strongly about the mutators and I also don't have a better place to put them.
@@ -684,16 +684,22 @@ bool BaseDynamicClusterImpl::updateDynamicHostList(const HostVector& new_hosts, | |||
HostVector& hosts_added, | |||
HostVector& hosts_removed) { | |||
uint64_t max_host_weight = 1; | |||
|
|||
// Did hosts changed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/changed/change
!Protobuf::util::MessageDifferencer::Equivalent(host->metadata(), (*i)->metadata()); | ||
if (metadata_changed) { | ||
(*i)->metadata(host->metadata()); | ||
(*i)->canary(host->canary()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems odd here to do the canary update inside the metadata check. Should this be an independent check for people that use canary but not metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattklein123 my thinking is that canary is encoded under envoy.lb/canary
, so for it to change metadata has to change — no?
E.g.: https://github.com/envoyproxy/envoy/blob/master/api/envoy/api/v2/endpoint/endpoint.proto#L64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Hmm. 2 questions:
- What does setting canary actually do then? Is that just for debugging output?
- If we are setting these inline, this is not thread safe unfortunately. Canary is easy to fix by making it atomic, but metadata is not and would need a lock (probably a reader/writer lock if we go that direction).
Thoughts? Whatever we decide we should add more comments here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for:
Question 1:
I was setting canary blindly just in case it had changed, given the metadata as a whole changed. This is not (just) for debugging, but to address the case in which an endpoint (e.g.: 10.0.0.2:80) had the following transitions in EDS:
# eds snapshot v1
10.0.0.2:80 canary:false
# eds snapshot v2
10.0.0.2:80 canary:true
That is, it's just a metadata change as everything else and it needs to be updated so that subsets can be properly recreated. Was this what you were asking? I suspect maybe not...
Question 2:
Ah, I thought this was all computed for each worker and thus thread safe. I'll add locking then.
And yeah, I'll add more comments too. Lets clear up question 1. before we move forward though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the metadata shared ptr be TLS and we post updates to it, which will then be applied on individual worker threads outside other processing? I think we would still want to latch where you suggest, but we could avoid having to take mutexes and worry about overlap with other execution by doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative might be to treat metadata differences as different hosts -- push a new host with the new metadata into hosts_added and have the old one appear in hosts_removed? I think then we don't have to mess with locking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative might be to treat metadata differences as different hosts -- push a new host with the new metadata into hosts_added and have the old one appear in hosts_removed? I think then we don't have to mess with locking?
Yeah, this occurred to me also. If this ends up getting super complicated (which it seems to be) I think we might want to consider this approach. Basically just consider this a new host and do a remove/add operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, intuitively I rather not treat this as remove/add... Sounds more heavy-handed than needed.
So do you really think it's too complicated as is now? Shall we explore the remove/add route then?
FWIW, I am testing the current version of this patch now since we really really really need mutable metadata for our setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will review your current iteration in a bit. Let's hold here for now until we can review. Thanks for iterating on this. :)
Signed-off-by: Raul Gutierrez Segales <[email protected]>
Address Matt's review and the previous lack of thread-safety: * the canary attribute is now wrapped in a std::atomic * the Metadata attribute is wrapped in a shared_ptr and lock protected for updates Signed-off-by: Raul Gutierrez Segales <[email protected]>
@mattklein123 something like 15259b9? |
Signed-off-by: Raul Gutierrez Segales <[email protected]>
Signed-off-by: Raul Gutierrez Segales <[email protected]>
Signed-off-by: Raul Gutierrez Segales <[email protected]>
Signed-off-by: Raul Gutierrez Segales <[email protected]>
const envoy::api::v2::core::Metadata& metadata() const override { return metadata_; } | ||
void canary(bool is_canary) override { canary_ = is_canary; } | ||
const std::shared_ptr<envoy::api::v2::core::Metadata> metadata() const override { | ||
return metadata_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This not thread safe on read, since shared_ptr needs to both read the value and increment the ref count. This will need to be guarded with a read lock on the shared mutex. Two additional things:
-
Can you look into adding lock annotations for this lock? IIRC this doesn't work correctly for reader/writer locks last I checked so it might not be possible, but I know @jmarantz did some work in this area so maybe it is now? Also, are we supposed to use absl locks now? Check https://github.com/envoyproxy/envoy/blob/master/source/common/common/thread.h. If there is no R/W lock there now just add a TODO comment here about adding that later.
-
Can you add some comment here about perf? If lock contentions becomes an issue we can refactor later to use TLS and posting from the main thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup — will do. Thanks for the review!
* protect read access to the shared_ptr with a shared_lock * add comment wrt perf, moving to absl locks and using lock annotations I am still researching if lock annotations are currently working correctly with R/W locks. Signed-off-by: Raul Gutierrez Segales <[email protected]>
Signed-off-by: Raul Gutierrez Segales <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. LGTM.
* origin/master: config: making v2-config-only a boolean flag (envoyproxy#3847) lc trie: add exclusive flag. (envoyproxy#3825) upstream: introduce PriorityStateManager, refactor EDS (envoyproxy#3783) test: deflaking header_integration_test (envoyproxy#3849) http: new style WebSockets, where headers and data are processed by the filter chain. (envoyproxy#3776) common: minor doc updates (envoyproxy#3845) fix master build (envoyproxy#3844) logging: Requiring details for RELEASE_ASSERT (envoyproxy#3842) test: add test for consistency of RawStatData internal memory representation (envoyproxy#3843) common: jittered backoff implementation (envoyproxy#3791) format: run buildifier on .bzl files. (envoyproxy#3824) Support mutable metadata for endpoints (envoyproxy#3814) test: deflaking a test, improving debugability (envoyproxy#3829) Update ApiConfigSource docs with grpc_services only for GRPC configs (envoyproxy#3834) Add hard-coded /hot_restart_version test (envoyproxy#3832) healthchecks: Add interval_jitter_percent healthcheck option (envoyproxy#3816) Signed-off-by: Snow Pettersen <[email protected]>
In `SubsetLoadBalancer::HostSubsetImpl::update()` we have repeated calls to `predicate()`. These calls are expensive mostly because of `Envoy::Config::Metadata::metadataValue()` calls. Furthermore, metadata is guarded by a lock since envoyproxy#3814 — so there is some contention too. This change improves things by doing two things: * avoid use of `predicate()` to derive `healthy_hosts_per_locality` * cache `predicate()` calls for the special case of only metadata changing (current hosts == hosts added) Signed-off-by: Raul Gutierrez Segales <[email protected]>
In `SubsetLoadBalancer::HostSubsetImpl::update()` we have repeated calls to `predicate()`. These calls are expensive mostly because of `Envoy::Config::Metadata::metadataValue()` calls. Furthermore, metadata is guarded by a lock since #3814 — so there is some contention too. This change improves things by doing two things: * avoid use of `predicate()` to derive `healthy_hosts_per_locality` * cache `predicate()` calls for the special case of only metadata changing (current hosts == hosts added) Signed-off-by: Raul Gutierrez Segales <[email protected]>
This change will force a rebuild in updateDynamicHostList(), if metadata
has changed.
The concrete use-case this fixes is when an EDS service does not emit
removals for endpoints that had metadata changes (e.g.: version
changed). If metadata doesn't get updated, the subset load balancer
won't be able to create new subsets nor remove inactive ones.
Fixes #3810
Signed-off-by: Raul Gutierrez Segales [email protected]