Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support mutable metadata for endpoints #3814

Merged
merged 11 commits into from
Jul 10, 2018

Conversation

rgs1
Copy link
Member

@rgs1 rgs1 commented Jul 9, 2018

This change will force a rebuild in updateDynamicHostList(), if metadata
has changed.

The concrete use-case this fixes is when an EDS service does not emit
removals for endpoints that had metadata changes (e.g.: version
changed). If metadata doesn't get updated, the subset load balancer
won't be able to create new subsets nor remove inactive ones.

Fixes #3810

Signed-off-by: Raul Gutierrez Segales [email protected]

This change will force a rebuild in updateDynamicHostList(), if metadata
has changed.

The concrete use-case this fixes is when an EDS service does not emit
removals for endpoints that had metadata changes (e.g.: version
changed). If metadata doesn't get updated, the subset load balancer
won't be able to create new subsets nor remove inactive ones.

Fixes envoyproxy#3810

Signed-off-by: Raul Gutierrez Segales <[email protected]>
Copy link
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little unsure that we want the canary/metadata mutation methods in HostDescription, but otherwise this looks good modulo a couple of small things.

(*i)->canary(host->canary());

// If metadata changed, we need to rebuild. See github issue #3810.
health_changed = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should rename this var to reflect that it's tracking changes to more than just health.

@@ -109,6 +109,11 @@ parseClusterSocketOptions(const envoy::api::v2::Cluster& config,
return cluster_options;
}

bool metadataEq(const envoy::api::v2::core::Metadata& lhs,
const envoy::api::v2::core::Metadata& rhs) {
return Protobuf::util::MessageDifferencer::Equivalent(lhs, rhs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just inline the call since it's only used once.

* rename `health_changed` to `hosts_changed`, since we track
  more than health now.
* inline call to Equivalent()

Signed-off-by: Raul Gutierrez Segales <[email protected]>
@rgs1
Copy link
Member Author

rgs1 commented Jul 9, 2018

@zuercher addressed — thanks! Hmm, I don't have a better idea for avoiding the mutation methods in HostDescription... would moving them to Host help?

zuercher
zuercher previously approved these changes Jul 9, 2018
Copy link
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly about the mutators and I also don't have a better place to put them.

@@ -684,16 +684,22 @@ bool BaseDynamicClusterImpl::updateDynamicHostList(const HostVector& new_hosts,
HostVector& hosts_added,
HostVector& hosts_removed) {
uint64_t max_host_weight = 1;

// Did hosts changed?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/changed/change

!Protobuf::util::MessageDifferencer::Equivalent(host->metadata(), (*i)->metadata());
if (metadata_changed) {
(*i)->metadata(host->metadata());
(*i)->canary(host->canary());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems odd here to do the canary update inside the metadata check. Should this be an independent check for people that use canary but not metadata?

Copy link
Member Author

@rgs1 rgs1 Jul 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattklein123 my thinking is that canary is encoded under envoy.lb/canary, so for it to change metadata has to change — no?

E.g.: https://github.com/envoyproxy/envoy/blob/master/api/envoy/api/v2/endpoint/endpoint.proto#L64

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Hmm. 2 questions:

  1. What does setting canary actually do then? Is that just for debugging output?
  2. If we are setting these inline, this is not thread safe unfortunately. Canary is easy to fix by making it atomic, but metadata is not and would need a lock (probably a reader/writer lock if we go that direction).

Thoughts? Whatever we decide we should add more comments here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@rgs1 rgs1 Jul 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for:

Question 1:
I was setting canary blindly just in case it had changed, given the metadata as a whole changed. This is not (just) for debugging, but to address the case in which an endpoint (e.g.: 10.0.0.2:80) had the following transitions in EDS:

# eds snapshot v1
10.0.0.2:80 canary:false

# eds snapshot v2
10.0.0.2:80 canary:true

That is, it's just a metadata change as everything else and it needs to be updated so that subsets can be properly recreated. Was this what you were asking? I suspect maybe not...

Question 2:
Ah, I thought this was all computed for each worker and thus thread safe. I'll add locking then.

And yeah, I'll add more comments too. Lets clear up question 1. before we move forward though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the metadata shared ptr be TLS and we post updates to it, which will then be applied on individual worker threads outside other processing? I think we would still want to latch where you suggest, but we could avoid having to take mutexes and worry about overlap with other execution by doing this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative might be to treat metadata differences as different hosts -- push a new host with the new metadata into hosts_added and have the old one appear in hosts_removed? I think then we don't have to mess with locking?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative might be to treat metadata differences as different hosts -- push a new host with the new metadata into hosts_added and have the old one appear in hosts_removed? I think then we don't have to mess with locking?

Yeah, this occurred to me also. If this ends up getting super complicated (which it seems to be) I think we might want to consider this approach. Basically just consider this a new host and do a remove/add operation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, intuitively I rather not treat this as remove/add... Sounds more heavy-handed than needed.

So do you really think it's too complicated as is now? Shall we explore the remove/add route then?

FWIW, I am testing the current version of this patch now since we really really really need mutable metadata for our setup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will review your current iteration in a bit. Let's hold here for now until we can review. Thanks for iterating on this. :)

Signed-off-by: Raul Gutierrez Segales <[email protected]>
Address Matt's review and the previous lack of thread-safety:

* the canary attribute is now wrapped in a std::atomic
* the Metadata attribute is wrapped in a shared_ptr and lock
  protected for updates

Signed-off-by: Raul Gutierrez Segales <[email protected]>
@rgs1
Copy link
Member Author

rgs1 commented Jul 9, 2018

@mattklein123 something like 15259b9?

Raul Gutierrez Segales added 4 commits July 9, 2018 17:16
Signed-off-by: Raul Gutierrez Segales <[email protected]>
Signed-off-by: Raul Gutierrez Segales <[email protected]>
Signed-off-by: Raul Gutierrez Segales <[email protected]>
Signed-off-by: Raul Gutierrez Segales <[email protected]>
const envoy::api::v2::core::Metadata& metadata() const override { return metadata_; }
void canary(bool is_canary) override { canary_ = is_canary; }
const std::shared_ptr<envoy::api::v2::core::Metadata> metadata() const override {
return metadata_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This not thread safe on read, since shared_ptr needs to both read the value and increment the ref count. This will need to be guarded with a read lock on the shared mutex. Two additional things:

  1. Can you look into adding lock annotations for this lock? IIRC this doesn't work correctly for reader/writer locks last I checked so it might not be possible, but I know @jmarantz did some work in this area so maybe it is now? Also, are we supposed to use absl locks now? Check https://github.com/envoyproxy/envoy/blob/master/source/common/common/thread.h. If there is no R/W lock there now just add a TODO comment here about adding that later.

  2. Can you add some comment here about perf? If lock contentions becomes an issue we can refactor later to use TLS and posting from the main thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup — will do. Thanks for the review!

Raul Gutierrez Segales added 3 commits July 10, 2018 12:55
* protect read access to the shared_ptr with a shared_lock
* add comment wrt perf, moving to absl locks and using lock
  annotations

I am still researching if lock annotations are currently working
correctly with R/W locks.

Signed-off-by: Raul Gutierrez Segales <[email protected]>
Signed-off-by: Raul Gutierrez Segales <[email protected]>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. LGTM.

@mattklein123 mattklein123 merged commit 810ec2e into envoyproxy:master Jul 10, 2018
snowp added a commit to snowp/envoy that referenced this pull request Jul 12, 2018
* origin/master:
  config: making v2-config-only a boolean flag (envoyproxy#3847)
  lc trie: add exclusive flag. (envoyproxy#3825)
  upstream: introduce PriorityStateManager, refactor EDS (envoyproxy#3783)
  test: deflaking header_integration_test (envoyproxy#3849)
  http: new style WebSockets, where headers and data are processed by the filter chain. (envoyproxy#3776)
  common: minor doc updates (envoyproxy#3845)
  fix master build (envoyproxy#3844)
  logging: Requiring details for RELEASE_ASSERT (envoyproxy#3842)
  test: add test for consistency of RawStatData internal memory representation (envoyproxy#3843)
  common: jittered backoff implementation (envoyproxy#3791)
  format: run buildifier on .bzl files. (envoyproxy#3824)
  Support mutable metadata for endpoints (envoyproxy#3814)
  test: deflaking a test, improving debugability (envoyproxy#3829)
  Update ApiConfigSource docs with grpc_services only for GRPC configs (envoyproxy#3834)
  Add hard-coded /hot_restart_version test (envoyproxy#3832)
  healthchecks: Add interval_jitter_percent healthcheck option (envoyproxy#3816)

Signed-off-by: Snow Pettersen <[email protected]>
rgs1 pushed a commit to rgs1/envoy that referenced this pull request Jul 16, 2018
In `SubsetLoadBalancer::HostSubsetImpl::update()` we have repeated calls
to `predicate()`. These calls are expensive mostly because of
`Envoy::Config::Metadata::metadataValue()` calls. Furthermore, metadata
is guarded by a lock since envoyproxy#3814 — so there is some contention too.

This change improves things by doing two things:
* avoid use of `predicate()` to derive `healthy_hosts_per_locality`
* cache `predicate()` calls for the special case of only metadata
  changing (current hosts == hosts added)

Signed-off-by: Raul Gutierrez Segales <[email protected]>
mattklein123 pushed a commit that referenced this pull request Jul 18, 2018
In `SubsetLoadBalancer::HostSubsetImpl::update()` we have repeated calls
to `predicate()`. These calls are expensive mostly because of
`Envoy::Config::Metadata::metadataValue()` calls. Furthermore, metadata
is guarded by a lock since #3814 — so there is some contention too.

This change improves things by doing two things:
* avoid use of `predicate()` to derive `healthy_hosts_per_locality`
* cache `predicate()` calls for the special case of only metadata
  changing (current hosts == hosts added)

Signed-off-by: Raul Gutierrez Segales <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support mutable metadata for endpoints
4 participants