-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xds: "Cluster not available" can cause backoff retries of [0,1s) for xDS requests using an Envoy cluster #27702
Labels
Comments
This was referenced May 30, 2023
htuch
pushed a commit
that referenced
this issue
Jun 2, 2023
Clarification for a workaround to #27702. Signed-off-by: Ali Beyad <[email protected]>
reskin89
pushed a commit
to reskin89/envoy
that referenced
this issue
Jul 11, 2023
…y#27706) Clarification for a workaround to envoyproxy#27702. Signed-off-by: Ali Beyad <[email protected]> Signed-off-by: Ryan Eskin <[email protected]>
abeyad
added a commit
to abeyad/envoy
that referenced
this issue
Oct 14, 2023
This commit addresses an issue outlined in envoyproxy#27702 where, when configuring xDS to use an ADS gRPC mux connecting to an Envoy cluster (via EnvoyGrpc), if the cluster requires asynchronous initialization (e.g. a DNS cluster that must asynchronously resolve the endpoints before it can be used), then the ADS mux attempted to initialize before the cluster was ready. This caused the underlying GrpcStream to fail to establish the connection and set off a retry timer for [0,1s). Instead, this commit ensures that we don't try initializing the ADS mux and connecting to the xDS backend until the Envoy cluster for the xDS backend is fully initiliazed and ready. As part of testing this change, we had to move some of the ClusterManagerImpl initialization functionality to a separate init() method, instead of running it all in the constructor. Running all of the ClusterManagerImpl's lifecycle from the constructor prevents derived classes from overriding any of the functionality (e.g. to override certain functions in tests), because when running in the base class's constructor, only base class methods are executed, not the derived class's. Signed-off-by: Ali Beyad <[email protected]>
htuch
pushed a commit
that referenced
this issue
Oct 22, 2023
…30215) This commit addresses an issue outlined in #27702 where, when configuring xDS to use an ADS gRPC mux connecting to an Envoy cluster (via EnvoyGrpc), if the cluster requires asynchronous initialization (e.g. a DNS cluster that must asynchronously resolve the endpoints before it can be used), then the ADS mux attempted to initialize before the cluster was ready. This caused the underlying GrpcStream to fail to establish the connection and set off a retry timer for [0,1s). Instead, this commit ensures that we don't try initializing the ADS mux and connecting to the xDS backend until the Envoy cluster for the xDS backend is fully initiliazed and ready. As part of testing this change, we had to move some of the ClusterManagerImpl initialization functionality to a separate init() method, instead of running it all in the constructor. Running all of the ClusterManagerImpl's lifecycle from the constructor prevents derived classes from overriding any of the functionality (e.g. to override certain functions in tests), because when running in the base class's constructor, only base class methods are executed, not the derived class's. It turns out that several tests in cluster_manager_impl_test.cc depended on a derived class of ClusterManagerImpl which offered mocked behavior, but the overridden methods in the derived class were actually never getting invoked (because they were called from the base ClusterManagerImpl's constructor). We fix those tests in this commit too. Risk Level: Low Testing: unit test Docs Changes: N/A Release Notes: Fixed a bug (#27702) that caused ADS initialization to fail on the first attempt and set a back-off retry interval of up to 1 second, if ADS is using an Envoy Cluster for the backend. The issue was fixed to ensure that ADS initialization happens after the Envoy Cluster it depends upon has been properly initialized. ADS that does not depend on an Envoy Cluster (i.e. GoogleGrpc) is not affected by this change. Platform Specific Features: N/A Signed-off-by: Ali Beyad <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When the Bootstrap config has xDS configured to use Envoy gRPC, we can get into a situation where we get a
Cluster not available
error when trying to establish a new gRPC stream.This happens when the xDS cluster is not yet initialized, and the cluster that depends on it (e.g. for ADS or SDS using Envoy gRPC) gets initialized first, and attempts to create a connection to a not-yet-initialized cluster where the cluster is not found in the thread-local storage for the main thread.
When the cluster is not found, the gRPC stream schedules a retry attempt for some jittered back-off delay of up to 1 second. Typically, the xDS cluster is initialized by then, so the subsequent retry attempt to establish the gRPC stream, which gets triggered when the timer expires, ends up being successful.
This is not ideal, especially for Envoy Mobile, where we wouldn't want to wait for up to 1 second just for xDS initialization in the app.
This scenario can be triggered under two circumstances:
static_resources
repeated field ordering.static_resources
field) and the cluster upon which ADS depends is not guaranteed to be fully warmed during initialization. The ClusterManagerImpl callsads_mux_->start()
after adding primary clusters, but depending on the cluster type, it may not yet be warmed when the ClusterManagerInitHelper::addCluster() call finishes executing. Examples of cluster types that will call the ADS mux'sstart()
, even if wait_for_warm_on_init is set, include the STRICT_DNS cluster, which has to make asynchronous calls to resolve the DNS entries for the destination hosts.For scenario 1, there are a few options to solve the issue:
static_resources
field.For scenario 2, there are a couple options to "solve" the issue:
ads_mux_->start()
call to after the initialization is complete for the cluster that ADS is configured with. This will require a first pass to determine cluster dependencies, and adding the ADS mux initialization to the post-cluster-init callback if ADS depends on it.Another option that applies to both scenarios is to tighten the backoff timer interval (e.g. to 500ms max instead of 1s).
The text was updated successfully, but these errors were encountered: