Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds: "Cluster not available" can cause backoff retries of [0,1s) for xDS requests using an Envoy cluster #27702

Open
abeyad opened this issue May 30, 2023 · 1 comment
Assignees
Labels
area/xds bug no stalebot Disables stalebot from closing an issue

Comments

@abeyad
Copy link
Contributor

abeyad commented May 30, 2023

When the Bootstrap config has xDS configured to use Envoy gRPC, we can get into a situation where we get a Cluster not available error when trying to establish a new gRPC stream.

This happens when the xDS cluster is not yet initialized, and the cluster that depends on it (e.g. for ADS or SDS using Envoy gRPC) gets initialized first, and attempts to create a connection to a not-yet-initialized cluster where the cluster is not found in the thread-local storage for the main thread.

When the cluster is not found, the gRPC stream schedules a retry attempt for some jittered back-off delay of up to 1 second. Typically, the xDS cluster is initialized by then, so the subsequent retry attempt to establish the gRPC stream, which gets triggered when the timer expires, ends up being successful.

This is not ideal, especially for Envoy Mobile, where we wouldn't want to wait for up to 1 second just for xDS initialization in the app.

This scenario can be triggered under two circumstances:

  1. The static xDS cluster appears after the cluster that depends on it in the Bootstrap config's static_resources repeated field ordering.
  2. We use ADS, where the gRPC service is using Envoy gRPC (i.e. an Envoy cluster in the static_resources field) and the cluster upon which ADS depends is not guaranteed to be fully warmed during initialization. The ClusterManagerImpl calls ads_mux_->start() after adding primary clusters, but depending on the cluster type, it may not yet be warmed when the ClusterManagerInitHelper::addCluster() call finishes executing. Examples of cluster types that will call the ADS mux's start(), even if wait_for_warm_on_init is set, include the STRICT_DNS cluster, which has to make asynchronous calls to resolve the DNS entries for the destination hosts.

For scenario 1, there are a few options to solve the issue:

  1. Have the ClusterManagerImpl go through a first pass where it figures out the dependencies between clusters, then initializes them based on that ordering.
    • This is the most sensible in terms of API experience
    • But adds quite a bit of complexity to already complex code and will likely slow down the time it takes to get all the clusters initialized since we'll have to go through a first pass to figure out dependencies (and at that point, maybe it's not so much more efficient than just the backoff retry mechanism?)
  2. Have a way for clusters to declare their dependencies on each other in the Bootstrap config.
    • But this requires a new configuration knob in the Bootstrap and still requires the API user to do something to declare dependencies, which isn't much different than the third option which is...
  3. Require that clusters that depend on a static xDS cluster are ordered after the cluster they depend upon in the Bootstrap's static_resources field.

For scenario 2, there are a couple options to "solve" the issue:

  1. Move the ads_mux_->start() call to after the initialization is complete for the cluster that ADS is configured with. This will require a first pass to determine cluster dependencies, and adding the ADS mux initialization to the post-cluster-init callback if ADS depends on it.
  2. Just have a caveat for the time being that ADS configured with an Envoy cluster that requires asynchronous initialization will result in a backoff retry that could delay app initialization. For the ADS use cases with Google's Traffic Director that we are thinking of, it will depend on Google gRPC having a target URI which doesn't have the same retry backoff problem that Envoy gRPC does.

Another option that applies to both scenarios is to tighten the backoff timer interval (e.g. to 500ms max instead of 1s).

@abeyad
Copy link
Contributor Author

abeyad commented May 30, 2023

cc @adisuissa @alyssawilk

@abeyad abeyad added the no stalebot Disables stalebot from closing an issue label May 31, 2023
htuch pushed a commit that referenced this issue Jun 2, 2023
Clarification for a workaround to #27702.

Signed-off-by: Ali Beyad <[email protected]>
reskin89 pushed a commit to reskin89/envoy that referenced this issue Jul 11, 2023
…y#27706)

Clarification for a workaround to envoyproxy#27702.

Signed-off-by: Ali Beyad <[email protected]>
Signed-off-by: Ryan Eskin <[email protected]>
@abeyad abeyad self-assigned this Oct 3, 2023
abeyad added a commit to abeyad/envoy that referenced this issue Oct 14, 2023
This commit addresses an issue outlined in
envoyproxy#27702 where, when configuring
xDS to use an ADS gRPC mux connecting to an Envoy cluster (via
EnvoyGrpc), if the cluster requires asynchronous initialization (e.g. a
DNS cluster that must asynchronously resolve the endpoints before it can
be used), then the ADS mux attempted to initialize before the cluster
was ready.  This caused the underlying GrpcStream to fail to establish
the connection and set off a retry timer for [0,1s).

Instead, this commit ensures that we don't try initializing the ADS mux
and connecting to the xDS backend until the Envoy cluster for the xDS
backend is fully initiliazed and ready.

As part of testing this change, we had to move some of the
ClusterManagerImpl initialization functionality to a separate init()
method, instead of running it all in the constructor. Running all of the
ClusterManagerImpl's lifecycle from the constructor prevents derived
classes from overriding any of the functionality (e.g. to override
certain functions in tests), because when running in the base class's
constructor, only base class methods are executed, not the derived
class's.

Signed-off-by: Ali Beyad <[email protected]>
htuch pushed a commit that referenced this issue Oct 22, 2023
…30215)

This commit addresses an issue outlined in
#27702 where, when configuring xDS to use an ADS gRPC mux connecting to an Envoy cluster (via EnvoyGrpc), if the cluster requires asynchronous initialization (e.g. a DNS cluster that must asynchronously resolve the endpoints before it can be used), then the ADS mux attempted to initialize before the cluster was ready. This caused the underlying GrpcStream to fail to establish the connection and set off a retry timer for [0,1s).

Instead, this commit ensures that we don't try initializing the ADS mux and connecting to the xDS backend until the Envoy cluster for the xDS backend is fully initiliazed and ready.

As part of testing this change, we had to move some of the ClusterManagerImpl initialization functionality to a separate init() method, instead of running it all in the constructor. Running all of the ClusterManagerImpl's lifecycle from the constructor prevents derived classes from overriding any of the functionality (e.g. to override certain functions in tests), because when running in the base class's constructor, only base class methods are executed, not the derived class's.

It turns out that several tests in cluster_manager_impl_test.cc depended on a derived class of ClusterManagerImpl which offered mocked behavior, but the overridden methods in the derived class were actually never getting invoked (because they were called from the base ClusterManagerImpl's constructor). We fix those tests in this commit too.

Risk Level: Low
Testing: unit test
Docs Changes: N/A
Release Notes: Fixed a bug (#27702) that caused ADS initialization
to fail on the first attempt and set a back-off retry interval of up to 1 second, if ADS is
using an Envoy Cluster for the backend. The issue was fixed to ensure that ADS initialization
happens after the Envoy Cluster it depends upon has been properly initialized. ADS that does
not depend on an Envoy Cluster (i.e. GoogleGrpc) is not affected by this change.
Platform Specific Features: N/A

Signed-off-by: Ali Beyad <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/xds bug no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

No branches or pull requests

1 participant