-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure cloudprovider retry using flowcontrol #46660
Azure cloudprovider retry using flowcontrol #46660
Conversation
An initial attempt at engaging exponential backoff for API error responses. Uses k8s.io/client-go/util/flowcontrol; implementation inspired by GCE cloudprovider backoff.
Hi @jackfrancis. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
I know very little about azure API behavior. /unassign |
@k8s-bot ok to test |
/cc @brendandburns |
@@ -0,0 +1,149 @@ | |||
/* | |||
Copyright 2016 The Kubernetes Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 2017
if err != nil { | ||
return true | ||
} | ||
if resp.StatusCode == 429 || resp.StatusCode == 500 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be 5xx instead of just 500?
Also, please use http.StatusServerError etc. rather than hard constants.
// isSuccessHTTPResponse determines if the response from an HTTP request suggests success | ||
func isSuccessHTTPResponse(resp autorest.Response) bool { | ||
// TODO determine the complete set of success conditions | ||
if resp.StatusCode == 200 || resp.StatusCode == 201 || resp.StatusCode == 202 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use http.StatusOK
etc instead of constants.
@@ -177,6 +179,9 @@ func NewCloud(configReader io.Reader) (cloudprovider.Interface, error) { | |||
az.StorageAccountClient = storage.NewAccountsClientWithBaseURI(az.Environment.ResourceManagerEndpoint, az.SubscriptionID) | |||
az.StorageAccountClient.Authorizer = servicePrincipalToken | |||
|
|||
// 1 qps, up to 5 burst when in flowcontrol; i.e., aggressive backoff enforcement | |||
az.operationPollRateLimiter = flowcontrol.NewTokenBucketRateLimiter(1, 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't ever use this as far as I can tell.
I think we really want this to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this configurable (and disable-able) via config file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brendandburns @colemickens would the --cloud-config
--> Config
struct be the best configuration vector for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
|
||
// CreateOrUpdateSGWithRetry invokes az.SecurityGroupsClient.CreateOrUpdate with exponential backoff retry | ||
func (az *Cloud) CreateOrUpdateSGWithRetry(sg network.SecurityGroup) error { | ||
return wait.Poll(operationPollInterval, operationPollTimeoutDuration, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this exponential backoff? Poll always just waits for Interval
- corrected Copyright copy/paste - now actually implementing exponential backoff instead of regular interval retries - using more general HTTP response code success/failure determination (e.g., 5xx for retry) - net/http constants ftw
@brendandburns thanks for keeping me honest, review notes incorporated |
return true | ||
} | ||
// HTTP 5xx suggests we should retry | ||
r, err := regexp.Compile(`^5\d\d$`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use numeric comparisons
eg
code >= http.StatusInternalError
Do this for the 2xx below too.
// isSuccessHTTPResponse determines if the response from an HTTP request suggests success | ||
func isSuccessHTTPResponse(resp autorest.Response) bool { | ||
// HTTP 2xx suggests a successful response | ||
r, err := regexp.Compile(`^2\d\d$`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above
if shouldRetryAPIRequest(resp, err) { | ||
retryErr := az.CreateOrUpdateSGWithRetry(sg) | ||
if retryErr != nil { | ||
return nil, retryErr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to return here? Or just set err = retryErr
and let the code below handle both err cases?
I think I prefer that approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, your approach is better. The purpose of the novel retryErr
reference (as opposed to reusing the existing err
reference prior to checking for retry-ability is to be able to facilitate logging/debug in the retry execution branch. We can still do that by reusing the err
reference and eliminate an unnecessary return
.
resp, err := az.SecurityGroupsClient.CreateOrUpdate(az.ResourceGroup, *reconciledSg.Name, reconciledSg, nil) | ||
if shouldRetryAPIRequest(resp, err) { | ||
retryErr := az.CreateOrUpdateSGWithRetry(reconciledSg) | ||
if retryErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above
_, err = az.LoadBalancerClient.CreateOrUpdate(az.ResourceGroup, *lb.Name, lb, nil) | ||
resp, err := az.LoadBalancerClient.CreateOrUpdate(az.ResourceGroup, *lb.Name, lb, nil) | ||
if shouldRetryAPIRequest(resp, err) { | ||
retryErr := az.CreateOrUpdateLBWithRetry(lb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above
resp, err := az.LoadBalancerClient.Delete(az.ResourceGroup, lbName, nil) | ||
if shouldRetryAPIRequest(resp, err) { | ||
retryErr := az.DeleteLBWithRetry(lbName) | ||
if retryErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too
resp, err := az.PublicIPAddressesClient.CreateOrUpdate(az.ResourceGroup, *pip.Name, pip, nil) | ||
if shouldRetryAPIRequest(resp, err) { | ||
retryErr := az.CreateOrUpdatePIPWithRetry(pip) | ||
if retryErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here
resp, err := az.InterfacesClient.CreateOrUpdate(az.ResourceGroup, *nic.Name, nic, nil) | ||
if shouldRetryAPIRequest(resp, err) { | ||
retryErr := az.CreateOrUpdateInterfaceWithRetry(nic) | ||
if retryErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You get the point, I'm going to stop commenting on all of them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was enjoying how many variations you could come up with. :)
- removed unnecessary return statements - optimized HTTP response code evaluations as numeric comparisons
@k8s-bot pull-kubernetes-e2e-kops-aws test this |
Looks like you need to run /lgtm Thanks! |
we don’t need to rate limit the calls _to_ it
@jackfrancis this looks like it has some You can run |
not waiting to rate limit until we get an error response from the API, doing so on initial request for all API requests
@brendandburns thx, addressed |
/retest |
/sig azure |
@grodrigues3 @wojtek-t this should have the 1.7 milestone attached |
@brendandburns LGTM needed |
/lgtm |
/approve #47048 |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: brendandburns, jackfrancis Associated issue: 47048 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few comments.
return true | ||
} | ||
// HTTP 4xx or 5xx suggests we should retry | ||
if 399 < resp.StatusCode && resp.StatusCode < 600 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
400 is not a retry-able error in Azure, I think should only retry status code == 429 or status code > 500
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brendandburns I'm happy to incorporate this feedback if you're willing to go through the lgtm approval obstacle course all over agin. @yangl900 the other two remarks I believe we can justify tackling later on as part of a more holistic effort to pair k8s cloudprovider code with specific API idioms
const ( | ||
// CloudProviderName is the value used for the --cloud-provider flag | ||
CloudProviderName = "azure" | ||
rateLimitQPSDefault = 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read quota is much higher than write, I don't know if you want to differentiate in this change. that potentially give you higher quota. we can improve this later too.
var retryErr error | ||
machine, exists, retryErr = az.getVirtualMachine(name) | ||
if retryErr != nil { | ||
glog.Errorf("backoff: failure, will retry,err=%v", retryErr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a Retry-After header returned from 429 requests, probably helpful if we trace that.
@jackfrancis: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Automatic merge from submit-queue (batch tested with PRs 43005, 46660, 46385, 46991, 47103) |
@jackfrancis can you address @yangl900's comments in a follow on PR? Thanks! |
An initial attempt at engaging exponential backoff for API error responses.
Addresses #47048
Uses k8s.io/client-go/util/flowcontrol; implementation inspired by GCE
cloudprovider backoff.
What this PR does / why we need it:
The existing azure cloudprovider implementation has no guard rails in place to adapt to unexpected underlying operational conditions (i.e., clogs in resource plumbing between k8s runtime and the cloud API). The purpose of these changes is to support exponential backoff wrapping around API calls; and to support targeted rate limiting. Both of these options are configurable via
--cloud-config
.Implementation inspired by the GCE's use of
k8s.io/client-go/util/flowcontrol
andk8s.io/apimachinery/pkg/util/wait
, this PR likewise usesflowcontrol
for rate limiting; andwait
to thinly wrap backoff retry attempts to the API.Special notes for your reviewer:
Pay especial note to the declaration of retry-able conditions from an unsuccessful HTTP request:
4xx
and5xx
HTTP responsesAnd the declaration of retry success conditions:
2xx
HTTP responsesTests updated to include additions to
Config
.Those may be incomplete, or in other ways non-representative.
Release note: