Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure cloudprovider retry using flowcontrol #46660

Merged

Conversation

jackfrancis
Copy link
Contributor

@jackfrancis jackfrancis commented May 31, 2017

An initial attempt at engaging exponential backoff for API error responses.

Addresses #47048

Uses k8s.io/client-go/util/flowcontrol; implementation inspired by GCE
cloudprovider backoff.

What this PR does / why we need it:

The existing azure cloudprovider implementation has no guard rails in place to adapt to unexpected underlying operational conditions (i.e., clogs in resource plumbing between k8s runtime and the cloud API). The purpose of these changes is to support exponential backoff wrapping around API calls; and to support targeted rate limiting. Both of these options are configurable via --cloud-config.

Implementation inspired by the GCE's use of k8s.io/client-go/util/flowcontrol and k8s.io/apimachinery/pkg/util/wait, this PR likewise uses flowcontrol for rate limiting; and wait to thinly wrap backoff retry attempts to the API.

Special notes for your reviewer:

Pay especial note to the declaration of retry-able conditions from an unsuccessful HTTP request:

  • all 4xx and 5xx HTTP responses
  • non-nil error responses

And the declaration of retry success conditions:

  • 2xx HTTP responses

Tests updated to include additions to Config.

Those may be incomplete, or in other ways non-representative.

Release note:

Added exponential backoff to Azure cloudprovider

An initial attempt at engaging exponential backoff for API error responses.

Uses k8s.io/client-go/util/flowcontrol; implementation inspired by GCE
cloudprovider backoff.
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 31, 2017
@k8s-ci-robot
Copy link
Contributor

Hi @jackfrancis. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels May 31, 2017
@jsafrane
Copy link
Member

I know very little about azure API behavior.

/unassign
/assign @colemickens

@k8s-ci-robot k8s-ci-robot assigned colemickens and unassigned jsafrane May 31, 2017
@jsafrane
Copy link
Member

@k8s-bot ok to test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 31, 2017
@k8s-github-robot k8s-github-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels May 31, 2017
@jdumars
Copy link
Member

jdumars commented May 31, 2017

/cc @brendandburns

@@ -0,0 +1,149 @@
/*
Copyright 2016 The Kubernetes Authors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2017

if err != nil {
return true
}
if resp.StatusCode == 429 || resp.StatusCode == 500 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be 5xx instead of just 500?

Also, please use http.StatusServerError etc. rather than hard constants.

// isSuccessHTTPResponse determines if the response from an HTTP request suggests success
func isSuccessHTTPResponse(resp autorest.Response) bool {
// TODO determine the complete set of success conditions
if resp.StatusCode == 200 || resp.StatusCode == 201 || resp.StatusCode == 202 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use http.StatusOK etc instead of constants.

@@ -177,6 +179,9 @@ func NewCloud(configReader io.Reader) (cloudprovider.Interface, error) {
az.StorageAccountClient = storage.NewAccountsClientWithBaseURI(az.Environment.ResourceManagerEndpoint, az.SubscriptionID)
az.StorageAccountClient.Authorizer = servicePrincipalToken

// 1 qps, up to 5 burst when in flowcontrol; i.e., aggressive backoff enforcement
az.operationPollRateLimiter = flowcontrol.NewTokenBucketRateLimiter(1, 5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't ever use this as far as I can tell.

I think we really want this to be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this configurable (and disable-able) via config file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brendandburns @colemickens would the --cloud-config --> Config struct be the best configuration vector for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.


// CreateOrUpdateSGWithRetry invokes az.SecurityGroupsClient.CreateOrUpdate with exponential backoff retry
func (az *Cloud) CreateOrUpdateSGWithRetry(sg network.SecurityGroup) error {
return wait.Poll(operationPollInterval, operationPollTimeoutDuration, func() (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this exponential backoff? Poll always just waits for Interval

- corrected Copyright copy/paste
- now actually implementing exponential backoff instead of regular interval retries
- using more general HTTP response code success/failure determination (e.g., 5xx for retry)
- net/http constants ftw
@jackfrancis
Copy link
Contributor Author

@brendandburns thanks for keeping me honest, review notes incorporated

arg cruft in CreateOrUpdateSGWithRetry function declaration
return true
}
// HTTP 5xx suggests we should retry
r, err := regexp.Compile(`^5\d\d$`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use numeric comparisons

eg

code >= http.StatusInternalError

Do this for the 2xx below too.

// isSuccessHTTPResponse determines if the response from an HTTP request suggests success
func isSuccessHTTPResponse(resp autorest.Response) bool {
// HTTP 2xx suggests a successful response
r, err := regexp.Compile(`^2\d\d$`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above

if shouldRetryAPIRequest(resp, err) {
retryErr := az.CreateOrUpdateSGWithRetry(sg)
if retryErr != nil {
return nil, retryErr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to return here? Or just set err = retryErr and let the code below handle both err cases?

I think I prefer that approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, your approach is better. The purpose of the novel retryErr reference (as opposed to reusing the existing err reference prior to checking for retry-ability is to be able to facilitate logging/debug in the retry execution branch. We can still do that by reusing the err reference and eliminate an unnecessary return.

resp, err := az.SecurityGroupsClient.CreateOrUpdate(az.ResourceGroup, *reconciledSg.Name, reconciledSg, nil)
if shouldRetryAPIRequest(resp, err) {
retryErr := az.CreateOrUpdateSGWithRetry(reconciledSg)
if retryErr != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above

_, err = az.LoadBalancerClient.CreateOrUpdate(az.ResourceGroup, *lb.Name, lb, nil)
resp, err := az.LoadBalancerClient.CreateOrUpdate(az.ResourceGroup, *lb.Name, lb, nil)
if shouldRetryAPIRequest(resp, err) {
retryErr := az.CreateOrUpdateLBWithRetry(lb)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above

resp, err := az.LoadBalancerClient.Delete(az.ResourceGroup, lbName, nil)
if shouldRetryAPIRequest(resp, err) {
retryErr := az.DeleteLBWithRetry(lbName)
if retryErr != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too

resp, err := az.PublicIPAddressesClient.CreateOrUpdate(az.ResourceGroup, *pip.Name, pip, nil)
if shouldRetryAPIRequest(resp, err) {
retryErr := az.CreateOrUpdatePIPWithRetry(pip)
if retryErr != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here

resp, err := az.InterfacesClient.CreateOrUpdate(az.ResourceGroup, *nic.Name, nic, nil)
if shouldRetryAPIRequest(resp, err) {
retryErr := az.CreateOrUpdateInterfaceWithRetry(nic)
if retryErr != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You get the point, I'm going to stop commenting on all of them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was enjoying how many variations you could come up with. :)

- removed unnecessary return statements
- optimized HTTP response code evaluations as numeric comparisons
@fejta
Copy link
Contributor

fejta commented Jun 1, 2017

@k8s-bot pull-kubernetes-e2e-kops-aws test this
ref: kubernetes/test-infra#2932

@brendandburns
Copy link
Contributor

brendandburns commented Jun 2, 2017

Looks like you need to run ./hack/update-bazel.sh otherwise, LGTM.

/lgtm
/approve

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 2, 2017
@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 2, 2017
@calebamiles calebamiles modified the milestone: v1.7 Jun 2, 2017
@k8s-github-robot k8s-github-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 6, 2017
@brendandburns
Copy link
Contributor

@jackfrancis this looks like it has some go vet errors.

You can run go vet k8s.io/kubernetes/pkg/cloudprovider/providers/azure to generate the errors.

not waiting to rate limit until we get an error response from the API, doing so on initial request for all API requests
@jackfrancis
Copy link
Contributor Author

@brendandburns thx, addressed

@jdumars
Copy link
Member

jdumars commented Jun 7, 2017

/retest

@jdumars
Copy link
Member

jdumars commented Jun 7, 2017

/sig azure

@jdumars
Copy link
Member

jdumars commented Jun 7, 2017

@grodrigues3 @wojtek-t this should have the 1.7 milestone attached

@jdumars
Copy link
Member

jdumars commented Jun 7, 2017

@brendandburns LGTM needed

@brendandburns
Copy link
Contributor

brendandburns commented Jun 7, 2017 via email

@brendandburns
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 7, 2017
@brendandburns
Copy link
Contributor

/approve #47048

@brendandburns
Copy link
Contributor

/approve

@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brendandburns, jackfrancis

Associated issue: 47048

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 7, 2017
@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

Copy link
Contributor

@yangl900 yangl900 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few comments.

return true
}
// HTTP 4xx or 5xx suggests we should retry
if 399 < resp.StatusCode && resp.StatusCode < 600 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

400 is not a retry-able error in Azure, I think should only retry status code == 429 or status code > 500

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brendandburns I'm happy to incorporate this feedback if you're willing to go through the lgtm approval obstacle course all over agin. @yangl900 the other two remarks I believe we can justify tackling later on as part of a more holistic effort to pair k8s cloudprovider code with specific API idioms

const (
// CloudProviderName is the value used for the --cloud-provider flag
CloudProviderName = "azure"
rateLimitQPSDefault = 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read quota is much higher than write, I don't know if you want to differentiate in this change. that potentially give you higher quota. we can improve this later too.

var retryErr error
machine, exists, retryErr = az.getVirtualMachine(name)
if retryErr != nil {
glog.Errorf("backoff: failure, will retry,err=%v", retryErr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a Retry-After header returned from 429 requests, probably helpful if we trace that.

@k8s-ci-robot
Copy link
Contributor

@jackfrancis: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-etcd3 acb6517 link @k8s-bot pull-kubernetes-e2e-gce-etcd3 test this

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 43005, 46660, 46385, 46991, 47103)

@k8s-github-robot k8s-github-robot merged commit 3adb9b4 into kubernetes:master Jun 7, 2017
@brendandburns
Copy link
Contributor

@jackfrancis can you address @yangl900's comments in a follow on PR?

Thanks!

k8s-github-robot pushed a commit that referenced this pull request Jun 12, 2017
…60-upstream-release-1.6

Automatic merge from submit-queue

Automated cherry pick of #46660

Cherry pick of #46660 on release-1.6.

#46660: Azure cloudprovider retry using flowcontrol
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.