Skip to content

fix azure retry issue when return 2XX with error #78298

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 28, 2019

Conversation

andyzhangx
Copy link
Member

What type of PR is this?
/kind bug

What this PR does / why we need it:
With this PR, when azure API call return <200, error>, it would regard as error and continue retry.
As we could find in error case, and also talked with VMSS team, return 2XX does not mean the operation is successful(doc: https://github.com/Azure/azure-resource-manager-rpc/blob/master/v1.0/Addendum.md#creating-or-updating-resources), there could also be error condition, e.g.

httpStatusCode 200
resultCode NetworkingInternalOperationError

Which issue(s) this PR fixes:

Fixes #78172

Special notes for your reviewer:
Note: this code change will affect all error retry including following places:

./azure_backoff.go:             done, retryError := az.processHTTPRetryResponse(service, "CreateOrUpdateSecurityGroup", resp, err)
./azure_backoff.go:             done, retryError := az.processHTTPRetryResponse(service, "CreateOrUpdateLoadBalancer", resp, err)
./azure_backoff.go:             return az.processHTTPRetryResponse(service, "CreateOrUpdatePublicIPAddress", resp, err)
./azure_backoff.go:             return az.processHTTPRetryResponse(service, "CreateOrUpdateInterface", resp, err)
./azure_backoff.go:             return az.processHTTPRetryResponse(service, "DeletePublicIPAddress", resp, err)
./azure_backoff.go:             done, err := az.processHTTPRetryResponse(service, "DeleteLoadBalancer", resp, err)
./azure_backoff.go:             done, retryError := az.processHTTPRetryResponse(nil, "", resp, err)
./azure_backoff.go:             done, retryError := az.processHTTPRetryResponse(nil, "", resp, err)
./azure_backoff.go:             return az.processHTTPRetryResponse(nil, "", resp, err)
./azure_backoff.go:             return az.processHTTPRetryResponse(nil, "", resp, err)
./azure_backoff.go:// processHTTPRetryResponse : return true means stop retry, false means continue retry
./azure_backoff.go:func (az *Cloud) processHTTPRetryResponse(service *v1.Service, reason string, resp *http.Response, err error) (bool, error) {
./azure_backoff.go:                     klog.Errorf("processHTTPRetryResponse: backoff failure, will retry, err=%v", err)
./azure_backoff.go:                     klog.Errorf("processHTTPRetryResponse: backoff failure, will retry, HTTP response=%d", resp.StatusCode)
./azure_backoff.go:             klog.Errorf("processHTTPRetryResponse failure with err: %v", err)
./azure_backoff.go:             klog.Errorf("processHTTPRetryResponse failure with HTTP response %q", resp.Status)
./azure_backoff_test.go:                res, err := az.processHTTPRetryResponse(nil, "", resp, test.err)
./azure_controller_common.go:                   return c.cloud.processHTTPRetryResponse(nil, "", resp, err)

Does this PR introduce a user-facing change?:

fix azure retry issue when return 2XX with error

/hold
Let's hold on for VMSS team do the final confirmation and also code review.

/kind bug
/assign @feiskyer
/priority important-soon
/sig azure

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. sig/azure area/cloudprovider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. labels May 24, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 24, 2019
@andyzhangx andyzhangx force-pushed the azure-retry-issue branch from 95497e0 to 8a45ba1 Compare May 24, 2019 13:12
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 24, 2019
@andyzhangx
Copy link
Member Author

BTW, this is a fundamental fix for all error retry issues on all ARM calls by azure cloud provider, though I found this issue on VMSS. We should not regard 200 return code as successful operation since it also may return an error code per https://github.com/Azure/azure-resource-manager-rpc/blob/master/v1.0/Addendum.md#creating-or-updating-resources and also from ARM team.

@khenidak @brendandburns

@andyzhangx
Copy link
Member Author

/test pull-kubernetes-e2e-gce

@andyzhangx
Copy link
Member Author

/hold cancel
Got confirmation from azure ARM team

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 25, 2019
@andyzhangx
Copy link
Member Author

/test pull-kubernetes-e2e-aks-engine-azure

@feiskyer
Copy link
Member

pull-kubernetes-e2e-aks-engine-azure is failing because of Azure/aks-engine#1368.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 27, 2019
@justaugustus
Copy link
Member

/skip pull-kubernetes-e2e-aks-engine-azure

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

2 similar comments
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot
Copy link
Contributor

@andyzhangx: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-aks-engine-azure 8a45ba1 link /test pull-kubernetes-e2e-aks-engine-azure

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot merged commit fcc9f16 into kubernetes:master May 28, 2019
k8s-ci-robot added a commit that referenced this pull request May 31, 2019
…8298-upstream-release-1.13

Automated cherry pick of #78298: fix azure retry issue when return 2XX with error
k8s-ci-robot added a commit that referenced this pull request May 31, 2019
…8298-upstream-release-1.14

Automated cherry pick of #78298: fix azure retry issue when return 2XX with error
k8s-ci-robot added a commit that referenced this pull request May 31, 2019
…8298-upstream-release-1.12

Automated cherry pick of #78298: fix azure retry issue when return 2XX with error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Detached Azure Disks never re-attach
5 participants