Skip to content

Conversation

@andyzhangx
Copy link
Member

@andyzhangx andyzhangx commented Apr 15, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
In some error condition when detach azure disk failed, azure cloud provider will retry 6 times at most with exponential backoff, it will hold the data disk list for about 3 minutes with a node level lock, and in that time period, if customer update data disk list manually (e.g. need manual operationto attach/detach another disk since there is attach/detach error, ) , the data disk list will be obselete(dirty data), then weird VM status happens, e.g. attach a non-existing disk, we should split those retry operations, every retry should get a fresh data disk list in the beginning.

if as.CloudProviderBackoff && shouldRetryHTTPRequest(resp, err) {
klog.V(2).Infof("azureDisk - update(%s) backing off: vm(%s) detach disk(%s, %s), err: %v", nodeResourceGroup, vmName, diskName, diskURI, err)
retryErr := as.CreateOrUpdateVMWithRetry(nodeResourceGroup, vmName, newVM)
if retryErr != nil {

This PR has two commits:

  1. rename function name from DetachDiskByName to DetachDisk
  2. refine detach azure disk retry operation, make every detach azure disk operation in a standalone function, originally it's by as.CreateOrUpdateVMWithRetry(nodeResourceGroup, vmName, newVM) which may lead to obsolete data disk list

This PR don't change the logic of attach disk since there is no retry in azure cloud provide for attach disk, k8s attach-detach controller will do the attach volume retry.

BTW, I have tested this PR on both vmss and vmas k8s cluster by a stress disk attach/detach test, it works well for a bunch of times.

Which issue(s) this PR fixes:

Fixes #76502

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

fix detach azure disk back off issue which has too big lock in failure retry condition

/kind bug
/assign @feiskyer
/priority important-soon
/sig azure

cc @khenidak @brendandburns

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/azure labels Apr 15, 2019
@k8s-ci-robot k8s-ci-robot added area/cloudprovider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/storage Categorizes an issue or PR as relevant to SIG Storage. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 15, 2019
@andyzhangx andyzhangx force-pushed the disk-backoff-refactor branch from 299ef42 to 3772cd8 Compare April 15, 2019 05:19
@k8s-ci-robot k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 15, 2019
fix comments

fix import keymux check error

add unit test for attach/detach disk funcs
@andyzhangx andyzhangx force-pushed the disk-backoff-refactor branch from 3772cd8 to 6c70ca6 Compare April 16, 2019 05:31
@andyzhangx
Copy link
Member Author

/test pull-kubernetes-integration

Copy link
Member

@feiskyer feiskyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 16, 2019
@feiskyer
Copy link
Member

@andrewsykim Could you help to approve the cloud-provider changes?

/assign @andrewsykim

@andrewsykim
Copy link
Member

Can we add a bit more details on the release note please?

@andyzhangx andyzhangx changed the title fix detach azure disk back off issue fix detach azure disk back off issue which has too big lock in failure retry condition Apr 17, 2019
@andyzhangx
Copy link
Member Author

/test pull-kubernetes-e2e-gce-csi-serial

@andyzhangx
Copy link
Member Author

@andrewsykim thanks. I have changed the release note to:

fix detach azure disk back off issue which has too big lock in failure retry condition

Let me know if you have any question.

@andyzhangx
Copy link
Member Author

@andrewsykim PTAL, thanks.

@andrewsykim
Copy link
Member

/approve
/lgtm

for pkg/cloudprovider/providers/.import-restrictions changes

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, andyzhangx, feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 19, 2019
@k8s-ci-robot k8s-ci-robot merged commit 64a0441 into kubernetes:master Apr 19, 2019
k8s-ci-robot added a commit that referenced this pull request Apr 28, 2019
…6573-upstream-release-1.12

Automated cherry pick of #76573: refactor detach azure disk retry operation
k8s-ci-robot added a commit that referenced this pull request Apr 30, 2019
…6573-upstream-release-1.13

Automated cherry pick of #76573: refactor detach azure disk retry operation
k8s-ci-robot added a commit that referenced this pull request May 1, 2019
…6573-upstream-release-1.14

Automated cherry pick of #76573: refactor detach azure disk retry operation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

detach azure disk should not hold data disk list for a long time

4 participants