fix detach azure disk back off issue which has too big lock in failure retry condition #76573

andyzhangx · 2019-04-15T01:55:59Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
In some error condition when detach azure disk failed, azure cloud provider will retry 6 times at most with exponential backoff, it will hold the data disk list for about 3 minutes with a node level lock, and in that time period, if customer update data disk list manually (e.g. need manual operationto attach/detach another disk since there is attach/detach error, ) , the data disk list will be obselete(dirty data), then weird VM status happens, e.g. attach a non-existing disk, we should split those retry operations, every retry should get a fresh data disk list in the beginning.

kubernetes/pkg/cloudprovider/providers/azure/azure_controller_standard.go

Lines 150 to 153 in cbaaa67

    
           if as.CloudProviderBackoff && shouldRetryHTTPRequest(resp, err) { 
        
           	klog.V(2).Infof("azureDisk - update(%s) backing off: vm(%s) detach disk(%s, %s), err: %v", nodeResourceGroup, vmName, diskName, diskURI, err) 
        
           	retryErr := as.CreateOrUpdateVMWithRetry(nodeResourceGroup, vmName, newVM) 
        
           	if retryErr != nil {

This PR has two commits:

refactor detach azure disk retry operation:

rename function name from DetachDiskByName to DetachDisk
refine detach azure disk retry operation, make every detach azure disk operation in a standalone function, originally it's by as.CreateOrUpdateVMWithRetry(nodeResourceGroup, vmName, newVM) which may lead to obsolete data disk list

move disk lock process to azure cloud provider:
Thus there would be separate lock for every detach azure disk operation, we don't need to lock whole disk detach backoff(6 retries) process.

This PR don't change the logic of attach disk since there is no retry in azure cloud provide for attach disk, k8s attach-detach controller will do the attach volume retry.

BTW, I have tested this PR on both vmss and vmas k8s cluster by a stress disk attach/detach test, it works well for a bunch of times.

Which issue(s) this PR fixes:

Fixes #76502

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

fix detach azure disk back off issue which has too big lock in failure retry condition

/kind bug
/assign @feiskyer
/priority important-soon
/sig azure

cc @khenidak @brendandburns

pkg/cloudprovider/providers/azure/azure_controller_common.go

fix comments fix import keymux check error add unit test for attach/detach disk funcs

andyzhangx · 2019-04-16T06:54:03Z

/test pull-kubernetes-integration

feiskyer

/lgtm
/approve

feiskyer · 2019-04-16T06:58:06Z

@andrewsykim Could you help to approve the cloud-provider changes?

/assign @andrewsykim

andrewsykim · 2019-04-16T17:06:38Z

Can we add a bit more details on the release note please?

andyzhangx · 2019-04-17T01:32:49Z

/test pull-kubernetes-e2e-gce-csi-serial

andyzhangx · 2019-04-17T01:33:52Z

@andrewsykim thanks. I have changed the release note to:

fix detach azure disk back off issue which has too big lock in failure retry condition

Let me know if you have any question.

andyzhangx · 2019-04-19T01:29:04Z

@andrewsykim PTAL, thanks.

andrewsykim · 2019-04-19T03:25:08Z

/approve
/lgtm

for pkg/cloudprovider/providers/.import-restrictions changes

k8s-ci-robot · 2019-04-19T03:25:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, andyzhangx, feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/cloudprovider/OWNERS~~ [andrewsykim]
~~pkg/volume/azure_dd/OWNERS~~ [andyzhangx,feiskyer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…6573-upstream-release-1.12 Automated cherry pick of #76573: refactor detach azure disk retry operation

…6573-upstream-release-1.13 Automated cherry pick of #76573: refactor detach azure disk retry operation

…6573-upstream-release-1.14 Automated cherry pick of #76573: refactor detach azure disk retry operation

refactor detach azure disk retry operation

39c239c

k8s-ci-robot assigned feiskyer Apr 15, 2019

k8s-ci-robot requested review from karataliu and rootfs April 15, 2019 01:56

k8s-ci-robot added area/cloudprovider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/storage Categorizes an issue or PR as relevant to SIG Storage. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 15, 2019

andyzhangx force-pushed the disk-backoff-refactor branch from 299ef42 to 3772cd8 Compare April 15, 2019 05:19

k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 15, 2019

feiskyer reviewed Apr 15, 2019

View reviewed changes

pkg/cloudprovider/providers/azure/azure_controller_common.go Outdated Show resolved Hide resolved

move disk lock process to azure cloud provider

6c70ca6

fix comments fix import keymux check error add unit test for attach/detach disk funcs

andyzhangx force-pushed the disk-backoff-refactor branch from 3772cd8 to 6c70ca6 Compare April 16, 2019 05:31

feiskyer reviewed Apr 16, 2019

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 16, 2019

k8s-ci-robot assigned andrewsykim Apr 16, 2019

andyzhangx changed the title ~~fix detach azure disk back off issue~~ fix detach azure disk back off issue which has too big lock in failure retry condition Apr 17, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 19, 2019

k8s-ci-robot merged commit 64a0441 into kubernetes:master Apr 19, 2019

k8s-ci-robot added a commit that referenced this pull request Apr 28, 2019

Merge pull request #76981 from andyzhangx/automated-cherry-pick-of-#7…

7e72732

…6573-upstream-release-1.12 Automated cherry pick of #76573: refactor detach azure disk retry operation

k8s-ci-robot added a commit that referenced this pull request Apr 30, 2019

Merge pull request #76887 from andyzhangx/automated-cherry-pick-of-#7…

aac3f46

…6573-upstream-release-1.13 Automated cherry pick of #76573: refactor detach azure disk retry operation

k8s-ci-robot added a commit that referenced this pull request May 1, 2019

Merge pull request #76886 from andyzhangx/automated-cherry-pick-of-#7…

ba43257

…6573-upstream-release-1.14 Automated cherry pick of #76573: refactor detach azure disk retry operation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix detach azure disk back off issue which has too big lock in failure retry condition #76573

fix detach azure disk back off issue which has too big lock in failure retry condition #76573

Uh oh!

andyzhangx commented Apr 15, 2019 •

edited

Loading

Uh oh!

Uh oh!

andyzhangx commented Apr 16, 2019

Uh oh!

feiskyer left a comment

Uh oh!

feiskyer commented Apr 16, 2019

Uh oh!

andrewsykim commented Apr 16, 2019

Uh oh!

andyzhangx commented Apr 17, 2019

Uh oh!

andyzhangx commented Apr 17, 2019

Uh oh!

andyzhangx commented Apr 19, 2019

Uh oh!

andrewsykim commented Apr 19, 2019

Uh oh!

k8s-ci-robot commented Apr 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	if as.CloudProviderBackoff && shouldRetryHTTPRequest(resp, err) {
	klog.V(2).Infof("azureDisk - update(%s) backing off: vm(%s) detach disk(%s, %s), err: %v", nodeResourceGroup, vmName, diskName, diskURI, err)
	retryErr := as.CreateOrUpdateVMWithRetry(nodeResourceGroup, vmName, newVM)
	if retryErr != nil {

fix detach azure disk back off issue which has too big lock in failure retry condition #76573

fix detach azure disk back off issue which has too big lock in failure retry condition #76573

Uh oh!

Conversation

andyzhangx commented Apr 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

andyzhangx commented Apr 16, 2019

Uh oh!

feiskyer left a comment

Choose a reason for hiding this comment

Uh oh!

feiskyer commented Apr 16, 2019

Uh oh!

andrewsykim commented Apr 16, 2019

Uh oh!

andyzhangx commented Apr 17, 2019

Uh oh!

andyzhangx commented Apr 17, 2019

Uh oh!

andyzhangx commented Apr 19, 2019

Uh oh!

andrewsykim commented Apr 19, 2019

Uh oh!

k8s-ci-robot commented Apr 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andyzhangx commented Apr 15, 2019 •

edited

Loading