-
Notifications
You must be signed in to change notification settings - Fork 42k
fix detach azure disk back off issue which has too big lock in failure retry condition #76573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix detach azure disk back off issue which has too big lock in failure retry condition #76573
Conversation
299ef42 to
3772cd8
Compare
fix comments fix import keymux check error add unit test for attach/detach disk funcs
3772cd8 to
6c70ca6
Compare
|
/test pull-kubernetes-integration |
feiskyer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
|
@andrewsykim Could you help to approve the cloud-provider changes? /assign @andrewsykim |
|
Can we add a bit more details on the release note please? |
|
/test pull-kubernetes-e2e-gce-csi-serial |
|
@andrewsykim thanks. I have changed the release note to: Let me know if you have any question. |
|
@andrewsykim PTAL, thanks. |
|
/approve for |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andrewsykim, andyzhangx, feiskyer The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…6573-upstream-release-1.12 Automated cherry pick of #76573: refactor detach azure disk retry operation
…6573-upstream-release-1.13 Automated cherry pick of #76573: refactor detach azure disk retry operation
…6573-upstream-release-1.14 Automated cherry pick of #76573: refactor detach azure disk retry operation
What type of PR is this?
/kind bug
What this PR does / why we need it:
In some error condition when detach azure disk failed, azure cloud provider will retry 6 times at most with exponential backoff, it will hold the data disk list for about 3 minutes with a node level lock, and in that time period, if customer update data disk list manually (e.g. need manual operationto attach/detach another disk since there is attach/detach error, ) , the data disk list will be obselete(dirty data), then weird VM status happens, e.g. attach a non-existing disk, we should split those retry operations, every retry should get a fresh data disk list in the beginning.
kubernetes/pkg/cloudprovider/providers/azure/azure_controller_standard.go
Lines 150 to 153 in cbaaa67
This PR has two commits:
DetachDiskByNametoDetachDiskas.CreateOrUpdateVMWithRetry(nodeResourceGroup, vmName, newVM)which may lead to obsolete data disk listThus there would be separate lock for every detach azure disk operation, we don't need to lock whole disk detach backoff(6 retries) process.
This PR don't change the logic of attach disk since there is no retry in azure cloud provide for attach disk, k8s attach-detach controller will do the attach volume retry.
BTW, I have tested this PR on both vmss and vmas k8s cluster by a stress disk attach/detach test, it works well for a bunch of times.
Which issue(s) this PR fixes:
Fixes #76502
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
/kind bug
/assign @feiskyer
/priority important-soon
/sig azure
cc @khenidak @brendandburns