Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix race condition issue when detaching azure disk #60183

Merged

Conversation

andyzhangx
Copy link
Member

@andyzhangx andyzhangx commented Feb 22, 2018

What this PR does / why we need it:
add lock before detaching azure disk, without this PR, there would be lots of Multi-Attach error when scheduling one pod from one node to another.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #60101

Special notes for your reviewer:
@feiskyer @djsly @khenidak
Since we are using getLunMutex.LockKey(instanceid) for both AttachDisk and DetachDisk, there would be only one VM.update operation at a time for both AttachDisk and DetachDisk.

Release note:

fix race condition issue when detaching azure disk

/assign @feiskyer
Could you also mark as v1.10 milestone @feiskyer thanks.
/sig azure

@k8s-ci-robot k8s-ci-robot added sig/azure size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 22, 2018
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 22, 2018
@feiskyer feiskyer added this to the v1.10 milestone Feb 22, 2018
@feiskyer feiskyer added status/approved-for-milestone kind/bug Categorizes issue or PR as related to a bug. labels Feb 22, 2018
@andyzhangx
Copy link
Member Author

@khenidak I think lots of Multi-Attach error of azure disk was due to this bug, it's really bad we do not lock before detaching azure disk...

@andyzhangx andyzhangx force-pushed the addlock-detach-azuredisk branch from 630269a to f3324a6 Compare February 22, 2018 03:55
@djsly
Copy link
Contributor

djsly commented Feb 22, 2018

Looking good on my side.
I will apply the patch on my local 1.7 source and test it out tomorrow. We have another issue with PV and I want to confirm it is the same source.

@andyzhangx
Copy link
Member Author

@djsly just replace the controller manager, thanks.

Copy link
Member

@feiskyer feiskyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the changes.

Let's wait a while for @djsly's validation.

@@ -268,7 +268,7 @@ func (d *azureDiskDetacher) Detach(diskURI string, nodeName types.NodeName) erro
return fmt.Errorf("invalid disk to detach: %q", diskURI)
}

_, err := d.cloud.InstanceID(context.TODO(), nodeName)
instanceid, err := d.cloud.InstanceID(context.TODO(), nodeName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context.TODO() seems like a mistake here? is there a context we can attach to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brendanburns it's related to #59287, there is some future work item for context.TODO()

@djsly
Copy link
Contributor

djsly commented Feb 22, 2018

@andyzhangx here are the logs for my latest test after I applied the patch on the 1.7 branch.

https://gist.github.com/djsly/d04edbd3239373f345ca27616ddd5132

Things are looking good!

Except the Multi-Attach error. This seems to be caused by the reconciler. It last ~2m30sec until the disks are unmounted and the first detach kicks in.

Could you confirm that those warnings do not involve the ARM api ?

Multi-Attach error for volume "pvc-91d3b0a6-11b9-11e8-82b7-000d3a018ac3" (UniqueName: "kubernetes.io/azure-disk//subscriptions/<sub_id>/resourceGroups/<rg>/providers/Microsoft.Compute/disks/<rg>-dynamic-pvc-91d3b0a6-11b9-11e8-82b7-000d3a018ac3") from node "kn-edge-2" Volume is already exclusively attached to one node and can't be attached to another

@djsly
Copy link
Contributor

djsly commented Feb 22, 2018

Also, could we have this PR cherry picked all the way to 1.7 ? thanks!

@brendandburns
Copy link
Contributor

/lgtm
/approve

@jdumars can you shepard this into 1.7.x and 1.8.x?

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 22, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, brendandburns

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-cherrypick-bot
Copy link

Removing label cherrypick-candidate because no release milestone was set. This is an invalid state and thus this PR is not being considered for cherry-pick to any release branch. Please add an appropriate release milestone and then re-add the label.

@andyzhangx
Copy link
Member Author

@djsly I checked the code again, current Multi-Attach error from reconciler is expected, it does not come from ARM api, it's a error message generated by reconciler.
When one pod is rescheduled from node#1 to node#2, DisksAreAttached func in azure_controllerCommon.go will still return that the original disk is still in node#1, after around one minute, disk was removed from node#1, and then the Multi-Attach error from reconciler disappears.

I will cherry pick this PR to v1.7-1.9.
@djsly Thanks again for your reporting and verification, finally we resolved this Multi-Attach error issue.

@djsly
Copy link
Contributor

djsly commented Feb 23, 2018

My pleasure and thanks for the quick PR.

@andyzhangx
Copy link
Member Author

@karataliu @rootfs FYI

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 60208, 60084, 60183, 59713, 60096). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 8f9e8c0 into kubernetes:master Feb 23, 2018
@jdumars
Copy link
Member

jdumars commented Feb 23, 2018

@mehdy @jpbetz @wojtek-t this is an important cherry pick

@k8s-cherrypick-bot
Copy link

Removing label cherrypick-candidate because no release milestone was set. This is an invalid state and thus this PR is not being considered for cherry-pick to any release branch. Please add an appropriate release milestone and then re-add the label.

@jdumars jdumars modified the milestones: v1.10, v1.9 Feb 23, 2018
k8s-github-robot pushed a commit that referenced this pull request Feb 28, 2018
…0183-upstream-release-1.8

Automatic merge from submit-queue.

Automated cherry pick of #60183

Cherry pick of #60183 on release-1.8.

#60183: add lock before detaching azure disk
k8s-github-robot pushed a commit that referenced this pull request Mar 1, 2018
…0183-upstream-release-1.7

Automatic merge from submit-queue.

Automated cherry pick of #60183

Cherry pick of #60183 on release-1.7.

#60183: add lock before detaching azure disk
k8s-github-robot pushed a commit that referenced this pull request Mar 16, 2018
…0183-upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #60183: add lock before detaching azure disk

Cherry pick of #60183 on release-1.9.

#60183: add lock before detaching azure disk
@k8s-cherrypick-bot
Copy link

Commit found in the "release-1.9" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Azure Disk Detach are not working with multiple disk detach on the same Node.
8 participants