-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing VolumesAreAttached and DisksAreAttached functions in vSphere #45569
Fixing VolumesAreAttached and DisksAreAttached functions in vSphere #45569
Conversation
Hi @divyenpatel. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@gnufied can you review this PR? as we discussed made fix in the VolumesAreAttached function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for find this bug! Just have a question to the DiskIsAttached function, which may have similar issue as this DisksAreAttached function.
result, _ := checkDiskAttached(volPath, vmDevices, dc, vs.client) | ||
if result { | ||
attached[volPath] = true | ||
result, err := checkDiskAttached(volPath, vmDevices, dc, vs.client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have similar concern when checkDiskAttached is called inside "DiskIsAttached" function?
Seems if err is not nil, it will always return false. And according to the detach function here, even error is not nil, it will continue and mark this volume as already detached, which is also the same problem as resolved by this fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For DiskIsAttached, we are returning error, so should be ok.
https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/vsphere/vsphere.go#L1000
But in AttachDisk I missed to handle error. Added error check.
Also Verified attacher.go -> Detach() is properly handling returned err. When DiskIsAttached is retuning error with false, we log the error and retry detach operation. If retry fails, we return error.
Only when error is nil and returned value is true Kubernetes mark the detach as successful.
https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/vsphere_volume/attacher.go#L229
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! I missed the err == nil check in Detach().
Looks good to me now :)
96f031a
to
9f89b57
Compare
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: divyenpatel, kerneltime, luomiao
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
@divyenpatel: The following test(s) failed:
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Automatic merge from submit-queue (batch tested with PRs 45569, 45602, 45604, 45478, 45550) |
…45569-kubernetes-release-1.6 Automatic merge from submit-queue Automated cherry pick of #45181 #45569 upstream release 1.6 Cherry pick of #45181 #45569 on release-1.6. #45181: Filter out IPV6 addresses from NodeAddresses() returned by vSphere #45569: Fixing VolumesAreAttached and DisksAreAttached functions in vSphere @BaluDontu @luomiao @tusharnt
…reAttached functions in vSphere)
What this PR does / why we need it:
In the vSphere HA, when node fail over happens, node VM momentarily goes in to “not connected” state. During this time, if kubernetes calls VolumesAreAttached function, we are returning incorrect map, with status for volume set to false - detached state.
Volumes attached to previous nodes, requires to be detached before they can attach to the new node. Kubernetes attempt to check volume attachment. When node VM is not accessible or for any reason we cannot determine disk is attached, we were returning a Map of volumepath and its attachment status set to false. This was misinterpreted as disks are already detached from the node and Kubernetes was marking volumes as detached after orphaned pod is cleaned up. This causes volumes to remain attached to previous node, and pod creation always remains in the “containercreating” state. Since both the node are powered on, volumes can not be attached to new node.
Logs before fix
In this change, we are making sure correct volume attachment map is returned, and in case of any error occurred while checking disk’s status, we return nil map.
Logs after fix
Which issue this PR fixes
fixes #45464, vmware-archive#116
Special notes for your reviewer:
Verified this change on locally built hyperkube image - v1.7.0-alpha.3.147+3c0526cb64bdf5-dirty
performed many fail over with large volumes (30GB) attached to the pod.
$ kubectl describe pod
Name: wordpress-mysql-2789807967-3xcvc
Node: node3/172.1.87.0
Status: Running
Powered Off node3's host. pod failed over to node2. Verified all 3 disks detached from node3 and attached to node2.
$ kubectl describe pod
Name: wordpress-mysql-2789807967-qx0b0
Node: node2/172.1.9.0
Status: Running
Powered Off node2's host. pod failed over to node3. Verified all 3 disks detached from node2 and attached to node3.
$ kubectl describe pod
Name: wordpress-mysql-2789807967-7849s
Node: node3/172.1.87.0
Status: Running
Powered Off node3's host. pod failed over to node1. Verified all 3 disks detached from node3 and attached to node1.
$ kubectl describe pod
Name: wordpress-mysql-2789807967-26lp1
Node: node1/172.1.98.0
Status: Running
Powered off node1's host. pod failed over to node3. Verified all 3 disks detached from node1 and attached to node3.
$ kubectl describe pods
Name: wordpress-mysql-2789807967-4pdtl
Node: node3/172.1.87.0
Status: Running
Powered off node3's host. pod failed over to node1. Verified all 3 disks detached from node3 and attached to node1.
$ kubectl describe pod
Name: wordpress-mysql-2789807967-t375f
Node: node1/172.1.98.0
Status: Running
Powered off node1's host. pod failed over to node3. Verified all 3 disks detached from node1 and attached to node3.
$ kubectl describe pods
Name: wordpress-mysql-2789807967-pn6ps
Node: node3/172.1.87.0
Status: Running
powered off node3's host. pod failed over to node1. Verified all 3 disks detached from node3 and attached to node1
$ kubectl describe pods
Name: wordpress-mysql-2789807967-0wqc1
Node: node1/172.1.98.0
Status: Running
powered off node1's host. pod failed over to node3. Verified all 3 disks detached from node1 and attached to node3.
$ kubectl describe pods
Name: wordpress-mysql-2789807967-821nc
Node: node3/172.1.87.0
Status: Running
Release note:
CC: @BaluDontu @abrarshivani @luomiao @tusharnt @pdhamdhere