Fixing VolumesAreAttached and DisksAreAttached functions in vSphere #45569

divyenpatel · 2017-05-10T00:26:50Z

What this PR does / why we need it:

In the vSphere HA, when node fail over happens, node VM momentarily goes in to “not connected” state. During this time, if kubernetes calls VolumesAreAttached function, we are returning incorrect map, with status for volume set to false - detached state.

Volumes attached to previous nodes, requires to be detached before they can attach to the new node. Kubernetes attempt to check volume attachment. When node VM is not accessible or for any reason we cannot determine disk is attached, we were returning a Map of volumepath and its attachment status set to false. This was misinterpreted as disks are already detached from the node and Kubernetes was marking volumes as detached after orphaned pod is cleaned up. This causes volumes to remain attached to previous node, and pod creation always remains in the “containercreating” state. Since both the node are powered on, volumes can not be attached to new node.

Logs before fix

{"log":"E0508 21:31:20.902501       1 vsphere.go:1053] disk uuid not found for [vsanDatastore] kubevols/kubernetes-dynamic-pvc-8b75170e-342d-11e7-bab5-0050568aeb0a.vmdk. err: No disk UUID fou
nd\n","stream":"stderr","time":"2017-05-08T21:31:20.902792337Z"}
{"log":"E0508 21:31:20.902552       1 vsphere.go:1041] Failed to check whether disk is attached. err: No disk UUID found\n","stream":"stderr","time":"2017-05-08T21:31:20.902842673Z"}
{"log":"I0508 21:31:20.902575       1 attacher.go:114] VolumesAreAttached: check volume \"[vsanDatastore] kubevols/kubernetes-dynamic-pvc-8b75170e-342d-11e7-bab5-0050568aeb0a.vmdk\" (specName
: \"pvc-8b75170e-342d-11e7-bab5-0050568aeb0a\") is no longer attached\n","stream":"stderr","time":"2017-05-08T21:31:20.902849717Z"}
{"log":"I0508 21:31:20.902596       1 operation_generator.go:166] VerifyVolumesAreAttached determined volume \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-8b7
5170e-342d-11e7-bab5-0050568aeb0a.vmdk\" (spec.Name: \"pvc-8b75170e-342d-11e7-bab5-0050568aeb0a\") is no longer attached to node \"node3\", therefore it was marked as detached.\n","stream":"s
tderr","time":"2017-05-08T21:31:20.902863097Z"}

In this change, we are making sure correct volume attachment map is returned, and in case of any error occurred while checking disk’s status, we return nil map.

Logs after fix

{"log":"E0509 20:25:37.982152       1 vsphere.go:1067] Failed to check whether disk is attached. err: No disk UUID found\n","stream":"stderr","time":"2017-05-09T20:25:37.982516134Z"}
{"log":"E0509 20:25:37.982190       1 attacher.go:104] Error checking if volumes ([[vsanDatastore] kubevols/kubernetes-dynamic-pvc-c26fcae8-34f2-11e7-9303-0050568a3ac1.vmdk [vsanDatastore] kubevols/kubernetes-dynamic-pvc-c268f141-34f2-11e7-9303-0050568a3ac1.vmdk [vsanDatastore] kubevols/kubernetes-dynamic-pvc-c25d08d3-34f2-11e7-9303-0050568a3ac1.vmdk]) are attached to current node (\"node3\"). err=No disk UUID found\n","stream":"stderr","time":"2017-05-09T20:25:37.982521101Z"}
{"log":"E0509 20:25:37.982220       1 operation_generator.go:158] VolumesAreAttached failed for checking on node \"node3\" with: No disk UUID found\n","stream":"stderr","time":"2017-05-09T20:25:37.982526285Z"}
{"log":"I0509 20:25:39.157279       1 attacher.go:115] VolumesAreAttached: volume \"[vsanDatastore] kubevols/kubernetes-dynamic-pvc-c268f141-34f2-11e7-9303-0050568a3ac1.vmdk\" (specName: \"pvc-c268f141-34f2-11e7-9303-0050568a3ac1\") is attached\n","stream":"stderr","time":"2017-05-09T20:25:39.157724393Z"}
{"log":"I0509 20:25:39.157329       1 attacher.go:115] VolumesAreAttached: volume \"[vsanDatastore] kubevols/kubernetes-dynamic-pvc-c25d08d3-34f2-11e7-9303-0050568a3ac1.vmdk\" (specName: \"pvc-c25d08d3-34f2-11e7-9303-0050568a3ac1\") is attached\n","stream":"stderr","time":"2017-05-09T20:25:39.157787946Z"}
{"log":"I0509 20:25:39.157367       1 attacher.go:115] VolumesAreAttached: volume \"[vsanDatastore] kubevols/kubernetes-dynamic-pvc-c26fcae8-34f2-11e7-9303-0050568a3ac1.vmdk\" (specName: \"pvc-c26fcae8-34f2-11e7-9303-0050568a3ac1\") is attached\n","stream":"stderr","time":"2017-05-09T20:25:39.157794586Z"}

{"log":"I0509 20:25:41.267425       1 reconciler.go:173] Started DetachVolume for volume \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-c26fcae8-34f2-11e7-9303-0050568a3ac1.vmdk\" from node \"node3\"\n","stream":"stderr","time":"2017-05-09T20:25:41.267883567Z"}
{"log":"I0509 20:25:41.271836       1 operation_generator.go:694] Verified volume is safe to detach for volume \"pvc-c26fcae8-34f2-11e7-9303-0050568a3ac1\" (UniqueName: \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-c26fcae8-34f2-11e7-9303-0050568a3ac1.vmdk\") on node \"node3\" \n","stream":"stderr","time":"2017-05-09T20:25:41.272703255Z"}
{"log":"I0509 20:25:47.928021       1 operation_generator.go:341] DetachVolume.Detach succeeded for volume \"pvc-c26fcae8-34f2-11e7-9303-0050568a3ac1\" (UniqueName: \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-c26fcae8-34f2-11e7-9303-0050568a3ac1.vmdk\") on node \"node3\" \n","stream":"stderr","time":"2017-05-09T20:25:47.928348553Z"}

{"log":"I0509 20:26:12.535962       1 operation_generator.go:694] Verified volume is safe to detach for volume \"pvc-c25d08d3-34f2-11e7-9303-0050568a3ac1\" (UniqueName: \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-c25d08d3-34f2-11e7-9303-0050568a3ac1.vmdk\") on node \"node3\" \n","stream":"stderr","time":"2017-05-09T20:26:12.536055214Z"}
{"log":"I0509 20:26:14.188580       1 operation_generator.go:341] DetachVolume.Detach succeeded for volume \"pvc-c25d08d3-34f2-11e7-9303-0050568a3ac1\" (UniqueName: \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-c25d08d3-34f2-11e7-9303-0050568a3ac1.vmdk\") on node \"node3\" \n","stream":"stderr","time":"2017-05-09T20:26:14.188792677Z"}

{"log":"I0509 20:26:40.355656       1 reconciler.go:173] Started DetachVolume for volume \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-c268f141-34f2-11e7-9303-0050568a3ac1.vmdk\" from node \"node3\"\n","stream":"stderr","time":"2017-05-09T20:26:40.355922165Z"}
{"log":"I0509 20:26:40.357988       1 operation_generator.go:694] Verified volume is safe to detach for volume \"pvc-c268f141-34f2-11e7-9303-0050568a3ac1\" (UniqueName: \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-c268f141-34f2-11e7-9303-0050568a3ac1.vmdk\") on node \"node3\" \n","stream":"stderr","time":"2017-05-09T20:26:40.358177953Z"}

Which issue this PR fixes
fixes #45464, vmware-archive#116

Special notes for your reviewer:
Verified this change on locally built hyperkube image - v1.7.0-alpha.3.147+3c0526cb64bdf5-dirty

performed many fail over with large volumes (30GB) attached to the pod.

$ kubectl describe pod
Name: wordpress-mysql-2789807967-3xcvc
Node: node3/172.1.87.0
Status: Running

Powered Off node3's host. pod failed over to node2. Verified all 3 disks detached from node3 and attached to node2.

$ kubectl describe pod
Name: wordpress-mysql-2789807967-qx0b0
Node: node2/172.1.9.0
Status: Running

Powered Off node2's host. pod failed over to node3. Verified all 3 disks detached from node2 and attached to node3.

$ kubectl describe pod
Name: wordpress-mysql-2789807967-7849s
Node: node3/172.1.87.0
Status: Running

Powered Off node3's host. pod failed over to node1. Verified all 3 disks detached from node3 and attached to node1.

$ kubectl describe pod
Name: wordpress-mysql-2789807967-26lp1
Node: node1/172.1.98.0
Status: Running

Powered off node1's host. pod failed over to node3. Verified all 3 disks detached from node1 and attached to node3.

$ kubectl describe pods
Name: wordpress-mysql-2789807967-4pdtl
Node: node3/172.1.87.0
Status: Running

Powered off node3's host. pod failed over to node1. Verified all 3 disks detached from node3 and attached to node1.

$ kubectl describe pod
Name: wordpress-mysql-2789807967-t375f
Node: node1/172.1.98.0
Status: Running

Powered off node1's host. pod failed over to node3. Verified all 3 disks detached from node1 and attached to node3.

$ kubectl describe pods
Name: wordpress-mysql-2789807967-pn6ps
Node: node3/172.1.87.0
Status: Running

powered off node3's host. pod failed over to node1. Verified all 3 disks detached from node3 and attached to node1

$ kubectl describe pods
Name: wordpress-mysql-2789807967-0wqc1
Node: node1/172.1.98.0
Status: Running

powered off node1's host. pod failed over to node3. Verified all 3 disks detached from node1 and attached to node3.

$ kubectl describe pods
Name: wordpress-mysql-2789807967-821nc
Node: node3/172.1.87.0
Status: Running

Release note:

vSphere cloud provider: Fix volume detach on node failure.

CC: @BaluDontu @abrarshivani @luomiao @tusharnt @pdhamdhere

k8s-ci-robot · 2017-05-10T00:26:58Z

Hi @divyenpatel. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-reviewable · 2017-05-10T00:26:58Z

This change is

divyenpatel · 2017-05-10T00:28:14Z

@gnufied can you review this PR? as we discussed made fix in the VolumesAreAttached function.

luomiao

Thanks for find this bug! Just have a question to the DiskIsAttached function, which may have similar issue as this DisksAreAttached function.

luomiao · 2017-05-10T00:58:48Z

pkg/cloudprovider/providers/vsphere/vsphere.go

-		result, _ := checkDiskAttached(volPath, vmDevices, dc, vs.client)
-		if result {
-			attached[volPath] = true
+		result, err := checkDiskAttached(volPath, vmDevices, dc, vs.client)


Do we have similar concern when checkDiskAttached is called inside "DiskIsAttached" function?
Seems if err is not nil, it will always return false. And according to the detach function here, even error is not nil, it will continue and mark this volume as already detached, which is also the same problem as resolved by this fix.

For DiskIsAttached, we are returning error, so should be ok.
https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/vsphere/vsphere.go#L1000

But in AttachDisk I missed to handle error. Added error check.

https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/vsphere/vsphere.go#L769

Also Verified attacher.go -> Detach() is properly handling returned err. When DiskIsAttached is retuning error with false, we log the error and retry detach operation. If retry fails, we return error.
Only when error is nil and returned value is true Kubernetes mark the detach as successful.

https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/vsphere_volume/attacher.go#L229

Cool! I missed the err == nil check in Detach().
Looks good to me now :)

luomiao · 2017-05-10T17:30:38Z

/lgtm

kerneltime · 2017-05-10T18:35:07Z

/approve

k8s-github-robot · 2017-05-10T18:39:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: divyenpatel, kerneltime, luomiao

Needs approval from an approver in each of these OWNERS Files:

~~pkg/cloudprovider/providers/vsphere/OWNERS~~ [kerneltime,luomiao]
~~pkg/volume/vsphere_volume/OWNERS~~ [kerneltime]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-ci-robot · 2017-05-11T04:27:16Z

@divyenpatel: The following test(s) failed:

Test name	Commit	Details	Rerun command
pull-kubernetes-federation-e2e-gce	`9f89b57`	link	`@k8s-bot pull-kubernetes-federation-e2e-gce test this`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-github-robot · 2017-05-11T04:34:37Z

Automatic merge from submit-queue (batch tested with PRs 45569, 45602, 45604, 45478, 45550)

@BaluDontu

…45569-kubernetes-release-1.6 Automatic merge from submit-queue Automated cherry pick of #45181 #45569 upstream release 1.6 Cherry pick of #45181 #45569 on release-1.6. #45181: Filter out IPV6 addresses from NodeAddresses() returned by vSphere #45569: Fixing VolumesAreAttached and DisksAreAttached functions in vSphere @BaluDontu @luomiao @tusharnt

…reAttached functions in vSphere)

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 10, 2017

k8s-github-robot assigned justinsb and krousey May 10, 2017

k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-none Denotes a PR that doesn't merit a release note. labels May 10, 2017

luomiao reviewed May 10, 2017

View reviewed changes

fix implementation of VolumesAreAttached function

9f89b57

divyenpatel force-pushed the fix_VolumesAreAttached branch from 96f031a to 9f89b57 Compare May 10, 2017 17:18

k8s-ci-robot assigned luomiao May 10, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 10, 2017

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 10, 2017

k8s-github-robot merged commit b0d024f into kubernetes:master May 11, 2017

divyenpatel mentioned this pull request May 15, 2017

Automated cherry pick of #45181 #45569 upstream release 1.6 #45834

Merged

enisoc added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels May 15, 2017

luomiao mentioned this pull request Aug 8, 2017

Mark volume as detached when node does not exist for vsphere #50281

Merged

divyenpatel deleted the fix_VolumesAreAttached branch September 5, 2017 23:10

sutinski pushed a commit to hpcloud/kubernetes that referenced this pull request Apr 4, 2018

Bacport of fix kubernetes#45569 (Fixing VolumesAreAttached and DisksA…

462752a

…reAttached functions in vSphere)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing VolumesAreAttached and DisksAreAttached functions in vSphere #45569

Fixing VolumesAreAttached and DisksAreAttached functions in vSphere #45569

divyenpatel commented May 10, 2017 •

edited by enisoc

Loading

k8s-ci-robot commented May 10, 2017

k8s-reviewable commented May 10, 2017

divyenpatel commented May 10, 2017

luomiao left a comment

luomiao May 10, 2017

divyenpatel May 10, 2017

luomiao May 10, 2017

luomiao commented May 10, 2017

kerneltime commented May 10, 2017

k8s-github-robot commented May 10, 2017

k8s-ci-robot commented May 11, 2017 •

edited

Loading

k8s-github-robot commented May 11, 2017

Fixing VolumesAreAttached and DisksAreAttached functions in vSphere #45569

Fixing VolumesAreAttached and DisksAreAttached functions in vSphere #45569

Conversation

divyenpatel commented May 10, 2017 • edited by enisoc Loading

k8s-ci-robot commented May 10, 2017

k8s-reviewable commented May 10, 2017

divyenpatel commented May 10, 2017

luomiao left a comment

Choose a reason for hiding this comment

luomiao May 10, 2017

Choose a reason for hiding this comment

divyenpatel May 10, 2017

Choose a reason for hiding this comment

luomiao May 10, 2017

Choose a reason for hiding this comment

luomiao commented May 10, 2017

kerneltime commented May 10, 2017

k8s-github-robot commented May 10, 2017

k8s-ci-robot commented May 11, 2017 • edited Loading

k8s-github-robot commented May 11, 2017

divyenpatel commented May 10, 2017 •

edited by enisoc

Loading

k8s-ci-robot commented May 11, 2017 •

edited

Loading