Skip to content

Conversation

@FengyunPan
Copy link

If node does not exist, node's volumes will be detached
automatically and become available. So mark them detached and
return false without error.
Fix #50266

Special notes for your reviewer:
/assign @jingxu97

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 8, 2017
@k8s-ci-robot
Copy link
Contributor

Hi @FengyunPan. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 8, 2017
@k8s-github-robot k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. release-note-none Denotes a PR that doesn't merit a release note. labels Aug 8, 2017
@FengyunPan
Copy link
Author

/assign @luomiao
/assign @kerneltime

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During host failover when node VM goes to disconnected state or is not accessible for any reason we are returning a Map of volumepath and its attachment status set to false. This will be misinterpreted as disks are already detached from the node and Kubernetes mark volumes as detached after orphaned pod is cleaned up.

This can causes volumes to remain attached to previous node, and new pod creation on another node will remains in the “containercreating” state. Since both the node vms are powered on and hot volumes can not be attached to other node.

We need to test this PR change for vSphere HA failover.

@BaluDontu @luomiao @tusharnt @rohitjogvmw

@luomiao
Copy link

luomiao commented Aug 8, 2017

@FengyunPan Thanks for coming up this PR regarding to the detached volumes.
However, I think this will revert a previous PR to fix another HA issue: #45569
As @divyenpatel has mentioned, directly setting the status of volumes to detached will cause a race condition when the previous dead node is restarted. More explanation can be found in that PR.

Edit:
sorry I missed that this is regarding to an issue when the instance is deleted from cloud provider.
In this case we will need a fix to handle both of the two situations: when a instance is deleted from cloud provider and when a VM is dead/restarted in vCenter. This PR will handle the first situation but will have a problem with the second situation.

@jingxu97
Copy link
Contributor

jingxu97 commented Aug 8, 2017

@luomiao Could you confirm that when a node is deleted, will the volume be detached automatically? For all other cloud providers such as GCE PD, AWS, that is the behavior. So the function DiskIsAttached() in cloud provider should return false (not attached) without error.

@luomiao
Copy link

luomiao commented Aug 8, 2017

@jingxu97 I feel there are two different issues are mixed here :) but the short answer is no, volumes are not detached automatically.

What @FengyunPan 's PR has changed is not when the node is deleted. Instead, it's when the node is powered off but still exists in vCenter. In this case, volumes are still attached to the VM. Thus we cannot simply return false (means disk is not attached). The variable and function names are misleading here. Instead of nodeExist, it should be nodeActive.

Regarding to your question, it also depends on how you define "deleting".

  1. If "deleting" means "removing the node from kubernetes cluster":
    Then the node is till alive in vCenter. So the volume is still attached to the VM.
  2. If "deleting" means "removing the node from vCenter":
    Removing node from vCenter involves two steps: power off the node and then delete the node.
    When the node is powered-off but not deleted yet, the volumes are attached to the VM.

@divyenpatel would you please confirm if the above understanding is correct or not?

@jingxu97
Copy link
Contributor

jingxu97 commented Aug 8, 2017

@luomiao By "deleted" I mean deleted from vCenter, similar to the case of node instance is deleted from cloud provider.

The code nodeExist, err := vs.NodeExists(nodeName) what does it mean? I thought it means the node is deleted from vcenter already?

@luomiao
Copy link

luomiao commented Aug 8, 2017

@jingxu97
The current NodeExists is misleading.
It could return nodeExist=false and err=nil, when the node is powered-off but not deleted.
Now I think it's not a problem with this PR but a problem with the NodeExists function.
Please allow us some time to do a few testings to confirm :)
The change to NodeExists function might be required for this PR.

@luomiao
Copy link

luomiao commented Aug 8, 2017

@jingxu97
Just did some double check. Sorry to take back my words.
NodeExists with a return value nodeExist=false means the node is not in an active status (powered off, disconnected, not found...).
Nodes can be in a non-active state (powered off, disconnected) while volumes are still attached.
In the case of vsphere cloud provider, deleting the node from cloud provider can make the nodes into any of the non-active status (powered off, disconnected, not found...).
But this PR could mark volumes as detached even the volumes might still be attached. Thus it's not correct.

@jingxu97
Copy link
Contributor

jingxu97 commented Aug 8, 2017

Is there a way to identify whether the node is deleted from vsphere cloud provider. If no, could we add one?

@FengyunPan
Copy link
Author

FengyunPan commented Aug 9, 2017

The variable and function names are misleading here. Instead of nodeExist, it should be nodeActive.
NodeExists with a return value nodeExist=false means the node is not in an active status (powered off, disconnected, not found...).

I see, @luomiao thank for your comments.

Is there a way to identify whether the node is deleted from vsphere cloud provider. If no, could we add one?

I think it is necessary for attachDetachManager, we should add one.

@FengyunPan FengyunPan force-pushed the mark-detached-vsphere branch from 2d9952a to 0f1a34e Compare August 9, 2017 02:41
@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 9, 2017
@FengyunPan FengyunPan force-pushed the mark-detached-vsphere branch 3 times, most recently from e0a4030 to 4b0fca1 Compare August 9, 2017 03:14
@luomiao
Copy link

luomiao commented Aug 9, 2017

@FengyunPan @jingxu97
Sorry for the delayed reply.
The current PR looks good to me.

We will submit another PR to fix NodeExists (NodeActive) function to return InstanceNotFound error when a node is not existing anymore.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 9, 2017
@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 9, 2017
@dims
Copy link
Member

dims commented Aug 9, 2017

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 9, 2017
@jingxu97
Copy link
Contributor

jingxu97 commented Aug 9, 2017

@luomiao should we wait until your fix is merged to merge this PR? Thanks!

@FengyunPan
Copy link
Author

@luomiao @divyenpatel @jingxu97 I have add codes to check if node exists and rebase the PR ( #49164), Please take a look.

@FengyunPan FengyunPan force-pushed the mark-detached-vsphere branch from 48a5c58 to 05a748b Compare August 10, 2017 04:56
@FengyunPan
Copy link
Author

/retest

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the change made in the PR, but just came across this unwanted code block.
I think this is not required.

if vmMoList[0].Summary.Config.Template == false {
 		glog.Warningf("VM is not in %s state", ActivePowerState)
 	} else {
 		glog.Warningf("VM is a template")
}

Copy link
Author

@FengyunPan FengyunPan Aug 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change of Exists() is helpful for reader, is right?
Yeah, that is not required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes IsActive is helpful for reader.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returned error should be different. When the Node VM is powered off, we can still determine whether disks are attached to powered off VM or not. Current error message does not look good in this case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, there is no need to determine wherher node is powered off in DisksAreAttached() and DisksisAttached().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as commented for DisksAreAttached. When node VM is powered off, we can still check if disks are attached to the Node VM, and return actual state of the disk.

Copy link
Member

@divyenpatel divyenpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided my review comments.

@FengyunPan FengyunPan force-pushed the mark-detached-vsphere branch from 05a748b to 8e8da99 Compare August 11, 2017 02:26
Copy link
Member

@divyenpatel divyenpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes looks good to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes IsActive is helpful for reader.

Copy link
Contributor

@BaluDontu BaluDontu Aug 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current log statement is very confusing as when the node is not found, disks will not detached automatically.
Can you rephrase this info log to
glog.Infof("Node %q does not exist, vsphere CP will assume all disks %v are not attached to it.", nodeName, volPaths).
Same at other places as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thank for your comment.

@FengyunPan FengyunPan force-pushed the mark-detached-vsphere branch from 8e8da99 to 0927121 Compare August 14, 2017 01:54
@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 14, 2017
FengyunPan added 2 commits August 14, 2017 10:09
If node does not exist, node's volumes will be detached
automatically and become available. So mark them detached and
return false without error.
Fix kubernetes#50266
@FengyunPan FengyunPan force-pushed the mark-detached-vsphere branch from 0927121 to ea32f06 Compare August 14, 2017 02:16
@k8s-github-robot k8s-github-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 14, 2017
glog.Errorf("Failed to get VM object for node: %q. err: +%v", vSphereInstance, err)
return nil, err
}
nodeExist, err := vm.Exists(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one more question. If node is powered off and the pod is placed on to other available nodes, will the volumes attached to powered off node detach automatically?

What logic in here, would cause it to happen?

Its because i have one PR which does this #40118. Throwing an error when node is powered off caused the volume to be detached automatically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it matter if the node is powered off or not when the DisksAreAttached() call is made.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bacongobbler Have you see the comment from @divyenpatel:"Same as commented for DisksAreAttached. When node VM is powered off, we can still check if disks are attached to the Node VM, and return actual state of the disk."
It seem it doesn't matter if the node is powered off or not when the DisksAreAttached() call is made.
On the other hand, the "vm.Exists(ctx)" just check if vm is power on, can't check if vm is exist.

@FengyunPan
Copy link
Author

@dims @jingxu97 @kerneltime @luomiao @BaluDontu - Any thoughts on it?

@BaluDontu
Copy link
Contributor

The change looks good to me. /lgtm from my side.

@luomiao
Copy link

luomiao commented Aug 17, 2017

sorry for the late reply.
I thought this PR is already merged...
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 17, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: FengyunPan, luomiao

Associated issue: 50266

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 7c13d65 into kubernetes:master Aug 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants