Skip to content

Conversation

@andyzhangx
Copy link
Member

@andyzhangx andyzhangx commented Mar 9, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR fixed this issue:
pod could not be terminated when pod volume path is corrupted due to smb server return error status: resource temporarily unavailable, in this condition, pod will be in Terminating forever. In this PR, it would call Unmount if PathExists returns any error.

$ kubectl get po
NAME              READY   STATUS        RESTARTS   AGE
nginx-flex-smb   0/1     Terminating   0          6h46m

Which issue(s) this PR fixes:

Fixes #75233

Special notes for your reviewer:
EAGAIN - APPLICATION_ERROR: "resource temporarily unavailable"
This error happens when SMB server is unavailable due to network issue, password change, smb path change, etc. , in this PR, it would call Unmount if PathExists returns any error.

workaround for this issue

  • on agent node, run mount | grep cifs to get all cifs mounts
  • check every cifs mount, e.g.
sudo ls -lt /var/lib/kubelet/pods/5c949781-4c6d-11e9-825d-000d3a0dd47b/volumes/microsoft.com~smb/test
ls: cannot access '/var/lib/kubelet/pods/5c949781-4c6d-11e9-825d-000d3a0dd47b/volumes/microsoft.com~smb/test': Permission denied
  • if ls -lt ... failed, run
sudo umount /var/lib/kubelet/pods/5c949781-4c6d-11e9-825d-000d3a0dd47b/volumes/microsoft.com~smb/test

And then the pod will be in terminated soon.

Does this PR introduce a user-facing change?:

fix pod stuck issue due to corrupt mnt point in flexvol plugin, call Unmount if PathExists returns any error

/priority important-soon
/assign @jingxu97 @msau42 @davidz627

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Mar 9, 2019
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 9, 2019
@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 9, 2019
@gnufied
Copy link
Member

gnufied commented Mar 9, 2019

/assign

cc @chakri-nelluri

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 12, 2019
@andyzhangx
Copy link
Member Author

thanks guys for the comments, let me hold this PR, and track the root case
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 13, 2019
@andyzhangx andyzhangx changed the title fix pod stuck issue due to corrupt mnt point fix pod stuck issue due to corrupt mnt point (resource temporarily unavailable) Mar 22, 2019
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 22, 2019
@andyzhangx
Copy link
Member Author

PTAL, thanks.

@andyzhangx
Copy link
Member Author

hi @jingxu97 @msau42 I have reverted the code change in PathExists func, make this PR only in flexvol function. Change of PathExists func may affect too many plugins, could you take a look? I would also cherry pick this PR to old release, thanks.


if pathErr != nil && !mount.IsCorruptedMnt(pathErr) {
return fmt.Errorf("Error checking path: %v", pathErr)
if pathErr != nil {
Copy link
Contributor

@jingxu97 jingxu97 May 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about
if !pathExists & pathErr == nil {
return nil
}
...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then this function would depend on return value of PathExists, if it return <false, error>, it won't proceed, mainly I would like to cherry pick to old release, only for flexvol.TearDownAt func

Copy link
Contributor

@jingxu97 jingxu97 May 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sorry my mistake. I corrected it.
only if it return (false, nil) which means path is definitely not exist , it will return immediately without any further actions.

Copy link
Member Author

@andyzhangx andyzhangx May 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jingxu97 logic of this PR is identical to this your suggested comment:

	if pathErr != nil {
		// only log warning here since plugins should anyways have to deal with errors
		klog.Warningf("Error checking path: %v", pathErr)
	} else {
		if !pathExists {
			klog.Warningf("Warning: Unmount skipped because path does not exist: %v", dir)
			return nil
		}
	}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the logic is the same. I am thinking to make it easier to read. And I think the other PR #75645 also uses the simpler way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jingxu97 shall we go with this PR first, and I would like to cherry pick this PR to old release, thanks.

@andyzhangx
Copy link
Member Author

/hold cancel
@jingxu97 PTAL, already addressed your comments, thanks.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 30, 2019
@andyzhangx
Copy link
Member Author

ping @msau42 @jingxu97

@jingxu97
Copy link
Contributor

jingxu97 commented Jun 7, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 7, 2019
@andyzhangx
Copy link
Member Author

andyzhangx commented Jun 8, 2019

/lgtm

still needs /approve, and are we going to merge this into v1.15? @jingxu97

@jingxu97
Copy link
Contributor

jingxu97 commented Jun 9, 2019 via email

@andyzhangx
Copy link
Member Author

/assign @saad-ali @chakri-nelluri @MikaelCluseau

@k8s-ci-robot
Copy link
Contributor

@andyzhangx: GitHub didn't allow me to assign the following users: MikaelCluseau.

Note that only kubernetes members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

/assign @saad-ali @chakri-nelluri @MikaelCluseau

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@msau42
Copy link
Member

msau42 commented Jun 11, 2019

IIUC, before, Unmount would not be called if PathExists returns non-corrupted error.

Now what this fix does is it always calls Unmount if PathExists returns any error, corrupted or not. Is that the intended behavior @andyzhangx? From the PR description, it sounds like you wanted some change in behavior for corrupted errors?

@andyzhangx
Copy link
Member Author

andyzhangx commented Jun 11, 2019

IIUC, before, Unmount would not be called if PathExists returns non-corrupted error.

Now what this fix does is it always calls Unmount if PathExists returns any error, corrupted or not. Is that the intended behavior @andyzhangx? From the PR description, it sounds like you wanted some change in behavior for corrupted errors?

I have changed the PR description, current reality is there is some unexpected error in flexvolume path, and in this PR, it would call Unmount if PathExists returns any error. @msau42

@msau42
Copy link
Member

msau42 commented Jun 11, 2019

/approve

Thanks for clarifying! Can you also fix the release note?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, jingxu97, msau42

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 11, 2019
@andyzhangx
Copy link
Member Author

/approve

Thanks for clarifying! Can you also fix the release note?

done, thanks.

@k8s-ci-robot k8s-ci-robot merged commit 3a598d7 into kubernetes:master Jun 14, 2019
@k8s-ci-robot
Copy link
Contributor

@andyzhangx: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-integration fb81a28 link /test pull-kubernetes-integration

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot added a commit that referenced this pull request Jun 20, 2019
…5234-upstream-release-1.12

Automated cherry pick of #75234: fix flexvol stuck issue due to corrupted mnt point
k8s-ci-robot added a commit that referenced this pull request Jun 20, 2019
…5234-upstream-release-1.13

Automated cherry pick of #75234: fix flexvol stuck issue due to corrupted mnt point
k8s-ci-robot added a commit that referenced this pull request Jun 20, 2019
…5234-upstream-release-1.14

Automated cherry pick of #75234: fix flexvol stuck issue due to corrupted mnt point
k8s-ci-robot added a commit that referenced this pull request Jun 28, 2019
…5234-upstream-release-1.15

Automated cherry pick of #75234: fix flexvol stuck issue due to corrupted mnt point
@andyzhangx andyzhangx changed the title fix pod stuck issue due to corrupt mnt point in flexvol plugin fix pod stuck in Terminating issue due to corrupt mnt point in flexvol plugin Jul 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pod could not be terminated when pod volume path is corrupted

8 participants