Skip to content

Conversation

@andyzhangx
Copy link
Member

@andyzhangx andyzhangx commented Mar 7, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR fix following issue:
pod with smb(cifs) mount could not be terminated on Windows when that smb account is invalid, the pod will be stuck forever

This PR is another windows fix related to #74625

Which issue(s) this PR fixes:

Fixes #75025

Special notes for your reviewer:
This PR implement IsCorruptedMnt func on Windows, unix code is actully unchanged.
The warnning logs in this PR would be like following in kubelet which is expected:

W0308 05:53:34.344557    2392 mount_helper_windows.go:52] IsCorruptedMnt failed with ERROR_LOGON_FAILURE error: CreateFile c:\var\lib\kubelet\pods\4fe70e4a-4166-11e9-bf93-000d3af95268\volumes\kubernetes.io~azure-file\pvc-4fe1243c-4166-11e9-bf93-000d3af95268: The user name or password is incorrect.
W0308 05:53:34.347757    2392 mount_helper_windows.go:52] IsCorruptedMnt failed with ERROR_LOGON_FAILURE error: CreateFile c:\var\lib\kubelet\pods\4fe70e4a-4166-11e9-bf93-000d3af95268\volumes\kubernetes.io~azure-file\pvc-4fe1243c-4166-11e9-bf93-000d3af95268: The user name or password is incorrect.
I0308 05:53:34.347757    2392 mount_helper_common.go:74] "c:\\var\\lib\\kubelet\\pods\\4fe70e4a-4166-11e9-bf93-000d3af95268\\volumes\\kubernetes.io~azure-file\\pvc-4fe1243c-4166-11e9-bf93-000d3af95268" is a mountpoint, unmounting
I0308 05:53:34.347757    2392 mount_windows.go:149] azureMount: Unmount target ("c:\\var\\lib\\kubelet\\pods\\4fe70e4a-4166-11e9-bf93-000d3af95268\\volumes\\kubernetes.io~azure-file\\pvc-4fe1243c-4166-11e9-bf93-000d3af95268")

Does this PR introduce a user-facing change?:

fix smb unmount issue on Windows

/sig windows
/priority important-soon
/assign @jingxu97 @davidz627 @msau42

cc @feiskyer could you mark milestone of this PR, thanks.

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Mar 7, 2019
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. sig/windows Categorizes an issue or PR as relevant to SIG Windows. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 7, 2019
@andyzhangx
Copy link
Member Author

/test pull-kubernetes-cross

@k8s-ci-robot k8s-ci-robot requested review from jingxu97 and msau42 March 7, 2019 05:47
@feiskyer
Copy link
Member

feiskyer commented Mar 7, 2019

/milestone v1.14

@k8s-ci-robot k8s-ci-robot added this to the v1.14 milestone Mar 7, 2019
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "password is incorrect" the only case for corrupt? Are there any other possibilities?

Copy link
Member Author

@andyzhangx andyzhangx Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, there must be other possiblities, while I don't know.
another fix is write like this which may have drawbacks, it depends on whether we use whiltelist or blacklist:

func IsCorruptedMnt(err error) bool {
    return err != nil
}

I think there are some errors still missing on Linux implementation since only following 4 errors caught in IsCorruptedMnt func:

func IsCorruptedMnt(err error) bool {
if err == nil {
return false
}
var underlyingError error
switch pe := err.(type) {
case nil:
return false
case *os.PathError:
underlyingError = pe.Err
case *os.LinkError:
underlyingError = pe.Err
case *os.SyscallError:
underlyingError = pe.Err
}
return underlyingError == syscall.ENOTCONN || underlyingError == syscall.ESTALE || underlyingError == syscall.EIO || underlyingError == syscall.EACCES

@jingxu97 @msau42 @davidz627 what do you think about this? why way do you prefer on Windows?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invalid password seems like a strange scenario to consider as a "corrupted" mount. Is that a normal error that users may encounter? Is it recoverable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these could also be returned

  • ERROR_NETWORK_ACCESS_DENIED
  • ERROR_BAD_NET_NAME
  • ERROR_NETNAME_DELETED

(reference list: https://docs.microsoft.com/en-us/windows/desktop/Debug/system-error-codes--0-499-)

Copy link
Member

@neolit123 neolit123 Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyzhangx

I think there are some errors still missing on Linux implementation since only following 4 errors caught in IsCorruptedMnt func

this command uses Stat() -> fstatat() on *unix
and i think the list of existing errors is odd.

here is what fstatat returns:
http://man7.org/linux/man-pages/man3/fstatat.3p.html


on Windows Stat() uses the stat() function here:
https://golang.org/src/os/stat_windows.go

so it will throw a PathError at least. i don't know where SMBConnectionError comes from.

in terms of Windows error codes @PatrickLang already listed some, i don't know the codes of a corrupted mklink / CreateSymbolicLink, but here is what you can do to check against them:

package main

import (
	"fmt"
	"golang.org/x/sys/windows"
	"syscall"
)

func main() {
	ptr, err := windows.UTF16PtrFromString("./some-missing-path")
	if err != nil {
		fmt.Println(err)
		return
	}
	attrib, err := windows.GetFileAttributes(ptr)
	if err != nil {
		fmt.Println(err.(syscall.Errno) == windows.ERROR_FILE_NOT_FOUND) // or any error code...
		return
	}
	fmt.Println(attrib)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then error info The user name or password is incorrect does not exists in types_windows.go

we can still verify against the integer (ERROR_LOGON_FAILURE, 1326)
https://docs.microsoft.com/en-us/windows/desktop/debug/system-error-codes--1300-1699-

e.g.

const ERROR_LOGON_FAILURE = 1326
... 
err.(syscall.Errno) == ERROR_LOGON_FAILURE

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neolit123 thanks, it's 1040, not in the list of https://docs.microsoft.com/en-us/windows/desktop/Debug/system-error-codes--1000-1299-, so shall I go with 1040 error checking?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, i don't know what 1040 is, but probably best not to check for it.
golang just calls native functions and those follow only the existing error codes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neolit123 sorry, I was wrong, it's this error:

ERROR_LOGON_FAILURE
1326 (0x52E)
The user name or password is incorrect.

I think I have the correct solution now, I have this error check in the code first, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's great that it matches the real error codes from windows

@andyzhangx andyzhangx force-pushed the unmount-issue-windows branch from 251340f to 48aad50 Compare March 7, 2019 06:07
@andyzhangx
Copy link
Member Author

/test pull-kubernetes-cross

@andyzhangx
Copy link
Member Author

/test pull-kubernetes-e2e-gce
/test pull-kubernetes-integration

@michmike
Copy link
Contributor

michmike commented Mar 7, 2019

adding a hold until we identify all the proper mount failures on windows
/hold
@neolit123 can you also please review?

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 7, 2019
@michmike
Copy link
Contributor

michmike commented Mar 7, 2019

cc: @benmoss for review as well

@PatrickLang
Copy link
Contributor

Just like #74625, this would also need backport to 1.11-1.13.

@PatrickLang
Copy link
Contributor

cc @KnicKnic - any feedback on identifying failed mounts?

@andyzhangx
Copy link
Member Author

andyzhangx commented Mar 8, 2019

thanks guys, I used errorno check as @neolit123 suggested, pushed a new commit: d0d49da

BTW, this issue is a common failure in SMB user scenario, for some reason(e.g. security consideration), user changed the password of SMB server, while the original SMB mount could not be terminated forever, on Linux, it has been fixed by adding one more check:

underlyingError == syscall.EACCES

return underlyingError == syscall.ENOTCONN || underlyingError == syscall.ESTALE || underlyingError == syscall.EIO || underlyingError == syscall.EACCES

While it is not fixed on Windows yet.

It looks like it's very close to code freeze date, can we make it into 1.14? @michmike

@michmike
Copy link
Contributor

michmike commented Mar 8, 2019

thanks guys, I used errorno check as @neolit123 suggested, pushed a new commit: d0d49da

BTW, this issue is a common failure in SMB user scenario, for some reason(e.g. security consideration), user changed the password of SMB server, while the original SMB mount could not be terminated forever, on Linux, it has been fixed by adding one more check:

underlyingError == syscall.EACCES

kubernetes/pkg/util/mount/mount_helper.go

Line 123 in 788f245

return underlyingError == syscall.ENOTCONN || underlyingError == syscall.ESTALE || underlyingError == syscall.EIO || underlyingError == syscall.EACCES
While it is not fixed on Windows yet.

It looks like it's very close to code freeze date, can we make it into 1.14?

don't worry about code freeze. we need to fix this properly , test it and then we can see about 1.14

@michmike
Copy link
Contributor

michmike commented Mar 8, 2019

i would like @adelina-t, @BCLAU to also have the fix run in the same environment where it failed before

@andyzhangx
Copy link
Member Author

i would like @adelina-t, @BCLAU to also have the fix run in the same environment where it failed before

This fix only applies for SMB unmount issue on Windows, happens when password changed, do we have any test failure on this before? @michmike

@michmike
Copy link
Contributor

michmike commented Mar 9, 2019

I have added more error code check and squashed all commits, PTAL again.
In the code, I used an array to collect all known errors.

Below are the errors we need to handle per windows team:

For SMB mounts, the typical errors for connectivity are 53 and 64.

53 ERROR_BAD_NETPATH <--> 0xc00000be STATUS_BAD_NETWORK_PATH
64 ERROR_NETNAME_DELETED <--> 0xc00000c9 STATUS_NETWORK_NAME_DELETED
64 ERROR_NETNAME_DELETED <--> 0xc000013b STATUS_LOCAL_DISCONNECT
64 ERROR_NETNAME_DELETED <--> 0xc000013c STATUS_REMOTE_DISCONNECT
64 ERROR_NETNAME_DELETED <--> 0xc000020b STATUS_ADDRESS_CLOSED
64 ERROR_NETNAME_DELETED <--> 0xc000020c STATUS_CONNECTION_DISCONNECTED
64 ERROR_NETNAME_DELETED <--> 0xc000020d STATUS_CONNECTION_RESET
64 ERROR_NETNAME_DELETED <--> 0xc0000466 STATUS_SERVER_UNAVAILABLE
64 ERROR_NETNAME_DELETED <--> 0xc0000480 STATUS_SHARE_UNAVAILABLE

For incorrect password/logon failure:
1326 ERROR_LOGON_FAILURE <--> c000006d STATUS_LOGON_FAILURE

This is an error if there is another existing mapping using a different set of credentials:
1219 ERROR_SESSION_CREDENTIAL_CONFLICT <--> c0000195 STATUS_NETWORK_CREDENTIAL_CONFLICT

This indicates the share does not exist:
67 ERROR_BAD_NET_NAME <--> c00000cc STATUS_BAD_NETWORK_NAME

do all those errors map to the same one with 64 as the error code?

@michmike
Copy link
Contributor

michmike commented Mar 9, 2019

i added some comments which you should resolve before removing the hold
/hold
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 9, 2019
@michmike
Copy link
Contributor

michmike commented Mar 9, 2019

pinging the sig-storage team to take a look at this as well

@msau42
Copy link
Member

msau42 commented Mar 9, 2019

Are these all errors that when if encountered, the mount point could still be successfully unmounted and cleaned up?

To give some context, the unmount sequence is like:

  1. Check if directory is a mount point <-- this is where the IsCorruptedMount check is done
  2. If it is a mount, then call Unmount
  3. If unmount was successful and the directory is no longer a mount point, remove the directory.

fix log warning

use IsCorruptedMnt in GetMountRefs on Windows

use errorno in IsCorruptedMnt check

fix comments: add more error code

add more error no checking

change year

fix comments
@andyzhangx andyzhangx force-pushed the unmount-issue-windows branch from 92469f2 to 720a5e2 Compare March 10, 2019 02:13
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 10, 2019
@andyzhangx
Copy link
Member Author

do all those errors map to the same one with 64 as the error code?

Partly correct, a lot of errors are mapping to 64, others are different error codes. @michmike

Are these all errors that when if encountered, the mount point could still be successfully unmounted and cleaned up?
To give some context, the unmount sequence is like:
Check if directory is a mount point <-- this is where the IsCorruptedMount check is done
If it is a mount, then call Unmount
If unmount was successful and the directory is no longer a mount point, remove the directory.

Yes, on Windows, all Unmount would be simply rmdir opertion, it will always succeed for those conditions(@msau42 ):

func (mounter *Mounter) Unmount(target string) error {
klog.V(4).Infof("azureMount: Unmount target (%q)", target)
target = normalizeWindowsPath(target)
if output, err := exec.Command("cmd", "/c", "rmdir", target).CombinedOutput(); err != nil {
klog.Errorf("rmdir failed: %v, output: %q", err, string(output))
return err

/hold cancel
/test pull-kubernetes-cross

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 10, 2019
@michmike
Copy link
Contributor

/approve

@michmike
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 10, 2019
@michmike
Copy link
Contributor

@msau42 if this looks good to you as well, please approve and we will go through the motions of getting this approved for 1.14

@michmike
Copy link
Contributor

/assign @jingxu97

@jingxu97
Copy link
Contributor

jingxu97 commented Mar 11, 2019

I have a questions of what is considered as corrupted mount point? In what conditions we should clean them up? For example, for error STATUS_SERVER_UNAVAILABLE, is that possible that server is just temporally unavailable, do we want to clean up the mount in this situation?

@jingxu97
Copy link
Contributor

Talked with @msau42 and @saad-ali, for this PR, adding more error codes should be safe. If it is considered corrected mount point, it will move to the next step trying to unmount and then check mountpoint before removing files and directories.

But the whole process of checking correpted mount point before trying to unmount and remove files could be simplified. I propose to call umount directly, if it fails, checks whether the path is a mount point or not, if not, delete the files and directories. Please see #75273

@jingxu97
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, jingxu97, michmike

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 11, 2019
@k8s-ci-robot k8s-ci-robot merged commit 43b6ddf into kubernetes:master Mar 12, 2019
k8s-ci-robot added a commit that referenced this pull request Apr 18, 2019
…5087-upstream-release-1.12

Automated cherry pick of #75087: fix smb unmount issue on Windows
k8s-ci-robot added a commit that referenced this pull request May 7, 2019
…5087-upstream-release-1.13

Automated cherry pick of #75087: fix smb unmount issue on Windows
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pod with smb mount could not be terminated when password becomes invalid on Windows

10 participants