-
Notifications
You must be signed in to change notification settings - Fork 42.1k
fix smb unmount issue on Windows #75087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix smb unmount issue on Windows #75087
Conversation
|
/test pull-kubernetes-cross |
|
/milestone v1.14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "password is incorrect" the only case for corrupt? Are there any other possibilities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question, there must be other possiblities, while I don't know.
another fix is write like this which may have drawbacks, it depends on whether we use whiltelist or blacklist:
func IsCorruptedMnt(err error) bool {
return err != nil
}
I think there are some errors still missing on Linux implementation since only following 4 errors caught in IsCorruptedMnt func:
kubernetes/pkg/util/mount/mount_helper.go
Lines 107 to 123 in a8492d7
| func IsCorruptedMnt(err error) bool { | |
| if err == nil { | |
| return false | |
| } | |
| var underlyingError error | |
| switch pe := err.(type) { | |
| case nil: | |
| return false | |
| case *os.PathError: | |
| underlyingError = pe.Err | |
| case *os.LinkError: | |
| underlyingError = pe.Err | |
| case *os.SyscallError: | |
| underlyingError = pe.Err | |
| } | |
| return underlyingError == syscall.ENOTCONN || underlyingError == syscall.ESTALE || underlyingError == syscall.EIO || underlyingError == syscall.EACCES |
@jingxu97 @msau42 @davidz627 what do you think about this? why way do you prefer on Windows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
invalid password seems like a strange scenario to consider as a "corrupted" mount. Is that a normal error that users may encounter? Is it recoverable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these could also be returned
- ERROR_NETWORK_ACCESS_DENIED
- ERROR_BAD_NET_NAME
- ERROR_NETNAME_DELETED
(reference list: https://docs.microsoft.com/en-us/windows/desktop/Debug/system-error-codes--0-499-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are some errors still missing on Linux implementation since only following 4 errors caught in IsCorruptedMnt func
this command uses Stat() -> fstatat() on *unix
and i think the list of existing errors is odd.
here is what fstatat returns:
http://man7.org/linux/man-pages/man3/fstatat.3p.html
on Windows Stat() uses the stat() function here:
https://golang.org/src/os/stat_windows.go
so it will throw a PathError at least. i don't know where SMBConnectionError comes from.
in terms of Windows error codes @PatrickLang already listed some, i don't know the codes of a corrupted mklink / CreateSymbolicLink, but here is what you can do to check against them:
package main
import (
"fmt"
"golang.org/x/sys/windows"
"syscall"
)
func main() {
ptr, err := windows.UTF16PtrFromString("./some-missing-path")
if err != nil {
fmt.Println(err)
return
}
attrib, err := windows.GetFileAttributes(ptr)
if err != nil {
fmt.Println(err.(syscall.Errno) == windows.ERROR_FILE_NOT_FOUND) // or any error code...
return
}
fmt.Println(attrib)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and then error info The user name or password is incorrect does not exists in types_windows.go
we can still verify against the integer (ERROR_LOGON_FAILURE, 1326)
https://docs.microsoft.com/en-us/windows/desktop/debug/system-error-codes--1300-1699-
e.g.
const ERROR_LOGON_FAILURE = 1326
...
err.(syscall.Errno) == ERROR_LOGON_FAILURE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@neolit123 thanks, it's 1040, not in the list of https://docs.microsoft.com/en-us/windows/desktop/Debug/system-error-codes--1000-1299-, so shall I go with 1040 error checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, i don't know what 1040 is, but probably best not to check for it.
golang just calls native functions and those follow only the existing error codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@neolit123 sorry, I was wrong, it's this error:
ERROR_LOGON_FAILURE
1326 (0x52E)
The user name or password is incorrect.
I think I have the correct solution now, I have this error check in the code first, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's great that it matches the real error codes from windows
251340f to
48aad50
Compare
|
/test pull-kubernetes-cross |
|
/test pull-kubernetes-e2e-gce |
|
adding a hold until we identify all the proper mount failures on windows |
|
cc: @benmoss for review as well |
|
Just like #74625, this would also need backport to 1.11-1.13. |
|
cc @KnicKnic - any feedback on identifying failed mounts? |
|
thanks guys, I used errorno check as @neolit123 suggested, pushed a new commit: d0d49da BTW, this issue is a common failure in SMB user scenario, for some reason(e.g. security consideration), user changed the password of SMB server, while the original SMB mount could not be terminated forever, on Linux, it has been fixed by adding one more check: kubernetes/pkg/util/mount/mount_helper.go Line 123 in 788f245
While it is not fixed on Windows yet. It looks like it's very close to code freeze date, can we make it into 1.14? @michmike |
don't worry about code freeze. we need to fix this properly , test it and then we can see about 1.14 |
|
i would like @adelina-t, @BCLAU to also have the fix run in the same environment where it failed before |
This fix only applies for SMB unmount issue on Windows, happens when password changed, do we have any test failure on this before? @michmike |
do all those errors map to the same one with 64 as the error code? |
|
i added some comments which you should resolve before removing the hold |
|
pinging the sig-storage team to take a look at this as well |
|
Are these all errors that when if encountered, the mount point could still be successfully unmounted and cleaned up? To give some context, the unmount sequence is like:
|
fix log warning use IsCorruptedMnt in GetMountRefs on Windows use errorno in IsCorruptedMnt check fix comments: add more error code add more error no checking change year fix comments
92469f2 to
720a5e2
Compare
Partly correct, a lot of errors are mapping to 64, others are different error codes. @michmike
Yes, on Windows, all Unmount would be simply rmdir opertion, it will always succeed for those conditions(@msau42 ): kubernetes/pkg/util/mount/mount_windows.go Lines 148 to 153 in 01c9edf
/hold cancel |
|
/approve |
|
/lgtm |
|
@msau42 if this looks good to you as well, please approve and we will go through the motions of getting this approved for 1.14 |
|
/assign @jingxu97 |
|
I have a questions of what is considered as corrupted mount point? In what conditions we should clean them up? For example, for error STATUS_SERVER_UNAVAILABLE, is that possible that server is just temporally unavailable, do we want to clean up the mount in this situation? |
|
Talked with @msau42 and @saad-ali, for this PR, adding more error codes should be safe. If it is considered corrected mount point, it will move to the next step trying to unmount and then check mountpoint before removing files and directories. But the whole process of checking correpted mount point before trying to unmount and remove files could be simplified. I propose to call umount directly, if it fails, checks whether the path is a mount point or not, if not, delete the files and directories. Please see #75273 |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andyzhangx, jingxu97, michmike The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…5087-upstream-release-1.12 Automated cherry pick of #75087: fix smb unmount issue on Windows
…5087-upstream-release-1.13 Automated cherry pick of #75087: fix smb unmount issue on Windows
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR fix following issue:
pod with smb(cifs) mount could not be terminated on Windows when that smb account is invalid, the pod will be stuck forever
This PR is another windows fix related to #74625
Which issue(s) this PR fixes:
Fixes #75025
Special notes for your reviewer:
This PR implement
IsCorruptedMntfunc on Windows, unix code is actully unchanged.The warnning logs in this PR would be like following in kubelet which is expected:
Does this PR introduce a user-facing change?:
/sig windows
/priority important-soon
/assign @jingxu97 @davidz627 @msau42
cc @feiskyer could you mark milestone of this PR, thanks.