Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix azure disk mount failure on coreos and some other distros #54334

Merged
merged 1 commit into from
Oct 25, 2017

Conversation

andyzhangx
Copy link
Member

@andyzhangx andyzhangx commented Oct 21, 2017

What this PR does / why we need it:
azure disk mount failure on coreos due to /dev/disk/azure not populated on coreos, and also on some other distros.
After a long time back and forth discussion, I have decided to fix this issue on k8s side. Customers are leaving azure because of this udev issue, you could read details from #50150 . I have verifed the code works well on current coreos on azure.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged):
fixes #50150: azure_dd: managed disks don't pass "FormatAndMount"
In coreos on azure and some old distros, one udev rule is not applied, which means /dev/disk/azure not populated, while current disk detection code logic relies on /dev/disk/azure, so it would lead to azure disk mount failure, for more details, pls refer to #50150

Special notes for your reviewer:
In v1.6, there is no such issue becuase it's not using /dev/azure/disk to detect the LUN of new disk, so I refer the code logic of v1.6:
https://github.com/kubernetes/kubernetes/blob/release-1.6/pkg/volume/azure_dd/vhd_util.go#L59-L66

while from v1.7, it's using /dev/azure/disk, the code logic does not apply to distro which does not have udev rules(/dev/azure/disk), see:
https://github.com/kubernetes/kubernetes/blob/release-1.7/pkg/volume/azure_dd/azure_common.go#L231-L236

Release note:

fix azure disk mount failure on coreos and some other distro due to udev rule not applied

/sig azure
@rootfs @brendanburns @colemickens
@tlyng @karataliu @szarkos

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/azure size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 21, 2017
@andyzhangx
Copy link
Member Author

/test pull-kubernetes-unit

@andyzhangx andyzhangx changed the title fix#50150: azure disk mount failure on coreos and some other distros fix: azure disk mount failure on coreos and some other distros Oct 21, 2017
@andyzhangx andyzhangx changed the title fix: azure disk mount failure on coreos and some other distros fix azure disk mount failure on coreos and some other distros Oct 21, 2017
@andyzhangx
Copy link
Member Author

/test pull-kubernetes-kubemark-e2e-gce

@andyzhangx
Copy link
Member Author

/test pull-kubernetes-unit

glog.Errorf("failed to parse target from %v (%v), err %v", arr[0], name, err)
continue
}
// as observed, targets 0-3 are used by OS disks. Skip them
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used to use this heuristics but it is not reliable: on certain instances (last time I used F instance and found this test was different than D instance), this test fails.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which test are you refering to? You mean ""/sys/bus/scsi/devices"" on different VM instance is different?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @rootfs I have tried coreos on Standard F1s VM, it works well. Do you have any other instance suggestion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paste all the info and logs on F instance, I attached two disks and try to detect Lun 1 disk:

azureuser@andy-coreosf /tmp $ ./detect-new-lun1
Azure sys disks paths: []there is nothing under /dev/disk/azure/
there is nothing under /dev/disk/azure/
there is nothing under /dev/disk/azure/
there is nothing under /dev/disk/azure/
/dev/sdd <nil>
azureuser@andy-coreosf /tmp $ ls /sys/bus/scsi/devices -lt
total 0
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 target0:0:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host0/target0:0:0
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 target3:0:1 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host3/target3:0:1
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 target5:0:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host5/target5:0:0
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 5:0:0:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host5/target5:0:0/5:0:0:0
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 5:0:0:1 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host5/target5:0:0/5:0:0:1
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 host5 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host5
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 3:0:1:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host3/target3:0:1/3:0:1:0
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 host3 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host3
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 host4 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781a-1e82-4818-a1c3-63d806ec15bb/host4
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 0:0:0:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host0/target0:0:0/0:0:0:0
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 host0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host0
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 host1 -> ../../../devices/pci0000:00/0000:00:07.1/ata1/host1
lrwxrwxrwx. 1 root root 0 Oct 23 02:52 host2 -> ../../../devices/pci0000:00/0000:00:07.1/ata2/host2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as observed, targets 0-3 are used by OS disks, there are similar observations too

However, this observation failed when i was using an F8s instance running RHEL. And that gave rise to this fix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rootfs , thanks for the check.
The latest code already contains #44295, while current coreos issue is: since /dev/disk/azure/ is not populated in coreos, which means there would be no filtering, see: https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/azure_dd/azure_common_linux.go#L119-L124. In the end, findDiskByLun would return a wrong dev name which is actually os disk. So my fix is add back the skip logic if azureDisks is empty.
I have also tested on RHEL F instance, my PR works well. Let me know if there is any other issue, thx.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyzhangx thanks, please glog a message here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rootfs done

@rootfs
Copy link
Contributor

rootfs commented Oct 21, 2017

setting consistent udev rules on coreos is a more robust fix

@rootfs
Copy link
Contributor

rootfs commented Oct 21, 2017

WALinuxAgent should populate the same udev rules
https://raw.githubusercontent.com/Azure/WALinuxAgent/master/config/66-azure-storage.rules

@andyzhangx
Copy link
Member Author

andyzhangx commented Oct 21, 2017

Hi @rootfs , I agree WALinuxAgent should create same udev rule, while current coreos image in azure does not have this udev rule, so lots of customers are complaining about azure disk mount failure of coreos. In this situation, I would suggest we should provide k8s side fix for this issue. You could refer to #50150 for more details, lots of customers complaining in the thread. add @jdumars for your comments, thx.

@tlyng
Copy link

tlyng commented Oct 23, 2017

The CoreOS I am using have this udev rules file already populated, does not fix anything in my case at least.

@andyzhangx
Copy link
Member Author

@tlyng then it's maybe other issue, could you pls paste out the pod failure info into #50150 ? It's important for diagnostics. Thx.
Anyway I still would like to fix this issue in k8s side.

@tlyng
Copy link

tlyng commented Oct 24, 2017

@andyzhangx I don't know what is causing it, many times it works flawlessly. Sometimes when disks get detached it looks like the VM get stuck in

{
  "...": "...",
  "provisioningState" : "updating"
}

Why this happend is beyond my knowledge, perhaps the agent service? It renders the node useless at least. I currently have one node in this updating state (due to chart upgrade which moved the PVC from one node to another, source node still in "updating" state). Hopefully this times out, but I've seen it last for over 24h. In my current situation it has been waiting for 45 minutes - which is unacceptable.

@andyzhangx
Copy link
Member Author

@tlyng could you paste your pod info into issue#50150 ?
kubectl describe YOUR-POD-NAME

This is a PR thread, would you discuss your issue in #50150 Or you may open a new issue, thx.

@rootfs
Copy link
Contributor

rootfs commented Oct 25, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 25, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, rootfs

Associated issue: 50150

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 25, 2017
@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 54336, 54470, 54334, 54175). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 59ad01f into kubernetes:master Oct 25, 2017
k8s-github-robot pushed a commit that referenced this pull request Oct 26, 2017
…4334-upstream-release-1.8

Automatic merge from submit-queue.

Automated cherry pick of #54334

Cherry pick of #54334 on release-1.8.

#54334: fix#50150: azure disk mount failure on coreos
k8s-github-robot pushed a commit that referenced this pull request Oct 27, 2017
…4334-upstream-release-1.7

Automatic merge from submit-queue.

Automated cherry pick of #54334

Cherry pick of #54334 on release-1.7.

#54334: fix#50150: azure disk mount failure on coreos
@andyzhangx andyzhangx deleted the coreos-azure-fix branch May 8, 2018 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

azure_dd: managed disks don't pass "FormatAndMount"
7 participants