-
Notifications
You must be signed in to change notification settings - Fork 42.1k
Description
Kubernetes version: 1.6.2 master and nodes
Environment: GKE
What happened: The following has been happening for the last week or two.
I noticed loads of pods being evicted with the following message from kubectl describe:
Node: gke-prow-build-pool-a89df2af-4bc8/
Status: Failed
Reason: Evicted
Message: The node was low on resource: nodefs.
The node shows ready but also that it has disk pressure from kubectl get no:
status:
conditions:
- lastHeartbeatTime: 2017-05-09T17:38:33Z
lastTransitionTime: 2017-05-09T15:43:02Z
message: kubelet has disk pressure
reason: KubeletHasDiskPressure
status: "True"
type: DiskPressure
- lastHeartbeatTime: 2017-05-09T17:38:33Z
lastTransitionTime: 2017-05-09T01:07:06Z
message: kubelet is posting ready status. AppArmor enabled
reason: KubeletReady
status: "True"
type: Readykubectl describe no shows lots of ImageGCFailed.
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
4h 19s 564 kubelet, gke-prow-build-pool-a89df2af-4bc8 Warning EvictionThresholdMet Attempting to reclaim nodefs
4h 18s 54 kubelet, gke-prow-build-pool-a89df2af-4bc8 Warning ImageGCFailed (events with common reason combined)
Kubelet logs show that it's failing to delete the images and free up disk space. For each image it shows this every 10 seconds:
A I0509 17:59:31.183907 1453 image_gc_manager.go:335] [imageGCManager]: Removing image "sha256:fa60023475d842a7a62d38fa27a0d3f6fd672be5ea1f09e6d07f8459d2c0c60a" to free 1105710474 bytes
A E0509 17:59:31.186643 1453 remote_image.go:124] RemoveImage "sha256:fa60023475d842a7a62d38fa27a0d3f6fd672be5ea1f09e6d07f8459d2c0c60a" from image service failed: rpc error: code = 2 desc = Error response from daemon: conflict: unable to delete fa60023475d8 (must be forced) - image is being used by stopped container 8641d5395d30
A E0509 17:59:31.186705 1453 kuberuntime_image.go:126] Remove image "sha256:fa60023475d842a7a62d38fa27a0d3f6fd672be5ea1f09e6d07f8459d2c0c60a" failed: rpc error: code = 2 desc = Error response from daemon: conflict: unable to delete fa60023475d8 (must be forced) - image is being used by stopped container 8641d5395d30
What you expected to happen:
I would be happy if the node were marked unschedulable when it's out of disk. I would also be happy if the images successfully clean up. As it is, the node just evicts any pod that attempts to run on it.
How to reproduce it:
I don't know how to reproduce from scratch, but I've cordoned this node and can give access to someone for debugging.
Please let me know if you need more information, and apologies if this is a dupe.
cc @kubernetes/sig-node-bugs