Skip to content

ImageGCFailed, unable to delete images and reclaim disk. #45558

@spxtr

Description

@spxtr

Kubernetes version: 1.6.2 master and nodes
Environment: GKE
What happened: The following has been happening for the last week or two.

I noticed loads of pods being evicted with the following message from kubectl describe:

Node:		gke-prow-build-pool-a89df2af-4bc8/
Status:		Failed
Reason:		Evicted
Message:	The node was low on resource: nodefs.

The node shows ready but also that it has disk pressure from kubectl get no:

status:
  conditions:
  - lastHeartbeatTime: 2017-05-09T17:38:33Z
    lastTransitionTime: 2017-05-09T15:43:02Z
    message: kubelet has disk pressure
    reason: KubeletHasDiskPressure
    status: "True"
    type: DiskPressure
  - lastHeartbeatTime: 2017-05-09T17:38:33Z
    lastTransitionTime: 2017-05-09T01:07:06Z
    message: kubelet is posting ready status. AppArmor enabled
    reason: KubeletReady
    status: "True"
    type: Ready

kubectl describe no shows lots of ImageGCFailed.

  FirstSeen	LastSeen	Count	From						SubObjectPath	Type		Reason			Message
  ---------	--------	-----	----						-------------	--------	------			-------
  4h		19s		564	kubelet, gke-prow-build-pool-a89df2af-4bc8			Warning		EvictionThresholdMet	Attempting to reclaim nodefs
  4h		18s		54	kubelet, gke-prow-build-pool-a89df2af-4bc8			Warning		ImageGCFailed		(events with common reason combined)

Kubelet logs show that it's failing to delete the images and free up disk space. For each image it shows this every 10 seconds:

A  I0509 17:59:31.183907    1453 image_gc_manager.go:335] [imageGCManager]: Removing image "sha256:fa60023475d842a7a62d38fa27a0d3f6fd672be5ea1f09e6d07f8459d2c0c60a" to free 1105710474 bytes 
A  E0509 17:59:31.186643    1453 remote_image.go:124] RemoveImage "sha256:fa60023475d842a7a62d38fa27a0d3f6fd672be5ea1f09e6d07f8459d2c0c60a" from image service failed: rpc error: code = 2 desc = Error response from daemon: conflict: unable to delete fa60023475d8 (must be forced) - image is being used by stopped container 8641d5395d30 
A  E0509 17:59:31.186705    1453 kuberuntime_image.go:126] Remove image "sha256:fa60023475d842a7a62d38fa27a0d3f6fd672be5ea1f09e6d07f8459d2c0c60a" failed: rpc error: code = 2 desc = Error response from daemon: conflict: unable to delete fa60023475d8 (must be forced) - image is being used by stopped container 8641d5395d30 

What you expected to happen:

I would be happy if the node were marked unschedulable when it's out of disk. I would also be happy if the images successfully clean up. As it is, the node just evicts any pod that attempts to run on it.

How to reproduce it:

I don't know how to reproduce from scratch, but I've cordoned this node and can give access to someone for debugging.

Please let me know if you need more information, and apologies if this is a dupe.

cc @kubernetes/sig-node-bugs

Metadata

Metadata

Assignees

Labels

kind/failing-testCategorizes issue or PR as related to a consistently or frequently failing test.sig/nodeCategorizes an issue or PR as relevant to SIG Node.sig/testingCategorizes an issue or PR as relevant to SIG Testing.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions