Closed
Description
Is this a BUG REPORT or FEATURE REQUEST?: bug
/kind bug
What happened:
- Deployment running with read-only GCE PD.
- A node containing one of the deployment pods is deleted and recreated soon after.
- New pod is scheduled to the same node, but is stuck in ContainerCreating state. The node's kubelet logs show continuous "Volume is not yet attached according to node" errors.
What you expected to happen: Pod recreated successfully.
Anything else we need to know?: Cause by PR #45923
The issue happens as follows:
- When the node is deleted, GCE PD is detached from the cloud provider.
- AttachDetachController discovers the PD has been detached, and marks it as detached.
- The original pod still exists in the API server (i.e. before garbage collection), so AttachDetachController reattaches the PD, successfully.
- Before the node recovers, ADC tries to update the node status but finds that the node doesn't exist. Following the logic introduced in Node status updater now deletes the node entry in attach updates... #45923 it never attempts to update the node status again.
- The node recovers, a new pod is scheduled to it, waits for a successful attach signal in the node status but never gets it.