Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler stucks (try to schedule a pod endlessly, but never succeed) #1751

Closed
ddysher opened this issue Oct 12, 2014 · 5 comments · Fixed by #1752
Closed

Scheduler stucks (try to schedule a pod endlessly, but never succeed) #1751

ddysher opened this issue Oct 12, 2014 · 5 comments · Fixed by #1752
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@ddysher
Copy link
Contributor

ddysher commented Oct 12, 2014

When creating replication controller (e.g. frontend php-redis from guestbook example), scheduler gets stuck and never recovers. This is a race, I have to run several times to see the error (A script to delete controller and pods, then create new controller, leave enough time in between for k8s to react). I've created PR for an attempted fix.

E1012 13:56:42.305589 05928 factory.go:173] Error scheduling 89f8519b-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:42.332367 05928 factory.go:173] Error scheduling 89f7f92d-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:42.354623 05928 factory.go:173] Error scheduling 89f8519b-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:42.378341 05928 factory.go:173] Error scheduling 89f7f92d-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:42.402696 05928 factory.go:173] Error scheduling 89f8519b-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:42.427481 05928 factory.go:173] Error scheduling 89f7f92d-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:42.574170 05928 factory.go:173] Error scheduling 89f8519b-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:42.596701 05928 factory.go:173] Error scheduling 89f7f92d-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:42.621299 05928 factory.go:173] Error scheduling 89f8519b-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
I1012 13:56:42.650949 05928 request.go:274] Waiting for completion of /operations/2120
E1012 13:56:44.652685 05928 factory.go:173] Error scheduling 89f7f92d-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:44.672077 05928 factory.go:173] Error scheduling 89f8519b-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:44.696668 05928 factory.go:173] Error scheduling 89f7f92d-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:44.721061 05928 factory.go:173] Error scheduling 89f8519b-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:44.747720 05928 factory.go:173] Error scheduling 89f7f92d-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:44.805098 05928 factory.go:173] Error scheduling 89f8519b-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:44.826355 05928 factory.go:173] Error scheduling 89f7f92d-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
I1012 13:56:44.858229 05928 request.go:274] Waiting for completion of /operations/2127
E1012 13:56:46.859055 05928 factory.go:173] Error scheduling 89f8519b-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:46.887165 05928 factory.go:173] Error scheduling 89f7f92d-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:46.913196 05928 factory.go:173] Error scheduling 89f8519b-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:46.940598 05928 factory.go:173] Error scheduling 89f7f92d-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:46.960659 05928 factory.go:173] Error scheduling 89f8519b-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:46.983646 05928 factory.go:173] Error scheduling 89f7f92d-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:47.080963 05928 factory.go:173] Error scheduling 89f8519b-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:47.103699 05928 factory.go:173] Error scheduling 89f7f92d-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:47.127745 05928 factory.go:173] Error scheduling 89f8519b-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:47.152460 05928 factory.go:173] Error scheduling 89f7f92d-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:47.176050 05928 factory.go:173] Error scheduling 89f8519b-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
I1012 13:56:47.309377 05928 request.go:274] Waiting for completion of /operations/2138
E1012 13:56:49.310294 05928 factory.go:173] Error scheduling 89f7f92d-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:49.331656 05928 factory.go:173] Error scheduling 89f8519b-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:49.360355 05928 factory.go:173] Error scheduling 89f7f92d-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:49.384414 05928 factory.go:173] Error scheduling 89f8519b-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:49.418865 05928 factory.go:173] Error scheduling 89f7f92d-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:49.556415 05928 factory.go:173] Error scheduling 89f8519b-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:49.580934 05928 factory.go:173] Error scheduling 89f7f92d-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:49.604854 05928 factory.go:173] Error scheduling 89f8519b-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:49.630138 05928 factory.go:173] Error scheduling 89f7f92d-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
E1012 13:56:49.665224 05928 factory.go:173] Error scheduling 89f8519b-5217-11e4-8eaf-42010af0907f: The assignment would cause a constraint violation; retrying
@abonas
Copy link
Contributor

abonas commented Oct 13, 2014

@ddysher I saw this too while following the guest book and creating the php/frontend step. In my case, I had 2 minions while the replication controller configuration had configured 3 replicas, so 2 pods got scheduled -each on another minion, the third pod had empty Host (saw that when doing 'list pods'), and the 3 of them were stuck in "waiting" status.
I deleted the controller and the pods, fixed the configuration in json file to be 2 replicas and recreated the controller and everything worked.
Does your fix take care of this use case? (too many replicas, too few minions/one of the pods is not scheduled)

@ddysher
Copy link
Contributor Author

ddysher commented Oct 13, 2014

@abonas This is different from your use case. If you have more replicas than minions, then the scheduler will get stuck for sure (with the error msg probably being 'failed to find a fit for pod'). AFAICT, it's not support yet.

@lavalamp lavalamp added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Oct 13, 2014
@lavalamp
Copy link
Member

@abonas If you remove the HostPorts, it should work for you. @ddysher thanks for report.

@abonas
Copy link
Contributor

abonas commented Oct 14, 2014

@lavalamp , @ddysher thanks for your replies.
I understand why it isn't working, but I'd expect one of the 2 options to happen instead of the current situation:

  1. fail the creation of replication controller if there are less minions than replicas, AND there is a hostPort property. In other words - validation of the content during creation.
  2. At least schedule all replicas that can get a minion (in my case 2 out of 3) and leave only one at "waiting".
    Is there an issue open about the use case I describe?

@lavalamp
Copy link
Member

@abonas #2 should be true today. #1 is less possible than it seems like-- minions can come and go, and become over or under loaded in the time between creation and when the pod shows up at the scheduler. So even if we check up front, we still have to check again later, and sometimes the first check will pass but not the second.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants