Replies: 1 comment
-
|
may be we could try to introduce the re try machanism? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I’m running a Django application using Celery for background tasks (file generation and backend processing).
When the app starts, tasks execute and return results normally.
After around 1 hour of uptime, delayed tasks stop returning results — their state remains “PENDING” because the result key no longer exists in Memcached.
On the Django side, the frontend polls the backend periodically to check task status, but Celery never retrieves any result.
No errors are logged on the worker side.
Restarting Memcached (thus dropping all connections) reproduces the problem immediately: Celery never reconnects and all delayed tasks lose their result backend connection.
This affects all delayed tasks, except “fire-and-forget” ones (which don’t store results).
I also reproduced the issue when switching between the default Memcached client and pylibmc
Observed behavior
After ~1 hour (sometimes less), delayed tasks remain in PENDING state.
The result key no longer exists in Memcached.
No error or reconnect attempt appears in Celery logs.
Restarting Memcached immediately triggers the issue (workers stay connected to dead sockets).
Restarting Celery workers temporarily fixes it.
Expected behavior
Celery should detect when the Memcached connection has been dropped (e.g., due to idle TCP timeout in Kubernetes) and reconnect automatically instead of silently failing to read/write results.
Hypothesis
Celery’s Memcached backend keeps persistent connections open to Memcached.
After some idle time, either Memcached or the Kubernetes network stack closes the connection.
Since Celery’s Memcached backend doesn’t retry or reopen the connection, all subsequent result writes/reads silently fail.
This explains why result keys disappear and tasks remain “pending”.
What I’ve checked
Memcached was not restarted when the issue occurred.
No socket error or connection reset in Celery logs.
Memcached metrics (curr_connections, get_hits, get_misses) reain stable.
RabbitMQ and task dispatching are unaffected — only result backend lookups fail.
Issue reproduced using both python-memcached and libpymemcached backends.
Beta Was this translation helpful? Give feedback.
All reactions