-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Open
Labels
questionFurther information is requestedFurther information is requested
Description
Bug Report: Random Long Execution Time in Requests
Description
We are experiencing an issue where some requests take an abnormally long time to complete, seemingly at random. After extensive debugging, we added detailed logs to our application to track the execution time of each step. The results show that the delays occur in a completely unpredictable manner.
Expected Behavior
The majority of requests should be processed within a normal time frame without sudden, unexplained pauses in execution.
Current Behavior
- Most requests are processed successfully and within the expected time.
- However, a small percentage of requests (~5 out of 100,000) experience extreme delays.
- Logs indicate that these affected requests have execution gaps of 50 seconds on average, but in some cases, they extend up to 1 hour at random points during processing.
- Once the delay ends, the request execution resumes as if nothing happened.
- The delay does not appear to be tied to a specific endpoint, payload, or request pattern.
- Some requests get stuck at different points in the execution, sometimes between queries, sometimes at the beginning of execution, etc.
- Our database queries never take that long to execute, so it does not seem to be related to database performance.
Steps to Reproduce
We have not been able to determine a specific pattern or trigger for the issue, but the following conditions are observed:
- The issue occurs randomly across different requests.
- The issue is not tied to a specific endpoint, user, or payload.
- The execution gap does not appear to be related to database queries or external API calls (as per our logs).
- No CPU or memory spikes are observed on the server when delays occur.
- We have tried adjusting
max_requests, coroutine limits, and database connection pool configurations, but the issue persists.
Environment
- Hyperf Version: 3.1.42
- PHP Version: 8.3.14
- Swoole Version: 6.0.1
- Operating System: Running in Docker/Kubernetes
Swoole Configuration
Swoole => enabled
Author => Swoole Team <[email protected]>
Version => 6.0.1
Built => Feb 17 2025 02:52:16
coroutine => enabled with boost asm context
epoll => enabled
eventfd => enabled
signalfd => enabled
cpu_affinity => enabled
spinlock => enabled
rwlock => enabled
openssl => OpenSSL 3.1.8 11 Feb 2025
dtls => enabled
http2 => enabled
json => enabled
curl-native => enabled
curl-version => 8.12.1
c-ares => 1.27.0
zlib => 1.3.1
brotli => E16781312/D16781312
mutex_timedlock => enabled
pthread_barrier => enabled
coroutine_pgsql => enabled
coroutine_odbc => enabled
coroutine_sqlite => enabled
Hyperf Server Configuration
return [
'mode' => SWOOLE_PROCESS,
'servers' => [
[
'name' => 'http',
'type' => Server::SERVER_HTTP,
'host' => '0.0.0.0',
'port' => 9501,
'sock_type' => SWOOLE_SOCK_TCP,
'callbacks' => [
Event::ON_REQUEST => [Hyperf\HttpServer\Server::class, 'onRequest'],
],
'options' => [
// Whether to enable request lifecycle event
'enable_request_lifecycle' => true,
],
],
],
'settings' => [
Constant::OPTION_ENABLE_COROUTINE => true,
Constant::OPTION_WORKER_NUM => swoole_cpu_num(),
Constant::OPTION_PID_FILE => BASE_PATH . '/runtime/hyperf.pid',
Constant::OPTION_OPEN_TCP_NODELAY => true,
Constant::OPTION_MAX_COROUTINE => 100000,
Constant::OPTION_OPEN_HTTP2_PROTOCOL => true,
Constant::OPTION_MAX_REQUEST => 1500,
Constant::OPTION_SOCKET_BUFFER_SIZE => 2 * 1024 * 1024,
Constant::OPTION_BUFFER_OUTPUT_SIZE => 2 * 1024 * 1024,
],
'callbacks' => [
Event::ON_WORKER_START => [Hyperf\Framework\Bootstrap\WorkerStartCallback::class, 'onWorkerStart'],
Event::ON_PIPE_MESSAGE => [Hyperf\Framework\Bootstrap\PipeMessageCallback::class, 'onPipeMessage'],
Event::ON_WORKER_EXIT => [Hyperf\Framework\Bootstrap\WorkerExitCallback::class, 'onWorkerExit'],
],
];
Additional Questions
- Are there any known cases of request suspension due to Hyperf's coroutine scheduler?
- Could this be related to coroutine locks, resource contention, or garbage collection?
- Are there any debugging tools we can use to trace coroutine execution and detect what is causing the delay?
We appreciate any guidance on how to further investigate or mitigate this issue. Thank you!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested