-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
global OOM issue (69M messages in global - his_the_locker and cancel with distinct tags) #9117
Comments
Hi, I think it can enter an infinite loop if we ever hit this case.
It looks like the connection is restarted if Following the restart attempt, it seems like a bug that
Line 1854 in cdd61f5
The other node (
Node 74 has canceled Should there be a What can lead to us hitting this case? Lines 1809 to 1812 in cdd61f5
It is worth mentioning that we did not see this issue with two other nodes that joined the cluster in the same timeframe. It seems like a race that we hit this case as per the comment. |
@neelima32 Great find! Yes, I think there is a throw missing after I saw in the logs that the connections frequently went up and down between the nodes. This should not be due to the Have you configured |
Thanks. The connections go up and down as pods are recreated in a K8s cluster. We are still figuring out the sequence of events. All the nodes are connected in a big cluster. We don't have global_group configured. We haven't been able to reproduce the original problem (stale locks / OOM) - to confirm the fix addresses it. |
@rickard-green We ran into this situation again at a customer site. They use OTP 25.3 without any changes in We've been looking at how we get into this situation. A is the stable node. B is the flapping node.
It seems like quite the edge case. If this is valid, then at t5, A gets a second Things we've observed when this occurs:
|
Describe the bug
Multiple nodes ran out of memory on a 45-node cluster.
We have memory_data in the logs, which indicates that the worst offender was the same process on all the affected nodes (global:locker). Process backtrace is available for one of the affected nodes.
To Reproduce
It has occurred once and we have not been able to reproduce it. Our guess is it happens when there are multiple nodedown messages. One node (85) was failed over and added back to the cluster. While adding it back to the cluster, connections were repeatedly shut down and recreated (~500 times) (from 85 to every other node in the cluster). There are numerous net_kernel, nodedown messages in the logs. Immediately after node85 successfully establishes the dist connection to node74, global on node74 starts consuming memory. The memory consumption increases linearly.
Logs on node74 indicate 69M messages:
From <0.55.0>'s process dictionary:
It appears there have been 576460752302537370 - 576460752267720345 = 34 M tags (and twice as many messages for his_the_locker/cancel resulting in 69M messages).
Affected versions
OTP 25.3
Additional context
Note that prevent_overlapping_partitions is false on all nodes (not the default: true on OTP 25).
I'll extract and upload the global_(locks|pid_id|node_resources|...) ETS tables on the nodes for which logs are available.
The text was updated successfully, but these errors were encountered: