Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

global OOM issue (69M messages in global - his_the_locker and cancel with distinct tags) #9117

Open
neelima32 opened this issue Nov 26, 2024 · 5 comments
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@neelima32
Copy link

neelima32 commented Nov 26, 2024

Describe the bug
Multiple nodes ran out of memory on a 45-node cluster.
We have memory_data in the logs, which indicates that the worst offender was the same process on all the affected nodes (global:locker). Process backtrace is available for one of the affected nodes.

To Reproduce
It has occurred once and we have not been able to reproduce it. Our guess is it happens when there are multiple nodedown messages. One node (85) was failed over and added back to the cluster. While adding it back to the cluster, connections were repeatedly shut down and recreated (~500 times) (from 85 to every other node in the cluster). There are numerous net_kernel, nodedown messages in the logs. Immediately after node85 successfully establishes the dist connection to node74, global on node74 starts consuming memory. The memory consumption increases linearly.

Logs on node74 indicate 69M messages:

   {messages,
       [<<"{cancel,'ns_1@node-85',-576460752302537370,no_fun}">>,
        <<"{his_the_locker,<18602.56.0>,{8,[]},-576460752303387318,-576460752302537369}">>,
        <<"{cancel,'ns_1@node-85',-576460752302537369,no_fun}">>,
        <<"{his_the_locker,<18602.56.0>,{8,[]},-576460752303387317,-576460752302537368}">>,
        <<"{cancel,'ns_1@node-85',-576460752302537368,no_fun}">>,
        <<"{his_the_locker,<18602.56.0>,{8,[]},-576460752303387316,-576460752302537367}">>,
        <<"{cancel,'ns_1@node-85',-576460752302537367,no_fun}">>,
        <<"{his_the_locker,<18602.56.0>,{8,[]},-576460752303387315,-576460752302537366}">>,
        <<"{cancel,'ns_1@node-85',-576460752302537366,no_fun}">>,
        <<"{his_the_locker,<18602.56.0>,{8,[]},-576460752303387314,-576460752302537365}">>]},
   {message_queue_len,69628685},

From <0.55.0>'s process dictionary:

    {{sync_tag_my,
         'ns_1@node-85'},
     -576460752267720345},
    {{sync_tag_his,
         'ns_1@node-85'},
     -576460752268570427},

It appears there have been 576460752302537370 - 576460752267720345 = 34 M tags (and twice as many messages for his_the_locker/cancel resulting in 69M messages).

Affected versions
OTP 25.3

Additional context
Note that prevent_overlapping_partitions is false on all nodes (not the default: true on OTP 25).
I'll extract and upload the global_(locks|pid_id|node_resources|...) ETS tables on the nodes for which logs are available.

@neelima32 neelima32 added the bug Issue is reported as a bug label Nov 26, 2024
@neelima32 neelima32 changed the title global OOM issue (64M messages in global - his_the_locker and cancel with distinct tags) global OOM issue (69M messages in global - his_the_locker and cancel with distinct tags) Nov 26, 2024
@IngelaAndin IngelaAndin added the team:VM Assigned to OTP team VM label Nov 27, 2024
@jhogberg jhogberg added the stalled waiting for input by the Erlang/OTP team label Dec 2, 2024
@neelima32
Copy link
Author

neelima32 commented Dec 4, 2024

logs.tar.gz

Screenshot 2024-12-03 at 4 07 35 PM

The memory consumption increases rapidly at:

  1. 7:33 AM - when node 83 joins the cluster, memory consumption rises in tandem on nodes 21 and 83.
  2. 8:21 AM - when node 87 joins the cluster, memory consumption rises in tandem on nodes 22 and 87.
  3. 10:37 AM - when node 85 joins the cluster, memory consumption rises in tandem on nodes 74 and 85.
  4. 11:39 AM - when node 51 joins the cluster, memory consumption rises on node 51.

We parsed global_locks. The locker process pids of nodes 83 and 22 are implicated on 28 nodes. This may mean the system has been in a bad state since 7:33 AM.

The logs include global ETS tables, stack info for global_name_server and locker processes, visible node-to-name mappings, net_kernel, and dist events, when available. The stack info for node 74 indicates repeated connection attempts. Unfortunately, we don't have stack info for the other nodes on which OOM occurred.

It appears that the global lock was not released and repeated connection attempts exhausted memory.

@neelima32
Copy link
Author

neelima32 commented Dec 4, 2024

Hi, I think it can enter an infinite loop if we ever hit this case.
https://github.com/erlang/otp/blob/master/lib/kernel/src/global.erl#L1824-L1832

                        true ->
                             %% Cancel this new connection attempt as well
                             %% as the ongoing connection setup then restart
                             %% the connect by making a new connection
                             %% attempt...

                             send_cancel_connect_message(Node, HisTag),
                             MyOldTag = get({sync_tag_my, Node}),
                             restart_connect(Node, MyOldTag, S0)

It looks like the connection is restarted if MyOldTag exists and a new tag is generated for the session with Node in handle_nodeup:
https://github.com/erlang/otp/blob/master/lib/kernel/src/global.erl#L3002-L3003

Following the restart attempt, it seems like a bug that HisTag that was just canceled, is inserted again:
https://github.com/erlang/otp/blob/master/lib/kernel/src/global.erl#L1836-L1837

        put({prot_vsn, Node}, Vsn),
        put({sync_tag_his, Node}, HisTag),
        case lists:keyfind(Node, 1, S1#state.resolvers) of
            {Node, MyTag, _Resolver} ->
                MyTag = get({sync_tag_my, Node}), % assertion
                {locker, _NoLongerAPid, _HisKnown0, HisTheLocker} = InitMsg,
                ?trace({init_connect,{histhelocker,HisTheLocker}}),
                HisKnown = [],
                S1#state.the_locker ! {his_the_locker, HisTheLocker,
                                       {Vsn,HisKnown}, HisTag, MyTag},

{pending, Node} is added:

S1#state{known = Known1#{{pending, Node} => HisVsn}};

The other node (Node) will cancel in response to cancel_connect, restart its connection but find itself pending - and the cycle repeats. The stack on node0074 indicates this:

   {messages,
       [<<"{cancel,'ns_1@node-85',-576460752302537370,no_fun}">>,
        <<"{his_the_locker,<18602.56.0>,{8,[]},-576460752303387318,-576460752302537369}">>,
        <<"{cancel,'ns_1@node-85',-576460752302537369,no_fun}">>,
        <<"{his_the_locker,<18602.56.0>,{8,[]},-576460752303387317,-576460752302537368}">>,
        <<"{cancel,'ns_1@node-85',-576460752302537368,no_fun}">>,
        <<"{his_the_locker,<18602.56.0>,{8,[]},-576460752303387316,-576460752302537367}">>,
        <<"{cancel,'ns_1@node-85',-576460752302537367,no_fun}">>,
        <<"{his_the_locker,<18602.56.0>,{8,[]},-576460752303387315,-576460752302537366}">>,
        <<"{cancel,'ns_1@node-85',-576460752302537366,no_fun}">>,
        <<"{his_the_locker,<18602.56.0>,{8,[]},-576460752303387314,-576460752302537365}">>]},

Node 74 has canceled -576460752302537370 (likely after just canceling the connect with -576460752303387318 from node 85).
It sends itself his_the_locker with node 85's -576460752303387318 and the new tag it minted in restart_connect / handle_nodeup -576460752302537369, then adds {pending, Node} again.
The next init_connect attempt from Node with newer tag -576460752303387317 finds itself pending, which initiates a cancel -576460752302537369, cancel_connect -576460752303387317.

Should there be a throw after the restart_connect?

What can lead to us hitting this case?

otp/lib/kernel/src/global.erl

Lines 1809 to 1812 in cdd61f5

%% This should not be possible unless global group has
%% been configured. We got an already ongoing connection
%% setup with Node and get yet another connection attempt
%% from Node.

It is worth mentioning that we did not see this issue with two other nodes that joined the cluster in the same timeframe. It seems like a race that we hit this case as per the comment.

@rickard-green rickard-green removed the stalled waiting for input by the Erlang/OTP team label Dec 10, 2024
@rickard-green
Copy link
Contributor

@neelima32 Great find! Yes, I think there is a throw missing after restart_connect() in init_connect(). I've fixed that in the rickard/global-restart-connect-fix/OTP-19381 branch (based on OTP 25.3) in my repository. The branch also fixes another issue with restart_connect() where locks could be left hanging on the node. Please test this branch if possible. Note that I've only done some basic testing so far for the branch.

I saw in the logs that the connections frequently went up and down between the nodes. This should not be due to the restart_connect() since it only restarts the global connect protocol and do not take down connections. Do you know the reason for the frequent loss/setup of connections?

Have you configured global_group or are all nodes connected in a big cluster?

@neelima32
Copy link
Author

Thanks. The connections go up and down as pods are recreated in a K8s cluster. We are still figuring out the sequence of events.

All the nodes are connected in a big cluster. We don't have global_group configured.

We haven't been able to reproduce the original problem (stale locks / OOM) - to confirm the fix addresses it.

@neelima32
Copy link
Author

neelima32 commented Dec 16, 2024

@rickard-green We ran into this situation again at a customer site. They use OTP 25.3 without any changes in global. We haven't been able to reproduce it. However, this occurs in a K8s environment when a pod is recreated multiple times. This happens during a Kubernetes upgrade when pods are drained and new ones are created. We see net_kernel wait_pending events on other nodes before the pod stabilizes and the connection is accepted.

We've been looking at how we get into this situation.
(1) It seems the sync sessions use erlang:unique_integer([monotonic]) to mint sync tags for sessions. When the pod is recreated several times, the initial sync tags may overlap with those it used previously.
(2) When we hit the wait_pending situation, a nodedown and nodeup may be delivered closer than they would have otherwise (i.e. clean disconnect, followed by connect.)
(3) If an ongoing sync session is interrupted by a node that keeps going up and down, is the following situation possible?

A is the stable node. B is the flapping node.
A had an in-flight sync session with B, HisTag:200 (life 1). B goes down and comes back up with a reused Tag:200 from its previous life.
The ongoing dist connection is killed (wait_pending) only when the new one is established.
When A processes nodedown, it cancels the old sync session and sends a cancel_connect on the new connection (before the nodeup is processed). If A sends a cancel_connect to B, B considers the (reused) tag valid and restarts the connection.

Time    Node A                              Node B
----    -------                             -------
t1      Processing: {nodedown, B}           Processing: {nodeup, A}
        Sends: -                            Sends: -
        State: {pending, B}                 State: -
        Queue: {nodeup, B}                  Queue: -

t2      Processing: {nodedown, B}           Processing: {nodeup, A}
        Sends: cancel_connect(200)          Sends: init_connect(200) # tag 200 reused from B's previous life
        State: -                            State: -
        Queue: {nodeup, B}                  Queue: -

t3      Processing: {nodeup, B}             Processing: -
        Sends: init_connect(100)            Sends: -
        State: -                            State: -
        Queue: init_connect(200)            Queue: cancel_connect(200)
        
t4      Processing: init_connect(200)       Processing: cancel_connect(200)
        Sends: init_connect_ack(100,200)    Sends: init_connect(201) # retried because tag 200 matches
        State: {pending, B}                 State: -
        Queue: -                            Queue: init_connect(100)

t5      Processing: -                       Processing: init_connect(100)
        Sends: -                            Sends: init_connect_ack(201,100)
        State: {pending, B}                 State: {pending, A}
        Queue: init_connect(201)            Queue: init_connect_ack(100,200)

It seems like quite the edge case. If this is valid, then at t5, A gets a second init_connect while one is already ongoing - even if we don't use global group.

Things we've observed when this occurs:

  1. One pod is recreated rapidly (12-13 seconds)
  2. Other nodes don't detect this immediately. It appears TCP connections are not closed cleanly. NLB and k8s networking, pod termination specifics may contribute to this.
  3. As soon as the connection is accepted, two nodes (one of which always includes the pod that was recreated) enter a state of livelock where they flood each other with connection attempts but neither makes progress. This results in an OOM situation and unfortunately, no stack dumps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

4 participants