Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix v2 transport intermittent test failure (#29002) #29006

Conversation

theStack
Copy link
Contributor

@theStack theStack commented Dec 5, 2023

This PR improves the following fragile construct for detection of a new connection to the node under test in p2p_v2_transport.py:

num_peers = len(self.nodes[0].getpeerinfo())
s.connect(("127.0.0.1", p2p_port(0)))
self.wait_until(lambda: len(self.nodes[0].getpeerinfo()) == num_peers + 1)

Only relying on the number of peers for that suffers from race conditions, as unrelated previous peers could disconnect at anytime in-between. In the test run in #29002, the following happens:

  • getpeerinfo() is called the first time -> assigned to num_peers
  • previous peer disconnects, the node's peer count is now num_peers - 1 (in most test runs, this happens before the first getpeerinfo call)
  • new peer connects, the node's peer count is now num_peers
  • the condition that the node's peer count is num_peers + 1 is never true, test fails

Use the more robust approach of watching for an increased highest peer id instead (again using the getpeerinfo RPC call), with a newly introduced context manager method TestNode.wait_for_new_peer(). Note that for the opposite case of a disconnect, no new method is introduced; this is currently used only once in the test and is also simpler.

Still happy to take suggestions for alternative solutions.

Fixes #29002.

Only relying on the number of peers for detecting a new connection
suffers from race conditions, as unrelated previous peers could
disconnect at anytime in-between. Use the more robust approach of
watching for an increased highest peer id instead (again using the
`getpeerinfo` RPC call), with a newly introduced context manager
method `TestNode.wait_for_new_peer()`.

Fixes bitcoin#29009.
@DrahtBot
Copy link
Contributor

DrahtBot commented Dec 5, 2023

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK stratospher, maflcko
Concept ACK kevkevinpal

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

@DrahtBot DrahtBot added the Tests label Dec 5, 2023
@kevkevinpal
Copy link
Contributor

Concept ACK 00e0658

Copy link
Contributor

@stratospher stratospher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 00e0658.

cross checked node0's debug log to verify that this getpeerinfo gets called before disconnection is complete.

		2023-12-05T13:26:56.774275Z [net] [net.cpp:1106] [ProcessReceivedKeyBytes] [net] V2 transport error: V1 peer with wrong MessageStart 0a03cf40
		2023-12-05T13:26:56.857216Z [net] [net.cpp:554] [CloseSocketDisconnect] [net] disconnecting peer=5
		2023-12-05T13:26:56.857322Z [http] [httpserver.cpp:306] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:32828
num_peers 	2023-12-05T13:26:56.857568Z [httpworker.2] [rpc/request.cpp:187] [parse] [rpc] ThreadRPCServer method=getpeerinfo user=__cookie__
		2023-12-05T13:26:56.858580Z [net] [logging/timer.h:58] [Log] [lock] Enter: lock contention m_nodes_mutex, net.cpp:1883 started
		2023-12-05T13:26:56.862378Z [http] [httpserver.cpp:306] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:32828
		2023-12-05T13:26:56.862523Z [net] [logging/timer.h:58] [Log] [lock] Enter: lock contention m_nodes_mutex, net.cpp:1883 completed (3865μs)
num_peers-1     2023-12-05T13:26:56.862680Z [net] [net_processing.cpp:1651] [FinalizeNode] [net] Cleared nodestate for peer=5
num_peers-1	2023-12-05T13:26:56.863412Z [httpworker.3] [rpc/request.cpp:187] [parse] [rpc] ThreadRPCServer method=getpeerinfo user=__cookie__
num_peers-1+1   2023-12-05T13:26:56.863771Z [net] [net.cpp:3704] [CNode] [net] Added connection peer=6

possible alternatives i could think of is checking whether the ip address:port combo is present/not present in getpeerinfo. something like:
self.wait_until(lambda: "127.0.0.1:" + str(port) in [p["addrbind"] for p in self.nodes[0].getpeerinfo()])
but i think the current approach is slightly faster.

num_peers = len(self.nodes[0].getpeerinfo())
s.connect(("127.0.0.1", p2p_port(0)))
self.wait_until(lambda: len(self.nodes[0].getpeerinfo()) == num_peers + 1)
with self.nodes[0].wait_for_new_peer():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

00e0658: wondering if we need to wait for previous disconnections here since the unrelated disconnections happen only with peers where we directly open the sockets to the node right? (tests above use TestNode) i like how the approach is consistent with the rest of the test though.

Copy link
Contributor Author

@theStack theStack Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, in this case there shouldn't be any previous disconnect issues and hence the current connection-count approach would also work. I still decided to also use wait_for_new_peer here for consistency reasons and it's less code and more readable.

@@ -144,22 +143,23 @@ def run_test(self):
# Check wrong network prefix detection (hits if the next 12 bytes correspond to a v1 version message)
wrong_network_magic_prefix = MAGIC_BYTES["signet"] + V1_PREFIX[4:]
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.connect(("127.0.0.1", p2p_port(0)))
with self.nodes[0].wait_for_new_peer():
s.connect(("127.0.0.1", p2p_port(0)))
with self.nodes[0].assert_debug_log(["V2 transport error: V1 peer with wrong MessageStart"]):
s.sendall(wrong_network_magic_prefix + b"somepayload")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unrelated: Would be good to wait for CNode desctruction after disconnection by checking the getpeerinfo RPC. This is what I meant in #29002 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, it would be good to pass the timeout option to assert_debug_log, so that it is well-documented that the debug log is used to sync.


initial_peer_id = get_highest_peer_id()
yield
wait_until_helper_internal(lambda: get_highest_peer_id() > initial_peer_id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unrelated: Could have a self.wait_until to avoid having to pass the timeout factor every time?

Copy link
Member

@maflcko maflcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Can't hurt to have this helper, because it also makes code less verbose

@theStack
Copy link
Contributor Author

theStack commented Dec 8, 2023

Thanks for the reviews! I'm leaving the unrelated nits (#29006 (comment) and #29006 (comment)) for a follow-up.

@maflcko
Copy link
Member

maflcko commented Dec 8, 2023

Ok, lgtm ACK 00e0658

@fanquake fanquake merged commit 03042fb into bitcoin:master Dec 8, 2023
16 checks passed
@theStack theStack deleted the 202312-test-fix_v2_transport_intermittent_failure branch December 8, 2023 11:34
@fanquake
Copy link
Member

fanquake commented Dec 8, 2023

Note there was a typo in the commit message. This fixed #29002, not #29009 (which accidently got closed).

fanquake added a commit that referenced this pull request Dec 11, 2023
ea00f98 test: fix intermittent error in rpc_net.py (#29030) (Sebastian Falbesoner)

Pull request description:

  Asserting for the debug log message "Added connection peer=" is insufficient for ensuring that this new connection will show up in a following getpeerinfo() call, as the debug message is written in the CNode ctor, which means it hasn't necessarily been added to CConnman.m_nodes at this point.

  Solve this by using the recently introduced `wait_for_new_peer` helper (see #29006,  commit 00e0658), which is more robust.

  Fixes #29030.

ACKs for top commit:
  maflcko:
    lgtm ACK ea00f98

Tree-SHA512: dda307949a466fb3b24408a8c213d307e0af2155f2e8b4e52c836a22397f9d218bf9d8c54ca55bae62a96d7566f27167db9311dd8801785c327234783af5ed00
kwvg added a commit to kwvg/dash that referenced this pull request Oct 15, 2024
kwvg added a commit to kwvg/dash that referenced this pull request Oct 16, 2024
PastaPastaPasta added a commit to dashpay/dash that referenced this pull request Oct 22, 2024
, bitcoin#28645, bitcoin#28632, bitcoin#28782, bitcoin#28822, bitcoin#29006, bitcoin#29212, merge bitcoin-core/gui#754, partial bitcoin#23443, bitcoin#26448 (BIP324 backports: part 3)

aa5311d merge bitcoin#29212: Fix -netinfo backward compat with getpeerinfo pre-v26 (Kittywhiskers Van Gogh)
1a293c7 merge bitcoin#29006: fix v2 transport intermittent test failure (Kittywhiskers Van Gogh)
d0804d4 merge bitcoin#28822: Add missing wait for version to be sent in add_outbound_p2p_connection (Kittywhiskers Van Gogh)
c0b3062 merge bitcoin#28782: Add missing sync on send_version in peer_connect (Kittywhiskers Van Gogh)
35253cd merge bitcoin#28632: make python p2p not send getaddr on incoming connections (Kittywhiskers Van Gogh)
6a4ca62 merge bitcoin#28645: fix `assert_debug_log` call-site bugs, add type checks (Kittywhiskers Van Gogh)
deaee14 merge bitcoin-core/gui#754: Add BIP324-specific labels to peer details (Kittywhiskers Van Gogh)
fffe6e7 merge bitcoin#27986: remove race in the user-agent reception check (Kittywhiskers Van Gogh)
1bf135b merge bitcoin#26553: Fix intermittent failure in rpc_net.py (Kittywhiskers Van Gogh)
5bf245b partial bitcoin#26448: fix intermittent failure in p2p_sendtxrcncl.py (Kittywhiskers Van Gogh)
b7c0030 partial bitcoin#23443: Erlay support signaling (Kittywhiskers Van Gogh)
c709df7 merge bitcoin#20524: Move MIN_VERSION_SUPPORTED to p2p.py (Kittywhiskers Van Gogh)

Pull request description:

  ## Additional Information

  * Dependent on #6324

  * Dependency for #6330

  ## How Has This Been Tested?

  * Changes to Qt client were validated by running the client

    <details>

    <summary>Screenshot</summary>

    ![Transport reporting in Qt](https://github.com/user-attachments/assets/0d551e19-f3a2-4ce7-83d6-5cb3d03b1765)

    </details>

  * Changes to `dash-cli` were validated by running it with different node versions

    **Against a node built on this PR**

    <details>

    <summary>Screenshot</summary>

    ![getinfo with node running the latest build](https://github.com/user-attachments/assets/8cda68cc-727a-4cf3-a4d8-dd6a33331d78)

    </details>

    **Against a node built running the last release**

    <details>

    <summary>Screenshot</summary>

    ![getinfo with node running the latest release](https://github.com/user-attachments/assets/0c6ff476-7cc9-4297-bae5-35d423aba480)

    </details>

  ## Breaking Changes

  None expected.

  ## Checklist:

  - [x] I have performed a self-review of my own code
  - [x] I have commented my code, particularly in hard-to-understand areas (note: N/A)
  - [x] I have added or updated relevant unit/integration/functional/e2e tests
  - [x] I have made corresponding changes to the documentation (note: N/A)
  - [x] I have assigned this pull request to a milestone _(for repository code-owners and collaborators only)_

ACKs for top commit:
  UdjinM6:
    ACK aa5311d
  PastaPastaPasta:
    utACK aa5311d

Tree-SHA512: 8cca324ac988a73c0590a4e9b318e81ce951ac55fb173cf507fa647cab01ab4981e6a06d4792376b4bfb44ff09d4811de05fadb9ba793dd00b4c7965b4b22654
@bitcoin bitcoin locked and limited conversation to collaborators Dec 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Intermittent test failure in p2p_v2_transport
6 participants