Skip to content

Fix race conditions in HostConnection that happen during shutdown#112

Merged
dkropachev merged 4 commits intoscylladb:masterfrom
k0machi:iss111-lock-races
Oct 18, 2021
Merged

Fix race conditions in HostConnection that happen during shutdown#112
dkropachev merged 4 commits intoscylladb:masterfrom
k0machi:iss111-lock-races

Conversation

@k0machi
Copy link
Copy Markdown

@k0machi k0machi commented Oct 17, 2021

Previously, locking used in HostConnection would just check if was
shutdown or to set the is_shutdown attribute. This leads to a problem
where Thread A acquires lock, checks the flag, releases lock, checks
branch condition for opening a new shard connection, gets scheduled away,
while Thread B starts shutdown - acquires lock, sets flag, releases lock, starts
closing connections and gets scheduled away - leading to a race where once
Thread A resumes, it grows the dictionary being viewed by Thread B with a new
connection - even though it should not do so anymore. This commit
addresses this by moving any operation on the _connections under a lock

Also, this commit adds an internal list to track scheduled futures for the thread executor - preventing shard connections from proceeding if connection pool was shut down

Fixes #111

Trello

@k0machi k0machi requested review from fruch and ultrabug October 17, 2021 14:33
@k0machi
Copy link
Copy Markdown
Author

k0machi commented Oct 17, 2021

This PR should fix #111, here's the full log of the problem:

< t:2021-10-13 11:46:53,944 f:common.py       l:1267 c:cassandra.pool       p:DEBUG > Shutting down connections to 10.0.0.241:9042
< t:2021-10-13 11:46:53,945 f:thread.py       l:52   c:cassandra.pool       p:DEBUG > New connection created to shard_id=7 on host 10.0.0.241:9042
< t:2021-10-13 11:46:53,946 f:thread.py       l:52   c:cassandra.pool       p:DEBUG > Connected to 5/14 shards on host 10.0.0.241:9042 (9 missing)
< t:2021-10-13 11:46:53,945 f:common.py       l:1267 c:cassandra.pool       p:DEBUG > Closing connection (140221857249408) to 10.0.0.241:9042
< t:2021-10-13 11:46:53,946 f:libevreactor.py l:287  c:cassandra.io.libevreactor p:DEBUG > Closing connection (140221857249408) to 10.0.0.241:9042
< t:2021-10-13 11:46:53,946 f:libevreactor.py l:291  c:cassandra.io.libevreactor p:DEBUG > Closed socket to 10.0.0.241:9042
< t:2021-10-13 11:46:53,948 f:libevreactor.py l:287  c:cassandra.io.libevreactor p:DEBUG > Closing connection (140221857129712) to 10.0.0.241:9042
< t:2021-10-13 11:46:53,949 f:libevreactor.py l:291  c:cassandra.io.libevreactor p:DEBUG > Closed socket to 10.0.0.241:9042
< t:2021-10-13 11:46:53,949 f:thread.py       l:52   c:cassandra.pool       p:DEBUG > New connection created to shard_id=1 on host 10.0.2.63:9042
< t:2021-10-13 11:46:53,949 f:thread.py       l:52   c:cassandra.pool       p:DEBUG > Connected to 2/14 shards on host 10.0.2.63:9042 (12 missing)
< t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR > sdcm.nemesis.SisyphusMonkey: Unhandled exception in method <function Nemesis.disrupt_add_drop_column at 0x7f8848334ca0> < t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR > sdcm.nemesis.SisyphusMonkey: Unhandled exception in method <function Nemesis.disrupt_add_drop_column at 0x7f8848334ca0>
< t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR > Traceback (most recent call last):
< t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2864, in wrapper
< t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR >     result = method(*args, **kwargs)
< t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1520, in disrupt_add_drop_column
< t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR >     self._add_drop_column_run_in_cycle()
< t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1351, in _add_drop_column_run_in_cycle
< t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR >     self._add_drop_column()
< t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1317, in _add_drop_column
< t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR >     self._add_drop_column_target_table = self._add_drop_column_get_target_table(
< t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1242, in _add_drop_column_get_target_table
< t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR >     current_tables = self._get_all_tables_with_no_compact_storage(self._add_drop_column_tables_to_ignore)
< t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1238, in _get_all_tables_with_no_compact_storage
< t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR >     output[ks] = tables
< t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 1267, in __exit__
< t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR >     self.cluster.shutdown()
< t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR >   File "cassandra/cluster.py", line 1768, in cassandra.cluster.Cluster.shutdown
< t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR >   File "cassandra/cluster.py", line 3156, in cassandra.cluster.Session.shutdown
< t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR >   File "cassandra/pool.py", line 546, in cassandra.pool.HostConnection.shutdown
< t:2021-10-13 11:46:53,948 f:nemesis.py      l:2875 c:sdcm.nemesis         p:ERROR > RuntimeError: dictionary changed size during iteration

You can see the "Shutting down connections to %s" line from shutdown() and it happens just slightly before _open_connection_to_missing_shard resumes and opens a new shard connection

Forgot to mention in the commit message, this problem appears when cluster connection opens and closes very quickly - as it happened in this case.

@k0machi k0machi requested review from bentsi, haaawk, piodul, psarna and roydahan and removed request for roydahan October 17, 2021 15:13
@k0machi k0machi force-pushed the iss111-lock-races branch 2 times, most recently from d87a031 to 4beb587 Compare October 17, 2021 17:12
@k0machi
Copy link
Copy Markdown
Author

k0machi commented Oct 18, 2021

Made a quick repro, fails on master almost every time: https://gist.github.com/6d0065db359fe8a5134851265d9b559e

Log:

2021-10-18 13:13:07,377 - INFO     - __main__  : Starting...
2021-10-18 13:13:07,378 - DEBUG    - cassandra.pool: Initializing connection for host <MagicMock id='140090488006800'>
2021-10-18 13:13:07,379 - DEBUG    - cassandra.pool: First connection created to <MagicMock id='140090488006800'> for shard_id=0
2021-10-18 13:13:07,379 - INFO     - __main__  : Scheduling <bound method HostConnection._open_connection_to_missing_shard of <cassandra.pool.HostConnection object at 0x7f695bca90f0>> with args: (0,), kwargs: {}
2021-10-18 13:13:07,379 - INFO     - __main__  : Scheduling <bound method HostConnection._open_connection_to_missing_shard of <cassandra.pool.HostConnection object at 0x7f695bca90f0>> with args: (1,), kwargs: {}
2021-10-18 13:13:07,380 - INFO     - __main__  : Scheduling <bound method HostConnection._open_connection_to_missing_shard of <cassandra.pool.HostConnection object at 0x7f695bca90f0>> with args: (2,), kwargs: {}
2021-10-18 13:13:07,380 - INFO     - __main__  : Scheduling <bound method HostConnection._open_connection_to_missing_shard of <cassandra.pool.HostConnection object at 0x7f695bca90f0>> with args: (3,), kwargs: {}
2021-10-18 13:13:07,380 - INFO     - __main__  : Scheduling <bound method HostConnection._open_connection_to_missing_shard of <cassandra.pool.HostConnection object at 0x7f695bca90f0>> with args: (4,), kwargs: {}
2021-10-18 13:13:07,380 - INFO     - __main__  : Scheduling <bound method HostConnection._open_connection_to_missing_shard of <cassandra.pool.HostConnection object at 0x7f695bca90f0>> with args: (5,), kwargs: {}
2021-10-18 13:13:07,380 - INFO     - __main__  : Scheduling <bound method HostConnection._open_connection_to_missing_shard of <cassandra.pool.HostConnection object at 0x7f695bca90f0>> with args: (6,), kwargs: {}
2021-10-18 13:13:07,380 - INFO     - __main__  : Scheduling <bound method HostConnection._open_connection_to_missing_shard of <cassandra.pool.HostConnection object at 0x7f695bca90f0>> with args: (7,), kwargs: {}
2021-10-18 13:13:07,380 - INFO     - __main__  : Scheduling <bound method HostConnection._open_connection_to_missing_shard of <cassandra.pool.HostConnection object at 0x7f695bca90f0>> with args: (8,), kwargs: {}
2021-10-18 13:13:07,380 - INFO     - __main__  : Scheduling <bound method HostConnection._open_connection_to_missing_shard of <cassandra.pool.HostConnection object at 0x7f695bca90f0>> with args: (9,), kwargs: {}
2021-10-18 13:13:07,380 - INFO     - __main__  : Scheduling <bound method HostConnection._open_connection_to_missing_shard of <cassandra.pool.HostConnection object at 0x7f695bca90f0>> with args: (10,), kwargs: {}
2021-10-18 13:13:07,380 - INFO     - __main__  : Scheduling <bound method HostConnection._open_connection_to_missing_shard of <cassandra.pool.HostConnection object at 0x7f695bca90f0>> with args: (11,), kwargs: {}
2021-10-18 13:13:07,380 - INFO     - __main__  : Scheduling <bound method HostConnection._open_connection_to_missing_shard of <cassandra.pool.HostConnection object at 0x7f695bca90f0>> with args: (12,), kwargs: {}
2021-10-18 13:13:07,380 - INFO     - __main__  : Scheduling <bound method HostConnection._open_connection_to_missing_shard of <cassandra.pool.HostConnection object at 0x7f695bca90f0>> with args: (13,), kwargs: {}
2021-10-18 13:13:07,380 - DEBUG    - cassandra.pool: Finished initializing connection for host <MagicMock id='140090488006800'>
2021-10-18 13:13:07,380 - INFO     - __main__  : Initialized pool <cassandra.pool.HostConnection object at 0x7f695bca90f0>
2021-10-18 13:13:07,381 - INFO     - __main__  : Connections: {0: <MagicMock id='140090488327856'>}
2021-10-18 13:13:07,880 - INFO     - __main__  : Future start: ()
2021-10-18 13:13:07,882 - DEBUG    - cassandra.pool: New connection created to shard_id=1 on host <MagicMock id='140090488006800'>
2021-10-18 13:13:07,884 - DEBUG    - cassandra.pool: Connected to 2/14 shards on host <MagicMock id='140090488006800'> (12 missing)
2021-10-18 13:13:07,885 - DEBUG    - cassandra.pool: New connection created to shard_id=2 on host <MagicMock id='140090488006800'>
2021-10-18 13:13:07,885 - DEBUG    - cassandra.pool: Connected to 3/14 shards on host <MagicMock id='140090488006800'> (11 missing)
2021-10-18 13:13:07,886 - DEBUG    - cassandra.pool: New connection created to shard_id=3 on host <MagicMock id='140090488006800'>
2021-10-18 13:13:07,886 - DEBUG    - cassandra.pool: Connected to 4/14 shards on host <MagicMock id='140090488006800'> (10 missing)
2021-10-18 13:13:07,887 - DEBUG    - cassandra.pool: New connection created to shard_id=4 on host <MagicMock id='140090488006800'>
2021-10-18 13:13:07,888 - DEBUG    - cassandra.pool: Connected to 5/14 shards on host <MagicMock id='140090488006800'> (9 missing)
2021-10-18 13:13:07,889 - DEBUG    - cassandra.pool: New connection created to shard_id=5 on host <MagicMock id='140090488006800'>
2021-10-18 13:13:07,889 - DEBUG    - cassandra.pool: Connected to 6/14 shards on host <MagicMock id='140090488006800'> (8 missing)
2021-10-18 13:13:07,890 - DEBUG    - cassandra.pool: New connection created to shard_id=6 on host <MagicMock id='140090488006800'>
2021-10-18 13:13:07,890 - DEBUG    - cassandra.pool: Connected to 7/14 shards on host <MagicMock id='140090488006800'> (7 missing)
2021-10-18 13:13:07,891 - DEBUG    - cassandra.pool: New connection created to shard_id=7 on host <MagicMock id='140090488006800'>
2021-10-18 13:13:07,891 - DEBUG    - cassandra.pool: Connected to 8/14 shards on host <MagicMock id='140090488006800'> (6 missing)
2021-10-18 13:13:07,892 - DEBUG    - cassandra.pool: New connection created to shard_id=8 on host <MagicMock id='140090488006800'>
2021-10-18 13:13:07,892 - DEBUG    - cassandra.pool: Connected to 9/14 shards on host <MagicMock id='140090488006800'> (5 missing)
2021-10-18 13:13:07,893 - DEBUG    - cassandra.pool: New connection created to shard_id=9 on host <MagicMock id='140090488006800'>
2021-10-18 13:13:07,893 - DEBUG    - cassandra.pool: Connected to 10/14 shards on host <MagicMock id='140090488006800'> (4 missing)
2021-10-18 13:13:07,894 - DEBUG    - cassandra.pool: New connection created to shard_id=10 on host <MagicMock id='140090488006800'>
2021-10-18 13:13:07,894 - DEBUG    - cassandra.pool: Connected to 11/14 shards on host <MagicMock id='140090488006800'> (3 missing)
2021-10-18 13:13:07,895 - DEBUG    - cassandra.pool: New connection created to shard_id=11 on host <MagicMock id='140090488006800'>
2021-10-18 13:13:07,895 - DEBUG    - cassandra.pool: Connected to 12/14 shards on host <MagicMock id='140090488006800'> (2 missing)
2021-10-18 13:13:07,896 - DEBUG    - cassandra.pool: New connection created to shard_id=12 on host <MagicMock id='140090488006800'>
2021-10-18 13:13:07,896 - DEBUG    - cassandra.pool: Connected to 13/14 shards on host <MagicMock id='140090488006800'> (1 missing)
2021-10-18 13:13:07,897 - DEBUG    - cassandra.pool: New connection created to shard_id=13 on host <MagicMock id='140090488006800'>
2021-10-18 13:13:07,897 - DEBUG    - cassandra.pool: Connected to 14/14 shards on host <MagicMock id='140090488006800'> (0 missing)
2021-10-18 13:13:07,898 - DEBUG    - cassandra.pool: Shutting down connections to <MagicMock id='140090488006800'>
2021-10-18 13:13:07,898 - INFO     - __main__  : Future start: ()
2021-10-18 13:13:07,898 - DEBUG    - cassandra.pool: New connection created to shard_id=14 on host <MagicMock id='140090488006800'>
2021-10-18 13:13:07,898 - DEBUG    - cassandra.pool: Closing connection (140090488327856) to <MagicMock id='140090488006800'>
Traceback (most recent call last):
  File "/home/komachi/development/python-driver/./race-repro.py", line 71, in <module>
    main()
  File "/home/komachi/development/python-driver/./race-repro.py", line 67, in main
2021-10-18 13:13:07,899 - DEBUG    - cassandra.pool: Connected to 15/14 shards on host <MagicMock id='140090488006800'> (-1 missing)
    pool.shutdown()
  File "/home/komachi/development/python-driver/cassandra/pool.py", line 546, in shutdown
    for c in self._connections.values():
RuntimeError: dictionary changed size during iteration

@k0machi k0machi added bug Something isn't working test-build If this tag, do all build stages labels Oct 18, 2021
@k0machi k0machi added the integration-tests If this tag, this PR won't run integration tests label Oct 18, 2021
@dkropachev
Copy link
Copy Markdown
Collaborator

dkropachev commented Oct 18, 2021

@k0machi, please add lock to the following routines: _set_keyspace_for_all_conns, open_count, get_state

@k0machi k0machi requested review from dkropachev and removed request for ultrabug October 18, 2021 13:04
@k0machi k0machi force-pushed the iss111-lock-races branch 2 times, most recently from a2147a8 to ed0bcab Compare October 18, 2021 16:57
Previously, locking used in HostConnection would just check if was
shutdown or to set the is_shutdown attribute. This leads to a problem
where Thread A acquires lock, checks the flag, releases lock, checks
branch condition for opening a new shard connection, gets scheduled away,
while Thread B starts shutdown - acquires lock, sets flag, releases lock, starts
closing connections and gets scheduled away - leading to a race where once
Thread A resumes, it grows the dictionary being viewed by Thread B with a new
connection - even though it should not do so anymore. This commit
addresses this by moving any operation on the _connections under a lock
@dkropachev dkropachev merged commit 64242af into scylladb:master Oct 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working integration-tests If this tag, this PR won't run integration tests test-build If this tag, do all build stages

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race condition when closing connections during shutdown sequence

3 participants