Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#1608 revisited: Fix subscription busy wait melodic #1684

Merged

Conversation

cwecht
Copy link
Contributor

@cwecht cwecht commented Apr 2, 2019

This replaces #1608 . It fixes the same issue without breaking API/ABI.

@cwecht
Copy link
Contributor Author

cwecht commented Apr 2, 2019

@ros-pull-request-builder retest this please

2 similar comments
@cwecht
Copy link
Contributor Author

cwecht commented Apr 2, 2019

@ros-pull-request-builder retest this please

@cwecht
Copy link
Contributor Author

cwecht commented Apr 3, 2019

@ros-pull-request-builder retest this please

@cwecht
Copy link
Contributor Author

cwecht commented Apr 5, 2019

It seems, that BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC has to be defined in all places, where boost::condition_variable is used in a project. Otherwise this will result in an ODR-violation - different not identical definitions of some member-function of condition_variable. This causes the failing tests if BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC is defined in callback_queue.cpp only.

If this macro is defined in the test as well, the test, introduced by this PR succeeds, but another tests starts to fail. It seems, that the macro can cause problems with `boost::this_thread::sleep``-

Anyway. the ODR-violations are a potentially serious issue and should be avoided. #1651 might help here.

@peci1
Copy link
Contributor

peci1 commented Apr 26, 2019

Hey Christopher. Thanks very much for developing this further! Are you with Open Robotics, or did you start working on it on your own? =)

I can't believe the solution is really so simple, but I was never good at multithreaded code :)

Is there anything I can help with?

@cwecht
Copy link
Contributor Author

cwecht commented Apr 29, 2019

You're welcome, Martin! I've just some time to spent after finishing my masters degree.

This PR is currently blocked by #1651 , so I see not much need here, to do anything. We need to get #1651 merged first.

@peci1
Copy link
Contributor

peci1 commented Apr 29, 2019

Congrats for finishing your studies ;) Well, I don't think there is a direct dependency on #1651 (or is there?), but it's true that these two changes would be good to be tested together, since there might be some hidden problems.

@cwecht
Copy link
Contributor Author

cwecht commented Apr 29, 2019

Well, this PR as is should not be merged. I had to disable the monotonic clock, to make the tests work. But the monotonic clock is rather important, to avoid issues like #1558. Therefore I strongly recommend to wait for #1651 to be merged and then update this PR.

@peci1
Copy link
Contributor

peci1 commented Apr 29, 2019

Okay. Let me know if you need any help.

#ifndef __APPLE__
#define BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC
#endif

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your patch works fine for me even without #1651.

But why was this section removed independently of https://github.com/ros/ros_comm/pull/1651/files#diff-5f391de72dc7aa5abcad51904f99a29a? With Boost <1.61 (e.g. in Ubuntu Xenial) the preprocessor check in callback_queue.cpp:43 triggers an error now. For all Boost versions <1.67 condition variables use the non-monotonic system clock by default, with undesired behavior in case of jumps in system time, but condition_variable::wait_for and condition_variable::wait_until will still do the right thing in both cases.

Copy link
Contributor Author

@cwecht cwecht Jun 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this lines makes the tests pass. The problem is, that with this lines the definition of condition_variable is differente in callback_queue and in the tests. This ODR-violation leads to test failures.

@dirk-thomas
Copy link
Member

With #1878 merged please rebase this patch to check the tests with both patches together.

@cwecht
Copy link
Contributor Author

cwecht commented Feb 15, 2020

Ok, it gets even more complicated now. First of all: the new tests work, if ros_comm is build in release mode. In Debug-Mode there are ODR-problems with test_roscpp. If BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC is set for test_roscpp, the new tests are working, but all test using boost::posix_time are running forever (or at least very long). It turns out that boost::posix_time relies on the shared/static-library part of boost.thread which is shipped precompiled. In this precompiles parts a function of boost::posix_time is implemented using boost:condition_variable, which has been compiled without BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC set. This again seems to cause ODR-problems.

We could ban boost::posix_time from ros_comm, to fix this. Everything would still work for the usual user who uses the prebuild packages. But people, who build ros_comm in Debug-Mode and build their stuff using this build AND are using boost::posix_time, will encounter serious issues.

@meyerj
Copy link
Contributor

meyerj commented Feb 21, 2020

We could ban boost::posix_time from ros_comm, to fix this. Everything would still work for the usual user who uses the prebuild packages. But people, who build ros_comm in Debug-Mode and build their stuff using this build AND are using boost::posix_time, will encounter serious issues.

+1 for eliminating boost::posix_time and using boost::chrono instead.

As suggested in #1878 (comment), compiling roscpp with hidden default visibility might be an option to get rid of the ODR violations due to conflicting definitions of boost::condition_variable::condition_variable() and member functions, but strictly speaking that would break the API. Consider for Noetic?

Unfortunately switching to std::condition_variable would also not solve the problem, at least not before GCC 10:

Summary: The pthread C API does allow to use condition variables with a CLOCK_MONOTONIC internal clock, but neither boost::condition_variable before version 1.67 nor current versions of libstdc++ can deal with it without hassles.

Would it be an option to simply not define -DBOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC anymore? Then ros::SteadyTimer is not really steady and affected by system time jumps (e.g. after initial NTP synchronization) if the underlying boost::condition_variable default implementation is. But that is also the case for most other software that uses boost::condition_variable or std::condition_variable with timeouts. Forward jumps of the system time (which is much more likely than backwards) are also not an issue if spurious wakeups are handled correctly, like in TimerManager<T, D, E>::threadFunc().

@peci1
Copy link
Contributor

peci1 commented Jun 28, 2020

Any more thoughts on this? It'd be nice to have this solved in melodic.

@dirk-thomas
Copy link
Member

@ros-pull-request-builder retest this please

@dirk-thomas
Copy link
Member

@cwecht @peci1 I am fine merging this patch. But I wanted to double check how it relates to #1932 and if both are trying to address the same problem? If yes, is merging one of the patches sufficient and if that is the case which one should be preferred and why?

@peci1
Copy link
Contributor

peci1 commented Jul 28, 2020

This PR solves a different issue than #1932. This PR tackles the issue that if callbacks take too long, a multithreaded spinner busy waits on all threads because of a missing lock. PR #1932 solves just a problem with steady timers.

However, these PRs are related a bit in the CMakeLists.txt. This PR adds a workaround for the non-monotonic steady timers, while PR #1932 solves that issue directly, rendering the workaround from this PR useless. So I'd suggest merging #1932 first, adapting this PR to utilize the fixes from #1932, and merging this PR if all tests are green.

@dirk-thomas
Copy link
Member

@peci1 Thanks for the feedback. I have merged #1932. Please update this PR accordingly (@cwecht ?).

I am planning to release a new version of this repo into Melodic later this week and it would be great if this patch could be merge in time. 🙏

@peci1
Copy link
Contributor

peci1 commented Jul 30, 2020

I locally tested a rebase of this PR onto melodic-devel with the CMakeLists.txt changes left out, and all test_roscpp tests are green.

@cwecht do you have time today to do the change? If not, I can send another PR so that the change can be accepted in time before the next melodic release... But creating another (third, fourth?) PR for this change would just increase the mess...

This is the result of intra_suite (I've tested on WSL, so real performance would probably be even higher).

----------------------------------------------------------
Throughput Test 0: receiver_threads [1], sender_threads [1], streams [1], test_duration [1.0000000000], message_size [100]
        Messages Sent: 87445
        Messages Received: 87367 (99.9108010750%)
        Bytes Sent: 9444060
        Bytes Received: 9435636
        Bytes Per Second: 9435793 (8.9986734390 MB/s)
----------------------------------------------------------
Throughput Test 1: receiver_threads [1], sender_threads [1], streams [1], test_duration [1.0000000000], message_size [10485760]
        Messages Sent: 81892
        Messages Received: 81841 (99.9377228545%)
        Bytes Sent: 858700513056
        Bytes Received: 858165738888
        Bytes Per Second: 858170373008 (818415.0438385010 MB/s)
----------------------------------------------------------
Throughput Test 2: receiver_threads [1], sender_threads [1], streams [1], test_duration [1.0000000000], message_size [104857600]
        Messages Sent: 85396
        Messages Received: 85351 (99.9473043234%)
        Bytes Sent: 8954420292768
        Bytes Received: 8949701700408
        Bytes Per Second: 8949730339545 (8535127.9635858536 MB/s)
----------------------------------------------------------
Throughput Test 3: receiver_threads [1], sender_threads [1], streams [1], test_duration [10.0000000000], message_size [100]
        Messages Sent: 861589
        Messages Received: 861540 (99.9943128336%)
        Bytes Sent: 93051612
        Bytes Received: 93046320
        Bytes Per Second: 9304633 (8.8735895157 MB/s)
----------------------------------------------------------
Throughput Test 4: receiver_threads [1], sender_threads [1], streams [1], test_duration [10.0000000000], message_size [10485760]
        Messages Sent: 993794
        Messages Received: 993741 (99.9946669028%)
        Bytes Sent: 10420693323792
        Bytes Received: 10420137578088
        Bytes Per Second: 1042014299656 (993742.2749099731 MB/s)
----------------------------------------------------------
Throughput Test 5: receiver_threads [1], sender_threads [1], streams [1], test_duration [10.0000000000], message_size [104857600]
        Messages Sent: 896544
        Messages Received: 896497 (99.9947576471%)
        Bytes Sent: 94009459306752
        Bytes Received: 94004530999176
        Bytes Per Second: 9400456578086 (8964974.0010128021 MB/s)
----------------------------------------------------------
Multi-Threaded Latency Test 0: receiver_threads [1], sender_threads [1], streams [1], count_per_stream [100000], message_size [1]
        Message Count: 100000
        Latency Average: 0.0000490027
        Latency Min: 0.0000321865
        Latency Max: 0.0004410744
----------------------------------------------------------
Multi-Threaded Latency Test 1: receiver_threads [1], sender_threads [1], streams [1], count_per_stream [10000], message_size [1024]
        Message Count: 10000
        Latency Average: 0.0000496288
        Latency Min: 0.0000460148
        Latency Max: 0.0004994869
----------------------------------------------------------
Multi-Threaded Latency Test 2: receiver_threads [1], sender_threads [1], streams [1], count_per_stream [1000], message_size [1048576]
        Message Count: 1000
        Latency Average: 0.0000521352
        Latency Min: 0.0000464916
        Latency Max: 0.0003809929
----------------------------------------------------------
Multi-Threaded Latency Test 3: receiver_threads [1], sender_threads [1], streams [1], count_per_stream [100], message_size [104857600]
        Message Count: 100
        Latency Average: 0.0000617075
        Latency Min: 0.0000467300
        Latency Max: 0.0005068779
----------------------------------------------------------
Single-Threaded Latency Test 0
        Message Count: 10000
        Latency Average: 0.0000201961
        Latency Min: 0.0000169277
        Latency Max: 0.0002233982
----------------------------------------------------------
Single-Threaded Latency Test 1
        Message Count: 100000
        Latency Average: 0.0000204621
        Latency Min: 0.0000169277
        Latency Max: 0.0002276897
----------------------------------------------------------
Single-Threaded Latency Test 2
        Message Count: 1000000
        Latency Average: 0.0000202981
        Latency Min: 0.0000166893
        Latency Max: 0.0004115105

@cwecht cwecht force-pushed the fix_subscription_busy_wait-melodic branch from 749246e to 1ae22ec Compare July 30, 2020 17:46
@dirk-thomas
Copy link
Member

@cwecht Is the pull request ready for review / merge?

@cwecht
Copy link
Contributor Author

cwecht commented Jul 30, 2020

I've just rebased this branch. From my side, yes, it is ready for review. But I did not kept track of the most recent changes in ros_comm. Maybe e.g. @meyerj should have a final look on this as well.

@dirk-thomas
Copy link
Member

I plan to merge this and make a release on Friday morning (PDT)...

Copy link
Contributor

@meyerj meyerj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't have time right now for a full review, or even to test the patch. But I can confirm that it is not replacing #1932 nor does this patch replace #1932. They are solving different problems with similar symptoms.

The block that includes the non-existent "boost_161_condition_variable.h" header conditionally can be simply removed.

Are there plans to backport those recent patches to kinetic-devel or forward port to noetic-devel?

clients/roscpp/include/ros/callback_queue.h Outdated Show resolved Hide resolved
@dirk-thomas
Copy link
Member

Are there plans to backport those recent patches to kinetic-devel or forward port to noetic-devel?

I don't plan to backport these changes to kinetic-devel. I thought Kinetic isn't affected by the high CPU usage?

I also don't plan to forward port any changes to Noetic. If the change would be necessary in Noetic it should have landed there in the first place.

That being said if anything needs to be applied to other branches please create PRs for it. Thanks.

@meyerj
Copy link
Contributor

meyerj commented Jul 31, 2020

@cwecht @dirk-thomas Please also consider my proposal to simplify callOne(timeout) in cwecht#1, based on this PR. I already forgot about it. Just rebased it.

I did not test the patch anymore recently and after #1878 and #1932, but at first glance I still think that it would be a cleaner and shorter implementation and we are using it in our internal builds of ROS melodic since more than a year now.

@dirk-thomas dirk-thomas merged commit 3bca6e3 into ros:melodic-devel Jul 31, 2020
@dirk-thomas
Copy link
Member

Please also consider my proposal to simplify callOne(timeout) in cwecht#1

I moved the change to #2014 to run CI and be able to merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants