Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for deadlock issue 1980 #2121

Merged
merged 2 commits into from
Mar 11, 2021

Conversation

iwanders
Copy link
Contributor

We encountered the issue sketched in #1980 in a production system last week.

To understand the issue at hand I've made a minimal-non-working example here. Looking at the source code it seems that the callback queue is actually at fault, not the timer code. The removeByID deadlocks when a callback by that id is currently being executed.

The first commit introduces a unit test that demonstrates this deadlock in a unit test.

The second commit is the proposed fix for this problem. We try to obtain the lock, if we fail to obtain the lock we defer removal until the next callback queue cycle using the already existing marked_for_removal boolean.

FYI @guillaumeautran , @mikepurvis , @jasonimercer

This unit test fails until we can remove callbacks that are being executed.
This fixes a potential deadlock when a timer is being removed that's also being executed.
Copy link
Contributor

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me. I also confirmed that this PR fixes the deadlock problem introduced here.

@iwanders
Copy link
Contributor Author

iwanders commented Mar 8, 2021

@fujitatomoya , is there anything else I can do to help get this issue merged and resolved?

@fujitatomoya
Copy link
Contributor

@jacobperron @mjcarroll @sloretz friendly ping.

Copy link
Contributor

@jacobperron jacobperron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good to me, thanks!

Also, thank you for the reproduction and unit test 🙇

@jacobperron jacobperron merged commit 70d4b5b into ros:noetic-devel Mar 11, 2021
jacobperron pushed a commit that referenced this pull request Apr 6, 2021
* Add unit test for removing a callback that's being executed.

This unit test fails until we can remove callbacks that are being executed.

* Use the marked_for_removal flag if the callback is being executed.

This fixes a potential deadlock when a timer is being removed that's also being executed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants