-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the destruction of interruptible token registry #1229
Fix the destruction of interruptible token registry #1229
Conversation
@@ -203,21 +205,25 @@ class interruptible { | |||
{ | |||
std::lock_guard<std::mutex> guard_get(mutex_); | |||
// the following constructs an empty shared_ptr if the key does not exist. | |||
auto& weak_store = registry_[thread_id]; | |||
auto& weak_store = (*registry_)[thread_id]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: since we know the registry can only be deleted on program exit, accessing it here without checks shouldn't cause any problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least on my machine, yes! Has been running for about ten minutes by now.
Can you think of any way we might be able to add a test for this just to make sure users don't encounter this issue in the future? I've not tried running our own tests in a loop but I've also not seen an indication that raft or cuml have suffered from this issue either. Just makes me wonder why, and if there's something we can do to force (or at least observe) the behavior to test it. |
Good point. Perhaps, we can try to run a simple program with interruptible tokens in a subprocess, many times in a loop? And check the exit codes. |
@achirkin, I think we can go ahead and merge this in the meantime so we can re-establish the feature and we can create an issue to revist with a test in the future |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-23.04 #1229 +/- ##
===============================================
Coverage ? 87.99%
===============================================
Files ? 21
Lines ? 483
Branches ? 0
===============================================
Hits ? 425
Misses ? 58
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@achirkin I've rerun the dask wheel tests a few times now and it looks like there's a repeatable error in the |
…ptible-destruction
…nly access them safe.
@achirkin I still notice this in the
More specifically, I see this in the stack trace: |
As explained in rapidsai#1246 (comment), ptxas chokes on the minkowski distance when `VecLen==4` and `IdxT==uint32_t`. This PR removes the veclen == 4 specialization for the minkowski distance. Follow up to: rapidsai#1239 Authors: - Allard Hendriksen (https://github.com/ahendriksen) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Sean Frye (https://github.com/sean-frye) URL: rapidsai#1254
…ptible-destruction
Marked do-not-merge just to be on the safe side: I haven't been able to confirm it doesn't crash on arm yet. |
Yes, thanks, that looks like the same issue. I managed to reproduce it, and this PR indeed fixes the issue. Yet, there is one thing that concerns me: what if this this crash is caused by something else, but just exposed here? Like the issue #740 that was solved by #764 (not very probable though, because the issue in #1275 is really minimalistic in terms of dependencies). |
…ptible-destruction
I've managed to run the the tests including the previously failing I'm removing the do-not-merge label, since I couldn't find any way to provoke the segfault in the current state of the PR. |
@achirkin Im okay with giving these changes a go. Many of the other downstream projects have been notified that we are going to merge this fix. Worst case is we find another bug and fix it, but the passing tests (and passing MREs) have given me confidence in these changes. |
/merge |
Because there's no way to control the order of destruction between the global and thread-local static objects, the token registry may sometimes be accessed after it has already been destructed (in the program exit handlers).
This fix wraps the registry in a shared pointer and keeps the weak pointers in the deleters which cause the problem, thus it avoids accessing the registry after it's been destroyed.
Closes #1225
Closes #1275