-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Possible data race in accessing scalars #40069
Comments
Can you show your stack trace? |
Arrow is built from commit 3c655df |
Thank you. It looks like calling |
@pitrou Just a thought, seems like the information to fill into the scalar's scratch space is totally self-contained, should we do the filling by a (virtual) member function of the scalar class, e.g. (But this may bring ABI issues, which I'm not sure how is dealt with in arrow.) |
@bkietz Would you be able to take a look at this? |
@zanmato1984 I'm not sure about the answer to the rest of your question, but for the record Arrow C++ doesn't promise any ABI stability. We do try to maintain API compatibility (with the occasional deprecation or signature change), but people are expected to recompile when switching from one version of Arrow C++ to another. |
Thank you for the info. After some attempts to draft a fix, I found something worth discussion. First I was able to reproduce the TSAN error using a simple UT. Then I tried to do a plane fix proposed by @pitrou , that is using relaxed atomic stores. Given that I'm not aware of how to apply atomic operations on plane memory in C++, I'll have to change the definition of the scratch space into several atomic variables. However, the interpretation of the scratch space is type-dependent, i.e., some types treat it as two 32-byte space and others treat it as two 64-byte space. This makes "defining the scratch space as a fixed number of atomic variables in the class" not feasible. Now I'm drafting the alternative way, which is using extra atomic flags + spinning (keeping scratch space itself unchanged - as plane memory), which is basically what I talked about in my previous comment. It'll come later. |
But all threads accessing the scratch space should interpret it as the same type, right?
Please let's avoid home-grown synchronization like this. Either we can use atomic accesses without spinning, or we should use a lock. |
Right. The question here is that the scratch space is defined in a generic place, i.e.,
Yeah, of course. I was thinking about using mutex too. |
It would be incorrect for different threads to interpret it differently, even with a lock. |
You are right about this too. I think here are two things:
For 1, if the answer is no, then it should fall into some sort of undefined behavior area. That will be a user code problem that is using scalar the wrong way. It is not essential whether this is a race, or how the synchronization is implemented. And hopefully, and fortunately, I think current compute and Acero components are using scalar the "right way". So we don't really need to worry about it. For 2, this is what we are really talking about. Aren't we :) |
Right, so I think we can simply use atomic stores here. |
Are you suggesting something like this? |
Yes, roughly. |
OK. I'll move on to add some more tests and get the PR ready. PS: in my previous several comments I didn't recall that there is such a way to operate plane memory atomically so there might have been some misunderstanding. My apology and really appreciate your patient :) |
…ion (#40237) ### Rationale for this change As #40069 shows, TSAN reports data race that is caused by concurrent filling the scratch space of a scalar instance. The concurrent use of the same scalar could be parallel executing an acero plan containing a literal (a "constant" that is simply represented by an underlying scalar), and this is totally legit. The problem lies in the fact that the scratch space of the scalar is filled "lazily" by the time when it is being involved in the computation and transformed to an array span, for *every* thread. After piloting several approaches (relaxed atomic - an earlier version of this PR, locking - #40260), @ pitrou and @ bkietz suggested an immutable-after-initialization approach, for which the latest version of this PR is. ### What changes are included in this PR? There are generally two parts in this PR: 1. Mandate the initialization of the scratch space in constructor of the concrete subclass of `Scalar`. 2. In order to keep the content of the scratch space consistent with the underlying `value` of the scalar, make the `value` constant. This effectively makes legacy code that directly assigning to the `value` invalid, which is refactored accordingly: 2.1 `BoxScalar` in https://github.com/apache/arrow/pull/40237/files#diff-08d11e02c001c82b1aa89565e16760a8bcca4a608c22619fb45da42fd0ebebac 2.2 `Scalar::CastTo` in https://github.com/apache/arrow/pull/40237/files#diff-b4b83682450006616fa7e4f6f2ea3031cf1a22d734f4bee42a99af313e808f9e 2.3 `ScalarMinMax` in https://github.com/apache/arrow/pull/40237/files#diff-368ab7e748bd4432c92d9fdc26b51e131742b968e3eb32a6fcea4b9f02fa36aa Besides, when refactoring 2.2, I found the current `Scalar::CastTo` is not fully covered by the existing tests. So I also added some lacking ones. ### Are these changes tested? Yes. ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** The `value` member of `BaseBinaryScalar` and subclasses/`BaseListScalar` and subclasses/`SparseUnionScalar`/`DenseUnionScalar`/`RunEndEncodedScalar` is made constant, thus code directly assigning to this member will no more compile. Also the `Scalar::mutable_data()` member function is removed because it's against the immutable nature of `Scalar`. However the impact of these changes seems limited. I don't think much user code is depending on these two old pieces of code. Also after an quick search, I didn't find any document that need to be updated according to this change. There could be none. But if there is, may someone please redirect me to it so I can update. Thanks. * GitHub Issue: #40069 Lead-authored-by: Ruoxi Sun <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
Issue resolved by pull request 40237 |
…alization (apache#40237) ### Rationale for this change As apache#40069 shows, TSAN reports data race that is caused by concurrent filling the scratch space of a scalar instance. The concurrent use of the same scalar could be parallel executing an acero plan containing a literal (a "constant" that is simply represented by an underlying scalar), and this is totally legit. The problem lies in the fact that the scratch space of the scalar is filled "lazily" by the time when it is being involved in the computation and transformed to an array span, for *every* thread. After piloting several approaches (relaxed atomic - an earlier version of this PR, locking - apache#40260), @ pitrou and @ bkietz suggested an immutable-after-initialization approach, for which the latest version of this PR is. ### What changes are included in this PR? There are generally two parts in this PR: 1. Mandate the initialization of the scratch space in constructor of the concrete subclass of `Scalar`. 2. In order to keep the content of the scratch space consistent with the underlying `value` of the scalar, make the `value` constant. This effectively makes legacy code that directly assigning to the `value` invalid, which is refactored accordingly: 2.1 `BoxScalar` in https://github.com/apache/arrow/pull/40237/files#diff-08d11e02c001c82b1aa89565e16760a8bcca4a608c22619fb45da42fd0ebebac 2.2 `Scalar::CastTo` in https://github.com/apache/arrow/pull/40237/files#diff-b4b83682450006616fa7e4f6f2ea3031cf1a22d734f4bee42a99af313e808f9e 2.3 `ScalarMinMax` in https://github.com/apache/arrow/pull/40237/files#diff-368ab7e748bd4432c92d9fdc26b51e131742b968e3eb32a6fcea4b9f02fa36aa Besides, when refactoring 2.2, I found the current `Scalar::CastTo` is not fully covered by the existing tests. So I also added some lacking ones. ### Are these changes tested? Yes. ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** The `value` member of `BaseBinaryScalar` and subclasses/`BaseListScalar` and subclasses/`SparseUnionScalar`/`DenseUnionScalar`/`RunEndEncodedScalar` is made constant, thus code directly assigning to this member will no more compile. Also the `Scalar::mutable_data()` member function is removed because it's against the immutable nature of `Scalar`. However the impact of these changes seems limited. I don't think much user code is depending on these two old pieces of code. Also after an quick search, I didn't find any document that need to be updated according to this change. There could be none. But if there is, may someone please redirect me to it so I can update. Thanks. * GitHub Issue: apache#40069 Lead-authored-by: Ruoxi Sun <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…ion (#40237) ### Rationale for this change As #40069 shows, TSAN reports data race that is caused by concurrent filling the scratch space of a scalar instance. The concurrent use of the same scalar could be parallel executing an acero plan containing a literal (a "constant" that is simply represented by an underlying scalar), and this is totally legit. The problem lies in the fact that the scratch space of the scalar is filled "lazily" by the time when it is being involved in the computation and transformed to an array span, for *every* thread. After piloting several approaches (relaxed atomic - an earlier version of this PR, locking - #40260), @ pitrou and @ bkietz suggested an immutable-after-initialization approach, for which the latest version of this PR is. ### What changes are included in this PR? There are generally two parts in this PR: 1. Mandate the initialization of the scratch space in constructor of the concrete subclass of `Scalar`. 2. In order to keep the content of the scratch space consistent with the underlying `value` of the scalar, make the `value` constant. This effectively makes legacy code that directly assigning to the `value` invalid, which is refactored accordingly: 2.1 `BoxScalar` in https://github.com/apache/arrow/pull/40237/files#diff-08d11e02c001c82b1aa89565e16760a8bcca4a608c22619fb45da42fd0ebebac 2.2 `Scalar::CastTo` in https://github.com/apache/arrow/pull/40237/files#diff-b4b83682450006616fa7e4f6f2ea3031cf1a22d734f4bee42a99af313e808f9e 2.3 `ScalarMinMax` in https://github.com/apache/arrow/pull/40237/files#diff-368ab7e748bd4432c92d9fdc26b51e131742b968e3eb32a6fcea4b9f02fa36aa Besides, when refactoring 2.2, I found the current `Scalar::CastTo` is not fully covered by the existing tests. So I also added some lacking ones. ### Are these changes tested? Yes. ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** The `value` member of `BaseBinaryScalar` and subclasses/`BaseListScalar` and subclasses/`SparseUnionScalar`/`DenseUnionScalar`/`RunEndEncodedScalar` is made constant, thus code directly assigning to this member will no more compile. Also the `Scalar::mutable_data()` member function is removed because it's against the immutable nature of `Scalar`. However the impact of these changes seems limited. I don't think much user code is depending on these two old pieces of code. Also after an quick search, I didn't find any document that need to be updated according to this change. There could be none. But if there is, may someone please redirect me to it so I can update. Thanks. * GitHub Issue: #40069 Lead-authored-by: Ruoxi Sun <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…alization (apache#40237) ### Rationale for this change As apache#40069 shows, TSAN reports data race that is caused by concurrent filling the scratch space of a scalar instance. The concurrent use of the same scalar could be parallel executing an acero plan containing a literal (a "constant" that is simply represented by an underlying scalar), and this is totally legit. The problem lies in the fact that the scratch space of the scalar is filled "lazily" by the time when it is being involved in the computation and transformed to an array span, for *every* thread. After piloting several approaches (relaxed atomic - an earlier version of this PR, locking - apache#40260), @ pitrou and @ bkietz suggested an immutable-after-initialization approach, for which the latest version of this PR is. ### What changes are included in this PR? There are generally two parts in this PR: 1. Mandate the initialization of the scratch space in constructor of the concrete subclass of `Scalar`. 2. In order to keep the content of the scratch space consistent with the underlying `value` of the scalar, make the `value` constant. This effectively makes legacy code that directly assigning to the `value` invalid, which is refactored accordingly: 2.1 `BoxScalar` in https://github.com/apache/arrow/pull/40237/files#diff-08d11e02c001c82b1aa89565e16760a8bcca4a608c22619fb45da42fd0ebebac 2.2 `Scalar::CastTo` in https://github.com/apache/arrow/pull/40237/files#diff-b4b83682450006616fa7e4f6f2ea3031cf1a22d734f4bee42a99af313e808f9e 2.3 `ScalarMinMax` in https://github.com/apache/arrow/pull/40237/files#diff-368ab7e748bd4432c92d9fdc26b51e131742b968e3eb32a6fcea4b9f02fa36aa Besides, when refactoring 2.2, I found the current `Scalar::CastTo` is not fully covered by the existing tests. So I also added some lacking ones. ### Are these changes tested? Yes. ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** The `value` member of `BaseBinaryScalar` and subclasses/`BaseListScalar` and subclasses/`SparseUnionScalar`/`DenseUnionScalar`/`RunEndEncodedScalar` is made constant, thus code directly assigning to this member will no more compile. Also the `Scalar::mutable_data()` member function is removed because it's against the immutable nature of `Scalar`. However the impact of these changes seems limited. I don't think much user code is depending on these two old pieces of code. Also after an quick search, I didn't find any document that need to be updated according to this change. There could be none. But if there is, may someone please redirect me to it so I can update. Thanks. * GitHub Issue: apache#40069 Lead-authored-by: Ruoxi Sun <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…alization (apache#40237) ### Rationale for this change As apache#40069 shows, TSAN reports data race that is caused by concurrent filling the scratch space of a scalar instance. The concurrent use of the same scalar could be parallel executing an acero plan containing a literal (a "constant" that is simply represented by an underlying scalar), and this is totally legit. The problem lies in the fact that the scratch space of the scalar is filled "lazily" by the time when it is being involved in the computation and transformed to an array span, for *every* thread. After piloting several approaches (relaxed atomic - an earlier version of this PR, locking - apache#40260), @ pitrou and @ bkietz suggested an immutable-after-initialization approach, for which the latest version of this PR is. ### What changes are included in this PR? There are generally two parts in this PR: 1. Mandate the initialization of the scratch space in constructor of the concrete subclass of `Scalar`. 2. In order to keep the content of the scratch space consistent with the underlying `value` of the scalar, make the `value` constant. This effectively makes legacy code that directly assigning to the `value` invalid, which is refactored accordingly: 2.1 `BoxScalar` in https://github.com/apache/arrow/pull/40237/files#diff-08d11e02c001c82b1aa89565e16760a8bcca4a608c22619fb45da42fd0ebebac 2.2 `Scalar::CastTo` in https://github.com/apache/arrow/pull/40237/files#diff-b4b83682450006616fa7e4f6f2ea3031cf1a22d734f4bee42a99af313e808f9e 2.3 `ScalarMinMax` in https://github.com/apache/arrow/pull/40237/files#diff-368ab7e748bd4432c92d9fdc26b51e131742b968e3eb32a6fcea4b9f02fa36aa Besides, when refactoring 2.2, I found the current `Scalar::CastTo` is not fully covered by the existing tests. So I also added some lacking ones. ### Are these changes tested? Yes. ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** The `value` member of `BaseBinaryScalar` and subclasses/`BaseListScalar` and subclasses/`SparseUnionScalar`/`DenseUnionScalar`/`RunEndEncodedScalar` is made constant, thus code directly assigning to this member will no more compile. Also the `Scalar::mutable_data()` member function is removed because it's against the immutable nature of `Scalar`. However the impact of these changes seems limited. I don't think much user code is depending on these two old pieces of code. Also after an quick search, I didn't find any document that need to be updated according to this change. There could be none. But if there is, may someone please redirect me to it so I can update. Thanks. * GitHub Issue: apache#40069 Lead-authored-by: Ruoxi Sun <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…alization (apache#40237) ### Rationale for this change As apache#40069 shows, TSAN reports data race that is caused by concurrent filling the scratch space of a scalar instance. The concurrent use of the same scalar could be parallel executing an acero plan containing a literal (a "constant" that is simply represented by an underlying scalar), and this is totally legit. The problem lies in the fact that the scratch space of the scalar is filled "lazily" by the time when it is being involved in the computation and transformed to an array span, for *every* thread. After piloting several approaches (relaxed atomic - an earlier version of this PR, locking - apache#40260), @ pitrou and @ bkietz suggested an immutable-after-initialization approach, for which the latest version of this PR is. ### What changes are included in this PR? There are generally two parts in this PR: 1. Mandate the initialization of the scratch space in constructor of the concrete subclass of `Scalar`. 2. In order to keep the content of the scratch space consistent with the underlying `value` of the scalar, make the `value` constant. This effectively makes legacy code that directly assigning to the `value` invalid, which is refactored accordingly: 2.1 `BoxScalar` in https://github.com/apache/arrow/pull/40237/files#diff-08d11e02c001c82b1aa89565e16760a8bcca4a608c22619fb45da42fd0ebebac 2.2 `Scalar::CastTo` in https://github.com/apache/arrow/pull/40237/files#diff-b4b83682450006616fa7e4f6f2ea3031cf1a22d734f4bee42a99af313e808f9e 2.3 `ScalarMinMax` in https://github.com/apache/arrow/pull/40237/files#diff-368ab7e748bd4432c92d9fdc26b51e131742b968e3eb32a6fcea4b9f02fa36aa Besides, when refactoring 2.2, I found the current `Scalar::CastTo` is not fully covered by the existing tests. So I also added some lacking ones. ### Are these changes tested? Yes. ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** The `value` member of `BaseBinaryScalar` and subclasses/`BaseListScalar` and subclasses/`SparseUnionScalar`/`DenseUnionScalar`/`RunEndEncodedScalar` is made constant, thus code directly assigning to this member will no more compile. Also the `Scalar::mutable_data()` member function is removed because it's against the immutable nature of `Scalar`. However the impact of these changes seems limited. I don't think much user code is depending on these two old pieces of code. Also after an quick search, I didn't find any document that need to be updated according to this change. There could be none. But if there is, may someone please redirect me to it so I can update. Thanks. * GitHub Issue: apache#40069 Lead-authored-by: Ruoxi Sun <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…alization (apache#40237) ### Rationale for this change As apache#40069 shows, TSAN reports data race that is caused by concurrent filling the scratch space of a scalar instance. The concurrent use of the same scalar could be parallel executing an acero plan containing a literal (a "constant" that is simply represented by an underlying scalar), and this is totally legit. The problem lies in the fact that the scratch space of the scalar is filled "lazily" by the time when it is being involved in the computation and transformed to an array span, for *every* thread. After piloting several approaches (relaxed atomic - an earlier version of this PR, locking - apache#40260), @ pitrou and @ bkietz suggested an immutable-after-initialization approach, for which the latest version of this PR is. ### What changes are included in this PR? There are generally two parts in this PR: 1. Mandate the initialization of the scratch space in constructor of the concrete subclass of `Scalar`. 2. In order to keep the content of the scratch space consistent with the underlying `value` of the scalar, make the `value` constant. This effectively makes legacy code that directly assigning to the `value` invalid, which is refactored accordingly: 2.1 `BoxScalar` in https://github.com/apache/arrow/pull/40237/files#diff-08d11e02c001c82b1aa89565e16760a8bcca4a608c22619fb45da42fd0ebebac 2.2 `Scalar::CastTo` in https://github.com/apache/arrow/pull/40237/files#diff-b4b83682450006616fa7e4f6f2ea3031cf1a22d734f4bee42a99af313e808f9e 2.3 `ScalarMinMax` in https://github.com/apache/arrow/pull/40237/files#diff-368ab7e748bd4432c92d9fdc26b51e131742b968e3eb32a6fcea4b9f02fa36aa Besides, when refactoring 2.2, I found the current `Scalar::CastTo` is not fully covered by the existing tests. So I also added some lacking ones. ### Are these changes tested? Yes. ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** The `value` member of `BaseBinaryScalar` and subclasses/`BaseListScalar` and subclasses/`SparseUnionScalar`/`DenseUnionScalar`/`RunEndEncodedScalar` is made constant, thus code directly assigning to this member will no more compile. Also the `Scalar::mutable_data()` member function is removed because it's against the immutable nature of `Scalar`. However the impact of these changes seems limited. I don't think much user code is depending on these two old pieces of code. Also after an quick search, I didn't find any document that need to be updated according to this change. There could be none. But if there is, may someone please redirect me to it so I can update. Thanks. * GitHub Issue: apache#40069 Lead-authored-by: Ruoxi Sun <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…alization (apache#40237) ### Rationale for this change As apache#40069 shows, TSAN reports data race that is caused by concurrent filling the scratch space of a scalar instance. The concurrent use of the same scalar could be parallel executing an acero plan containing a literal (a "constant" that is simply represented by an underlying scalar), and this is totally legit. The problem lies in the fact that the scratch space of the scalar is filled "lazily" by the time when it is being involved in the computation and transformed to an array span, for *every* thread. After piloting several approaches (relaxed atomic - an earlier version of this PR, locking - apache#40260), @ pitrou and @ bkietz suggested an immutable-after-initialization approach, for which the latest version of this PR is. ### What changes are included in this PR? There are generally two parts in this PR: 1. Mandate the initialization of the scratch space in constructor of the concrete subclass of `Scalar`. 2. In order to keep the content of the scratch space consistent with the underlying `value` of the scalar, make the `value` constant. This effectively makes legacy code that directly assigning to the `value` invalid, which is refactored accordingly: 2.1 `BoxScalar` in https://github.com/apache/arrow/pull/40237/files#diff-08d11e02c001c82b1aa89565e16760a8bcca4a608c22619fb45da42fd0ebebac 2.2 `Scalar::CastTo` in https://github.com/apache/arrow/pull/40237/files#diff-b4b83682450006616fa7e4f6f2ea3031cf1a22d734f4bee42a99af313e808f9e 2.3 `ScalarMinMax` in https://github.com/apache/arrow/pull/40237/files#diff-368ab7e748bd4432c92d9fdc26b51e131742b968e3eb32a6fcea4b9f02fa36aa Besides, when refactoring 2.2, I found the current `Scalar::CastTo` is not fully covered by the existing tests. So I also added some lacking ones. ### Are these changes tested? Yes. ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** The `value` member of `BaseBinaryScalar` and subclasses/`BaseListScalar` and subclasses/`SparseUnionScalar`/`DenseUnionScalar`/`RunEndEncodedScalar` is made constant, thus code directly assigning to this member will no more compile. Also the `Scalar::mutable_data()` member function is removed because it's against the immutable nature of `Scalar`. However the impact of these changes seems limited. I don't think much user code is depending on these two old pieces of code. Also after an quick search, I didn't find any document that need to be updated according to this change. There could be none. But if there is, may someone please redirect me to it so I can update. Thanks. * GitHub Issue: apache#40069 Lead-authored-by: Ruoxi Sun <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
Describe the bug, including details regarding any error messages, version, and platform.
I use an
arrow::compute::literal
in my parquet reading pipeline and get ThreadSanitizer reports about this code:arrow/cpp/src/arrow/array/data.cc
Lines 288 to 290 in c23a097
If I am reading it right, this results in issuing write instructions during what would seem to be read-only accesses, and this indeed looks racy.
Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: