[C++] Possible data race in accessing scalars #40069

mpimenov · 2024-02-13T18:54:31Z

Describe the bug, including details regarding any error messages, version, and platform.

I use an arrow::compute::literal in my parquet reading pipeline and get ThreadSanitizer reports about this code:

Lines 288 to 290 in c23a097

    
           auto* offsets = reinterpret_cast<offset_type*>(scratch_space); 
        
           offsets[0] = 0; 
        
           offsets[1] = static_cast<offset_type>(value_size);

If I am reading it right, this results in issuing write instructions during what would seem to be read-only accesses, and this indeed looks racy.

Component(s)

C++, Parquet

The text was updated successfully, but these errors were encountered:

pitrou · 2024-02-14T10:27:59Z

Can you show your stack trace?

mpimenov · 2024-02-14T16:47:34Z

Arrow is built from commit 3c655df
tsan-40069.txt

pitrou · 2024-02-14T17:33:46Z

Thank you. It looks like calling ArraySpan::FillFromScalar from several threads is legitimate when the same scalar is used as input for different computations. Given that this is probably always writing the same value, it would probably be enough to use relaxed atomic stores.

@bkietz

zanmato1984 · 2024-02-26T08:38:23Z

Thank you. It looks like calling ArraySpan::FillFromScalar from several threads is legitimate when the same scalar is used as input for different computations. Given that this is probably always writing the same value, it would probably be enough to use relaxed atomic stores.

@bkietz

@pitrou Just a thought, seems like the information to fill into the scalar's scratch space is totally self-contained, should we do the filling by a (virtual) member function of the scalar class, e.g. virtual BufferSpan Scalar::FillScrach(...)? And we can probably use an acquire-release atomic flag to mark if the scratch has already been filled. This might be more semantically clear imho.

(But this may bring ABI issues, which I'm not sure how is dealt with in arrow.)

pitrou · 2024-02-26T11:05:50Z

@bkietz Would you be able to take a look at this?

pitrou · 2024-02-26T11:07:08Z

@zanmato1984 I'm not sure about the answer to the rest of your question, but for the record Arrow C++ doesn't promise any ABI stability. We do try to maintain API compatibility (with the occasional deprecation or signature change), but people are expected to recompile when switching from one version of Arrow C++ to another.

zanmato1984 · 2024-02-26T13:08:20Z

@zanmato1984 I'm not sure about the answer to the rest of your question, but for the record Arrow C++ doesn't promise any ABI stability. We do try to maintain API compatibility (with the occasional deprecation or signature change), but people are expected to recompile when switching from one version of Arrow C++ to another.

Thank you for the info.

After some attempts to draft a fix, I found something worth discussion.

First I was able to reproduce the TSAN error using a simple UT. Then I tried to do a plane fix proposed by @pitrou , that is using relaxed atomic stores. Given that I'm not aware of how to apply atomic operations on plane memory in C++, I'll have to change the definition of the scratch space into several atomic variables. However, the interpretation of the scratch space is type-dependent, i.e., some types treat it as two 32-byte space and others treat it as two 64-byte space. This makes "defining the scratch space as a fixed number of atomic variables in the class" not feasible.

Now I'm drafting the alternative way, which is using extra atomic flags + spinning (keeping scratch space itself unchanged - as plane memory), which is basically what I talked about in my previous comment. It'll come later.

pitrou · 2024-02-26T13:09:41Z

However, the interpretation of the scratch space is type-dependent, i.e., some types treat it as two 32-byte space and others treat it as two 64-byte space.

But all threads accessing the scratch space should interpret it as the same type, right?

extra atomic flags + spinning (keeping scratch space itself unchanged - as plane memory)

Please let's avoid home-grown synchronization like this. Either we can use atomic accesses without spinning, or we should use a lock.

zanmato1984 · 2024-02-26T13:15:28Z

However, the interpretation of the scratch space is type-dependent, i.e., some types treat it as two 32-byte space and others treat it as two 64-byte space.

But all threads accessing the scratch space should interpret it as the same type, right?

Right. The question here is that the scratch space is defined in a generic place, i.e., ArraySpanFillFromScalarScratchSpace (base class of all "scratched" scalars) and we couldn't know how it is going to be interpreted at runtime, despite whether all threads interpret it as the same type or not.

extra atomic flags + spinning (keeping scratch space itself unchanged - as plane memory)

Please let's avoid home-grown synchronization like this. Either we can use atomic accesses without spinning, or we should use a lock.

Yeah, of course. I was thinking about using mutex too.

pitrou · 2024-02-26T13:19:28Z

Right. The question here is that the scratch space is defined in a generic place, i.e., ArraySpanFillFromScalarScratchSpace (base class of all "scratched" scalars) and we couldn't know how it is going to be interpreted at runtime, despite whether all threads interpret it as the same type or not.

It would be incorrect for different threads to interpret it differently, even with a lock.

zanmato1984 · 2024-02-26T13:43:04Z

Right. The question here is that the scratch space is defined in a generic place, i.e., ArraySpanFillFromScalarScratchSpace (base class of all "scratched" scalars) and we couldn't know how it is going to be interpreted at runtime, despite whether all threads interpret it as the same type or not.

It would be incorrect for different threads to interpret it differently, even with a lock.

You are right about this too.

I think here are two things:

Is the scratch being interpreted identically by different threads?
If 1 is yes, is it a race?

For 1, if the answer is no, then it should fall into some sort of undefined behavior area. That will be a user code problem that is using scalar the wrong way. It is not essential whether this is a race, or how the synchronization is implemented. And hopefully, and fortunately, I think current compute and Acero components are using scalar the "right way". So we don't really need to worry about it.

For 2, this is what we are really talking about. Aren't we :)

pitrou · 2024-02-26T15:17:57Z

Right, so I think we can simply use atomic stores here.
In debug mode, we can perhaps be a bit more sophisticated and check that we don't write two different values, but that's optional IMHO.

zanmato1984 · 2024-02-26T15:59:27Z

Right, so I think we can simply use atomic stores here. In debug mode, we can perhaps be a bit more sophisticated and check that we don't write two different values, but that's optional IMHO.

Are you suggesting something like this?
https://github.com/apache/arrow/pull/40237/files#diff-1d848527e898cd9da6a41bdbd68cab5d5826986128851220d20d1fdac9752d8bR289-R292

pitrou · 2024-02-26T16:05:17Z

Yes, roughly.

zanmato1984 · 2024-02-26T16:11:48Z

Yes, roughly.

OK. I'll move on to add some more tests and get the PR ready.

PS: in my previous several comments I didn't recall that there is such a way to operate plane memory atomically so there might have been some misunderstanding. My apology and really appreciate your patient :)

…ion (#40237) ### Rationale for this change As #40069 shows, TSAN reports data race that is caused by concurrent filling the scratch space of a scalar instance. The concurrent use of the same scalar could be parallel executing an acero plan containing a literal (a "constant" that is simply represented by an underlying scalar), and this is totally legit. The problem lies in the fact that the scratch space of the scalar is filled "lazily" by the time when it is being involved in the computation and transformed to an array span, for *every* thread. After piloting several approaches (relaxed atomic - an earlier version of this PR, locking - #40260), @ pitrou and @ bkietz suggested an immutable-after-initialization approach, for which the latest version of this PR is. ### What changes are included in this PR? There are generally two parts in this PR: 1. Mandate the initialization of the scratch space in constructor of the concrete subclass of `Scalar`. 2. In order to keep the content of the scratch space consistent with the underlying `value` of the scalar, make the `value` constant. This effectively makes legacy code that directly assigning to the `value` invalid, which is refactored accordingly: 2.1 `BoxScalar` in https://github.com/apache/arrow/pull/40237/files#diff-08d11e02c001c82b1aa89565e16760a8bcca4a608c22619fb45da42fd0ebebac 2.2 `Scalar::CastTo` in https://github.com/apache/arrow/pull/40237/files#diff-b4b83682450006616fa7e4f6f2ea3031cf1a22d734f4bee42a99af313e808f9e 2.3 `ScalarMinMax` in https://github.com/apache/arrow/pull/40237/files#diff-368ab7e748bd4432c92d9fdc26b51e131742b968e3eb32a6fcea4b9f02fa36aa Besides, when refactoring 2.2, I found the current `Scalar::CastTo` is not fully covered by the existing tests. So I also added some lacking ones. ### Are these changes tested? Yes. ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** The `value` member of `BaseBinaryScalar` and subclasses/`BaseListScalar` and subclasses/`SparseUnionScalar`/`DenseUnionScalar`/`RunEndEncodedScalar` is made constant, thus code directly assigning to this member will no more compile. Also the `Scalar::mutable_data()` member function is removed because it's against the immutable nature of `Scalar`. However the impact of these changes seems limited. I don't think much user code is depending on these two old pieces of code. Also after an quick search, I didn't find any document that need to be updated according to this change. There could be none. But if there is, may someone please redirect me to it so I can update. Thanks. * GitHub Issue: #40069 Lead-authored-by: Ruoxi Sun <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

bkietz · 2024-04-23T16:06:45Z

Issue resolved by pull request 40237
#40237

…alization (apache#40237) ### Rationale for this change As apache#40069 shows, TSAN reports data race that is caused by concurrent filling the scratch space of a scalar instance. The concurrent use of the same scalar could be parallel executing an acero plan containing a literal (a "constant" that is simply represented by an underlying scalar), and this is totally legit. The problem lies in the fact that the scratch space of the scalar is filled "lazily" by the time when it is being involved in the computation and transformed to an array span, for *every* thread. After piloting several approaches (relaxed atomic - an earlier version of this PR, locking - apache#40260), @ pitrou and @ bkietz suggested an immutable-after-initialization approach, for which the latest version of this PR is. ### What changes are included in this PR? There are generally two parts in this PR: 1. Mandate the initialization of the scratch space in constructor of the concrete subclass of `Scalar`. 2. In order to keep the content of the scratch space consistent with the underlying `value` of the scalar, make the `value` constant. This effectively makes legacy code that directly assigning to the `value` invalid, which is refactored accordingly: 2.1 `BoxScalar` in https://github.com/apache/arrow/pull/40237/files#diff-08d11e02c001c82b1aa89565e16760a8bcca4a608c22619fb45da42fd0ebebac 2.2 `Scalar::CastTo` in https://github.com/apache/arrow/pull/40237/files#diff-b4b83682450006616fa7e4f6f2ea3031cf1a22d734f4bee42a99af313e808f9e 2.3 `ScalarMinMax` in https://github.com/apache/arrow/pull/40237/files#diff-368ab7e748bd4432c92d9fdc26b51e131742b968e3eb32a6fcea4b9f02fa36aa Besides, when refactoring 2.2, I found the current `Scalar::CastTo` is not fully covered by the existing tests. So I also added some lacking ones. ### Are these changes tested? Yes. ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** The `value` member of `BaseBinaryScalar` and subclasses/`BaseListScalar` and subclasses/`SparseUnionScalar`/`DenseUnionScalar`/`RunEndEncodedScalar` is made constant, thus code directly assigning to this member will no more compile. Also the `Scalar::mutable_data()` member function is removed because it's against the immutable nature of `Scalar`. However the impact of these changes seems limited. I don't think much user code is depending on these two old pieces of code. Also after an quick search, I didn't find any document that need to be updated according to this change. There could be none. But if there is, may someone please redirect me to it so I can update. Thanks. * GitHub Issue: apache#40069 Lead-authored-by: Ruoxi Sun <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

…ion (#40237) ### Rationale for this change As #40069 shows, TSAN reports data race that is caused by concurrent filling the scratch space of a scalar instance. The concurrent use of the same scalar could be parallel executing an acero plan containing a literal (a "constant" that is simply represented by an underlying scalar), and this is totally legit. The problem lies in the fact that the scratch space of the scalar is filled "lazily" by the time when it is being involved in the computation and transformed to an array span, for *every* thread. After piloting several approaches (relaxed atomic - an earlier version of this PR, locking - #40260), @ pitrou and @ bkietz suggested an immutable-after-initialization approach, for which the latest version of this PR is. ### What changes are included in this PR? There are generally two parts in this PR: 1. Mandate the initialization of the scratch space in constructor of the concrete subclass of `Scalar`. 2. In order to keep the content of the scratch space consistent with the underlying `value` of the scalar, make the `value` constant. This effectively makes legacy code that directly assigning to the `value` invalid, which is refactored accordingly: 2.1 `BoxScalar` in https://github.com/apache/arrow/pull/40237/files#diff-08d11e02c001c82b1aa89565e16760a8bcca4a608c22619fb45da42fd0ebebac 2.2 `Scalar::CastTo` in https://github.com/apache/arrow/pull/40237/files#diff-b4b83682450006616fa7e4f6f2ea3031cf1a22d734f4bee42a99af313e808f9e 2.3 `ScalarMinMax` in https://github.com/apache/arrow/pull/40237/files#diff-368ab7e748bd4432c92d9fdc26b51e131742b968e3eb32a6fcea4b9f02fa36aa Besides, when refactoring 2.2, I found the current `Scalar::CastTo` is not fully covered by the existing tests. So I also added some lacking ones. ### Are these changes tested? Yes. ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** The `value` member of `BaseBinaryScalar` and subclasses/`BaseListScalar` and subclasses/`SparseUnionScalar`/`DenseUnionScalar`/`RunEndEncodedScalar` is made constant, thus code directly assigning to this member will no more compile. Also the `Scalar::mutable_data()` member function is removed because it's against the immutable nature of `Scalar`. However the impact of these changes seems limited. I don't think much user code is depending on these two old pieces of code. Also after an quick search, I didn't find any document that need to be updated according to this change. There could be none. But if there is, may someone please redirect me to it so I can update. Thanks. * GitHub Issue: #40069 Lead-authored-by: Ruoxi Sun <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

…alization (apache#40237) ### Rationale for this change As apache#40069 shows, TSAN reports data race that is caused by concurrent filling the scratch space of a scalar instance. The concurrent use of the same scalar could be parallel executing an acero plan containing a literal (a "constant" that is simply represented by an underlying scalar), and this is totally legit. The problem lies in the fact that the scratch space of the scalar is filled "lazily" by the time when it is being involved in the computation and transformed to an array span, for *every* thread. After piloting several approaches (relaxed atomic - an earlier version of this PR, locking - apache#40260), @ pitrou and @ bkietz suggested an immutable-after-initialization approach, for which the latest version of this PR is. ### What changes are included in this PR? There are generally two parts in this PR: 1. Mandate the initialization of the scratch space in constructor of the concrete subclass of `Scalar`. 2. In order to keep the content of the scratch space consistent with the underlying `value` of the scalar, make the `value` constant. This effectively makes legacy code that directly assigning to the `value` invalid, which is refactored accordingly: 2.1 `BoxScalar` in https://github.com/apache/arrow/pull/40237/files#diff-08d11e02c001c82b1aa89565e16760a8bcca4a608c22619fb45da42fd0ebebac 2.2 `Scalar::CastTo` in https://github.com/apache/arrow/pull/40237/files#diff-b4b83682450006616fa7e4f6f2ea3031cf1a22d734f4bee42a99af313e808f9e 2.3 `ScalarMinMax` in https://github.com/apache/arrow/pull/40237/files#diff-368ab7e748bd4432c92d9fdc26b51e131742b968e3eb32a6fcea4b9f02fa36aa Besides, when refactoring 2.2, I found the current `Scalar::CastTo` is not fully covered by the existing tests. So I also added some lacking ones. ### Are these changes tested? Yes. ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** The `value` member of `BaseBinaryScalar` and subclasses/`BaseListScalar` and subclasses/`SparseUnionScalar`/`DenseUnionScalar`/`RunEndEncodedScalar` is made constant, thus code directly assigning to this member will no more compile. Also the `Scalar::mutable_data()` member function is removed because it's against the immutable nature of `Scalar`. However the impact of these changes seems limited. I don't think much user code is depending on these two old pieces of code. Also after an quick search, I didn't find any document that need to be updated according to this change. There could be none. But if there is, may someone please redirect me to it so I can update. Thanks. * GitHub Issue: apache#40069 Lead-authored-by: Ruoxi Sun <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

mpimenov added the Type: bug label Feb 13, 2024

github-actions bot added Component: Parquet Component: C++ labels Feb 13, 2024

mpimenov mentioned this issue Feb 13, 2024

[C++] Segmentation fault in hash-join/swiss-join #39951

Closed

github-actions bot mentioned this issue Feb 26, 2024

GH-40069: [C++] Make scalar scratch space immutable after initialization #40237

Merged

github-actions bot assigned zanmato1984 Feb 26, 2024

github-actions bot mentioned this issue Feb 27, 2024

GH-40069: [C++] Use DCLP to prevent race conditions in reading/writing scalar scratch space #40260

Closed

bkietz added this to the 16.1.0 milestone Apr 23, 2024

bkietz closed this as completed Apr 23, 2024

amoeba added the Breaking Change Includes a breaking change to the API label May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Possible data race in accessing scalars #40069

[C++] Possible data race in accessing scalars #40069

mpimenov commented Feb 13, 2024

pitrou commented Feb 14, 2024

mpimenov commented Feb 14, 2024

pitrou commented Feb 14, 2024

zanmato1984 commented Feb 26, 2024

pitrou commented Feb 26, 2024

pitrou commented Feb 26, 2024

zanmato1984 commented Feb 26, 2024

pitrou commented Feb 26, 2024 •

edited

Loading

zanmato1984 commented Feb 26, 2024

pitrou commented Feb 26, 2024

zanmato1984 commented Feb 26, 2024

pitrou commented Feb 26, 2024

zanmato1984 commented Feb 26, 2024

pitrou commented Feb 26, 2024

zanmato1984 commented Feb 26, 2024

bkietz commented Apr 23, 2024

[C++] Possible data race in accessing scalars #40069

[C++] Possible data race in accessing scalars #40069

Comments

mpimenov commented Feb 13, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

pitrou commented Feb 14, 2024

mpimenov commented Feb 14, 2024

pitrou commented Feb 14, 2024

zanmato1984 commented Feb 26, 2024

pitrou commented Feb 26, 2024

pitrou commented Feb 26, 2024

zanmato1984 commented Feb 26, 2024

pitrou commented Feb 26, 2024 • edited Loading

zanmato1984 commented Feb 26, 2024

pitrou commented Feb 26, 2024

zanmato1984 commented Feb 26, 2024

pitrou commented Feb 26, 2024

zanmato1984 commented Feb 26, 2024

pitrou commented Feb 26, 2024

zanmato1984 commented Feb 26, 2024

bkietz commented Apr 23, 2024

pitrou commented Feb 26, 2024 •

edited

Loading