Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Possible data race in accessing scalars #40069

Closed
mpimenov opened this issue Feb 13, 2024 · 16 comments
Closed

[C++] Possible data race in accessing scalars #40069

mpimenov opened this issue Feb 13, 2024 · 16 comments
Assignees
Milestone

Comments

@mpimenov
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

I use an arrow::compute::literal in my parquet reading pipeline and get ThreadSanitizer reports about this code:

auto* offsets = reinterpret_cast<offset_type*>(scratch_space);
offsets[0] = 0;
offsets[1] = static_cast<offset_type>(value_size);

If I am reading it right, this results in issuing write instructions during what would seem to be read-only accesses, and this indeed looks racy.

Component(s)

C++, Parquet

@pitrou
Copy link
Member

pitrou commented Feb 14, 2024

Can you show your stack trace?

@mpimenov
Copy link
Author

Arrow is built from commit 3c655df
tsan-40069.txt

@pitrou
Copy link
Member

pitrou commented Feb 14, 2024

Thank you. It looks like calling ArraySpan::FillFromScalar from several threads is legitimate when the same scalar is used as input for different computations. Given that this is probably always writing the same value, it would probably be enough to use relaxed atomic stores.

@bkietz

@zanmato1984
Copy link
Contributor

Thank you. It looks like calling ArraySpan::FillFromScalar from several threads is legitimate when the same scalar is used as input for different computations. Given that this is probably always writing the same value, it would probably be enough to use relaxed atomic stores.

@bkietz

@pitrou Just a thought, seems like the information to fill into the scalar's scratch space is totally self-contained, should we do the filling by a (virtual) member function of the scalar class, e.g. virtual BufferSpan Scalar::FillScrach(...)? And we can probably use an acquire-release atomic flag to mark if the scratch has already been filled. This might be more semantically clear imho.

(But this may bring ABI issues, which I'm not sure how is dealt with in arrow.)

@pitrou
Copy link
Member

pitrou commented Feb 26, 2024

@bkietz Would you be able to take a look at this?

@pitrou
Copy link
Member

pitrou commented Feb 26, 2024

@zanmato1984 I'm not sure about the answer to the rest of your question, but for the record Arrow C++ doesn't promise any ABI stability. We do try to maintain API compatibility (with the occasional deprecation or signature change), but people are expected to recompile when switching from one version of Arrow C++ to another.

@zanmato1984
Copy link
Contributor

@zanmato1984 I'm not sure about the answer to the rest of your question, but for the record Arrow C++ doesn't promise any ABI stability. We do try to maintain API compatibility (with the occasional deprecation or signature change), but people are expected to recompile when switching from one version of Arrow C++ to another.

Thank you for the info.

After some attempts to draft a fix, I found something worth discussion.

First I was able to reproduce the TSAN error using a simple UT. Then I tried to do a plane fix proposed by @pitrou , that is using relaxed atomic stores. Given that I'm not aware of how to apply atomic operations on plane memory in C++, I'll have to change the definition of the scratch space into several atomic variables. However, the interpretation of the scratch space is type-dependent, i.e., some types treat it as two 32-byte space and others treat it as two 64-byte space. This makes "defining the scratch space as a fixed number of atomic variables in the class" not feasible.

Now I'm drafting the alternative way, which is using extra atomic flags + spinning (keeping scratch space itself unchanged - as plane memory), which is basically what I talked about in my previous comment. It'll come later.

@pitrou
Copy link
Member

pitrou commented Feb 26, 2024

However, the interpretation of the scratch space is type-dependent, i.e., some types treat it as two 32-byte space and others treat it as two 64-byte space.

But all threads accessing the scratch space should interpret it as the same type, right?

extra atomic flags + spinning (keeping scratch space itself unchanged - as plane memory)

Please let's avoid home-grown synchronization like this. Either we can use atomic accesses without spinning, or we should use a lock.

@zanmato1984
Copy link
Contributor

However, the interpretation of the scratch space is type-dependent, i.e., some types treat it as two 32-byte space and others treat it as two 64-byte space.

But all threads accessing the scratch space should interpret it as the same type, right?

Right. The question here is that the scratch space is defined in a generic place, i.e., ArraySpanFillFromScalarScratchSpace (base class of all "scratched" scalars) and we couldn't know how it is going to be interpreted at runtime, despite whether all threads interpret it as the same type or not.

extra atomic flags + spinning (keeping scratch space itself unchanged - as plane memory)

Please let's avoid home-grown synchronization like this. Either we can use atomic accesses without spinning, or we should use a lock.

Yeah, of course. I was thinking about using mutex too.

@pitrou
Copy link
Member

pitrou commented Feb 26, 2024

Right. The question here is that the scratch space is defined in a generic place, i.e., ArraySpanFillFromScalarScratchSpace (base class of all "scratched" scalars) and we couldn't know how it is going to be interpreted at runtime, despite whether all threads interpret it as the same type or not.

It would be incorrect for different threads to interpret it differently, even with a lock.

@zanmato1984
Copy link
Contributor

Right. The question here is that the scratch space is defined in a generic place, i.e., ArraySpanFillFromScalarScratchSpace (base class of all "scratched" scalars) and we couldn't know how it is going to be interpreted at runtime, despite whether all threads interpret it as the same type or not.

It would be incorrect for different threads to interpret it differently, even with a lock.

You are right about this too.

I think here are two things:

  1. Is the scratch being interpreted identically by different threads?
  2. If 1 is yes, is it a race?

For 1, if the answer is no, then it should fall into some sort of undefined behavior area. That will be a user code problem that is using scalar the wrong way. It is not essential whether this is a race, or how the synchronization is implemented. And hopefully, and fortunately, I think current compute and Acero components are using scalar the "right way". So we don't really need to worry about it.

For 2, this is what we are really talking about. Aren't we :)

@pitrou
Copy link
Member

pitrou commented Feb 26, 2024

Right, so I think we can simply use atomic stores here.
In debug mode, we can perhaps be a bit more sophisticated and check that we don't write two different values, but that's optional IMHO.

@zanmato1984
Copy link
Contributor

Right, so I think we can simply use atomic stores here. In debug mode, we can perhaps be a bit more sophisticated and check that we don't write two different values, but that's optional IMHO.

Are you suggesting something like this?
https://github.com/apache/arrow/pull/40237/files#diff-1d848527e898cd9da6a41bdbd68cab5d5826986128851220d20d1fdac9752d8bR289-R292

@pitrou
Copy link
Member

pitrou commented Feb 26, 2024

Yes, roughly.

@zanmato1984
Copy link
Contributor

Yes, roughly.

OK. I'll move on to add some more tests and get the PR ready.

PS: in my previous several comments I didn't recall that there is such a way to operate plane memory atomically so there might have been some misunderstanding. My apology and really appreciate your patient :)

bkietz pushed a commit that referenced this issue Apr 23, 2024
…ion (#40237)

### Rationale for this change

As #40069 shows, TSAN reports data race that is caused by concurrent filling the scratch space of a scalar instance. The concurrent use of the same scalar could be parallel executing an acero plan containing a literal (a "constant" that is simply represented by an underlying scalar), and this is totally legit. The problem lies in the fact that the scratch space of the scalar is filled "lazily" by the time when it is being involved in the computation and transformed to an array span, for *every* thread.

After piloting several approaches (relaxed atomic - an earlier version of this PR, locking - #40260), @ pitrou and @ bkietz suggested an immutable-after-initialization approach, for which the latest version of this PR is.

### What changes are included in this PR?

There are generally two parts in this PR:
1. Mandate the initialization of the scratch space in constructor of the concrete subclass of `Scalar`.
2. In order to keep the content of the scratch space consistent with the underlying `value` of the scalar, make the `value` constant. This effectively makes legacy code that directly assigning to the `value` invalid, which is refactored accordingly:
  2.1 `BoxScalar` in https://github.com/apache/arrow/pull/40237/files#diff-08d11e02c001c82b1aa89565e16760a8bcca4a608c22619fb45da42fd0ebebac
  2.2 `Scalar::CastTo` in https://github.com/apache/arrow/pull/40237/files#diff-b4b83682450006616fa7e4f6f2ea3031cf1a22d734f4bee42a99af313e808f9e
  2.3 `ScalarMinMax` in https://github.com/apache/arrow/pull/40237/files#diff-368ab7e748bd4432c92d9fdc26b51e131742b968e3eb32a6fcea4b9f02fa36aa

Besides, when refactoring 2.2, I found the current `Scalar::CastTo` is not fully covered by the existing tests. So I also added some lacking ones.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

**This PR includes breaking changes to public APIs.**
The `value` member of `BaseBinaryScalar` and subclasses/`BaseListScalar` and subclasses/`SparseUnionScalar`/`DenseUnionScalar`/`RunEndEncodedScalar` is made constant, thus code directly assigning to this member will no more compile.

Also the `Scalar::mutable_data()` member function is removed because it's against the immutable nature of `Scalar`.

However the impact of these changes seems limited. I don't think much user code is depending on these two old pieces of code.

Also after an quick search, I didn't find any document that need to be updated according to this change. There could be none. But if there is, may someone please redirect me to it so I can update. Thanks.

* GitHub Issue: #40069

Lead-authored-by: Ruoxi Sun <[email protected]>
Co-authored-by: Rossi Sun <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
@bkietz bkietz added this to the 16.1.0 milestone Apr 23, 2024
@bkietz
Copy link
Member

bkietz commented Apr 23, 2024

Issue resolved by pull request 40237
#40237

@bkietz bkietz closed this as completed Apr 23, 2024
zanmato1984 added a commit to zanmato1984/arrow that referenced this issue Apr 24, 2024
…alization (apache#40237)

### Rationale for this change

As apache#40069 shows, TSAN reports data race that is caused by concurrent filling the scratch space of a scalar instance. The concurrent use of the same scalar could be parallel executing an acero plan containing a literal (a "constant" that is simply represented by an underlying scalar), and this is totally legit. The problem lies in the fact that the scratch space of the scalar is filled "lazily" by the time when it is being involved in the computation and transformed to an array span, for *every* thread.

After piloting several approaches (relaxed atomic - an earlier version of this PR, locking - apache#40260), @ pitrou and @ bkietz suggested an immutable-after-initialization approach, for which the latest version of this PR is.

### What changes are included in this PR?

There are generally two parts in this PR:
1. Mandate the initialization of the scratch space in constructor of the concrete subclass of `Scalar`.
2. In order to keep the content of the scratch space consistent with the underlying `value` of the scalar, make the `value` constant. This effectively makes legacy code that directly assigning to the `value` invalid, which is refactored accordingly:
  2.1 `BoxScalar` in https://github.com/apache/arrow/pull/40237/files#diff-08d11e02c001c82b1aa89565e16760a8bcca4a608c22619fb45da42fd0ebebac
  2.2 `Scalar::CastTo` in https://github.com/apache/arrow/pull/40237/files#diff-b4b83682450006616fa7e4f6f2ea3031cf1a22d734f4bee42a99af313e808f9e
  2.3 `ScalarMinMax` in https://github.com/apache/arrow/pull/40237/files#diff-368ab7e748bd4432c92d9fdc26b51e131742b968e3eb32a6fcea4b9f02fa36aa

Besides, when refactoring 2.2, I found the current `Scalar::CastTo` is not fully covered by the existing tests. So I also added some lacking ones.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

**This PR includes breaking changes to public APIs.**
The `value` member of `BaseBinaryScalar` and subclasses/`BaseListScalar` and subclasses/`SparseUnionScalar`/`DenseUnionScalar`/`RunEndEncodedScalar` is made constant, thus code directly assigning to this member will no more compile.

Also the `Scalar::mutable_data()` member function is removed because it's against the immutable nature of `Scalar`.

However the impact of these changes seems limited. I don't think much user code is depending on these two old pieces of code.

Also after an quick search, I didn't find any document that need to be updated according to this change. There could be none. But if there is, may someone please redirect me to it so I can update. Thanks.

* GitHub Issue: apache#40069

Lead-authored-by: Ruoxi Sun <[email protected]>
Co-authored-by: Rossi Sun <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
raulcd pushed a commit that referenced this issue Apr 29, 2024
…ion (#40237)

### Rationale for this change

As #40069 shows, TSAN reports data race that is caused by concurrent filling the scratch space of a scalar instance. The concurrent use of the same scalar could be parallel executing an acero plan containing a literal (a "constant" that is simply represented by an underlying scalar), and this is totally legit. The problem lies in the fact that the scratch space of the scalar is filled "lazily" by the time when it is being involved in the computation and transformed to an array span, for *every* thread.

After piloting several approaches (relaxed atomic - an earlier version of this PR, locking - #40260), @ pitrou and @ bkietz suggested an immutable-after-initialization approach, for which the latest version of this PR is.

### What changes are included in this PR?

There are generally two parts in this PR:
1. Mandate the initialization of the scratch space in constructor of the concrete subclass of `Scalar`.
2. In order to keep the content of the scratch space consistent with the underlying `value` of the scalar, make the `value` constant. This effectively makes legacy code that directly assigning to the `value` invalid, which is refactored accordingly:
  2.1 `BoxScalar` in https://github.com/apache/arrow/pull/40237/files#diff-08d11e02c001c82b1aa89565e16760a8bcca4a608c22619fb45da42fd0ebebac
  2.2 `Scalar::CastTo` in https://github.com/apache/arrow/pull/40237/files#diff-b4b83682450006616fa7e4f6f2ea3031cf1a22d734f4bee42a99af313e808f9e
  2.3 `ScalarMinMax` in https://github.com/apache/arrow/pull/40237/files#diff-368ab7e748bd4432c92d9fdc26b51e131742b968e3eb32a6fcea4b9f02fa36aa

Besides, when refactoring 2.2, I found the current `Scalar::CastTo` is not fully covered by the existing tests. So I also added some lacking ones.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

**This PR includes breaking changes to public APIs.**
The `value` member of `BaseBinaryScalar` and subclasses/`BaseListScalar` and subclasses/`SparseUnionScalar`/`DenseUnionScalar`/`RunEndEncodedScalar` is made constant, thus code directly assigning to this member will no more compile.

Also the `Scalar::mutable_data()` member function is removed because it's against the immutable nature of `Scalar`.

However the impact of these changes seems limited. I don't think much user code is depending on these two old pieces of code.

Also after an quick search, I didn't find any document that need to be updated according to this change. There could be none. But if there is, may someone please redirect me to it so I can update. Thanks.

* GitHub Issue: #40069

Lead-authored-by: Ruoxi Sun <[email protected]>
Co-authored-by: Rossi Sun <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
tolleybot pushed a commit to tmct/arrow that referenced this issue May 2, 2024
…alization (apache#40237)

### Rationale for this change

As apache#40069 shows, TSAN reports data race that is caused by concurrent filling the scratch space of a scalar instance. The concurrent use of the same scalar could be parallel executing an acero plan containing a literal (a "constant" that is simply represented by an underlying scalar), and this is totally legit. The problem lies in the fact that the scratch space of the scalar is filled "lazily" by the time when it is being involved in the computation and transformed to an array span, for *every* thread.

After piloting several approaches (relaxed atomic - an earlier version of this PR, locking - apache#40260), @ pitrou and @ bkietz suggested an immutable-after-initialization approach, for which the latest version of this PR is.

### What changes are included in this PR?

There are generally two parts in this PR:
1. Mandate the initialization of the scratch space in constructor of the concrete subclass of `Scalar`.
2. In order to keep the content of the scratch space consistent with the underlying `value` of the scalar, make the `value` constant. This effectively makes legacy code that directly assigning to the `value` invalid, which is refactored accordingly:
  2.1 `BoxScalar` in https://github.com/apache/arrow/pull/40237/files#diff-08d11e02c001c82b1aa89565e16760a8bcca4a608c22619fb45da42fd0ebebac
  2.2 `Scalar::CastTo` in https://github.com/apache/arrow/pull/40237/files#diff-b4b83682450006616fa7e4f6f2ea3031cf1a22d734f4bee42a99af313e808f9e
  2.3 `ScalarMinMax` in https://github.com/apache/arrow/pull/40237/files#diff-368ab7e748bd4432c92d9fdc26b51e131742b968e3eb32a6fcea4b9f02fa36aa

Besides, when refactoring 2.2, I found the current `Scalar::CastTo` is not fully covered by the existing tests. So I also added some lacking ones.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

**This PR includes breaking changes to public APIs.**
The `value` member of `BaseBinaryScalar` and subclasses/`BaseListScalar` and subclasses/`SparseUnionScalar`/`DenseUnionScalar`/`RunEndEncodedScalar` is made constant, thus code directly assigning to this member will no more compile.

Also the `Scalar::mutable_data()` member function is removed because it's against the immutable nature of `Scalar`.

However the impact of these changes seems limited. I don't think much user code is depending on these two old pieces of code.

Also after an quick search, I didn't find any document that need to be updated according to this change. There could be none. But if there is, may someone please redirect me to it so I can update. Thanks.

* GitHub Issue: apache#40069

Lead-authored-by: Ruoxi Sun <[email protected]>
Co-authored-by: Rossi Sun <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
tolleybot pushed a commit to tmct/arrow that referenced this issue May 4, 2024
…alization (apache#40237)

### Rationale for this change

As apache#40069 shows, TSAN reports data race that is caused by concurrent filling the scratch space of a scalar instance. The concurrent use of the same scalar could be parallel executing an acero plan containing a literal (a "constant" that is simply represented by an underlying scalar), and this is totally legit. The problem lies in the fact that the scratch space of the scalar is filled "lazily" by the time when it is being involved in the computation and transformed to an array span, for *every* thread.

After piloting several approaches (relaxed atomic - an earlier version of this PR, locking - apache#40260), @ pitrou and @ bkietz suggested an immutable-after-initialization approach, for which the latest version of this PR is.

### What changes are included in this PR?

There are generally two parts in this PR:
1. Mandate the initialization of the scratch space in constructor of the concrete subclass of `Scalar`.
2. In order to keep the content of the scratch space consistent with the underlying `value` of the scalar, make the `value` constant. This effectively makes legacy code that directly assigning to the `value` invalid, which is refactored accordingly:
  2.1 `BoxScalar` in https://github.com/apache/arrow/pull/40237/files#diff-08d11e02c001c82b1aa89565e16760a8bcca4a608c22619fb45da42fd0ebebac
  2.2 `Scalar::CastTo` in https://github.com/apache/arrow/pull/40237/files#diff-b4b83682450006616fa7e4f6f2ea3031cf1a22d734f4bee42a99af313e808f9e
  2.3 `ScalarMinMax` in https://github.com/apache/arrow/pull/40237/files#diff-368ab7e748bd4432c92d9fdc26b51e131742b968e3eb32a6fcea4b9f02fa36aa

Besides, when refactoring 2.2, I found the current `Scalar::CastTo` is not fully covered by the existing tests. So I also added some lacking ones.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

**This PR includes breaking changes to public APIs.**
The `value` member of `BaseBinaryScalar` and subclasses/`BaseListScalar` and subclasses/`SparseUnionScalar`/`DenseUnionScalar`/`RunEndEncodedScalar` is made constant, thus code directly assigning to this member will no more compile.

Also the `Scalar::mutable_data()` member function is removed because it's against the immutable nature of `Scalar`.

However the impact of these changes seems limited. I don't think much user code is depending on these two old pieces of code.

Also after an quick search, I didn't find any document that need to be updated according to this change. There could be none. But if there is, may someone please redirect me to it so I can update. Thanks.

* GitHub Issue: apache#40069

Lead-authored-by: Ruoxi Sun <[email protected]>
Co-authored-by: Rossi Sun <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
rok pushed a commit to tmct/arrow that referenced this issue May 8, 2024
…alization (apache#40237)

### Rationale for this change

As apache#40069 shows, TSAN reports data race that is caused by concurrent filling the scratch space of a scalar instance. The concurrent use of the same scalar could be parallel executing an acero plan containing a literal (a "constant" that is simply represented by an underlying scalar), and this is totally legit. The problem lies in the fact that the scratch space of the scalar is filled "lazily" by the time when it is being involved in the computation and transformed to an array span, for *every* thread.

After piloting several approaches (relaxed atomic - an earlier version of this PR, locking - apache#40260), @ pitrou and @ bkietz suggested an immutable-after-initialization approach, for which the latest version of this PR is.

### What changes are included in this PR?

There are generally two parts in this PR:
1. Mandate the initialization of the scratch space in constructor of the concrete subclass of `Scalar`.
2. In order to keep the content of the scratch space consistent with the underlying `value` of the scalar, make the `value` constant. This effectively makes legacy code that directly assigning to the `value` invalid, which is refactored accordingly:
  2.1 `BoxScalar` in https://github.com/apache/arrow/pull/40237/files#diff-08d11e02c001c82b1aa89565e16760a8bcca4a608c22619fb45da42fd0ebebac
  2.2 `Scalar::CastTo` in https://github.com/apache/arrow/pull/40237/files#diff-b4b83682450006616fa7e4f6f2ea3031cf1a22d734f4bee42a99af313e808f9e
  2.3 `ScalarMinMax` in https://github.com/apache/arrow/pull/40237/files#diff-368ab7e748bd4432c92d9fdc26b51e131742b968e3eb32a6fcea4b9f02fa36aa

Besides, when refactoring 2.2, I found the current `Scalar::CastTo` is not fully covered by the existing tests. So I also added some lacking ones.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

**This PR includes breaking changes to public APIs.**
The `value` member of `BaseBinaryScalar` and subclasses/`BaseListScalar` and subclasses/`SparseUnionScalar`/`DenseUnionScalar`/`RunEndEncodedScalar` is made constant, thus code directly assigning to this member will no more compile.

Also the `Scalar::mutable_data()` member function is removed because it's against the immutable nature of `Scalar`.

However the impact of these changes seems limited. I don't think much user code is depending on these two old pieces of code.

Also after an quick search, I didn't find any document that need to be updated according to this change. There could be none. But if there is, may someone please redirect me to it so I can update. Thanks.

* GitHub Issue: apache#40069

Lead-authored-by: Ruoxi Sun <[email protected]>
Co-authored-by: Rossi Sun <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
rok pushed a commit to tmct/arrow that referenced this issue May 8, 2024
…alization (apache#40237)

### Rationale for this change

As apache#40069 shows, TSAN reports data race that is caused by concurrent filling the scratch space of a scalar instance. The concurrent use of the same scalar could be parallel executing an acero plan containing a literal (a "constant" that is simply represented by an underlying scalar), and this is totally legit. The problem lies in the fact that the scratch space of the scalar is filled "lazily" by the time when it is being involved in the computation and transformed to an array span, for *every* thread.

After piloting several approaches (relaxed atomic - an earlier version of this PR, locking - apache#40260), @ pitrou and @ bkietz suggested an immutable-after-initialization approach, for which the latest version of this PR is.

### What changes are included in this PR?

There are generally two parts in this PR:
1. Mandate the initialization of the scratch space in constructor of the concrete subclass of `Scalar`.
2. In order to keep the content of the scratch space consistent with the underlying `value` of the scalar, make the `value` constant. This effectively makes legacy code that directly assigning to the `value` invalid, which is refactored accordingly:
  2.1 `BoxScalar` in https://github.com/apache/arrow/pull/40237/files#diff-08d11e02c001c82b1aa89565e16760a8bcca4a608c22619fb45da42fd0ebebac
  2.2 `Scalar::CastTo` in https://github.com/apache/arrow/pull/40237/files#diff-b4b83682450006616fa7e4f6f2ea3031cf1a22d734f4bee42a99af313e808f9e
  2.3 `ScalarMinMax` in https://github.com/apache/arrow/pull/40237/files#diff-368ab7e748bd4432c92d9fdc26b51e131742b968e3eb32a6fcea4b9f02fa36aa

Besides, when refactoring 2.2, I found the current `Scalar::CastTo` is not fully covered by the existing tests. So I also added some lacking ones.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

**This PR includes breaking changes to public APIs.**
The `value` member of `BaseBinaryScalar` and subclasses/`BaseListScalar` and subclasses/`SparseUnionScalar`/`DenseUnionScalar`/`RunEndEncodedScalar` is made constant, thus code directly assigning to this member will no more compile.

Also the `Scalar::mutable_data()` member function is removed because it's against the immutable nature of `Scalar`.

However the impact of these changes seems limited. I don't think much user code is depending on these two old pieces of code.

Also after an quick search, I didn't find any document that need to be updated according to this change. There could be none. But if there is, may someone please redirect me to it so I can update. Thanks.

* GitHub Issue: apache#40069

Lead-authored-by: Ruoxi Sun <[email protected]>
Co-authored-by: Rossi Sun <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
@amoeba amoeba added the Breaking Change Includes a breaking change to the API label May 19, 2024
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…alization (apache#40237)

### Rationale for this change

As apache#40069 shows, TSAN reports data race that is caused by concurrent filling the scratch space of a scalar instance. The concurrent use of the same scalar could be parallel executing an acero plan containing a literal (a "constant" that is simply represented by an underlying scalar), and this is totally legit. The problem lies in the fact that the scratch space of the scalar is filled "lazily" by the time when it is being involved in the computation and transformed to an array span, for *every* thread.

After piloting several approaches (relaxed atomic - an earlier version of this PR, locking - apache#40260), @ pitrou and @ bkietz suggested an immutable-after-initialization approach, for which the latest version of this PR is.

### What changes are included in this PR?

There are generally two parts in this PR:
1. Mandate the initialization of the scratch space in constructor of the concrete subclass of `Scalar`.
2. In order to keep the content of the scratch space consistent with the underlying `value` of the scalar, make the `value` constant. This effectively makes legacy code that directly assigning to the `value` invalid, which is refactored accordingly:
  2.1 `BoxScalar` in https://github.com/apache/arrow/pull/40237/files#diff-08d11e02c001c82b1aa89565e16760a8bcca4a608c22619fb45da42fd0ebebac
  2.2 `Scalar::CastTo` in https://github.com/apache/arrow/pull/40237/files#diff-b4b83682450006616fa7e4f6f2ea3031cf1a22d734f4bee42a99af313e808f9e
  2.3 `ScalarMinMax` in https://github.com/apache/arrow/pull/40237/files#diff-368ab7e748bd4432c92d9fdc26b51e131742b968e3eb32a6fcea4b9f02fa36aa

Besides, when refactoring 2.2, I found the current `Scalar::CastTo` is not fully covered by the existing tests. So I also added some lacking ones.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

**This PR includes breaking changes to public APIs.**
The `value` member of `BaseBinaryScalar` and subclasses/`BaseListScalar` and subclasses/`SparseUnionScalar`/`DenseUnionScalar`/`RunEndEncodedScalar` is made constant, thus code directly assigning to this member will no more compile.

Also the `Scalar::mutable_data()` member function is removed because it's against the immutable nature of `Scalar`.

However the impact of these changes seems limited. I don't think much user code is depending on these two old pieces of code.

Also after an quick search, I didn't find any document that need to be updated according to this change. There could be none. But if there is, may someone please redirect me to it so I can update. Thanks.

* GitHub Issue: apache#40069

Lead-authored-by: Ruoxi Sun <[email protected]>
Co-authored-by: Rossi Sun <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants