ENH: Add center/ljust/rjust/zfill ufuncs for unicode and bytes by lysnikolaou · Pull Request #25908 · numpy/numpy

lysnikolaou · 2024-03-01T11:39:16Z

mhvk

Mostly looks good! But some comments too - hopefully enough to fix the failures...

mhvk · 2024-03-01T16:37:53Z

+
+    if a.dtype.char == "T":
+        shape = np.broadcast_shapes(a.shape, width.shape, fillchar.shape)
+        out = np.empty_like(a, shape=shape)


Not really so relevant here, but I'm actually a bit surprised we cannot just do out=None for StringDType...

lysnikolaou · 2024-03-04T16:05:51Z

The remaining test failures are because test_stringdtype.py::test_strip assumes that ljust/rjust will work with StringDType. This all should be fine as soon as @ngoldbaum adds support for StringDType here.

ngoldbaum · 2024-03-05T21:21:59Z

I was hoping to get to this today but I'm running out of steam this afternoon. I'll try to get to this in the next day or two.

ngoldbaum · 2024-03-06T22:33:36Z

+        Buffer<bufferenc> buf(in1, elsize1);
+        Buffer<fillenc> fill(in3, elsize3);
+        Buffer<bufferenc> outbuf(out, outsize);
+        size_t len = string_pad(buf, *(npy_int64 *)in2, *fill, pos, outbuf);


this should be npy_intp len, otherwise the error check below will never succeed.

also what happens if out and in1 are the same? Do you need to allocate a temporary buffer to avoid clobbering the input in that case like in my implementation?

ngoldbaum · 2024-03-06T22:39:32Z

I edited the PR description so merging this will close the issue about the original string being returned instead of truncating.

ngoldbaum · 2024-03-07T00:18:34Z

I'm working on this but actually just hit the bug that is fixed by #25944 due to the use of np.empty_like in the wrappers in np.strings to create the output array. I think that will need to be merged first before the tests pass over here. I'll push what I have so far so that can be reviewed but not done yet.

ngoldbaum · 2024-03-08T20:11:54Z

The last commit finishes UTF-8 implementation. I decided to not support bytestring operands for now to not deal with the issue you highlighted in the docstrings. I think for unicode and bytes we should either detect that case and error or force people to not mix dtypes like that instead of leaving the footgun behind. I didn't fix the issues I brought up a few days ago.

Looking at test_strings.py, there doesn't yet seem to be much testing of direct use of out= to force an in-place operation and I think there might be some bugs in in-place operations where input strings get clobbered because we're not allocating a temporary array.

ngoldbaum · 2024-03-08T20:47:11Z

Looks like there’s some breakage on 32 bit systems - there’s some questionable use of size_t that is the likely culprit, will push a fix on monday.

lysnikolaou · 2024-03-11T13:54:50Z

I think for unicode and bytes we should either detect that case and error

Done.

Looking at test_strings.py, there doesn't yet seem to be much testing of direct use of out= to force an in-place operation and I think there might be some bugs in in-place operations where input strings get clobbered because we're not allocating a temporary array.

Well, given that most of the ufuncs are wrapped in strings.py and thus do not even support out being passed, I think we're okay. The only ufunc that's not wrapped and returns a string is add and some manual tests showed that it's okay. Should we maybe add tests for that only? If so, shall we do it in another PR?

mhvk

Smaller things only, though I must admit my review was not all that thorough...

On possibly messing up with out=in, one could perhaps just error in the ufunc if that's the case.

mhvk · 2024-03-11T15:15:22Z

+
+    Examples
+    --------
+>>> a = np.array(['aAaAaA', '  aA  ', 'abBABba'])


Indentation error.

mhvk · 2024-03-11T15:24:27Z

+
+    if (eq_res != 1) {
+        PyErr_SetString(PyExc_TypeError,
+                        "Can only text justification operations with equal"


"Can only do text jus..."

mhvk · 2024-03-11T15:26:17Z

+    npy_string_allocator *oallocator = allocators[3];
+
+    JUSTPOSITION pos = *(JUSTPOSITION *)(context->method->static_data);
+    const char* ufunc_name = NULL;


Why not just ufunc_name = ((PyUFuncObject *)context->caller)->name, as done elsewhere?

Oops, missed that from your refactors

mhvk · 2024-03-11T15:32:02Z

+static const char* LJUST_NAME = "ljust";
+static const char* RJUST_NAME = "rjust";
+
+template <ENCODING enc>


Why is this a template? Isn't the encoding always utf8?

good point, dropped the unnecessary template

mhvk · 2024-03-11T15:32:39Z

+                buf = (char *)os.buf;
+            }
+
+            Buffer<enc> outbuf(buf, newsize);


Isn't this always Buffer<ENCODING::UTF8>?

mhvk · 2024-03-11T15:35:57Z

+    shape = np.broadcast_shapes(a.shape, width.shape, fillchar.shape)
+    if a.dtype.char == "T":
+        out = np.empty_like(a, shape=shape)
+        fillchar = fillchar.astype(a.dtype)


Add copy=False, in case fillchar is already OK.

mhvk · 2024-03-11T15:36:43Z

+
+    shape = np.broadcast_shapes(a.shape, width.shape, fillchar.shape)
+    if a.dtype.char == "T":
+        out = np.empty_like(a, shape=shape)


I think I wondered about this before, but logically the ufunc machinery should do this already for StringDType.

p.s. Ah, yes, I wondered below, in https://github.com/numpy/numpy/pull/25908/files#r1509243153

mhvk · 2024-03-11T15:40:02Z

-        fillchar = np._utils.asbytes(fillchar)
-    if isinstance(a_arr.dtype, np.dtypes.StringDType):
-        res_dtype = a_arr.dtype
+    a = np.asanyarray(a)


Looking at the third copy of this, it does seem one could refactor this to just have one dispatching function. But no big deal.

mhvk · 2024-03-11T15:41:15Z


+    FILL_ERROR = "The fill character must be exactly one character long"
+
+    def test_center_raises_multiple_character_fill(self, dt):


parametrize on function tested?

ngoldbaum · 2024-03-11T17:18:08Z

Well, given that most of the ufuncs are wrapped in strings.py and thus do not even support out being passed, I think we're okay.

Ah good point. We should also come back to exposing in-place operations though since it's a good way to save memory, particularly for fixed-width strings.

I could delete the in-place code paths in the stringdtype implementations, but I'll go ahead and leave them in with an expectation that we'll support in-place operations in the future. Does that make sense to you?

Should we maybe add tests for that only? If so, shall we do it in another PR?

Yup, let's circle back to this in another issue. The stringdtype implementation already handles the in-place case so feel free to look at that for inspiration.

ngoldbaum · 2024-03-11T17:20:30Z

The last push responds to review comments and in particular simplifies the wrappers a bit for stringdtype since as @mhvk pointed out, the use of out is unnecessary there. I also did it for a few of the other wrappers we've already implemented.

I also did a minor change to the setup for fillchar which should hopefully avoid the errors on 32 bit systems. Not totally sure what was happening there, it didn't have to do with my usage of size_t, instead it seemed to come from dtype inference from a python string ending up as a bytestring.

mhvk · 2024-03-11T17:29:09Z

simplifies the wrappers a bit for stringdtype since as @mhvk pointed out, the use of out is unnecessary there.

Thanks. In principle, some other things like np.asanyarray(a) are not necessary for StringDType either -- those are much closer to just being able to use the ufunc machinery -- but none of that is urgent.

lysnikolaou · 2024-03-11T17:53:59Z

I'll go ahead and leave them in with an expectation that we'll support in-place operations in the future. Does that make sense to you?

Sounds good, yes.

The last push responds to review comments and in particular simplifies the wrappers a bit for stringdtype since as @mhvk pointed out, the use of out is unnecessary there. I also did it for a few of the other wrappers we've already implemented.

Thanks for doing this. Minimizing the wrappers is an improvement.

ngoldbaum · 2024-03-12T17:57:44Z

This got reviewed by @mhvk so I'm going to merge this. I'll do a followup PR to clean up the docstrings to indicate stringdtype is supported.

charris · 2024-03-12T18:08:28Z

@ngoldbaum This could also use a release note.

charris · 2024-03-15T23:39:50Z

@ngoldbaum Should this be backported? Other backported PRs are failing because the functions are missing.

charris · 2024-03-16T01:35:31Z

This is mess to backport due to the merge, it depends on other PRs that aren't backported. The easy thing to do is leave it to NumPy 2.1.

ngoldbaum · 2024-03-16T02:17:45Z

This is a new feature that didn’t make it in time for 2.0 and shouldn’t be backported. It’s ok to keep this for 2.1.

ngoldbaum · 2024-03-16T02:18:42Z

Which backported PR does this conflict with? Happy to help out with resolving conflicts.

charris · 2024-03-16T03:47:41Z

@ngoldbaum Looks like you found it already :)

lysnikolaou requested review from mhvk and ngoldbaum March 1, 2024 11:39

github-actions Bot added the 01 - Enhancement label Mar 1, 2024

ENH: Add center/ljust/rjust/zfill ufuncs for unicode and bytes

cb0d7cd

lysnikolaou force-pushed the string-ufuncs-center-ljust-rjust branch from 559330e to cb0d7cd Compare March 1, 2024 11:41