ENH: Add center/ljust/rjust/zfill ufuncs for unicode and bytes#25908
Conversation
559330e to
cb0d7cd
Compare
mhvk
left a comment
There was a problem hiding this comment.
Mostly looks good! But some comments too - hopefully enough to fix the failures...
|
|
||
| if a.dtype.char == "T": | ||
| shape = np.broadcast_shapes(a.shape, width.shape, fillchar.shape) | ||
| out = np.empty_like(a, shape=shape) |
There was a problem hiding this comment.
Not really so relevant here, but I'm actually a bit surprised we cannot just do out=None for StringDType...
|
The remaining test failures are because |
|
I was hoping to get to this today but I'm running out of steam this afternoon. I'll try to get to this in the next day or two. |
| Buffer<bufferenc> buf(in1, elsize1); | ||
| Buffer<fillenc> fill(in3, elsize3); | ||
| Buffer<bufferenc> outbuf(out, outsize); | ||
| size_t len = string_pad(buf, *(npy_int64 *)in2, *fill, pos, outbuf); |
There was a problem hiding this comment.
this should be npy_intp len, otherwise the error check below will never succeed.
There was a problem hiding this comment.
also what happens if out and in1 are the same? Do you need to allocate a temporary buffer to avoid clobbering the input in that case like in my implementation?
|
I edited the PR description so merging this will close the issue about the original string being returned instead of truncating. |
|
I'm working on this but actually just hit the bug that is fixed by #25944 due to the use of |
|
The last commit finishes UTF-8 implementation. I decided to not support bytestring operands for now to not deal with the issue you highlighted in the docstrings. I think for unicode and bytes we should either detect that case and error or force people to not mix dtypes like that instead of leaving the footgun behind. I didn't fix the issues I brought up a few days ago. Looking at |
|
Looks like there’s some breakage on 32 bit systems - there’s some questionable use of |
Done.
Well, given that most of the ufuncs are wrapped in |
mhvk
left a comment
There was a problem hiding this comment.
Smaller things only, though I must admit my review was not all that thorough...
On possibly messing up with out=in, one could perhaps just error in the ufunc if that's the case.
|
|
||
| Examples | ||
| -------- | ||
| >>> a = np.array(['aAaAaA', ' aA ', 'abBABba']) |
|
|
||
| if (eq_res != 1) { | ||
| PyErr_SetString(PyExc_TypeError, | ||
| "Can only text justification operations with equal" |
| npy_string_allocator *oallocator = allocators[3]; | ||
|
|
||
| JUSTPOSITION pos = *(JUSTPOSITION *)(context->method->static_data); | ||
| const char* ufunc_name = NULL; |
There was a problem hiding this comment.
Why not just ufunc_name = ((PyUFuncObject *)context->caller)->name, as done elsewhere?
There was a problem hiding this comment.
Oops, missed that from your refactors
| static const char* LJUST_NAME = "ljust"; | ||
| static const char* RJUST_NAME = "rjust"; | ||
|
|
||
| template <ENCODING enc> |
There was a problem hiding this comment.
Why is this a template? Isn't the encoding always utf8?
There was a problem hiding this comment.
good point, dropped the unnecessary template
| buf = (char *)os.buf; | ||
| } | ||
|
|
||
| Buffer<enc> outbuf(buf, newsize); |
There was a problem hiding this comment.
Isn't this always Buffer<ENCODING::UTF8>?
| shape = np.broadcast_shapes(a.shape, width.shape, fillchar.shape) | ||
| if a.dtype.char == "T": | ||
| out = np.empty_like(a, shape=shape) | ||
| fillchar = fillchar.astype(a.dtype) |
There was a problem hiding this comment.
Add copy=False, in case fillchar is already OK.
|
|
||
| shape = np.broadcast_shapes(a.shape, width.shape, fillchar.shape) | ||
| if a.dtype.char == "T": | ||
| out = np.empty_like(a, shape=shape) |
There was a problem hiding this comment.
I think I wondered about this before, but logically the ufunc machinery should do this already for StringDType.
p.s. Ah, yes, I wondered below, in https://github.com/numpy/numpy/pull/25908/files#r1509243153
| fillchar = np._utils.asbytes(fillchar) | ||
| if isinstance(a_arr.dtype, np.dtypes.StringDType): | ||
| res_dtype = a_arr.dtype | ||
| a = np.asanyarray(a) |
There was a problem hiding this comment.
Looking at the third copy of this, it does seem one could refactor this to just have one dispatching function. But no big deal.
|
|
||
| FILL_ERROR = "The fill character must be exactly one character long" | ||
|
|
||
| def test_center_raises_multiple_character_fill(self, dt): |
There was a problem hiding this comment.
parametrize on function tested?
Ah good point. We should also come back to exposing in-place operations though since it's a good way to save memory, particularly for fixed-width strings. I could delete the in-place code paths in the stringdtype implementations, but I'll go ahead and leave them in with an expectation that we'll support in-place operations in the future. Does that make sense to you?
Yup, let's circle back to this in another issue. The stringdtype implementation already handles the in-place case so feel free to look at that for inspiration. |
|
The last push responds to review comments and in particular simplifies the wrappers a bit for stringdtype since as @mhvk pointed out, the use of I also did a minor change to the setup for |
Thanks. In principle, some other things like |
Sounds good, yes.
Thanks for doing this. Minimizing the wrappers is an improvement. |
|
This got reviewed by @mhvk so I'm going to merge this. I'll do a followup PR to clean up the docstrings to indicate stringdtype is supported. |
|
@ngoldbaum This could also use a release note. |
|
@ngoldbaum Should this be backported? Other backported PRs are failing because the functions are missing. |
|
This is mess to backport due to the merge, it depends on other PRs that aren't backported. The easy thing to do is leave it to NumPy 2.1. |
|
This is a new feature that didn’t make it in time for 2.0 and shouldn’t be backported. It’s ok to keep this for 2.1. |
|
Which backported PR does this conflict with? Happy to help out with resolving conflicts. |
|
@ngoldbaum Looks like you found it already :) |
fixes #25659