Skip to content

ENH: Add center/ljust/rjust/zfill ufuncs for unicode and bytes#25908

Merged
ngoldbaum merged 16 commits into
numpy:mainfrom
lysnikolaou:string-ufuncs-center-ljust-rjust
Mar 12, 2024
Merged

ENH: Add center/ljust/rjust/zfill ufuncs for unicode and bytes#25908
ngoldbaum merged 16 commits into
numpy:mainfrom
lysnikolaou:string-ufuncs-center-ljust-rjust

Conversation

@lysnikolaou

@lysnikolaou lysnikolaou commented Mar 1, 2024

Copy link
Copy Markdown
Member

fixes #25659

@lysnikolaou lysnikolaou force-pushed the string-ufuncs-center-ljust-rjust branch from 559330e to cb0d7cd Compare March 1, 2024 11:41

@mhvk mhvk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good! But some comments too - hopefully enough to fix the failures...

Comment thread numpy/_core/code_generators/ufunc_docstrings.py Outdated
Comment thread numpy/_core/code_generators/ufunc_docstrings.py Outdated
Comment thread numpy/_core/code_generators/ufunc_docstrings.py Outdated
Comment thread numpy/_core/src/umath/string_buffer.h
Comment thread numpy/_core/src/umath/string_buffer.h
Comment thread numpy/_core/src/umath/string_ufuncs.cpp
Comment thread numpy/_core/strings.py Outdated
Comment thread numpy/_core/strings.py Outdated

if a.dtype.char == "T":
shape = np.broadcast_shapes(a.shape, width.shape, fillchar.shape)
out = np.empty_like(a, shape=shape)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really so relevant here, but I'm actually a bit surprised we cannot just do out=None for StringDType...

Comment thread numpy/_core/strings.py Outdated
Comment thread numpy/_core/tests/test_strings.py
Comment thread numpy/_core/code_generators/ufunc_docstrings.py
Comment thread numpy/_core/code_generators/ufunc_docstrings.py Outdated
Comment thread numpy/_core/code_generators/ufunc_docstrings.py Outdated
Comment thread numpy/_core/code_generators/ufunc_docstrings.py Outdated
@lysnikolaou

Copy link
Copy Markdown
Member Author

The remaining test failures are because test_stringdtype.py::test_strip assumes that ljust/rjust will work with StringDType. This all should be fine as soon as @ngoldbaum adds support for StringDType here.

@ngoldbaum

Copy link
Copy Markdown
Member

I was hoping to get to this today but I'm running out of steam this afternoon. I'll try to get to this in the next day or two.

Comment thread numpy/_core/strings.py Outdated
Comment thread numpy/_core/src/umath/string_ufuncs.cpp Outdated
Buffer<bufferenc> buf(in1, elsize1);
Buffer<fillenc> fill(in3, elsize3);
Buffer<bufferenc> outbuf(out, outsize);
size_t len = string_pad(buf, *(npy_int64 *)in2, *fill, pos, outbuf);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be npy_intp len, otherwise the error check below will never succeed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also what happens if out and in1 are the same? Do you need to allocate a temporary buffer to avoid clobbering the input in that case like in my implementation?

@ngoldbaum

Copy link
Copy Markdown
Member

I edited the PR description so merging this will close the issue about the original string being returned instead of truncating.

@ngoldbaum

Copy link
Copy Markdown
Member

I'm working on this but actually just hit the bug that is fixed by #25944 due to the use of np.empty_like in the wrappers in np.strings to create the output array. I think that will need to be merged first before the tests pass over here. I'll push what I have so far so that can be reviewed but not done yet.

@ngoldbaum

Copy link
Copy Markdown
Member

The last commit finishes UTF-8 implementation. I decided to not support bytestring operands for now to not deal with the issue you highlighted in the docstrings. I think for unicode and bytes we should either detect that case and error or force people to not mix dtypes like that instead of leaving the footgun behind. I didn't fix the issues I brought up a few days ago.

Looking at test_strings.py, there doesn't yet seem to be much testing of direct use of out= to force an in-place operation and I think there might be some bugs in in-place operations where input strings get clobbered because we're not allocating a temporary array.

@ngoldbaum

Copy link
Copy Markdown
Member

Looks like there’s some breakage on 32 bit systems - there’s some questionable use of size_t that is the likely culprit, will push a fix on monday.

@lysnikolaou

Copy link
Copy Markdown
Member Author

I think for unicode and bytes we should either detect that case and error

Done.

Looking at test_strings.py, there doesn't yet seem to be much testing of direct use of out= to force an in-place operation and I think there might be some bugs in in-place operations where input strings get clobbered because we're not allocating a temporary array.

Well, given that most of the ufuncs are wrapped in strings.py and thus do not even support out being passed, I think we're okay. The only ufunc that's not wrapped and returns a string is add and some manual tests showed that it's okay. Should we maybe add tests for that only? If so, shall we do it in another PR?

@mhvk mhvk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smaller things only, though I must admit my review was not all that thorough...

On possibly messing up with out=in, one could perhaps just error in the ufunc if that's the case.


Examples
--------
>>> a = np.array(['aAaAaA', ' aA ', 'abBABba'])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation error.


if (eq_res != 1) {
PyErr_SetString(PyExc_TypeError,
"Can only text justification operations with equal"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Can only do text jus..."

npy_string_allocator *oallocator = allocators[3];

JUSTPOSITION pos = *(JUSTPOSITION *)(context->method->static_data);
const char* ufunc_name = NULL;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just ufunc_name = ((PyUFuncObject *)context->caller)->name, as done elsewhere?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, missed that from your refactors

static const char* LJUST_NAME = "ljust";
static const char* RJUST_NAME = "rjust";

template <ENCODING enc>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a template? Isn't the encoding always utf8?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, dropped the unnecessary template

buf = (char *)os.buf;
}

Buffer<enc> outbuf(buf, newsize);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this always Buffer<ENCODING::UTF8>?

Comment thread numpy/_core/strings.py Outdated
shape = np.broadcast_shapes(a.shape, width.shape, fillchar.shape)
if a.dtype.char == "T":
out = np.empty_like(a, shape=shape)
fillchar = fillchar.astype(a.dtype)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add copy=False, in case fillchar is already OK.

Comment thread numpy/_core/strings.py Outdated

shape = np.broadcast_shapes(a.shape, width.shape, fillchar.shape)
if a.dtype.char == "T":
out = np.empty_like(a, shape=shape)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I wondered about this before, but logically the ufunc machinery should do this already for StringDType.

p.s. Ah, yes, I wondered below, in https://github.com/numpy/numpy/pull/25908/files#r1509243153

Comment thread numpy/_core/strings.py
fillchar = np._utils.asbytes(fillchar)
if isinstance(a_arr.dtype, np.dtypes.StringDType):
res_dtype = a_arr.dtype
a = np.asanyarray(a)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the third copy of this, it does seem one could refactor this to just have one dispatching function. But no big deal.


FILL_ERROR = "The fill character must be exactly one character long"

def test_center_raises_multiple_character_fill(self, dt):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parametrize on function tested?

@ngoldbaum

Copy link
Copy Markdown
Member

Well, given that most of the ufuncs are wrapped in strings.py and thus do not even support out being passed, I think we're okay.

Ah good point. We should also come back to exposing in-place operations though since it's a good way to save memory, particularly for fixed-width strings.

I could delete the in-place code paths in the stringdtype implementations, but I'll go ahead and leave them in with an expectation that we'll support in-place operations in the future. Does that make sense to you?

Should we maybe add tests for that only? If so, shall we do it in another PR?

Yup, let's circle back to this in another issue. The stringdtype implementation already handles the in-place case so feel free to look at that for inspiration.

@ngoldbaum

Copy link
Copy Markdown
Member

The last push responds to review comments and in particular simplifies the wrappers a bit for stringdtype since as @mhvk pointed out, the use of out is unnecessary there. I also did it for a few of the other wrappers we've already implemented.

I also did a minor change to the setup for fillchar which should hopefully avoid the errors on 32 bit systems. Not totally sure what was happening there, it didn't have to do with my usage of size_t, instead it seemed to come from dtype inference from a python string ending up as a bytestring.

@mhvk

mhvk commented Mar 11, 2024

Copy link
Copy Markdown
Contributor

simplifies the wrappers a bit for stringdtype since as @mhvk pointed out, the use of out is unnecessary there.

Thanks. In principle, some other things like np.asanyarray(a) are not necessary for StringDType either -- those are much closer to just being able to use the ufunc machinery -- but none of that is urgent.

@lysnikolaou

Copy link
Copy Markdown
Member Author

I'll go ahead and leave them in with an expectation that we'll support in-place operations in the future. Does that make sense to you?

Sounds good, yes.

The last push responds to review comments and in particular simplifies the wrappers a bit for stringdtype since as @mhvk pointed out, the use of out is unnecessary there. I also did it for a few of the other wrappers we've already implemented.

Thanks for doing this. Minimizing the wrappers is an improvement.

@ngoldbaum

Copy link
Copy Markdown
Member

This got reviewed by @mhvk so I'm going to merge this. I'll do a followup PR to clean up the docstrings to indicate stringdtype is supported.

@ngoldbaum ngoldbaum merged commit 27aa60f into numpy:main Mar 12, 2024
@charris

charris commented Mar 12, 2024

Copy link
Copy Markdown
Member

@ngoldbaum This could also use a release note.

ngoldbaum added a commit to ngoldbaum/numpy that referenced this pull request Mar 12, 2024
ngoldbaum added a commit to ngoldbaum/numpy that referenced this pull request Mar 12, 2024
charris pushed a commit to charris/numpy that referenced this pull request Mar 15, 2024
@charris

charris commented Mar 15, 2024

Copy link
Copy Markdown
Member

@ngoldbaum Should this be backported? Other backported PRs are failing because the functions are missing.

@charris charris added 09 - Backport-Candidate PRs tagged should be backported and removed 09 - Backport-Candidate PRs tagged should be backported labels Mar 15, 2024
@charris

charris commented Mar 16, 2024

Copy link
Copy Markdown
Member

This is mess to backport due to the merge, it depends on other PRs that aren't backported. The easy thing to do is leave it to NumPy 2.1.

@ngoldbaum

Copy link
Copy Markdown
Member

This is a new feature that didn’t make it in time for 2.0 and shouldn’t be backported. It’s ok to keep this for 2.1.

@ngoldbaum

Copy link
Copy Markdown
Member

Which backported PR does this conflict with? Happy to help out with resolving conflicts.

@charris

charris commented Mar 16, 2024

Copy link
Copy Markdown
Member

@ngoldbaum Looks like you found it already :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: inconsistency between str.center and np.char.center

4 participants