Conversation
|
I marked this as draft in the github UI, hope you don't mind. |
|
Thanks, I had intended to have it as draft, but forgot. |
|
Still got work to do regarding figuring out dtypes / type lengths supported on all platforms and all, and figuring out a way to filter dtypes that should be included in the hash based unique, but for now, this is giving me a somewhat 10x speedup on 100_000_000 to 1_000_000_000 data points. EDIT: the speedup is more of 2-3x on these numbers with compiler optimization enabled. |
|
Example times on 1,000,000,000 samples: Now that the code is pretty small, is this something we'd like to have in numpy? cc @seberg , @rgommers since you've been involved in this. I still need to fix the compiler issues for all different compilers, and maybe figure out the macro defs to add 96 and 128 bit data, but for now I think we can discuss if this is the right thing to do. Also, the script I use to benchmark: import time
import numpy as np
from numpy._core.unique import unique_hash
import polars as pl
arr = np.random.randint(0, 1000, 1000_000_000)
time_start = time.perf_counter()
print("unique count (hashmap): ", len(unique_hash(arr)))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))
time_start = time.perf_counter()
print("unique count (numpy main): ", len(np.unique(arr)))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))
time_start = time.perf_counter()
print("unique count (polars): ", len(pl.Series(arr).unique()))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))Haven't included pandas since I guess they still need to do a release with numpy2 first. |
|
This seems like a reasonable amount of code to maintain; it's more or less a wrapper around the C++ standard library |
|
Also totally a style thing, but to keep things consistent it probably makes more sense to add the |
Nice work. I agree with what @ngoldbaum said above. This seems like a very clean and concise implementation. |
You could install the nightly wheels from the scientific python channel if you want (I don't think there is any (compiled) library that already has a 2.0 compatible release given there is not yet a stable ABI ?) I tried your benchmark case locally with pandas. I didn't fetch this branch to compare with, but also ran with released numpy as a point of comparison: Related to |
I tried and failed. One is C++ the other C, and they also have different macros related to compilation. So not sure how to merge them.
@jorisvandenbossche |
Should work, I think but I guess you could defer that for now. You will have to export an
In C, you do cleanup with a
In this case, I am not sure what is better, it might be 1 (have a simple C entry-point and just call into some C++ code that maybe doesn't even need objects at all) or 3 if you want to work with objects in C++.
I doubt it matters much, there is nothing cryptographic here, the issue would be denial of service by creating a dataset with a huge number of unique values which all hash to the same thing. Note that e.g. Python only randomizes the hash for strings (unless dicts randomize them for non-strings?). |
|
Doesn't seem like the failure is related to this PR =================================== FAILURES ===================================
___________ TestStringDiscovery.test_nested_arrays_stringlength[1.2] ___________
self = <test_array_coercion.TestStringDiscovery object at 0x3cd6fb17d010>
obj = 1.2
@pytest.mark.parametrize("obj",
[object(), 1.2, 10**43, None, "string"],
ids=["object", "1.2", "10**43", "None", "string"])
def test_nested_arrays_stringlength(self, obj):
length = len(str(obj))
expected = np.dtype(f"S{length}")
arr = np.array(obj, dtype="O")
> assert np.array([arr, arr], dtype="S").dtype == expected
E RuntimeWarning: invalid value encountered in cast
arr = array(1.2, dtype=object)
expected = dtype('S3')
length = 3
obj = 1.2
self = <test_array_coercion.TestStringDiscovery object at 0x3cd6fb17d010> |
|
The fix for that was just merged |
Co-authored-by: Christian Lorentzen <[email protected]>
seberg
left a comment
There was a problem hiding this comment.
I added a small commit to:
- Use
sorted=Falseforunique_values(and a release note) - Some smaller maintenance, partially just small preferences, though.
- I removed the 0-size special case. I think it is just nice to not need it (and I don't consider 0-size to be important to optimize)
This LGTM, would be nice if someone could have a quick look over my changes.
| NPY_ALLOW_C_API; | ||
| PyArray_Descr *descr = PyArray_DESCR(self); | ||
| Py_INCREF(descr); | ||
| PyObject *res_obj = PyArray_NewFromDescr( |
There was a problem hiding this comment.
If there was a C++ exception after this, we would leak it... But, I don't think that is possible, so let's not worry about it.
|
There's a segfault and the docs need updating since output is not sorted now @seberg |
Whoops, need to not try to iterate (I think the setup is OK though). The output being sorted is not documented for EDIT: 🤦 sorry, the examples do need updating of course, oops. |
(let's pretend it may not even be stable!)
|
Just to note the observation: The s390x build segfaulted once here. (i.e. not our test, but the compilation) |
|
Let's give this a shot, thanks @adrinjalali, we discussed it briefly yesterday and I don't think it is helpful to try to iterate here more. There are a lot of smaller and larger follow-ups (larger e.g. thinking about using another hash-map, maybe even one with some randomization, etc.). |
|
Thank y'all for all the patience, all the reviews, and all the help. This was a steep learning curve for me, doing C++ after about 12 years, and never having had done cpython bindings using its barebone API 😅 I loved every bit of this experience, and look forward to more contributions on this little corner ❤️ |
This calculates unique values with an unordered hashset in C++ for certain dtypes.
Towards #11136