-
-
Notifications
You must be signed in to change notification settings - Fork 981
Improve n-dimensional unique performance #8328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
cc: @essoca for vis |
e6e2938 to
69ad6d2
Compare
|
Sorry for the delay! Here are the benchmark results after a56161d, this new version matches NumPy implementation in returning the unique output sorted lexicographically in 2D. I tried to add a separate sort implementation in its own kernel (to prevent adding new compiled code) in 69ad6d2, however, this produced a 6x high mark improvement, compared to the 33x when Benchmark scriptfrom itertools import product
import cupy as cp
import numpy as np
from cupy import testing
from cupyx.profiler import benchmark
# from cupyx.scipy.signal import symiirorder2 as symiir2_gpu
# from scipy.signal import symiirorder2 as symiir2_cpu
from tqdm import tqdm
input_sizes = [5, 10, 100, 1000, 5000, 10000]
# input_sizes = [5000]
modules = [(cp, cp.unique, 'CuPy'), (np, np.unique, 'NumPy')]
# kwargs = [('dtype', [np.float32, np.float64, np.int32, np.int64])]
kwargs = [('dtype', [np.float32])]
# kwargs = [('extrapolate', [True])]
kwargs = list(product(*[list(product([x[0]], x[1])) for x in kwargs]))
measurement = {}
def gather_time(prof):
cpu_time = prof.cpu_times.mean() * 1000
gpu_time = prof.gpu_times.mean() * 1000
return max(cpu_time, gpu_time)
def input_creation(xp, fn, size, **kwargs):
size = size[0]
x = testing.shaped_random((size, size), xp, dtype=kwargs['dtype'], scale=0.1)
return fn, (x,)
def closure(fn, sig):
# try:
return fn(*sig, axis=0)
# except ValueError:
# pass
for kwarg_comb in kwargs:
kwarg_measurement = measurement.get(kwarg_comb, {})
measurement[kwarg_comb] = kwarg_measurement
kwarg_comb = dict(kwarg_comb)
for size in tqdm(input_sizes):
size_measurement = kwarg_measurement.get(size, {})
kwarg_measurement[size] = size_measurement
for xp, cls, _id in modules:
b, x = input_creation(xp, cls, (size,), **kwarg_comb)
prof = benchmark(closure, (b, x), n_repeat=100, n_warmup=1)
size_measurement[_id] = gather_time(prof)
lines = []
for kwarg_comb in kwargs:
kwarg_line = ', '.join([f'{x}={y.__name__}' for x, y in kwarg_comb])
lines.append(f'### `{kwarg_line}`\n')
lines.append('| Size | CuPy (ms) | NumPy (ms) | Speedup |')
lines.append('|:----:|:---------:|:----------:|:-------:|')
kwarg_measurement = measurement[kwarg_comb]
for size in input_sizes:
comp = f'{size}'
size_measurement = kwarg_measurement[size]
times = []
for _, _, _id in modules:
time = size_measurement[_id]
times.append(time)
time_info = f'{time:3f}'
comp = f'{comp} | {time_info}'
speedup = times[1] / times[0]
comp = f'{comp} | {speedup:3f}'
lines.append(f'| {comp} |')
lines.append('\n')
print('\n'.join(lines))
|
| Size | CuPy (ms) | NumPy (ms) | Speedup |
|---|---|---|---|
| 5 | 0.814147 | 0.064541 | 0.079275 |
| 10 | 0.895822 | 0.092653 | 0.103428 |
| 100 | 1.088569 | 0.422640 | 0.388253 |
| 1000 | 1.446971 | 11.319062 | 7.822593 |
| 5000 | 10.950881 | 283.175376 | 25.858685 |
| 10000 | 31.875722 | 1064.856683 | 33.406512 |
If sorted order was not required, the speedup would be ~422x, however, attaining such theoretical high mark would imply computing some sort of distance-aware embedding that could map high-dimensional vectors to single scalar values, such that the sorting of those values would imply the sorting of the vectors (lexicographically) in the high-dimensional space as well.
dtype=float32
| Size | CuPy (ms) | NumPy (ms) | Speedup |
|---|---|---|---|
| 5 | 0.809457 | 0.076864 | 0.094958 |
| 10 | 0.901338 | 0.116960 | 0.129763 |
| 100 | 0.876548 | 0.462223 | 0.527322 |
| 1000 | 1.104055 | 11.561804 | 10.472124 |
| 5000 | 1.778269 | 289.703060 | 162.912915 |
| 10000 | 2.746454 | 1160.532947 | 422.556838 |
a56161d to
f2fb7c5
Compare
|
Here are the full benchmark results across floating and integer dtypes:
|
| Size | CuPy (ms) | NumPy (ms) | Speedup |
|---|---|---|---|
| 5 | 0.666010 | 0.064248 | 0.096468 |
| 10 | 0.714919 | 0.080585 | 0.112719 |
| 100 | 0.809175 | 0.408440 | 0.504762 |
| 1000 | 1.119888 | 11.280470 | 10.072853 |
| 5000 | 10.330930 | 328.859222 | 31.832489 |
| 8000 | 20.500174 | 965.843477 | 47.113915 |
dtype=float64
| Size | CuPy (ms) | NumPy (ms) | Speedup |
|---|---|---|---|
| 5 | 0.911278 | 0.077275 | 0.084799 |
| 10 | 0.933582 | 0.097874 | 0.104837 |
| 100 | 1.139062 | 0.505815 | 0.444063 |
| 1000 | 1.657555 | 17.529524 | 10.575529 |
| 5000 | 15.796205 | 413.587950 | 26.182741 |
| 8000 | 31.086316 | 1181.832890 | 38.017786 |
dtype=int32
| Size | CuPy (ms) | NumPy (ms) | Speedup |
|---|---|---|---|
| 5 | 0.900484 | 0.076021 | 0.084423 |
| 10 | 0.895114 | 0.097498 | 0.108923 |
| 100 | 0.891321 | 0.493570 | 0.553751 |
| 1000 | 2.753612 | 13.386276 | 4.861351 |
| 5000 | 9.084340 | 332.088435 | 36.556143 |
| 8000 | 14.054907 | 966.191658 | 68.744079 |
dtype=int64
| Size | CuPy (ms) | NumPy (ms) | Speedup |
|---|---|---|---|
| 5 | 0.908947 | 0.077237 | 0.084974 |
| 10 | 0.919481 | 0.097343 | 0.105868 |
| 100 | 0.923542 | 0.506951 | 0.548920 |
| 1000 | 3.442746 | 19.153301 | 5.563379 |
| 5000 | 10.647459 | 407.381513 | 38.260914 |
| 8000 | 16.348631 | 1169.777595 | 71.552020 |
a0a4864 to
58d57a9
Compare
|
@andfoy Many thanks for this! This unexpectedly bit me as well yesterday ( |
|
These are the new benchmark values after c0fe249:
|
| Size | CuPy (ms) | NumPy (ms) | Speedup |
|---|---|---|---|
| 5 | 0.810840 | 0.069476 | 0.085683 |
| 10 | 1.052906 | 0.168616 | 0.160144 |
| 100 | 0.949754 | 0.414205 | 0.436119 |
| 1000 | 1.218176 | 11.760539 | 9.654218 |
| 5000 | 10.944192 | 292.644518 | 26.739711 |
| 8000 | 18.373534 | 817.939150 | 44.517246 |
dtype=float64
| Size | CuPy (ms) | NumPy (ms) | Speedup |
|---|---|---|---|
| 5 | 0.860330 | 0.066699 | 0.077527 |
| 10 | 0.855123 | 0.083544 | 0.097698 |
| 100 | 1.001703 | 0.421561 | 0.420845 |
| 1000 | 1.847500 | 15.182318 | 8.217764 |
| 5000 | 15.697573 | 361.005896 | 22.997562 |
| 8000 | 18.669041 | 1002.279578 | 53.686721 |
dtype=int32
| Size | CuPy (ms) | NumPy (ms) | Speedup |
|---|---|---|---|
| 5 | 0.840632 | 0.065981 | 0.078490 |
| 10 | 0.755288 | 0.082317 | 0.108988 |
| 100 | 0.975596 | 0.415043 | 0.425425 |
| 1000 | 3.039855 | 10.882150 | 3.579826 |
| 5000 | 9.463769 | 286.553839 | 30.279041 |
| 8000 | 14.670544 | 804.946319 | 54.868199 |
dtype=int64
| Size | CuPy (ms) | NumPy (ms) | Speedup |
|---|---|---|---|
| 5 | 0.849860 | 0.065714 | 0.077323 |
| 10 | 0.913852 | 0.081733 | 0.089438 |
| 100 | 0.982707 | 0.422759 | 0.430199 |
| 1000 | 3.856987 | 15.393857 | 3.991161 |
| 5000 | 12.622875 | 353.585588 | 28.011494 |
| 8000 | 16.760874 | 989.089278 | 59.011796 |
|
This pull request is now in conflicts. Could you fix it @andfoy? 🙏 |
c0fe249 to
e768923
Compare
2dad650 to
7430ee6
Compare
|
Sorry for the delay in posting updates, this PR is now passing tests as of 46d1b76 and it's ready for review, here are the latest benchmark results:
|
| Size | CuPy (ms) | NumPy (ms) | Speedup |
|---|---|---|---|
| 5 | 1.666737 | 0.089102 | 0.053459 |
| 10 | 1.731492 | 0.131966 | 0.076215 |
| 100 | 1.845882 | 0.560804 | 0.303814 |
| 1000 | 2.380428 | 12.505404 | 5.253426 |
| 5000 | 13.762609 | 306.781061 | 22.290908 |
| 7000 | 20.795560 | 661.770639 | 31.822689 |
dtype=float64
| Size | CuPy (ms) | NumPy (ms) | Speedup |
|---|---|---|---|
| 5 | 1.594058 | 0.076425 | 0.047944 |
| 10 | 1.542880 | 0.081076 | 0.052549 |
| 100 | 1.780715 | 0.414985 | 0.233044 |
| 1000 | 2.550307 | 26.087804 | 10.229279 |
| 5000 | 20.387019 | 423.489205 | 20.772493 |
| 7000 | 16.934111 | 730.523397 | 43.139164 |
dtype=int32
| Size | CuPy (ms) | NumPy (ms) | Speedup |
|---|---|---|---|
| 5 | 1.545758 | 0.063973 | 0.041386 |
| 10 | 1.403104 | 0.091719 | 0.065369 |
| 100 | 1.627545 | 0.424023 | 0.260529 |
| 1000 | 5.765384 | 21.437695 | 3.718347 |
| 5000 | 9.858319 | 299.476496 | 30.378050 |
| 7000 | 16.769665 | 571.202105 | 34.061628 |
dtype=int64
| Size | CuPy (ms) | NumPy (ms) | Speedup |
|---|---|---|---|
| 5 | 1.622536 | 0.072514 | 0.044692 |
| 10 | 1.542374 | 0.079068 | 0.051264 |
| 100 | 1.642492 | 0.497773 | 0.303060 |
| 1000 | 5.902750 | 26.794247 | 4.539282 |
| 5000 | 19.478476 | 388.214742 | 19.930448 |
| 7000 | 25.101465 | 718.955829 | 28.641987 |
|
@asi1024 Could you takeover the review of this PR? |
46d1b76 to
d74765d
Compare
|
@asi1024 Sorry for the delay in applying the suggested review comments! It should be now ready |
|
@andfoy could you fix the linter error? |
203db25 to
270f3bd
Compare
|
@leofang it should be ready now! |
|
This pull request is now in conflicts. Could you fix it @andfoy? 🙏 |
|
Seems like there's another conflict lol |
270f3bd to
106efa4
Compare
Fixes #8326
This PR aims to improve the performance of
cupy.uniquewhenaxis != None, the main idea goes as follows:unique_1d, which should yield the unique rows according to their hash value.