Skip to content

Conversation

@andfoy
Copy link
Contributor

@andfoy andfoy commented May 17, 2024

Fixes #8326

This PR aims to improve the performance of cupy.unique when axis != None, the main idea goes as follows:

  1. Compute a hash value for each row. This can be done completely in parallel
  2. Call unique_1d, which should yield the unique rows according to their hash value.

@takagi takagi self-assigned this May 20, 2024
@takagi takagi added cat:enhancement Improvements to existing features prio:medium labels May 20, 2024
@leofang
Copy link
Member

leofang commented May 20, 2024

cc: @essoca for vis

@andfoy andfoy force-pushed the improve_unique_axis branch 3 times, most recently from e6e2938 to 69ad6d2 Compare July 20, 2024 01:27
@andfoy
Copy link
Contributor Author

andfoy commented Aug 23, 2024

Sorry for the delay! Here are the benchmark results after a56161d, this new version matches NumPy implementation in returning the unique output sorted lexicographically in 2D.

I tried to add a separate sort implementation in its own kernel (to prevent adding new compiled code) in 69ad6d2, however, this produced a 6x high mark improvement, compared to the 33x when thrust calls are used.

Benchmark script
from itertools import product

import cupy as cp
import numpy as np
from cupy import testing
from cupyx.profiler import benchmark

# from cupyx.scipy.signal import symiirorder2 as symiir2_gpu
# from scipy.signal import symiirorder2 as symiir2_cpu

from tqdm import tqdm


input_sizes = [5, 10, 100, 1000, 5000, 10000]
# input_sizes = [5000]
modules = [(cp, cp.unique, 'CuPy'), (np, np.unique, 'NumPy')]
# kwargs = [('dtype', [np.float32, np.float64, np.int32, np.int64])]
kwargs = [('dtype', [np.float32])]
# kwargs = [('extrapolate', [True])]
kwargs = list(product(*[list(product([x[0]], x[1])) for x in kwargs]))
measurement = {}


def gather_time(prof):
    cpu_time = prof.cpu_times.mean() * 1000
    gpu_time = prof.gpu_times.mean() * 1000
    return max(cpu_time, gpu_time)


def input_creation(xp, fn, size, **kwargs):
    size = size[0]
    x = testing.shaped_random((size, size), xp, dtype=kwargs['dtype'], scale=0.1)
    return fn, (x,)


def closure(fn, sig):
    # try:
    return fn(*sig, axis=0)
    # except ValueError:
    #     pass


for kwarg_comb in kwargs:
    kwarg_measurement = measurement.get(kwarg_comb, {})
    measurement[kwarg_comb] = kwarg_measurement
    kwarg_comb = dict(kwarg_comb)
    for size in tqdm(input_sizes):
        size_measurement = kwarg_measurement.get(size, {})
        kwarg_measurement[size] = size_measurement
        for xp, cls, _id in modules:
            b, x = input_creation(xp, cls, (size,), **kwarg_comb)
            prof = benchmark(closure, (b, x), n_repeat=100, n_warmup=1)
            size_measurement[_id] = gather_time(prof)


lines = []
for kwarg_comb in kwargs:
    kwarg_line = ', '.join([f'{x}={y.__name__}' for x, y in kwarg_comb])
    lines.append(f'### `{kwarg_line}`\n')
    lines.append('| Size | CuPy (ms) | NumPy (ms) | Speedup |')
    lines.append('|:----:|:---------:|:----------:|:-------:|')
    kwarg_measurement = measurement[kwarg_comb]
    for size in input_sizes:
        comp = f'{size}'
        size_measurement = kwarg_measurement[size]
        times = []
        for _, _, _id in modules:
            time = size_measurement[_id]
            times.append(time)
            time_info = f'{time:3f}'
            comp = f'{comp} | {time_info}'
        speedup = times[1] / times[0]
        comp = f'{comp} | {speedup:3f}'
        lines.append(f'| {comp} |')
    lines.append('\n')

print('\n'.join(lines))

dtype=float32

Size CuPy (ms) NumPy (ms) Speedup
5 0.814147 0.064541 0.079275
10 0.895822 0.092653 0.103428
100 1.088569 0.422640 0.388253
1000 1.446971 11.319062 7.822593
5000 10.950881 283.175376 25.858685
10000 31.875722 1064.856683 33.406512

If sorted order was not required, the speedup would be ~422x, however, attaining such theoretical high mark would imply computing some sort of distance-aware embedding that could map high-dimensional vectors to single scalar values, such that the sorting of those values would imply the sorting of the vectors (lexicographically) in the high-dimensional space as well.

dtype=float32

Size CuPy (ms) NumPy (ms) Speedup
5 0.809457 0.076864 0.094958
10 0.901338 0.116960 0.129763
100 0.876548 0.462223 0.527322
1000 1.104055 11.561804 10.472124
5000 1.778269 289.703060 162.912915
10000 2.746454 1160.532947 422.556838

@andfoy andfoy force-pushed the improve_unique_axis branch from a56161d to f2fb7c5 Compare August 29, 2024 21:18
@andfoy
Copy link
Contributor Author

andfoy commented Aug 30, 2024

Here are the full benchmark results across floating and integer dtypes:

dtype=float32

Size CuPy (ms) NumPy (ms) Speedup
5 0.666010 0.064248 0.096468
10 0.714919 0.080585 0.112719
100 0.809175 0.408440 0.504762
1000 1.119888 11.280470 10.072853
5000 10.330930 328.859222 31.832489
8000 20.500174 965.843477 47.113915

dtype=float64

Size CuPy (ms) NumPy (ms) Speedup
5 0.911278 0.077275 0.084799
10 0.933582 0.097874 0.104837
100 1.139062 0.505815 0.444063
1000 1.657555 17.529524 10.575529
5000 15.796205 413.587950 26.182741
8000 31.086316 1181.832890 38.017786

dtype=int32

Size CuPy (ms) NumPy (ms) Speedup
5 0.900484 0.076021 0.084423
10 0.895114 0.097498 0.108923
100 0.891321 0.493570 0.553751
1000 2.753612 13.386276 4.861351
5000 9.084340 332.088435 36.556143
8000 14.054907 966.191658 68.744079

dtype=int64

Size CuPy (ms) NumPy (ms) Speedup
5 0.908947 0.077237 0.084974
10 0.919481 0.097343 0.105868
100 0.923542 0.506951 0.548920
1000 3.442746 19.153301 5.563379
5000 10.647459 407.381513 38.260914
8000 16.348631 1169.777595 71.552020

@andfoy andfoy force-pushed the improve_unique_axis branch from a0a4864 to 58d57a9 Compare October 3, 2024 16:58
@izaid
Copy link

izaid commented Oct 13, 2024

@andfoy Many thanks for this! This unexpectedly bit me as well yesterday (cp.unique with axis ran forever), so very keen to see this make it in for CuPy 14!

@andfoy
Copy link
Contributor Author

andfoy commented Oct 17, 2024

These are the new benchmark values after c0fe249:

dtype=float32

Size CuPy (ms) NumPy (ms) Speedup
5 0.810840 0.069476 0.085683
10 1.052906 0.168616 0.160144
100 0.949754 0.414205 0.436119
1000 1.218176 11.760539 9.654218
5000 10.944192 292.644518 26.739711
8000 18.373534 817.939150 44.517246

dtype=float64

Size CuPy (ms) NumPy (ms) Speedup
5 0.860330 0.066699 0.077527
10 0.855123 0.083544 0.097698
100 1.001703 0.421561 0.420845
1000 1.847500 15.182318 8.217764
5000 15.697573 361.005896 22.997562
8000 18.669041 1002.279578 53.686721

dtype=int32

Size CuPy (ms) NumPy (ms) Speedup
5 0.840632 0.065981 0.078490
10 0.755288 0.082317 0.108988
100 0.975596 0.415043 0.425425
1000 3.039855 10.882150 3.579826
5000 9.463769 286.553839 30.279041
8000 14.670544 804.946319 54.868199

dtype=int64

Size CuPy (ms) NumPy (ms) Speedup
5 0.849860 0.065714 0.077323
10 0.913852 0.081733 0.089438
100 0.982707 0.422759 0.430199
1000 3.856987 15.393857 3.991161
5000 12.622875 353.585588 28.011494
8000 16.760874 989.089278 59.011796

@mergify
Copy link
Contributor

mergify bot commented Oct 17, 2024

This pull request is now in conflicts. Could you fix it @andfoy? 🙏

@andfoy andfoy force-pushed the improve_unique_axis branch from c0fe249 to e768923 Compare October 17, 2024 18:02
@andfoy andfoy force-pushed the improve_unique_axis branch 2 times, most recently from 2dad650 to 7430ee6 Compare February 13, 2025 18:28
@andfoy andfoy marked this pull request as ready for review February 13, 2025 18:30
@andfoy
Copy link
Contributor Author

andfoy commented Feb 14, 2025

Sorry for the delay in posting updates, this PR is now passing tests as of 46d1b76 and it's ready for review, here are the latest benchmark results:

dtype=float32

Size CuPy (ms) NumPy (ms) Speedup
5 1.666737 0.089102 0.053459
10 1.731492 0.131966 0.076215
100 1.845882 0.560804 0.303814
1000 2.380428 12.505404 5.253426
5000 13.762609 306.781061 22.290908
7000 20.795560 661.770639 31.822689

dtype=float64

Size CuPy (ms) NumPy (ms) Speedup
5 1.594058 0.076425 0.047944
10 1.542880 0.081076 0.052549
100 1.780715 0.414985 0.233044
1000 2.550307 26.087804 10.229279
5000 20.387019 423.489205 20.772493
7000 16.934111 730.523397 43.139164

dtype=int32

Size CuPy (ms) NumPy (ms) Speedup
5 1.545758 0.063973 0.041386
10 1.403104 0.091719 0.065369
100 1.627545 0.424023 0.260529
1000 5.765384 21.437695 3.718347
5000 9.858319 299.476496 30.378050
7000 16.769665 571.202105 34.061628

dtype=int64

Size CuPy (ms) NumPy (ms) Speedup
5 1.622536 0.072514 0.044692
10 1.542374 0.079068 0.051264
100 1.642492 0.497773 0.303060
1000 5.902750 26.794247 4.539282
5000 19.478476 388.214742 19.930448
7000 25.101465 718.955829 28.641987

@kmaehashi
Copy link
Member

@asi1024 Could you takeover the review of this PR?

@asi1024 asi1024 assigned asi1024 and unassigned takagi Feb 25, 2025
@andfoy andfoy force-pushed the improve_unique_axis branch from 46d1b76 to d74765d Compare June 7, 2025 00:56
@andfoy
Copy link
Contributor Author

andfoy commented Jun 9, 2025

@asi1024 Sorry for the delay in applying the suggested review comments! It should be now ready

@leofang
Copy link
Member

leofang commented Jun 11, 2025

@andfoy could you fix the linter error?

@andfoy andfoy force-pushed the improve_unique_axis branch from 203db25 to 270f3bd Compare June 11, 2025 17:14
@andfoy
Copy link
Contributor Author

andfoy commented Jun 11, 2025

@leofang it should be ready now!

@mergify
Copy link
Contributor

mergify bot commented Jun 14, 2025

This pull request is now in conflicts. Could you fix it @andfoy? 🙏

@leofang
Copy link
Member

leofang commented Jun 16, 2025

Seems like there's another conflict lol

@andfoy andfoy force-pushed the improve_unique_axis branch from 270f3bd to 106efa4 Compare June 17, 2025 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:enhancement Improvements to existing features prio:medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cp.unique runs forever

6 participants