Improve n-dimensional unique performance #8328

andfoy · 2024-05-17T15:27:49Z

This PR aims to improve the performance of cupy.unique when axis != None, the main idea goes as follows:

Compute a hash value for each row. This can be done completely in parallel
Call unique_1d, which should yield the unique rows according to their hash value.

leofang · 2024-05-20T15:45:00Z

cc: @essoca for vis

andfoy · 2024-08-23T21:53:20Z

Sorry for the delay! Here are the benchmark results after a56161d, this new version matches NumPy implementation in returning the unique output sorted lexicographically in 2D.

I tried to add a separate sort implementation in its own kernel (to prevent adding new compiled code) in 69ad6d2, however, this produced a 6x high mark improvement, compared to the 33x when thrust calls are used.

Benchmark script

from itertools import product

import cupy as cp
import numpy as np
from cupy import testing
from cupyx.profiler import benchmark

# from cupyx.scipy.signal import symiirorder2 as symiir2_gpu
# from scipy.signal import symiirorder2 as symiir2_cpu

from tqdm import tqdm


input_sizes = [5, 10, 100, 1000, 5000, 10000]
# input_sizes = [5000]
modules = [(cp, cp.unique, 'CuPy'), (np, np.unique, 'NumPy')]
# kwargs = [('dtype', [np.float32, np.float64, np.int32, np.int64])]
kwargs = [('dtype', [np.float32])]
# kwargs = [('extrapolate', [True])]
kwargs = list(product(*[list(product([x[0]], x[1])) for x in kwargs]))
measurement = {}


def gather_time(prof):
    cpu_time = prof.cpu_times.mean() * 1000
    gpu_time = prof.gpu_times.mean() * 1000
    return max(cpu_time, gpu_time)


def input_creation(xp, fn, size, **kwargs):
    size = size[0]
    x = testing.shaped_random((size, size), xp, dtype=kwargs['dtype'], scale=0.1)
    return fn, (x,)


def closure(fn, sig):
    # try:
    return fn(*sig, axis=0)
    # except ValueError:
    #     pass


for kwarg_comb in kwargs:
    kwarg_measurement = measurement.get(kwarg_comb, {})
    measurement[kwarg_comb] = kwarg_measurement
    kwarg_comb = dict(kwarg_comb)
    for size in tqdm(input_sizes):
        size_measurement = kwarg_measurement.get(size, {})
        kwarg_measurement[size] = size_measurement
        for xp, cls, _id in modules:
            b, x = input_creation(xp, cls, (size,), **kwarg_comb)
            prof = benchmark(closure, (b, x), n_repeat=100, n_warmup=1)
            size_measurement[_id] = gather_time(prof)


lines = []
for kwarg_comb in kwargs:
    kwarg_line = ', '.join([f'{x}={y.__name__}' for x, y in kwarg_comb])
    lines.append(f'### `{kwarg_line}`\n')
    lines.append('| Size | CuPy (ms) | NumPy (ms) | Speedup |')
    lines.append('|:----:|:---------:|:----------:|:-------:|')
    kwarg_measurement = measurement[kwarg_comb]
    for size in input_sizes:
        comp = f'{size}'
        size_measurement = kwarg_measurement[size]
        times = []
        for _, _, _id in modules:
            time = size_measurement[_id]
            times.append(time)
            time_info = f'{time:3f}'
            comp = f'{comp} | {time_info}'
        speedup = times[1] / times[0]
        comp = f'{comp} | {speedup:3f}'
        lines.append(f'| {comp} |')
    lines.append('\n')

print('\n'.join(lines))

`dtype=float32`

Size	CuPy (ms)	NumPy (ms)	Speedup
5	0.814147	0.064541	0.079275
10	0.895822	0.092653	0.103428
100	1.088569	0.422640	0.388253
1000	1.446971	11.319062	7.822593
5000	10.950881	283.175376	25.858685
10000	31.875722	1064.856683	33.406512

If sorted order was not required, the speedup would be ~422x, however, attaining such theoretical high mark would imply computing some sort of distance-aware embedding that could map high-dimensional vectors to single scalar values, such that the sorting of those values would imply the sorting of the vectors (lexicographically) in the high-dimensional space as well.

`dtype=float32`

Size	CuPy (ms)	NumPy (ms)	Speedup
5	0.809457	0.076864	0.094958
10	0.901338	0.116960	0.129763
100	0.876548	0.462223	0.527322
1000	1.104055	11.561804	10.472124
5000	1.778269	289.703060	162.912915
10000	2.746454	1160.532947	422.556838

andfoy · 2024-08-30T19:14:16Z

Here are the full benchmark results across floating and integer dtypes:

`dtype=float32`

Size	CuPy (ms)	NumPy (ms)	Speedup
5	0.666010	0.064248	0.096468
10	0.714919	0.080585	0.112719
100	0.809175	0.408440	0.504762
1000	1.119888	11.280470	10.072853
5000	10.330930	328.859222	31.832489
8000	20.500174	965.843477	47.113915

`dtype=float64`

Size	CuPy (ms)	NumPy (ms)	Speedup
5	0.911278	0.077275	0.084799
10	0.933582	0.097874	0.104837
100	1.139062	0.505815	0.444063
1000	1.657555	17.529524	10.575529
5000	15.796205	413.587950	26.182741
8000	31.086316	1181.832890	38.017786

`dtype=int32`

Size	CuPy (ms)	NumPy (ms)	Speedup
5	0.900484	0.076021	0.084423
10	0.895114	0.097498	0.108923
100	0.891321	0.493570	0.553751
1000	2.753612	13.386276	4.861351
5000	9.084340	332.088435	36.556143
8000	14.054907	966.191658	68.744079

`dtype=int64`

Size	CuPy (ms)	NumPy (ms)	Speedup
5	0.908947	0.077237	0.084974
10	0.919481	0.097343	0.105868
100	0.923542	0.506951	0.548920
1000	3.442746	19.153301	5.563379
5000	10.647459	407.381513	38.260914
8000	16.348631	1169.777595	71.552020

izaid · 2024-10-13T09:29:22Z

@andfoy Many thanks for this! This unexpectedly bit me as well yesterday (cp.unique with axis ran forever), so very keen to see this make it in for CuPy 14!

andfoy · 2024-10-17T17:48:27Z

These are the new benchmark values after c0fe249:

`dtype=float32`

Size	CuPy (ms)	NumPy (ms)	Speedup
5	0.810840	0.069476	0.085683
10	1.052906	0.168616	0.160144
100	0.949754	0.414205	0.436119
1000	1.218176	11.760539	9.654218
5000	10.944192	292.644518	26.739711
8000	18.373534	817.939150	44.517246

`dtype=float64`

Size	CuPy (ms)	NumPy (ms)	Speedup
5	0.860330	0.066699	0.077527
10	0.855123	0.083544	0.097698
100	1.001703	0.421561	0.420845
1000	1.847500	15.182318	8.217764
5000	15.697573	361.005896	22.997562
8000	18.669041	1002.279578	53.686721

`dtype=int32`

Size	CuPy (ms)	NumPy (ms)	Speedup
5	0.840632	0.065981	0.078490
10	0.755288	0.082317	0.108988
100	0.975596	0.415043	0.425425
1000	3.039855	10.882150	3.579826
5000	9.463769	286.553839	30.279041
8000	14.670544	804.946319	54.868199

`dtype=int64`

Size	CuPy (ms)	NumPy (ms)	Speedup
5	0.849860	0.065714	0.077323
10	0.913852	0.081733	0.089438
100	0.982707	0.422759	0.430199
1000	3.856987	15.393857	3.991161
5000	12.622875	353.585588	28.011494
8000	16.760874	989.089278	59.011796

mergify · 2024-10-17T17:59:32Z

This pull request is now in conflicts. Could you fix it @andfoy? 🙏

andfoy · 2025-02-14T17:21:58Z

Sorry for the delay in posting updates, this PR is now passing tests as of 46d1b76 and it's ready for review, here are the latest benchmark results:

`dtype=float32`

Size	CuPy (ms)	NumPy (ms)	Speedup
5	1.666737	0.089102	0.053459
10	1.731492	0.131966	0.076215
100	1.845882	0.560804	0.303814
1000	2.380428	12.505404	5.253426
5000	13.762609	306.781061	22.290908
7000	20.795560	661.770639	31.822689

`dtype=float64`

Size	CuPy (ms)	NumPy (ms)	Speedup
5	1.594058	0.076425	0.047944
10	1.542880	0.081076	0.052549
100	1.780715	0.414985	0.233044
1000	2.550307	26.087804	10.229279
5000	20.387019	423.489205	20.772493
7000	16.934111	730.523397	43.139164

`dtype=int32`

Size	CuPy (ms)	NumPy (ms)	Speedup
5	1.545758	0.063973	0.041386
10	1.403104	0.091719	0.065369
100	1.627545	0.424023	0.260529
1000	5.765384	21.437695	3.718347
5000	9.858319	299.476496	30.378050
7000	16.769665	571.202105	34.061628

`dtype=int64`

Size	CuPy (ms)	NumPy (ms)	Speedup
5	1.622536	0.072514	0.044692
10	1.542374	0.079068	0.051264
100	1.642492	0.497773	0.303060
1000	5.902750	26.794247	4.539282
5000	19.478476	388.214742	19.930448
7000	25.101465	718.955829	28.641987

kmaehashi · 2025-02-25T05:37:50Z

@asi1024 Could you takeover the review of this PR?

cupy/_manipulation/add_remove.py

andfoy · 2025-06-09T17:55:04Z

@asi1024 Sorry for the delay in applying the suggested review comments! It should be now ready

leofang · 2025-06-11T03:34:21Z

@andfoy could you fix the linter error?

andfoy · 2025-06-11T20:52:14Z

@leofang it should be ready now!

mergify · 2025-06-14T02:05:34Z

This pull request is now in conflicts. Could you fix it @andfoy? 🙏

leofang · 2025-06-16T15:09:48Z

Seems like there's another conflict lol

takagi self-assigned this May 20, 2024

takagi added cat:enhancement Improvements to existing features prio:medium labels May 20, 2024

andfoy force-pushed the improve_unique_axis branch 3 times, most recently from e6e2938 to 69ad6d2 Compare July 20, 2024 01:27

andfoy force-pushed the improve_unique_axis branch from a56161d to f2fb7c5 Compare August 29, 2024 21:18

andfoy force-pushed the improve_unique_axis branch from a0a4864 to 58d57a9 Compare October 3, 2024 16:58

andfoy force-pushed the improve_unique_axis branch from c0fe249 to e768923 Compare October 17, 2024 18:02

andfoy force-pushed the improve_unique_axis branch 2 times, most recently from 2dad650 to 7430ee6 Compare February 13, 2025 18:28

andfoy marked this pull request as ready for review February 13, 2025 18:30

asi1024 assigned asi1024 and unassigned takagi Feb 25, 2025

asi1024 reviewed Mar 24, 2025

View reviewed changes

cupy/_manipulation/add_remove.py Outdated Show resolved Hide resolved

asi1024 reviewed Mar 24, 2025

View reviewed changes

cupy/_manipulation/add_remove.py Outdated Show resolved Hide resolved

asi1024 reviewed Mar 24, 2025

View reviewed changes

cupy/_manipulation/add_remove.py Outdated Show resolved Hide resolved

andfoy force-pushed the improve_unique_axis branch from 46d1b76 to d74765d Compare June 7, 2025 00:56

andfoy force-pushed the improve_unique_axis branch from 203db25 to 270f3bd Compare June 11, 2025 17:14

andfoy added 15 commits June 17, 2025 13:29

Start working on improvements to unique over axis

3232705

Reduce CRC alongside axis=1

079efa9

Extend CRC to other dtypes

e284b4a

Try to sort after filtering

20bfbb3

Ensure lexicographic order is preserved

d97604d

Use thrust implementatiion to sort in 2D

aeb2355

Add support for reteturn_inverse and return_counts

1df5320

Handle equal_nan kwarg

24f9994

Ensure CRCs are different

046af36

Debug

ce4399b

Ensure correct nan sorting

ff48dc8

Remove debug flags

5d0404a

Remove template debug message

d403fee

Mask nans in place

9237168

Apply review comments

106efa4

andfoy force-pushed the improve_unique_axis branch from 270f3bd to 106efa4 Compare June 17, 2025 18:33

Uh oh!

Improve n-dimensional unique performance #8328

Are you sure you want to change the base?

Improve n-dimensional unique performance #8328

Uh oh!

Conversation

andfoy commented May 17, 2024

Uh oh!

leofang commented May 20, 2024

Uh oh!

andfoy commented Aug 23, 2024

dtype=float32

dtype=float32

Uh oh!

andfoy commented Aug 30, 2024

dtype=float32

dtype=float64

dtype=int32

dtype=int64

Uh oh!

izaid commented Oct 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andfoy commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

dtype=float32

dtype=float64

dtype=int32

dtype=int64

Uh oh!

mergify bot commented Oct 17, 2024

Uh oh!

andfoy commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

dtype=float32

dtype=float64

dtype=int32

dtype=int64

Uh oh!

kmaehashi commented Feb 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andfoy commented Jun 9, 2025

Uh oh!

leofang commented Jun 11, 2025

Uh oh!

andfoy commented Jun 11, 2025

Uh oh!

mergify bot commented Jun 14, 2025

Uh oh!

leofang commented Jun 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

`dtype=float32`

`dtype=float32`

`dtype=float32`

`dtype=float64`

`dtype=int32`

`dtype=int64`

izaid commented Oct 13, 2024 •

edited

Loading

andfoy commented Oct 17, 2024 •

edited

Loading

`dtype=float32`

`dtype=float64`

`dtype=int32`

`dtype=int64`

andfoy commented Feb 14, 2025 •

edited

Loading

`dtype=float32`

`dtype=float64`

`dtype=int32`

`dtype=int64`