Use vigra.analysis.unique instead of numpy.unique #2903

k-dominik · 2024-08-28T14:45:34Z

vigra.analysis.unique seems to still be 2x faster than numpy unique. Also replaced the last remaining numpy.bincounts.
Some additional black formatting...

Closes #1200

tiny benchmark:

import timeit

for meth in ["vigra.analysis.unique(data)", "numpy.unique(data)"]:
    t = timeit.timeit(
        stmt=f"_ = {meth}",
        setup='import numpy; import vigra; data = numpy.random.randint(0, 256, (1000, 500, 250), dtype="uint32")',
        number=5,
    )
    print(f"{meth}: {t:.2f}")

Checklist

Format code and imports.
Add docstrings and comments.
Add tests.
Update user documentation.
Reference relevant issues and other pull requests.
Rebase commits into a logical sequence.

vigra.analysis.unique seems to still be 2x faster than numpy unique. Also replaced the last remaining numpy.bincounts. Some additional black formatting... Closes ilastik#1200

stuarteberg · 2024-08-29T00:16:56Z

For better or worse, pandas is already a dependency of ilastik, so you might consider pd.unique(). On my machine, I don't observe a 2x speedup for vigra, but I do observe a 9x speedup for pandas:

import numpy as np
import pandas as pd
import vigra

def bincount_unique(a):
    return np.bincount(a.reshape(-1)).nonzero()[0]

def pandas_unique(a):
    a = np.ravel(a, order='K')
    u = pd.unique(a)
    u.sort()
    return u

data = np.random.randint(0, 256, (1000, 500, 250), dtype="uint32")

# Sanity check
u1 = np.unique(data)
u2 = vigra.analysis.unique(data)
u3 = bincount_unique(data)
u4 = pandas_unique(data)
assert u1.tolist() == u2.tolist() == u3.tolist() == u4.tolist()

In [52]: %timeit np.unique(data)
    ...: %timeit vigra.analysis.unique(data)
    ...: %timeit bincount_unique(data)
    ...: %timeit pandas_unique(data)
4.97 s ± 272 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.89 s ± 81.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
847 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
531 ms ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

There's a similar story for bincount(). The equivalent function in pandas is value_counts(), but one has to add some additional manipulation to mimic the bincount() output.

Edit: And numba is even faster...

import vigra
import numpy as np
import pandas as pd
from numba import njit

def vigra_bincount(labels):
    """
    A RAM-efficient implementation of numpy.bincount() when you're dealing with uint32 labels.
    If your data isn't int64, numpy.bincount() will copy it internally -- a huge RAM overhead.
    (This implementation may also need to make a copy, but it prefers uint32, not int64.)
    """
    labels = labels.astype(np.uint32, copy=False)
    labels = np.ravel(labels, order="K").reshape((-1, 1), order="A")
    # We don't care what the 'image' parameter is, but we have to give something
    image = labels.view(np.float32)
    counts = vigra.analysis.extractRegionFeatures(image, labels, ["Count"])["Count"]
    return counts.astype(np.int64)

def pandas_bincount(labels):
    labels = np.ravel(labels, order="K")
    labels = pd.Series(labels, copy=False)
    vc = labels.value_counts()
    vc = vc.reindex(range(labels.max()+1), fill_value=0)
    return vc.values

@njit
def numba_bincount(a):
    c = np.zeros(a.max()+1, dtype=np.int64)
    for x in a.flat:
        c[x] += 1
    return c

def numpy_ufunc_bincount(a):
    a = np.ravel(a, order='K')
    counts = np.zeros(a.max()+1, np.int64)
    np.add.at(counts, a, 1)
    return counts

data = np.random.randint(0, 256, (1000, 500, 250), dtype="uint32")

# Sanity check
b1 = np.bincount(data.reshape(-1))
b2 = vigra_bincount(data)
b3 = pandas_bincount(data)
b4 = numpy_ufunc_bincount(data)
b5 = numba_bincount(data)
assert b1.tolist() == b2.tolist() == b3.tolist() == b4.tolist() == b5.tolist()

In [37]: %timeit vigra_bincount(data)
    ...: %timeit np.bincount(data.reshape(-1))
    ...: %timeit pandas_bincount(data)
    ...: %timeit numpy_ufunc_bincount(data)
    ...: %timeit numba_bincount(data)
1.8 s ± 107 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
870 ms ± 31.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
643 ms ± 34.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
417 ms ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
117 ms ± 3.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

(Also, it seems that np.bincount() is not too slow, so maybe it has been improved in recent years. It still loses to pandas and numba, though.)

k-dominik · 2024-08-29T05:25:49Z

Thanks Stuart :) This is lovely. The original motivation for the alternative bincount were memory considerations - I'll check if these still hold.

k-dominik · 2024-08-29T14:00:00Z

Hey @stuarteberg,

I'm trying to get a more complete picture here and, without over engineering came up with this benchmark, where I included your different methods. Maybe you can also run it on your machine and add your results: https://github.com/ilastik/unique-bincount-benchmark

Cheers
Dominik

stuarteberg · 2024-08-30T16:01:19Z

Since I am procrastinating at my real job, I took a look at the places where bincount (and variants of it) is being used. I suspect all of those use-cases could be reimplemented to use pd.Series(a).value_counts(), which would probably be more efficient. (The pandas_bincount() function above takes pains to behave exactly like np.bincount, but that's probably not necessary for the actual use-cases in ilastik.)

Looking in particular at computeSupervoxelLabels(), it looks like there's a lot of room for improvement using pandas.

For instance, maybe something like the following? (Note: Requires this function as a dependency, so you'd have to copy that into the code, too.)

def computeSupervoxelLabels(self, slice_=None):
    """
    For each supervoxel ID in self.SupervoxelSegmentation,
    return the label ID with the largest overlap from self.Labels.

    Returns:
        dict {sv: label}
    """
    supervoxel_mask = self.SupervoxelSegmentation.value[..., 0]
    labels = self.Labels.value[:]

    # Overlapping pairs of (sv, label), ordered by size (largest first).
    ct = (
        contingency_table(supervoxel_mask, labels)
        .rename_axis(['sv', 'label'])
        .reset_index()
    )

    # Select the largest non-zero label for each SV, unless the
    # SV has no non-zero label at all, in which case select 0.
    ct = pd.concat((ct.query('label != 0'), ct.query('label == 0')))
    ct = ct.drop_duplicates('sv')
    return dict(ct[['sv', 'label']].values)

Use vigra.analysis.unique instead of numpy.unique

8379e59

vigra.analysis.unique seems to still be 2x faster than numpy unique. Also replaced the last remaining numpy.bincounts. Some additional black formatting... Closes ilastik#1200

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use vigra.analysis.unique instead of numpy.unique #2903

Use vigra.analysis.unique instead of numpy.unique #2903

Uh oh!

k-dominik commented Aug 28, 2024 •

edited

Loading

Uh oh!

stuarteberg commented Aug 29, 2024 •

edited

Loading

Uh oh!

k-dominik commented Aug 29, 2024

Uh oh!

k-dominik commented Aug 29, 2024

Uh oh!

stuarteberg commented Aug 30, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use vigra.analysis.unique instead of numpy.unique #2903

Are you sure you want to change the base?

Use vigra.analysis.unique instead of numpy.unique #2903

Uh oh!

Conversation

k-dominik commented Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

stuarteberg commented Aug 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k-dominik commented Aug 29, 2024

Uh oh!

k-dominik commented Aug 29, 2024

Uh oh!

stuarteberg commented Aug 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

k-dominik commented Aug 28, 2024 •

edited

Loading

stuarteberg commented Aug 29, 2024 •

edited

Loading

stuarteberg commented Aug 30, 2024 •

edited

Loading