Skip to content

Conversation

@k-dominik
Copy link
Contributor

@k-dominik k-dominik commented Aug 28, 2024

vigra.analysis.unique seems to still be 2x faster than numpy unique. Also replaced the last remaining numpy.bincounts.
Some additional black formatting...

Closes #1200

tiny benchmark:

import timeit

for meth in ["vigra.analysis.unique(data)", "numpy.unique(data)"]:
    t = timeit.timeit(
        stmt=f"_ = {meth}",
        setup='import numpy; import vigra; data = numpy.random.randint(0, 256, (1000, 500, 250), dtype="uint32")',
        number=5,
    )
    print(f"{meth}: {t:.2f}")

Checklist

  • Format code and imports.
  • Add docstrings and comments.
  • Add tests.
  • Update user documentation.
  • Reference relevant issues and other pull requests.
  • Rebase commits into a logical sequence.

vigra.analysis.unique seems to still be 2x faster than numpy unique.
Also replaced the last remaining numpy.bincounts.
Some additional black formatting...

Closes ilastik#1200
@stuarteberg
Copy link
Member

stuarteberg commented Aug 29, 2024

For better or worse, pandas is already a dependency of ilastik, so you might consider pd.unique(). On my machine, I don't observe a 2x speedup for vigra, but I do observe a 9x speedup for pandas:

import numpy as np
import pandas as pd
import vigra

def bincount_unique(a):
    return np.bincount(a.reshape(-1)).nonzero()[0]

def pandas_unique(a):
    a = np.ravel(a, order='K')
    u = pd.unique(a)
    u.sort()
    return u

data = np.random.randint(0, 256, (1000, 500, 250), dtype="uint32")

# Sanity check
u1 = np.unique(data)
u2 = vigra.analysis.unique(data)
u3 = bincount_unique(data)
u4 = pandas_unique(data)
assert u1.tolist() == u2.tolist() == u3.tolist() == u4.tolist()
In [52]: %timeit np.unique(data)
    ...: %timeit vigra.analysis.unique(data)
    ...: %timeit bincount_unique(data)
    ...: %timeit pandas_unique(data)
4.97 s ± 272 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.89 s ± 81.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
847 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
531 ms ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

There's a similar story for bincount(). The equivalent function in pandas is value_counts(), but one has to add some additional manipulation to mimic the bincount() output.

Edit: And numba is even faster...

import vigra
import numpy as np
import pandas as pd
from numba import njit

def vigra_bincount(labels):
    """
    A RAM-efficient implementation of numpy.bincount() when you're dealing with uint32 labels.
    If your data isn't int64, numpy.bincount() will copy it internally -- a huge RAM overhead.
    (This implementation may also need to make a copy, but it prefers uint32, not int64.)
    """
    labels = labels.astype(np.uint32, copy=False)
    labels = np.ravel(labels, order="K").reshape((-1, 1), order="A")
    # We don't care what the 'image' parameter is, but we have to give something
    image = labels.view(np.float32)
    counts = vigra.analysis.extractRegionFeatures(image, labels, ["Count"])["Count"]
    return counts.astype(np.int64)

def pandas_bincount(labels):
    labels = np.ravel(labels, order="K")
    labels = pd.Series(labels, copy=False)
    vc = labels.value_counts()
    vc = vc.reindex(range(labels.max()+1), fill_value=0)
    return vc.values

@njit
def numba_bincount(a):
    c = np.zeros(a.max()+1, dtype=np.int64)
    for x in a.flat:
        c[x] += 1
    return c

def numpy_ufunc_bincount(a):
    a = np.ravel(a, order='K')
    counts = np.zeros(a.max()+1, np.int64)
    np.add.at(counts, a, 1)
    return counts

data = np.random.randint(0, 256, (1000, 500, 250), dtype="uint32")

# Sanity check
b1 = np.bincount(data.reshape(-1))
b2 = vigra_bincount(data)
b3 = pandas_bincount(data)
b4 = numpy_ufunc_bincount(data)
b5 = numba_bincount(data)
assert b1.tolist() == b2.tolist() == b3.tolist() == b4.tolist() == b5.tolist()
In [37]: %timeit vigra_bincount(data)
    ...: %timeit np.bincount(data.reshape(-1))
    ...: %timeit pandas_bincount(data)
    ...: %timeit numpy_ufunc_bincount(data)
    ...: %timeit numba_bincount(data)
1.8 s ± 107 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
870 ms ± 31.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
643 ms ± 34.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
417 ms ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
117 ms ± 3.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

(Also, it seems that np.bincount() is not too slow, so maybe it has been improved in recent years. It still loses to pandas and numba, though.)

@k-dominik
Copy link
Contributor Author

Thanks Stuart :) This is lovely. The original motivation for the alternative bincount were memory considerations - I'll check if these still hold.

@k-dominik
Copy link
Contributor Author

Hey @stuarteberg,

I'm trying to get a more complete picture here and, without over engineering came up with this benchmark, where I included your different methods. Maybe you can also run it on your machine and add your results: https://github.com/ilastik/unique-bincount-benchmark

Cheers
Dominik

@stuarteberg
Copy link
Member

stuarteberg commented Aug 30, 2024

Since I am procrastinating at my real job, I took a look at the places where bincount (and variants of it) is being used. I suspect all of those use-cases could be reimplemented to use pd.Series(a).value_counts(), which would probably be more efficient. (The pandas_bincount() function above takes pains to behave exactly like np.bincount, but that's probably not necessary for the actual use-cases in ilastik.)

Looking in particular at computeSupervoxelLabels(), it looks like there's a lot of room for improvement using pandas.

For instance, maybe something like the following? (Note: Requires this function as a dependency, so you'd have to copy that into the code, too.)

def computeSupervoxelLabels(self, slice_=None):
    """
    For each supervoxel ID in self.SupervoxelSegmentation,
    return the label ID with the largest overlap from self.Labels.

    Returns:
        dict {sv: label}
    """
    supervoxel_mask = self.SupervoxelSegmentation.value[..., 0]
    labels = self.Labels.value[:]

    # Overlapping pairs of (sv, label), ordered by size (largest first).
    ct = (
        contingency_table(supervoxel_mask, labels)
        .rename_axis(['sv', 'label'])
        .reset_index()
    )

    # Select the largest non-zero label for each SV, unless the
    # SV has no non-zero label at all, in which case select 0.
    ct = pd.concat((ct.query('label != 0'), ct.query('label == 0')))
    ct = ct.drop_duplicates('sv')
    return dict(ct[['sv', 'label']].values)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove all calls to numpy.bincount() and np.unique()

2 participants