-
Notifications
You must be signed in to change notification settings - Fork 168
Use vigra.analysis.unique instead of numpy.unique #2903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
vigra.analysis.unique seems to still be 2x faster than numpy unique. Also replaced the last remaining numpy.bincounts. Some additional black formatting... Closes ilastik#1200
|
For better or worse, import numpy as np
import pandas as pd
import vigra
def bincount_unique(a):
return np.bincount(a.reshape(-1)).nonzero()[0]
def pandas_unique(a):
a = np.ravel(a, order='K')
u = pd.unique(a)
u.sort()
return u
data = np.random.randint(0, 256, (1000, 500, 250), dtype="uint32")
# Sanity check
u1 = np.unique(data)
u2 = vigra.analysis.unique(data)
u3 = bincount_unique(data)
u4 = pandas_unique(data)
assert u1.tolist() == u2.tolist() == u3.tolist() == u4.tolist()In [52]: %timeit np.unique(data)
...: %timeit vigra.analysis.unique(data)
...: %timeit bincount_unique(data)
...: %timeit pandas_unique(data)
4.97 s ± 272 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.89 s ± 81.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
847 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
531 ms ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)There's a similar story for Edit: And import vigra
import numpy as np
import pandas as pd
from numba import njit
def vigra_bincount(labels):
"""
A RAM-efficient implementation of numpy.bincount() when you're dealing with uint32 labels.
If your data isn't int64, numpy.bincount() will copy it internally -- a huge RAM overhead.
(This implementation may also need to make a copy, but it prefers uint32, not int64.)
"""
labels = labels.astype(np.uint32, copy=False)
labels = np.ravel(labels, order="K").reshape((-1, 1), order="A")
# We don't care what the 'image' parameter is, but we have to give something
image = labels.view(np.float32)
counts = vigra.analysis.extractRegionFeatures(image, labels, ["Count"])["Count"]
return counts.astype(np.int64)
def pandas_bincount(labels):
labels = np.ravel(labels, order="K")
labels = pd.Series(labels, copy=False)
vc = labels.value_counts()
vc = vc.reindex(range(labels.max()+1), fill_value=0)
return vc.values
@njit
def numba_bincount(a):
c = np.zeros(a.max()+1, dtype=np.int64)
for x in a.flat:
c[x] += 1
return c
def numpy_ufunc_bincount(a):
a = np.ravel(a, order='K')
counts = np.zeros(a.max()+1, np.int64)
np.add.at(counts, a, 1)
return counts
data = np.random.randint(0, 256, (1000, 500, 250), dtype="uint32")
# Sanity check
b1 = np.bincount(data.reshape(-1))
b2 = vigra_bincount(data)
b3 = pandas_bincount(data)
b4 = numpy_ufunc_bincount(data)
b5 = numba_bincount(data)
assert b1.tolist() == b2.tolist() == b3.tolist() == b4.tolist() == b5.tolist()In [37]: %timeit vigra_bincount(data)
...: %timeit np.bincount(data.reshape(-1))
...: %timeit pandas_bincount(data)
...: %timeit numpy_ufunc_bincount(data)
...: %timeit numba_bincount(data)
1.8 s ± 107 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
870 ms ± 31.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
643 ms ± 34.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
417 ms ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
117 ms ± 3.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)(Also, it seems that |
|
Thanks Stuart :) This is lovely. The original motivation for the alternative bincount were memory considerations - I'll check if these still hold. |
|
Hey @stuarteberg, I'm trying to get a more complete picture here and, without over engineering came up with this benchmark, where I included your different methods. Maybe you can also run it on your machine and add your results: https://github.com/ilastik/unique-bincount-benchmark Cheers |
|
Since I am procrastinating at my real job, I took a look at the places where Looking in particular at For instance, maybe something like the following? (Note: Requires this function as a dependency, so you'd have to copy that into the code, too.) def computeSupervoxelLabels(self, slice_=None):
"""
For each supervoxel ID in self.SupervoxelSegmentation,
return the label ID with the largest overlap from self.Labels.
Returns:
dict {sv: label}
"""
supervoxel_mask = self.SupervoxelSegmentation.value[..., 0]
labels = self.Labels.value[:]
# Overlapping pairs of (sv, label), ordered by size (largest first).
ct = (
contingency_table(supervoxel_mask, labels)
.rename_axis(['sv', 'label'])
.reset_index()
)
# Select the largest non-zero label for each SV, unless the
# SV has no non-zero label at all, in which case select 0.
ct = pd.concat((ct.query('label != 0'), ct.query('label == 0')))
ct = ct.drop_duplicates('sv')
return dict(ct[['sv', 'label']].values) |
vigra.analysis.unique seems to still be 2x faster than numpy unique. Also replaced the last remaining numpy.bincounts.
Some additional black formatting...
Closes #1200
tiny benchmark:
Checklist