A fast alternative to the modulo reduction

Suppose you want to pick an integer at random in a set of N elements. Your computer has functions to generate random 32-bit integers, how do you transform such numbers into indexes no larger than N? Suppose you have a hash table with a capacity N. Again, you need to transform your hash values (typically 32-bit or 64-bit integers) down to an index no larger than N. Programmers often get around this problem by making sure that N is a power of two, but that is not always ideal.

We want a map that as fair as possible for an arbitrary integer N. That is, ideally, we would want that there are exactly 2³²/N values mapped to each value in the range {0, 1 ,…, N – 1} when starting from all 2³² 32-bit integers.

Sadly, we cannot have a perfectly fair map if 2³² is not divisible by N. But we can have the next best thing: we can require that there be either floor(2³²/N) or ceil(2³²/N) values mapped to each value in the range.

If N is small compared to 2³², then this map could be considered as good as perfect.

The common solution is to do a modulo reduction: x mod N. (Since we are computer scientists, we define the modulo reduction to be the remainder of the division, unless otherwise stated.)

uint32_t reduce(uint32_t x, uint32_t N) {
  return x % N;
}

How can I tell that it is fair? Well. Let us just run through the values of x starting with 0. You should be able to see that the modulo reduction takes on the values 0, 1, …, N – 1, 0, 1, … as you increment x. Eventually, x arrives at its last value (2³² – 1), at which point the cycle stops, leaving the values 0, 1, …, (2³² – 1) mod N with ceil(2³²/N) occurrences, and the remaining values with floor(2³²/N) occurrences. It is a fair map with a bias for smaller values.

It works, but a modulo reduction involves a division, and divisions are expensive. Much more expensive than multiplications. A single 32-bit division on a recent x64 processor has a throughput of one instruction every six cycles with a latency of 26 cycles. In contrast, a multiplication has a throughput of one instruction every cycle and a latency of 3 cycles.

There are fancy tricks to “precompute” a modulo reduction so that it can be transformed into a couple of multiplications as well as a few other operations, as long as N is known ahead of time. Your compiler will make use of them if N is known at compile time. Otherwise, you can use a software library or work out your own formula.

But it turns out that you can do even better! That is, there is an approach that is easy to implement, and provides just as good a map, without the same performance concerns.

Assume that x and N are 32-bit integers, consider the 64-bit product x * N. You have that (x * N) div 2³² is in the range, and it is a fair map.

uint32_t reduce(uint32_t x, uint32_t N) {
  return ((uint64_t) x * (uint64_t) N) >> 32 ;
}

Computing (x * N) div 2³² is very fast on a 64-bit processor. It is a multiplication followed by a shift. On a recent Intel processor, I expect that it has a latency of about 4 cycles and a throughput of at least on call every 2 cycles.

So how fast is our map compared to a 32-bit modulo reduction?

To test it out, I have implemented a benchmark where you repeatedly access random indexes in an array of size N. The indexes are obtained either with a modulo reduction or our approach. On a recent Intel processor (Skylake), I get the following number of CPU cycles per accesses:

modulo reduction	fast range
8.1	2.2

So it is four times faster! No bad.

As usual, my code is freely available.

What can this be good for? Well… if you have been forcing your arrays and hash tables to have power-of-two capacities to avoid expensive divisions, you may be able to use the fast range map to support arbitrary capacities without too much of a performance penalty. You can also generate random numbers in a range faster, which matters if you have a very fast random number generator.

So how can I tell that the map is fair?

By multiplying by N, we take integer values in the range [0, 2³²) and map them to multiples of N in [0, N * 2³²). By dividing by 2³², we map all multiples of N in [0, 2³²) to 0, all multiples of N in [2³², 2 * 2³²) to one, and so forth. To check that this is fair, we just need to count the number of multiples of N in intervals of length 2³². This count must be either ceil(2³²/N) or floor(2³²/N).

Suppose that the first value in the interval is a multiple of N: that is clearly the scenario that maximizes the number of multiples in the interval. How many will we find? Exactly ceil(2³²/N). Indeed, if you draw sub-intervals of length N, then every complete interval begins with a multiple of N and if there is any remainder, then there will be one extra multiple of N. In the worst case scenario, the first multiple of N appears at position N – 1 in the interval. In that case, we get floor(2³²/N) multiples. To see why, again, draw sub-intervals of length N. Every complete sub-interval ends with a multiple of N.

This completes the proof that the map is fair.

For fun, we can be slightly more precise. We have argued that the number of multiples was maximized when a multiple of N appears at the very beginning of the interval of length 2³². At the end, we get an incomplete interval of length 2³² mod N. If instead of having the first multiple of N appear at the very beginning of the interval, it appeared at index 2³² mod N, then there would not be room for the incomplete subinterval at the end. This means that whenever a multiple of N occurs before 2³² mod N, then we shall have ceil(2³²/N) multiples, and otherwise we shall have floor(2³²/N) multiples.

Can we tell which outcomes occur with frequency floor(2³²/N) and which occurs with frequency ceil(2³²/N)? Yes. Suppose we have an output value k. We need to find the location of the first multiple of N no smaller than k 2³². This location is ceil(k 2³² / N) N – k 2³² which we just need to compare with 2³² mod N. If it is smaller, then we have a count of ceil(2³²/N), otherwise we have a count of floor(2³²/N).

You can correct the bias with a rejection, see my post on fast shuffle functions.

Useful code: I published a C/C++ header on GitHub that you can use in your projects.

Further reading:

Daniel Lemire, Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation (to appear)
Google Tensorflow adopted this approach through a contribution by David Andersen (see the commit Switching the presized cuckoo map from using strict mod and Google+ post).
What is arguably the best Open Source Chess engine, Stockfish, also adopted this approach.
The technique described in this blog post is in used within Microsoft Arriba.
math/rand: speed up Int31n with multiply/shift instead of modulo (golang issue 16213), runtime: speed up fastrand() % n (golang commit)
Agner Fog, Pseudo-Random Number Generators for Vector Processors and Multicore Processors, Journal of Modern Applied Statistical Methods, 2015.
Kenneth A. Ross, Efficient Hash Probes on Modern Processors, IBM Research Report RC24100 (W0611-039) November 8, 2006

(Update: I have made the proof more intuitive following a comment by Kendall Willets.)

Daniel Lemire, "A fast alternative to the modulo reduction," in Daniel Lemire's blog, June 27, 2016, https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ). View all posts by Daniel Lemire

72 thoughts on “A fast alternative to the modulo reduction”

Marcel Popescu says:

June 27, 2016 at 3:37 pm

I’m missing something… are you saying that x % N and (x * N) div 2^32 are equivalent? This is trivially false for, e.g., 1 % 7 vs (1 * 7) div 2^32.

Reply
1. Daniel Lemire says:
  
  June 27, 2016 at 4:41 pm
  
  I am not saying that x % N is equal to (x * N) div 2^32 . This is obviously false, as you indicate. I am saying that they are both fair maps from the set of all 32-bit integers down to integers in [0,N).
  
  Reply
  1. Leonid Boytsov says:
    
    June 27, 2016 at 6:14 pm
    
    It is not, unless you have numbers uniformly distributed from 0 to numeric_limits::max()
    
    Reply
    1. Daniel Lemire says:
      
      June 27, 2016 at 7:23 pm
      
      If you have, say, 31-bit integers, instead of 32-bit integers, as might happen with a call to rand, you can adapt the map by shifting by 31 instead of 32.
      
      Reply
  2. Leonid Boytsov says:
    
    June 27, 2016 at 6:16 pm
    
    PS: Especially, if you use it for hashing and you numbers tend to be small then all your hash values will be horribly biased down. Perhaps, reducing div 2^32 to div 2^(some smaller degree of tw) may help.
    
    Reply
    1. Daniel Lemire says:
      
      June 27, 2016 at 7:28 pm
      
      A good hash function should be regular. That is, it should be so that all integers are “equally likely” (given a random input). If your hash values do not cover the whole 32-bit range, you need to adapt the map, probably as you describe.
      
      Reply
      1. Leonid Boytsov says:
        
        June 27, 2016 at 8:24 pm
        
        What is the cost of computing a good hash value? 🙂 Perhaps, it should be included in the overall computation time. This may (drastically?) reduce the difference between two approaches.
        
        I also suspect that in many cases, given a list of IDs you can get reasonable results by doing just ID % . In this case, you can get away without applying a hashing transformation. With your trick, you do need this.
        
        For example, if you numbers are smaller than 2^20 (which is quite likely) and N < 2^16, then it looks like all your hash values will be zero.
        
        Reply
        
        Leonid Boytsov says:
        
        June 27, 2016 at 8:25 pm
        
        by doing just ID * some perhaps prime number. Angle brackets were removed by HTML 🙂
        
        Reply
        
        Daniel Lemire says:
        
        June 27, 2016 at 9:59 pm
        
        What is the cost of computing a good hash value?
        
        Of course, you are correct that there are many cases where the modulo reduction will not be a bottleneck. In such cases, this fast map is useless.
        
        However, there are real-world examples where people set their capacity to a power of two to improve performance. A latency of a couple of dozens of cycles (for a division) is not great.
        
        I also suspect that in many cases, given a list of IDs you can get reasonable results by doing just ID % N.
        
        Yes. Moreover, as alluded to in my post, if N is known ahead of time, you can avoid the modulo reduction entirely and replace it with a faster function. See Hacker’s Delight.
        
        Reply
        
        Leonid Boytsov says:
        
        June 28, 2016 at 12:12 am
        
        Agree. Good point about N known in advance.
        
        Reply
        
        hlide fremen says:
        
        February 17, 2020 at 1:56 pm
        
        However, there are real-world examples where people set their capacity to a power of two to improve performance.
        
        You mean: ID % (2^N)? If so, chance the compiler turns this expression into ID & ((2^N) – 1) which is even faster than the fast alternative with a multiplication following a shift, isn’t it?
        
        Reply
        
        Daniel Lemire says:
        
        February 17, 2020 at 2:17 pm
        
        @hlide
        
        Yes, though being limited by powers of two can be wasteful.
        
        Reply
      2. Maxim Egorushkin says:
        
        June 21, 2017 at 9:55 am
        
        Would byte-swapping x help to preserve lower-order bits, e.g.:
        
        uint32_t reduce2(uint32_t x, uint32_t N) {
        return static_cast<uint64_t>(bswap_32(x)) * N >> 32;
        }
        
        Reply
Leonid Boytsov says:

June 27, 2016 at 6:29 pm

On a second thought, if modulo reduction slows you down, you may want to try SIMD and bulk conversion. You can something like:
__m128 recip = __mm_set1_ps(1/float(N));
__m128 mult = __mm_set1_ps(N);
__m128 to_convert = __mm_loadu_ps(…);
__m128 tmp = _mm_mul_ps(mult, _mm_floor_ps(_mm_mul_ps(to_convert_,recip)));
__m128i res = _mm_cvtps_epi32(_mm_floor_ps(_mm_sub_ps(to_convert, tmp)));
Peraphs, an additional SIMD min is required to ensure the value is < N. Anyways, seems like it is worth trying such as solution.

Reply
1. Daniel Lemire says:
  
  June 27, 2016 at 7:45 pm
  
  Yes, Leonid, I suspect that you are quite right. There are cases where using floating-point numbers, especially in conjunction with SIMD instructions, could lead to great results.
  
  Reply
2. moonchild says:
  
  January 21, 2023 at 9:17 am
  
  As long as your reciprocal is rounded down (not hard to ensure when you create it; can decrement if desirable), you don’t need a min, because float math is monotonic. However, you now only get 24 bits of (reduced) hash, which is not that much.
  
  Reply
Preston L. Bannister says:

June 27, 2016 at 7:22 pm

To re-phase Daniel:

We want to generate an index (I) chosen uniformly from the range 0..N-1.

We have a 32-bit random value (R), and assume a uniform distribution.

When N is a power-of-two, then the good/easy/fair answer: mask off the higher bits of R. (I have used this more than a few times, as a programmer.)

When N is not a power-of-two, the easy answer is:
I = R mod N
Two problems: the modulus operation is somewhat expensive, and the distribution is not *quite* even (that last wrap-around) – though likely good enough.

Daniel’s solution is (in two steps):
R = R * N
We now have a value in the range to N * 2^32.
I = R >> 32
We now have a value in the range to N. The two operations (multiply and shift) are almost certainly cheaper than the modulus operation.

I suspect the values are not *quite* uniform, but likely good enough. (How uniformly distributed are the multiplication products in each of the N buckets?)

Might write a test program to measure. 🙂

Reply
1. Daniel Lemire says:
  
  June 27, 2016 at 7:31 pm
  
  If N is small compared to 2^32, then I submit to you that you won’t be able to measure any bias.
  
  Reply
KWillets says:

June 28, 2016 at 1:07 am

It’s a fairly simple proof. You stretch the [0, 2^32) random variable range out to multiples of N in [0, N*2^32] and then map [0,2^32) to 0, [2^32, 2*2^32) to 1, etc. by integer division (>>32). The number of multiples of N in each range is floor or ceil(2^32/N).

Reply
1. Daniel Lemire says:
  
  June 28, 2016 at 2:40 am
  
  Thanks. I had a relatively short proof that used elementary number theory (a few lines), but it slightly got out of hand when I tried to make it accessible to people who may not master these concepts. A more direct approach like you suggest would have been better! Kudos!
  
  Reply
  1. Daniel Lemire says:
    
    June 28, 2016 at 3:32 pm
    
    I have updated my blog post with a more direct proof.
    
    Reply
    1. KWillets says:
      
      June 28, 2016 at 6:07 pm
      
      Thanks — I actually thought it through hoping to find a different method, and ended up with a proof of the same one. Oh well.
      
      Reply
Maxim Egorushkin says:

June 29, 2016 at 12:54 pm

Note that expression `((uint64_t) x * (uint64_t) N) >> 32` on x86-64 compiles into a 64-bit multiplication that yields a 128-bit result, followed by the shift.

With a bit of inline assembly it can use a shorter 32-bit mul instruction that yields a 64-bit result in EDX:EAX. The required result is in EDX, no shift instruction is required.

Reply
1. Daniel Lemire says:
  
  June 29, 2016 at 3:19 pm
  
  That’s a very good point. I am somewhat disappointed that the compiler fails to exploit this optimization.
  
  Reply
  1. Daniel Lemire says:
    
    June 29, 2016 at 3:39 pm
    
    For the record, something like this could do the job…
    
    uint32_t asm_highmult32to32(uint32_t u, uint32_t v) { uint32_t answer; __asm__ ("imull %[v]\n" "movl %%edx,%[answer]\n" : [answer] "+r" (answer) : [u] "a" (u), [v] "r" (v) :"eax","edx" ); return answer; }
    
    But it is unlikely to help performance as is. You’d probably need to rewrite a larger chunk of code using assembly.
    
    Reply
    1. Francois Saint-Jacques says:
      
      August 12, 2016 at 2:28 am
      
      For some reason, gcc does the correct thing with `uint128_t` operands.
      
      Reply
      1. Francois Saint-Jacques says:
        
        August 12, 2016 at 2:29 am
        
        See https://godbolt.org/g/kZ91Yu .
        
        Reply
    2. Maxim Egorushkin says:
      
      June 21, 2017 at 10:43 am
      
      mulx instruction would be ideal here: takes 2 assembly instructions to produce the value in eax, no flags affected:
      
      uint32_t reduce3(uint32_t x, uint32_t N) {
      uint32_t r;
      asm(“movl %[N], %%edx\n\t”
      “mulxl %[x], %[r], %[r]”
      : [r]”=r”(r)
      : [x]”r”(x), [N]”r”(N)
      : “edx”);
      return r;
      }
      
      Reply
KWillets says:

October 19, 2016 at 6:40 pm

Doing this for floats (random within [0,x)) is also fairly easy since it just requires subtracting 32 from the exponent, which ldexp can do, eg ldexp(rand(), -32) gives a float in [0,1) (assuming rand() is 32-bit here).

There’s some loss of precision as floats discard lower-order bits automatically, unless we cast to a longer mantissa first.

Reply
Pablo says:

February 28, 2017 at 6:48 pm

Since I don’t read math proofs daily, this took me a bit of thinking to parse. At the end I realized all this was a variation of:

rand(1) * N

In order:

( rand(2^32) * N ) / 2^32

is equivalent to

( rand(2^32) / 2^32 ) * N

which is equivalent to

floor( (float)[0,1) * N )

Reply
1. Daniel Lemire says:
  
  February 28, 2017 at 7:23 pm
  
  Yes, for some fuzzy notion of “equivalent”. For example, rand(2^32) / 2^32 is not equivalent to picking a number in [0,1) fairly.
  
  Reply
Chad Harrington says:

March 21, 2017 at 2:37 am

This is a godsend for FPGAs. Modulus by an arbitrary number is expensive in both time and space in hardware. Shifts are free and nearly all modern FPGAs have hard multiplier blocks. This is a perfect solution. Thanks !

Reply
Felix Chern says:

May 26, 2017 at 12:03 am

I created a similar approach to fast range named “fast mod and scale”. I didn’t realize fast range until I started to survey alternatives for writing my blog post. http://www.idryman.org/blog/2017/05/03/writing-a-damn-fast-hash-table-with-tiny-memory-footprints/
I’m happy to see other people also interested in this problem. Even though solution is just one or two lines are code, it is still a important problem to solve!

It first mask the hash value to next power of two, then do the fast range (I named it scaling) described in your post. It cost a bit more cycles, but are small enough because of modern cpu pipelining. The major usage of fast mod and scale (my method) is to make hash table probing as easy to implement as using strait mod. For hash tables that doesn’t use probing, like cuckoo hashing or standard chaining, fast range is sufficient.

More analysis and implementation details are in the blog I posted above.

Reply
Joost VandeVondele says:

December 21, 2017 at 6:57 am

This also made it in the world’s strongest open source chess engine (stockfish) :

https://github.com/official-stockfish/Stockfish/commit/2198cd0524574f0d9df8c0ec9aaf14ad8c94402b

Reply
1. Daniel Lemire says:
  
  December 22, 2017 at 10:55 pm
  
  That’s amazing.
  
  Reply
TQTrung says:

July 12, 2018 at 4:23 pm

I’m using x64 laptop-windows 10 and try fast modulo function but not success. Result is always zero. Why??? Did I miss something?

Reply
1. Daniel Lemire says:
  
  July 12, 2018 at 4:55 pm
  
  Can you share the code that gives you always zero?
  
  Reply
  1. TQTrung says:
    
    July 12, 2018 at 4:58 pm
    
    include
    
    using namespace std;
    
    uint32_t reduce(uint32_t x, uint32_t N)
    {
    return (((uint64_t)x * (uint64_t)N) >> 32);
    }
    
    int main()
    {
    cout<< reduce(12, 7);
    return 0;
    }
    I tried in my laptop (x64) and “www.onlinegdb.com” but i received the same result is zero. Did i miss something?
    
    Reply
    1. Daniel Lemire says:
      
      July 12, 2018 at 5:02 pm
      
      What do you expect reduce(12,7) to do?
      
      Reply
      1. Chuck #1 says:
        
        August 4, 2021 at 7:32 pm
        
        Yet wasn’t 7 chosen by a fair single cubic die roll?
        
        Reply
TQTrung says:

July 12, 2018 at 5:04 pm

I tried many different values as 20, 33, 5 (modulo for 7) and I received the same result is zero

Reply
1. Daniel Lemire says:
  
  July 12, 2018 at 8:12 pm
  
  Did you try random 32-bit numbers? Please do. Here is a sample program…
  
  https://gist.github.com/lemire/b9596313593dcb6aa311f5e5aa60f517
  
  Reply
TQTrung says:

July 13, 2018 at 1:52 am

Wow! Thank you very much!

Reply
Thorham says:

August 2, 2019 at 12:19 pm

Isn’t this basically fixed point arithmetic where the bottom 32 bits are the fractional part?

Reply
1. Daniel Lemire says:
  
  August 2, 2019 at 2:56 pm
  
  Isn’t this basically fixed point arithmetic where the bottom 32 bits
  are the fractional part?
  
  Basically, yes. It is really not hard conceptually.
  
  Reply
Cyril says:

October 7, 2019 at 9:48 am

Unless x * N > (1<<shiftAmount) the result is 0. I don’t know how you are computing fairness but your post is misleading.

In fact, you’re returning the quotient of the product by 2 raised to power of shiftAmount => reduce = floor[(x*N) / 2^shiftAmount]

Since, in a hash table, you are unlikely to use a power of 2 for the bucket’s count, this is not going to work well, and you’ll get a lot of collision if you’re using a power of 2 for the divisor. Since N can not be in [0 ; powerOfTwoRange] this is useless.

If you know that N is in [0; 2^32] range, nice. But that’s condition that rarely happens in real life.

Let’s say you have a hash table to have a more compact storage of ID => Value, then this function will produce a huge amount of collision on the low bucket’s indexes and almost no collision on the last buckets.

Obviously, in benchmark where N are as probable to happen in [0 2^32] range, this method is as fair as a modulo. Yet, these rarely happens in reality and the cost to handle collisions in a hash table is many times more important than the cost of the modulo.

You’d probably get a better result by SWARing the product before dividing (that is, p = x * N p = p ^ (p>>32) and so on) then masking by 0xFFFFFFFF. But at some point, I doubt you’ll beat a modulo operation.

Reply
1. Daniel Lemire says:
  
  October 8, 2019 at 6:12 pm
  
  Note that the trick described in this blog post is used in many real-life systems, some of which you are maybe relying upon. It definitively works.
  
  If you know that N is in [0; 2^32] range, nice. But that’s condition that rarely happens in real life.
  
  You are expected to use this trick, like the modulo reduction, after hashing. You can hash to the [0, 2^32) range.
  
  But at some point, I doubt you’ll beat a modulo operation
  
  You need to hash your objects in all cases. You should never “just” use the modulo reduction. What you typically do is hash and then reduce. You can either reduce using the modulo reduction or using this trick.
  
  Most hash functions, just like most random number generators, have output sizes that are powers of two.
  
  Reply
L. C. says:

December 27, 2019 at 5:40 pm

This works great! Thank you!

Reply
Piotr Grochowski says:

January 30, 2020 at 8:16 pm

Ok so what if the RNG output is 64-bit. Example, compiling xoshiro256** to a 32-bit executable, .

Reply
1. Piotr Grochowski says:
  
  January 31, 2020 at 11:16 am
  
  So, you can’t do this method on 64-bit output on 32-bit platforms because there are no 128-bit integers on 32-bit platforms!
  
  Reply
  1. Daniel Lemire says:
    
    January 31, 2020 at 2:20 pm
    
    On 32-bit systems, there are many optimizations you cannot do, so this technique should probably be the least of your worries.
    
    Reply
Anonymous says:

April 3, 2020 at 9:56 pm

Why not do real modulo though?

X % Y = (BITAND(CEILING(X*256/Y),255)*Y)>>8

The reciprocal 256/Y can be approximate

Reply
1. Daniel Lemire says:
  
  April 3, 2020 at 10:07 pm
  
  Yes, you can do this as well, though it is slightly more complicated. Please see
  
  Faster Remainder by Direct Computation: Applications to Compilers and Software Libraries
  Software: Practice and Experience 49 (6), 2019
  https://arxiv.org/abs/1902.01961
  
  Reply
Bo says:

June 26, 2020 at 2:10 pm

Thanks. This is amazingly fast in my case of a bloomfilter probe function. Even faster than the SIMD implementation.

Reply
1. Bo says:
  
  June 27, 2020 at 2:29 am
  
  Correction. There was a bug in my implementation. The fast modulo is close to the performance of “power-of-two”. The SIMD (AVX512) method is still the fastest.
  
  Reply
Pieter Wuille says:

June 20, 2021 at 9:18 pm

Hello Daniel,

Thank you for this technique; I have used it in a number of projects.

Today I was wondering about a generalization: say you have a hash output x (in range [0,232) like here), but want to extract two independent ranged values from it, with ranges N1 and N2 (where N1*N2 < 232). And it seems there is a very clean solution:

out1 = (x * N1) >> 32;
x2 = (uint32_t)(x * N1);
out2 = (x2 * N2) >> 32;

Effectively the multiplication x*N1 leaves us with a 64-bit number whose high bits are the first output, and the low bits are the remaining entropy – conveniently scaled to range [0,2**32) again, ready to be used for a second reduction.

It can be applied iteratively, though the quality of the extracted numbers will degrade as more entropy is extracted.

Reply
1. Daniel Lemire says:
  
  June 20, 2021 at 9:48 pm
  
  @Pieter It does not seem unreasonable.
  
  Reply
  1. Pieter Wuille says:
    
    June 21, 2021 at 1:14 am
    
    An even better construction:
    
    start with state = x
    iterate over ranges N[i] of numbers to be extracted:
    
    mask = ~N[i] & (N[i]-1)
    tmp = state * range
    output[i] = tmp >> 32
    state = (state + (out & mask)) & 0xFFFFFFFF
    
    This preserves all entropy; every iteration merely permutes the state. This means that all produced numbers are individually as uniformly distributed as extracting directly from x. Furthermore, it moves the “unused” portion of the entropy to the top of the state, so that it gets preferentially used in the next step. Some testing with small numbers seems to indicate that this actually produces optimal joint distributions of subsequently produced numbers (i.e. the distribution of output[k…k+j] is as well distributed as extracting a number with range N[k] * N[k+1] * … * [k+j] directly).
    
    Reply
    1. Daniel Lemire says:
      
      June 21, 2021 at 1:25 am
      
      What is “out” in your pseudocode?
      
      Reply
      1. Pieter Wuille says:
        
        June 21, 2021 at 1:28 am
        
        Oops, output[i].
        
        Reply
      2. Pieter Wuille says:
        
        June 21, 2021 at 1:35 am
        
        I made another typo.
        
        It is:
        
        mask = ~N[i] & (N[i]-1)
        tmp = state * range
        output[i] = tmp >> 32
        state = (tmp + (output[i] & mask)) & 0xFFFFFFFF
        
        Reply
2. Pieter Wuille says:
  
  July 16, 2021 at 11:00 pm
  
  I wrote a bit more extensively about this idea: https://github.com/sipa/writeups/tree/main/uniform-range-extraction
  
  Reply
  1. Daniel Lemire says:
    
    July 17, 2021 at 3:40 pm
    
    Thanks for the link.
    
    Reply
Kip Ingram says:

March 18, 2022 at 12:10 am

For N=255 you can also get rid of that multiplication by 255 – it can be replaced by an eight bit shift left and a subtraction of the original value.
A nice trick for N=255 is to note that 1/255 = 2^-8 + 2^-16 + 2^-24 + 2^-32 + … If you’re interested in handling 32-bit values and you’re on a 64-bit processor, you can get a lot of mileage out of shifts and adds built around this. I did one today that uses about 12 lines of x86_64 assembly, just shifts and add/subtract. The “wart” is that you do have to truncate that series somewhere, and consequently exact multiples of 255 will return 255, rather than 0. But one below goes to 254, and one above goes to 1, so it’s just a matter of catching the 255 results and replacing them with 0.

Sure beats the heck out of idiv.

Reply
vbextreme says:

October 7, 2022 at 7:26 am

I have try but not working, always return 0
https://godbolt.org/z/P51j1P3K3

Reply
1. Daniel Lemire says:
  
  October 7, 2022 at 12:35 pm
  
  Try to multiply numbers that cover the full range of values.
  
  Reply
Taras Tsugrii says:

August 9, 2023 at 7:33 pm

Thanks for as always insightful article! This technique is mentioned in ["Efficient Hash Probes on Modern Processors"][1] but only in passing without a proof, analysis and implementation. I’d only suggest emphasizing a little more the fact that it’s not an alternative to `”x % N” as some may assume only to find out that for sequential identifiers with relatively small Ns they are getting 0s in production 🙂

Reply
Raymond Toy says:

April 6, 2024 at 1:37 am

Just thought you might like to know that CMU Common Lisp (cmucl.org) implemented this idea in Dec 1997, by Douglas Crosher. The commit log just says it’s doing a multiply instead of a remainder operation because it’s faster.

It’s nice to see from this blog that this method is fair so there’s no real reason not to do this.

Reply
neurlang says:

May 5, 2024 at 11:17 am

Just a heads up. This trick is used in Hashtron’s modular hash.

https://github.com/neurlang/classifier/blob/master/hash/hash.go

Reply
Sad Clouds says:

May 19, 2024 at 7:16 am

As always, it all depends on various factors. It is not always fast, I would say it is a somewhat faster alternative to modulo reduction on some hardware platforms. For example, take various AArch64 processors where multiplication, division and modulo operations are multi-cycle instructions with higher latencies and lower throughput. Modulo by power of 2 is going to be much faster, since it can be implemented as bitwise AND operation.

For example, ARM Cortex-A72 as found on Raspberry Pi 4, has dual integer pipeline units and can execute up to two 64-bit integer addition, subtraction, or bitwise operations per clock cycle, where 64-bit integer multiplication operations have throughput of 1/3 instructions per clock cycle. So when running at max frequency of 1500 MHz, I get the following throughput metrics:

Int64 bitwise AND : 2665 x 10^6 instructions per second
Int64 multiplication: 499 x 10^6 instructions per second
Int64 modulo : 363 x 10^6 instructions per second

As you can glance, multiplication is only marginally faster than bitwise AND (modulo by power of 2). Also, a nice property of modulo by power of 2, is that with a sequential series (file descriptors and other sequential IDs) there is no need for a hashing function and the reduction will always result in zero collisions. If you don’t mind using power of 2 hash tables, then this is going to be even faster than the “fast alternative to the modulo reduction” as described here.

Adios amigos!

Reply
1. Daniel Lemire says:
  
  May 19, 2024 at 2:22 pm
  
  A modulo by a power of two is difficult to beat in raw speed, but that’s not always what people need or want.
  
  Reply
Dirk Reinbach says:

August 6, 2024 at 1:47 pm

I had a use case for this optimization today and I was very grateful for your post 🙂

Reply

Published by

Daniel Lemire

72 thoughts on “A fast alternative to the modulo reduction”

Leave a Reply Cancel reply