You sometimes feel the need to make all of your integers positive, without losing any information. That is, you want to map all of your integers from ‘signed’ integers (e.g., -1, 1, 3, -3) to ‘unsigned integers’ (e.g., 3,2,6,7). This could be useful if you have a fast function to compress integers that fails to work well for negative integers.
Many programming languages (Go, C, etc.) allow you just ‘cast’ the integer. For example, the following Go code will print out -1, 18446744073709551615, -1 under most systems:
var x = -1 var y = uint(x) var z = int(y) fmt.Println(x, y, z)
That is, you can take a small negative value, interpret it as a large integer, and then ‘recover’ back your small value.
What if you want to have that small values remain small ? Then a standard approach is to use zigzag encoding. The recipe is as follows:
- Compute twice the absolute value of your integer.
- Subtract 1 to the result when the original integer was negative.
Effectively, what you are doing is that all positive integers become even integers (exactly twice as big), and all negative integers become odd integers. We interleave negative and positive integers (odd and even).
original value | zigzag value |
---|---|
-20 | 39 |
-19 | 37 |
-18 | 35 |
-17 | 33 |
-16 | 31 |
-15 | 29 |
-14 | 27 |
-13 | 25 |
-12 | 23 |
-11 | 21 |
-10 | 19 |
-9 | 17 |
-8 | 15 |
-7 | 13 |
-6 | 11 |
-5 | 9 |
-4 | 7 |
-3 | 5 |
-2 | 3 |
-1 | 1 |
0 | 0 |
1 | 2 |
2 | 4 |
3 | 6 |
4 | 8 |
5 | 10 |
6 | 12 |
7 | 14 |
8 | 16 |
9 | 18 |
10 | 20 |
11 | 22 |
12 | 24 |
13 | 26 |
14 | 28 |
15 | 30 |
16 | 32 |
17 | 34 |
18 | 36 |
19 | 38 |
20 | 40 |
In Python, you might implement the encoding and the decoding as follows:
def zigzag_encode(val) : if val < 0: return - 2 * val - 1 return 2 * val def zigzag_decode(val) : if val & 1 == 1 : return - val // 2 return val // 2
The same code in C/C++ might work, but it could be more efficient to use optimized code which assumes that the underlying hardware represents signed integers with two’s complement encoding (which is a safe assumption in 2022 and a requirement in C++20) and that bytes span 8 bits (another safe assumption)…
int fast_decode(unsigned int x) { return (x >> 1) ^ (-(x&1)); } unsigned int fast_encode(int x) { return (2*x) ^ (x >>(sizeof(int) * 8 - 1)); }
Much the same code will work in Go, Rust, Swift, etc.
Daniel Lemire, "Making all your integers positive with zigzag encoding," in Daniel Lemire's blog, November 25, 2022, https://lemire.me/blog/2022/11/25/making-all-your-integers-positive-with-zigzag-encoding/.
Beware that, in C, right shift with signed integers is a undef. behavior.
Up until C23 it was implementation defined. I expect that C23 will fix this particular issue.
In Go or Java, there is no issue.
Right shift of signed integers was not UB, the spec text was “If E1 has a signed type and a negative value, the resulting value is implementation-defined”
Maybe point out that this works well in combination with varint integers in protobuf.
To add to other comments, I believe signed int overflows are formally undefined, and you can get overflows here. But it probably works in all “sane” compilers.
Is “2*x” with int x safe to do in C? Seems like you’d hit UB.
It’s not, you have to cast to unsigned to make it safe.
Can you safely cast all signed integers to unsigned integers (in current C standards)?
Yes, C99 and C11 specify in Section 6.3.1.3 (Signed and unsigned integers), 2nd paragraph:
So with that requirement one get bit-identical values if integers are in two’s complement – which is already assumed.
Rust has the i*::rotate_left(_) and ::rotate_right(_) functions on all integer types. So the correct encode function would be n.rotate_left(1); and decode would be n.rotate_right(1).
At least on x86_64, this maps to the respective rol/ror assembly instructions, and the compiler doesn’t need to optimize more complex code.
And speaking of vectorization in avx512, there is even vprold to convert many integers in a single instruction.
An alternative, if you know the range ahead of time, is to represent numbers with offsets from the low end of the range. This way, if (e.g.) you don’t have a lot of negative numbers, you don’t unfairly punish the positive ones by stealing a bit from them all.
There is another way to present this idea.
You are making the lowest bit the ‘sign’ bit.
This is a sign bit like a ‘sign’ bit in floating point as opposed to the method in twos-complement.
Once you see that, it is much more intuitive.
I don’t know why it isn’t presented like this more often.
That’s not quite what it is though, is it?
I cannot fathom a practical requirement or use of this algorithm. What’s the motivation?
It is commonly used as part of compression routines.
You should start with that as a motivator, instead of burying the lede.
The first paragraph is as follows…
You sometimes feel the need to make all of your integers positive, without losing any information. That is, you want to map all of your integers from ‘signed’ integers (e.g., -1, 1, 3, -3) to ‘unsigned integers’ (e.g., 3,2,6,7). This could be useful if you have a fast function to compress integers that fails to work well for negative integers.
I am sure it could be improved, but it is meant to provide motivation.
What about the case of having unequal amounts of negative and positive integers and wanting to avoid gaps?
I.E. let’s say our initial set is. -1, -2, -3, 1, 2, 3, 4, 5, 6, 7, 8
With zig-zag encoding applied we are left with
1, 3, 5, 2, 4, 6, 8, 10, 12, 14, 16.
Which leaves us with “gaps” (below). These gaps now make the positive integers in our initial set take up more space in their binary representation.
7, 9, 11, 13, 15.
What do compression ratios end up looking like in the varying scenarios of
1. Equal amounts of negative and positive integers
2. More or less negative and positive integers relative to each other.
Can you define what you mean by “compression ratios”? The blog post does not describe a compression routine.
This being said, zigzag encoding tends to favour values that are close (in absolute value) to zero… in the sense that such values get mapped to ‘small positive integers’.
Indeed, zigzag doesn’t necessarily compress per-se. At the same time, like @me (who’s that?:)) mentioned, it enables e.g. varint encoding (which, in my book, is also not compression, but hey 😛 )
@moonchild also mentioned adjusting the base, aka FOR encoding. If there’s a difference in positive/negative ranges, FOR indeed will create a better (smaller) range of integers. But you need to know that base upfront, which is a weakness.
In general, if someone is interested in more efficient integer compression, Daniel’s PFOR library is not the worst place to start: https://github.com/lemire/FastPFor 🙂
> This could be useful if you have a fast function to compress integers that fails to work well for negative integers.
This is what motivated that question (from the top of the post). I’d also need a definition of what it means for a function that compresses integers to work or not work well for negative integers :).
I guess a clearer question is — if we’re talking about zigzag encoding in terms of being a solution to “dealing with” negative integers so that they can be compressed “well” by some function that compresses integers — Is zigzag encoding the best encoding method for “getting rid of” negative integers for whichever function that compresses integers well, but doesn’t compress negative integers well.
The second part of your response I think partially answers my question. And while it does map those negative values close to 0 to small positive integers, it does also map existing positive integers to larger positive integers.
My blog post absolutely does not answer this question:
Is zigzag encoding the best encoding method for “getting rid of” negative integers for whichever function that compresses integers well, but doesn’t compress negative integers well.
Of course it doesn’t. That’s why I’m here in the comments! However the blog post does directly state that there exists, or at least probably exists, some function that compresses (positive) integers well.
This is why I was quoting this piece:
> This could be useful if you have a fast function to compress integers that fails to work well for negative integers.
I’m wondering what the usefulness you mention there is like in practice. If it’s not that important of a detail it seems like it wouldn’t be included in your post. Maybe it’s not. However, I don’t think it’s a trivial detail, which is why I’m asking questions about it.
I’ll think about this some more on my own.
From what I saw, varint is probably the most obvious example of such a function, and that’s where I’ve seen zigzag often.
It would probably enhance zstd compression in this example:
https://lemire.me/blog/2022/11/28/generic-number-compression-zstd/
It is also obviously applicable to other formats: https://github.com/lemire/streamvbyte