Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compressed Integer Conversion #16

Open
bmarwell opened this issue May 16, 2019 · 4 comments
Open

Compressed Integer Conversion #16

bmarwell opened this issue May 16, 2019 · 4 comments
Labels
enhancement New feature or request

Comments

@bmarwell
Copy link

About your comment

zchunk/src/lib/compint.c

Lines 61 to 63 in c831e22

/* There *must* be a more elegant way of doing c * 128**count */
for(int f=0; f<count; f++)
c *= 128;

Congratulations, you just reinvented LEB128 (https://en.wikipedia.org/wiki/LEB128), except you reversed the continue-flag!
Thus, you just made us implement our own reversed LEB128-logic 😆!


Actual answer to our comment: Yes, there is.

On that wiki page there is a pseudo-code, which will work just fine, except that you have to reverse the bitmask logic.


By the way, I think it is easier to use bitmasks (FLAG_LAST_BYTE = 0b10000000), because it really shows us the highest bit is a flag. If you write "128" and don't even assign a constant, everyone reading the code will wonder why there is this magic number.


Java code:

  public static int toUnsignedInt(final byte[] compressedUnsignedInt) {
    int result = 0b0;
    int shift = 0;

    for (int index = 0; index < compressedUnsignedInt.length; index++) {
      final byte b = compressedUnsignedInt[index];
      final byte leadingZero = (byte) (b & ~COMPRESSED_INT_LAST_BYTE_FLAG);

      result |= leadingZero << shift;
      shift += 7;
    }

    return BigInteger.valueOf(Integer.toUnsignedLong(result)).intValueExact();
  }

As you can see, I extract the bytes from the file beforehand (it cannot be more than 5 bytes anyway for an integer). I can have the length of the byte array seperately, which is feasible in java.

@jdieter
Copy link
Member

jdieter commented May 18, 2019

Thanks, I'd never even heard of LEB128. It's unfortunate that I'm only finding out about it now, after the format is fixed, but it's probably not a huge deal.

You're right about the advantage of using a bitmask. I'll take a look at refactoring that part of my code.

Just to note, integers can be more than five bytes on 64-bit architectures.

@bmarwell
Copy link
Author

bmarwell commented May 18, 2019

Yes, I already changed it to long in my code and added some checks (10 bytes max). I think this is what you also use in your code (implicitly on 64bit systems).
Maybe the byte length of compressed integer should be fixed to max = 10 bytes. That's 2^64 - 1. This way, we would not need to fiddle with 32bit vs 64bit. I think in your code it is dependent on sizeof(size_t), which will read only up to 5 bytes on 32bit systems.

Btw: why not reserve LEB128 for a v2.0 header? ;-)

@jdieter
Copy link
Member

jdieter commented May 19, 2019

Yes, I already changed it to long in my code and added some checks (10 bytes max). I think this is what you also use in your code (implicitly on 64bit systems).
Maybe the byte length of compressed integer should be fixed to max = 10 bytes. That's 2^64 - 1. This way, we would not need to fiddle with 32bit vs 64bit. I think in your code it is dependent on sizeof(size_t), which will read only up to 5 bytes on 32bit systems.

On a 32-bit system that isn't using 64-bit size_t's, zchunk should fail to read any sizes that require more than 32 bits. But, there is a define that you can use on 32 bit to set size_t to 64 bits (I've temporarily forgotten what it is), which would allow zchunk to work with a significantly larger maximum chunk size.

In the real world, I wouldn't expect that anybody would ever need >4GB chunk sizes, but one never knows. After all, nobody needs more than 640K, right? ;)

Btw: why not reserve LEB128 for a v2.0 header? ;-)

We could, but I'm not sure what the point would be. We will always need to support reading the v1 header, so it's not like we could throw away the current code.

@bmarwell bmarwell added the enhancement New feature or request label May 20, 2019
@bmarwell
Copy link
Author

Agreed. So on a 64bit system, the maximum number of bytes will be 10. On a 32bit system, the maximum number of bytes will be 5 (sizeof(size_t) == 4 bytes) and then 4 * 8 / 7 + 1 == 5).
With five you can represent 0xFF_FF_FF_FF_8F, which is 2^64 - 1. Yes, that will definitely suffice.

For compatiblity, I still chose long in java (width = 64 bits).

You can see the code I am using here:
https://github.com/zchunk/zchunk-java/blob/master/compressedint/src/main/java/io/github/zchunk/compressedint/CompressedIntUtil.java

There are only two methods, compress and decompress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants