Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ms VisualStudio - Assembler speedups on x64 #376

Closed
wants to merge 64 commits into from

Conversation

CristiFati
Copy link

@CristiFati CristiFati commented Sep 3, 2018

More details on [SO]: CMake zlib build on Windows - (@CristiFati's answer).

Note:

  • Besides the relevant change, there are some TABs replacement by SPACEs (such mixes scratch my brain). Thy should have been in a separate commit, but I rushed a bit

Closing this one as:

GerHobbelt pushed a commit to GerHobbelt/zlib that referenced this pull request Nov 20, 2021
This issue I mentioned in madler#370. Optimization code such as crc_folding.c, deflate_quick_sse.c, fill_window_sse.c, and insert_string_sse.c were not being compiled when the compiler was MSVC because the checks for the instrincs were not being done and the HAVE_[TARGET]_INTRIN variables weren't being set. I could have simply set HAVE_[TARGET]_INTRIN variables to ON manually in the case of MSVC, but it is better this way to have one path for all the compilers (that it runs and checks some code for determination). I have just added MSVC code where necessary in the checks.

* Rename HAVE_SSE42_INTRIN to HAVE_SSE42CRC_INLINE_ASM.
* Added msvc inline asm support to insert_string_sse.c
madler and others added 28 commits September 3, 2022 01:52
This commit allows a parameter change even if the input data has
not all been compressed and copied to the application output
buffer, so long as all of the input data has been compressed to
the internal pending output buffer. This also allows an immediate
deflateParams change so long as there have been no deflate calls
since initialization or reset.
This limits hash table inserts to the available data in the window
and to the sliding window size in deflate_stored(). The hash table
inserts are deferred until deflateParams() switches to a non-zero
compression level.
This is a problem in the odd case that the second argument of
LSEEK is a larger type than off_t. Apparently MinGW defines off_t
to be 32 bits, but _lseeki64 has a 64-bit second argument.

Also undo a previous commit to permit MinGW to use _lseeki64.
Seeing a few percent speedup by using a pointer instead of an
assigned structure. This seems to help the compiler to optimize
better.
If zlib and/or gzip header processing was requested, but a header
was never provided and inflateSync was used successfully, then the
inflate state would be inconsistent, trying to compute a check
value but with no flags set. This commit sets the inflate mode to
raw in this case, since there is no other assumption that can be
made if a header was requested but never seen.
This isn't the right type anyway to assure that it contains a
pointer. That type would be intptr_t or uintptr_t. However the C99
standard says that those types are optional, so their use would not
be portable. This commit simply uses size_t or whatever configure
decided to use for size_t. That would be the same length as
ptrdiff_t, and so will work just as well. The code checks to see if
the length of the type used is the same as the length of a void
pointer, so there is already protection against the use of the
wrong type. The use of size_t (or ptrdiff_t) will almost always
work, as all modern architectures have an array size that is the
same as the pointer size. Only old segmented architectures would
have to fall back to the slower CRC-32 calculation, where the
amount of memory that can be accessed is larger than the maximum
array size.
Allegedly the behavior of memcpy() is undefined if the source
pointer is NULL, even if the number of bytes to copy is zero.
In addition, there is not sufficient gain from the inflate
assembler code to warrant its inclusion.
This bug was reported by Danilo Ramos of Eideticom, Inc. It has
lain in wait 13 years before being found! The bug was introduced
in zlib 1.2.2.2, with the addition of the Z_FIXED option. That
option forces the use of fixed Huffman codes. For rare inputs with
a large number of distant matches, the pending buffer into which
the compressed data is written can overwrite the distance symbol
table which it overlays. That results in corrupted output due to
invalid distances, and can result in out-of-bound accesses,
crashing the application.

The fix here combines the distance buffer and literal/length
buffers into a single symbol buffer. Now three bytes of pending
buffer space are opened up for each literal or length/distance
pair consumed, instead of the previous two bytes. This assures
that the pending buffer cannot overwrite the symbol table, since
the maximum fixed code compressed length/distance is 31 bits, and
since there are four bytes of pending space for every three bytes
of symbol space.
inflateSync() is used to skip invalid deflate data, which means
that the check value that was being computed is no longer useful.
This commit turns off the check value computation, and furthermore
allows a successful return if the compressed data terminated in a
graceful manner. This commit also fixes a bug in the case that
inflateSync() is used before a header is ever processed. In that
case, there is no knowledge of a trailer, so the remainder is
treated as raw.
Define the macro Z_ARM_CRC32 at compile time to use the ARMv8
(aarch64) crc32x and crc32b instructions. This code does not check
for the presence of the crc32 instructions. Those instructions are
optional for ARMv8.0, though mandatory for ARMv8.1 and later. The
use of the crc32 instructions is about ten times as fast as the
software braided calculation of the CRC-32. This can noticeably
speed up the decompression of gzip streams.
The ARM crc32 instructions will be used only if an architecture is
explicitly specified at compile time that has those instructions.
For example, -march=armv8.1-a or -march=armv8-a+crc, or if the
machine being compiled on has the instructions, -march=native.
The issue is that unztell64() does not return the correct value if
the position in the current file (in the ZIP archive) is beyond 4
GB. The cause is that unzReadCurrentFile() does not account for
pfile_in_zip_read_info->stream.total_out at line 1854 of unzip.c
wrapping around (it is a 32-bit variable). So, on line 1860
uTotalOutAfter can be *less* than uTotalOutBefore, propagating the
wraparound to uOutThis, which in turn is added to
pfile_in_zip_read_info->total_out_64. That has the effect of
subtracting 4 GB.
Also fix Makefile test target and permit added compile options.
Note intentional switch case fall throughs to avoid gcc warnings.
0xffffffffffffffffULL was causing warnings for C99 usage. The
suffix for the constant is not necessary anyway, so this commit
removes them.
@CristiFati CristiFati closed this Sep 3, 2022
@CristiFati CristiFati deleted the master branch December 11, 2024 00:44
@Neustradamus
Copy link

@CristiFati: Why have you closed your PR?

Can you check latest comments of this ticket?

@Neustradamus Neustradamus mentioned this pull request Jan 1, 2025
@CristiFati
Copy link
Author

@Neustradamus : I explained why Id id that:

  1. It's a complete mess (lots of duplicate commits, wrong source branch,...)
  2. Proper way of handling it is in the other referenced PR (which was also closed based on maintainers' comments)

I briefly went through the issue you pointed out (noticed this was mentioned) but that's it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants