Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port deflate_quick to ARM #205

Closed
sebpop opened this issue Sep 17, 2018 · 5 comments
Closed

Port deflate_quick to ARM #205

sebpop opened this issue Sep 17, 2018 · 5 comments

Comments

@sebpop
Copy link
Contributor

sebpop commented Sep 17, 2018

In #199 @Dead2 wrote:

If only we could port deflate_quick to ARM too.

@sebpop
Copy link
Contributor Author

sebpop commented Jan 15, 2019

deflate_quick is only called at compression level 1 on x86_64.
There is too much code to implement for aarch64/aarch32 before the release.
We may revisit this after the release.

@Dead2 Dead2 added optimization help wanted Anyone can contribute labels Jan 17, 2019
@Myriachan
Copy link

ARM doesn't have anything similar to the pcmpestri SSE4.2 instruction used by the compare258 function of deflate_quick. This instruction is a huge part of the performance: it's used to do a miniature strstr-like operation to find matching patterns in the dictionary.

@nmoinvaz
Copy link
Member

nmoinvaz commented Feb 9, 2020

It seems like we could replace match_len = compare258(s->window + s->strstart, s->window + s->strstart - dist) with match_len = longest_match(s, hash_head) if SSE4.2 not supported. If that is the case, perhaps we could extend compare258 to deflate_slow, deflate_medium, etc if SSE4.2 is supposed. So of like a functable.longest_match. What do you think?

@mtl1979
Copy link
Collaborator

mtl1979 commented Feb 10, 2020

compare258 is essentially compare of 256 bytes followed by compare of trailing 2 bytes... Only difference is that for all iterations of the loop, it needs to find the first non-matching index in the vector and calculate length from it... If we assume that it is likely that all lanes are equal most of the time, it should not have performance penalty when the trailing bytes are not equal. Trailing bytes can be handled using uint8_t, uint16_t, uint32_t and uint64_t.

@nmoinvaz
Copy link
Member

nmoinvaz commented Jun 8, 2020

This is now complete.

@nmoinvaz nmoinvaz closed this as completed Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants