-
Notifications
You must be signed in to change notification settings - Fork 31.5k
HIGGS Quantization Support #34997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIGGS Quantization Support #34997
Conversation
|
cc @MekkCyber |
|
Failed tests look like a problem on the runner's end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for integrating this new quantization method so fast! I left some comments and don't forget to also update the documentation so that the users knows how to use it !
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
|
Hey @BlackSamorez, Thanks for adding this quantization method so quickly ! I added some very small nits |
| RUN python3 -m pip install git+https://github.com/NetEase-FuXi/EETQ.git | ||
|
|
||
| # Add flute-kernel and fast_hadamard_transform for quantization testing | ||
| RUN python3 -m pip install --no-cache-dir flute-kernel==0.2.6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docker image will be deployed on an instance with cuda 11.8 but on the flute github I noticed you need to specify https://flute-ai.github.io/whl/cu118 in that case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated.
| nb_fbgemm_linear = 0 | ||
| for module in model.modules(): | ||
| if isinstance(module, HiggsLinear): | ||
| nb_fbgemm_linear += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you meant nb_higgs_linear 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Fixed
| for m in module_tree: | ||
| parent = parent._modules[m] | ||
| return parent | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry if i'm mistaken, I don't believe we use this function anywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the unused function. Thanks!
Co-authored-by: Mohamed Mekkouri <[email protected]>
|
@SunMarc Hi! Is there a chance you could take a look at it this week? We wanted to release it before NIPS. |
I'll do that now. When is the deadline ? After my review, I still need this to be reviewed by a core maintainer |
There isn't a hard deadline, but it would be very nice to have this merged this week or early next week. |
SunMarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work ! Just a few nits.
docs/source/en/quantization/higgs.md
Outdated
| ## Usage Example | ||
|
|
||
| ```python | ||
| from transformers import AutoModelForCausalLM, AutoTokenizer, HiggsConfig | ||
|
|
||
| model = AutoModelForCausalLM.from_pretrained( | ||
| "google/gemma-2-9b-it", | ||
| quantization_config=HiggsConfig(bits=4), | ||
| device_map="auto", | ||
| ) | ||
|
|
||
| tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it") | ||
|
|
||
| tokenizer.decode(model.generate( | ||
| **tokenizer("Hi,", return_tensors="pt").to(model.device), | ||
| temperature=0.5, | ||
| top_p=0.80, | ||
| )[0]) | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small suggestion: What would be even better is to link a colab notebook as a demo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, T4 is not among the supported GPUs yet, so no Colab demo for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense thanks ! It would be nice to add a section to precise which specific gpu works with it
| # HIGGS | ||
|
|
||
| HIGGS is a 0-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance. You can find more information in the paper [arxiv.org/abs/2411.17525](https://arxiv.org/abs/2411.17525). | ||
|
|
||
| Runtime support for HIGGS is implemented through [FLUTE](https://arxiv.org/abs/2407.10960), and its [library](https://github.com/HanGuo97/flute). | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small suggestion : If you can share pre-quantized model in an hf org and link it here, that would be nice also !
Co-authored-by: Marc Sun <[email protected]>
|
Pre-quantized a bunch of models (including |
|
Gentle ping @ArthurZucker |
|
Can you fix the CI with |
SunMarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding new features !
| model._modules[name].weight.data = module( | ||
| torch.eye(in_features, device=module.scales.device, dtype=module.scales.dtype) | ||
| ).T.contiguous() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart way to perform dequantization. This could be added to all quants methods no ? cc @MekkCyber
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your hard work and for introducing this new method!
Super small nit and @SunMarc will merge 🤗
|
Things to recheck later:
|
SunMarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating ! I will merge this when the CI is green
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Quantization tests are nightly, right? |
|
That's right ! |
HIGGS 0-Shot Quantization
HIGGS is a new 0-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance. You can find more information in the paper.
Runtime support for HIGGS is implemented through FLUTE, and its library.
This PR adds support for HIGGS+FLUTE into
transformersallowing for low-error 0-shot quantization and fast LLM inference.Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.