Skip to content

Conversation

@BlackSamorez
Copy link
Contributor

@BlackSamorez BlackSamorez commented Nov 28, 2024

HIGGS 0-Shot Quantization

HIGGS is a new 0-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance. You can find more information in the paper.

Runtime support for HIGGS is implemented through FLUTE, and its library.

This PR adds support for HIGGS+FLUTE into transformers allowing for low-error 0-shot quantization and fast LLM inference.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Rocketknight1
Copy link
Member

cc @SunMarc @MekkCyber

@SunMarc SunMarc requested a review from MekkCyber November 28, 2024 14:43
@SunMarc
Copy link
Member

SunMarc commented Nov 28, 2024

cc @MekkCyber

@BlackSamorez
Copy link
Contributor Author

Failed tests look like a problem on the runner's end

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for integrating this new quantization method so fast! I left some comments and don't forget to also update the documentation so that the users knows how to use it !

@MekkCyber
Copy link
Contributor

Hey @BlackSamorez, Thanks for adding this quantization method so quickly ! I added some very small nits

RUN python3 -m pip install git+https://github.com/NetEase-FuXi/EETQ.git

# Add flute-kernel and fast_hadamard_transform for quantization testing
RUN python3 -m pip install --no-cache-dir flute-kernel==0.2.6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docker image will be deployed on an instance with cuda 11.8 but on the flute github I noticed you need to specify https://flute-ai.github.io/whl/cu118 in that case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated.

nb_fbgemm_linear = 0
for module in model.modules():
if isinstance(module, HiggsLinear):
nb_fbgemm_linear += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant nb_higgs_linear 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Fixed

for m in module_tree:
parent = parent._modules[m]
return parent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry if i'm mistaken, I don't believe we use this function anywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unused function. Thanks!

@MekkCyber MekkCyber requested a review from SunMarc December 3, 2024 11:13
@BlackSamorez
Copy link
Contributor Author

@SunMarc Hi! Is there a chance you could take a look at it this week? We wanted to release it before NIPS.

@SunMarc
Copy link
Member

SunMarc commented Dec 5, 2024

@SunMarc Hi! Is there a chance you could take a look at it this week? We wanted to release it before NIPS.

I'll do that now. When is the deadline ? After my review, I still need this to be reviewed by a core maintainer

@BlackSamorez
Copy link
Contributor Author

When is the deadline ?

There isn't a hard deadline, but it would be very nice to have this merged this week or early next week.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work ! Just a few nits.

Comment on lines 23 to 41
## Usage Example

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, HiggsConfig

model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2-9b-it",
quantization_config=HiggsConfig(bits=4),
device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")

tokenizer.decode(model.generate(
**tokenizer("Hi,", return_tensors="pt").to(model.device),
temperature=0.5,
top_p=0.80,
)[0])
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small suggestion: What would be even better is to link a colab notebook as a demo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, T4 is not among the supported GPUs yet, so no Colab demo for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense thanks ! It would be nice to add a section to precise which specific gpu works with it

Comment on lines +17 to +22
# HIGGS

HIGGS is a 0-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance. You can find more information in the paper [arxiv.org/abs/2411.17525](https://arxiv.org/abs/2411.17525).

Runtime support for HIGGS is implemented through [FLUTE](https://arxiv.org/abs/2407.10960), and its [library](https://github.com/HanGuo97/flute).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small suggestion : If you can share pre-quantized model in an hf org and link it here, that would be nice also !

@SunMarc SunMarc requested a review from ArthurZucker December 5, 2024 15:16
@BlackSamorez
Copy link
Contributor Author

Pre-quantized a bunch of models (including Llama-3.3-70B-Instruct) and added a link to the collection to docs.

@SunMarc
Copy link
Member

SunMarc commented Dec 9, 2024

Gentle ping @ArthurZucker

@SunMarc
Copy link
Member

SunMarc commented Dec 13, 2024

Can you fix the CI with make style for the quality and potentially rebase the PR @BlackSamorez ?

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding new features !

Comment on lines +646 to +648
model._modules[name].weight.data = module(
torch.eye(in_features, device=module.scales.device, dtype=module.scales.dtype)
).T.contiguous()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart way to perform dequantization. This could be added to all quants methods no ? cc @MekkCyber

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your hard work and for introducing this new method!
Super small nit and @SunMarc will merge 🤗

@huggingface huggingface deleted a comment from code30x58 Dec 18, 2024
@BlackSamorez
Copy link
Contributor Author

Things to recheck later:

  • make style doesn't actually fix python utils/custom_init_isort.py --check_only errors. I had to run python utils/custom_init_isort.py manually.
  • pillow (PIL) is needed for make quality but isn't included in pip install -e .[quality]

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating ! I will merge this when the CI is green

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@BlackSamorez
Copy link
Contributor Author

Quantization tests are nightly, right?

@SunMarc
Copy link
Member

SunMarc commented Dec 23, 2024

That's right !

@SunMarc SunMarc merged commit 64c05ee into huggingface:main Dec 23, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants