Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

mobiusml / hqq Public

Notifications You must be signed in to change notification settings
Fork 68
Star 696

Code
Issues 3
Pull requests 3
Discussions
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Security
Insights

Releases: mobiusml/hqq

Releases · mobiusml/hqq

v0.2.2

12 Sep 15:23

mobicham

Compare

Choose a tag to compare

Loading

v0.2.2 Latest

Latest

HQQ v0.2.2

Support static cache compilation without using HFGenerator
Fixing various issues related to torch.compile

Assets 2

Loading

remackad reacted with thumbs up emoji

remackad and kaizizzzzzz reacted with laugh emoji

remackad reacted with hooray emoji

remackad reacted with heart emoji

remackad reacted with rocket emoji

All reactions

👍 1 reaction
😄 2 reactions
🎉 1 reaction
❤️ 1 reaction
🚀 1 reaction

2 people reacted

v.0.2.1

29 Aug 16:25

mobicham

Compare

Choose a tag to compare

Loading

v.0.2.1

HQQ v0.2.1

HQQLinear.state_dict() for non-initialized layers. Mainly used in for huggingface/transformers#33141

Assets 2

Loading

remackad reacted with thumbs up emoji

All reactions

👍 1 reaction

1 person reacted

v.0.2.0

28 Aug 10:05

mobicham

Compare

Choose a tag to compare

Loading

v.0.2.0

HQQ v0.2.0

Bug fixes
Safetensors support for transformers via huggingface/transformers#33141
quant_scale, quant_zero and offload_meta are now deprecated. You can still use them with the hqq lib, but you can't use them with the transformers lib

Assets 2

Loading

Nelathan reacted with hooray emoji

All reactions

🎉 1 reaction

1 person reacted

v.0.1.8

11 Jul 12:00

mobicham

Compare

Choose a tag to compare

Loading

v.0.1.8

HQQ v0.1.8

Add BitBlas backend support
Simpler HQQLinear from weights HQQLinear.from_weights(W, bias, etc.)
Fix memory leak while swaping layers for the TorchAO Backend
Add HQQLinear.unpack() call

Assets 2

Loading

remackad reacted with thumbs up emoji

remackad reacted with laugh emoji

remackad reacted with hooray emoji

All reactions

👍 1 reaction
😄 1 reaction
🎉 1 reaction

1 person reacted

v0.1.7.post3

28 May 07:48

mobicham

Compare

Choose a tag to compare

Loading

v0.1.7.post3

HQQ v0.1.7.post3

Enable CPU quantization and runtime
_load_state_dict fix
fix extra_repr in HQQLinear
fix from_quantized bugs
fix | typing
fix 3-bit axis=1 slicing bug
add 5/6 bit for testing

Assets 2

Loading

All reactions

v0.1.7.post2

06 May 16:41

mobicham

Compare

Choose a tag to compare

Loading

v0.1.7.post2

HQQ v0.1.7.post2

Various bug fixes, especially with AutoHQQHFModel and the patching logic, to make it work with any transformers model.
Readme refactoring.
Whisper example.

Assets 2

Loading

remackad reacted with thumbs up emoji

remackad reacted with laugh emoji

remackad reacted with hooray emoji

remackad reacted with heart emoji

All reactions

👍 1 reaction
😄 1 reaction
🎉 1 reaction
❤️ 1 reaction

1 person reacted

v0.1.7

24 Apr 08:59

mobicham

Compare

Choose a tag to compare

Loading

v0.1.7

HQQ v0.1.7

Faster inference with torchao / marlin 4-bit kernels
Multi-gpu support for model.quantize()
Custom HF generator
Various bug fixes/improvements

Assets 2

Loading

remackad and pdh930105 reacted with thumbs up emoji

remackad reacted with laugh emoji

younesbelkada, remackad, and fkouteib reacted with hooray emoji

All reactions

👍 2 reactions
😄 1 reaction
🎉 3 reactions

4 people reacted

v0.1.6.post2

19 Mar 18:24

mobicham

Compare

Choose a tag to compare

Loading

v0.1.6.post2

HQQ v0.1.6.post2

Same as v0.1.6 with setup.py fixes:

find_packages fix: #25
Auto-build CUDA kernels via pypi package: #26

Assets 2

Loading

remackad reacted with thumbs up emoji

All reactions

👍 1 reaction

1 person reacted

v0.1.6.post1

19 Mar 15:16

mobicham

Compare

Choose a tag to compare

Loading

v0.1.6.post1

HQQ v0.1.6.post1

Same as v0.1.6 with a find_packages fix #25

Assets 2

Loading

remackad reacted with thumbs up emoji

All reactions

👍 1 reaction

1 person reacted

v0.1.6

19 Mar 13:35

mobicham

Compare

Choose a tag to compare

Loading

v0.1.6

HQQ v0.1.6

Use v0.1.6.post1 instead, unless you clone the repo first then install.

Features

Quantize on target device.
Meta-offloading uses pinned memory for faster/async transfers.
Loading saved LoRA weights automatically adds LoRA modules if not already present.
pip install automatically compiles the CUDA kernels now.
CUDA backend automatically detected and used when available.
You can quantize any HF model automatically via AutoHQQHFModel.
Faster meta-offloading with CUDA streams (experimental).
Int8 matmul (experimental).
Shared memory CUDA kernels (experimental).

Bugs

Fix Peft bias dtype.
Removed auto backend setting in LoRA.
All HQQLinear dtype/device-related overloads now return self which should solve a couple of issues.

Other

Refactor backends (using backprop backends by default now).
Added typing.
Ruff fix and reformat all Python files.
Refactor ATEN for reference tensors.

Issues

Using CUDA streams for offloading is faster but uses more memory (~+700MB with Llama2-7B 2-bit/gs=16) . In fact, sometimes it's almost as fast as keeping data on the GPU, so worth looking into this.
Shared memory CUDA kernels are a bit slower than without for some reason.
The block size setting doesn't have much influence on the speed.
Int8 matmul is slower than fp16 with the current "placeholder" implementation, it should be done on the Aten/CUDA side.

Assets 2

Loading

remackad reacted with thumbs up emoji

All reactions

👍 1 reaction

1 person reacted

Previous 1 2 Next

Previous Next

Footer

© 2024 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.