Skip to content

Releases: mobiusml/hqq

v0.2.2

12 Sep 15:23
Compare
Choose a tag to compare

HQQ v0.2.2

  • Support static cache compilation without using HFGenerator
  • Fixing various issues related to torch.compile

v.0.2.1

29 Aug 16:25
Compare
Choose a tag to compare

HQQ v0.2.1

v.0.2.0

28 Aug 10:05
Compare
Choose a tag to compare

HQQ v0.2.0

  • Bug fixes
  • Safetensors support for transformers via huggingface/transformers#33141
  • quant_scale, quant_zero and offload_meta are now deprecated. You can still use them with the hqq lib, but you can't use them with the transformers lib

v.0.1.8

11 Jul 12:00
Compare
Choose a tag to compare

HQQ v0.1.8

  • Add BitBlas backend support
  • Simpler HQQLinear from weights HQQLinear.from_weights(W, bias, etc.)
  • Fix memory leak while swaping layers for the TorchAO Backend
  • Add HQQLinear.unpack() call

v0.1.7.post3

28 May 07:48
Compare
Choose a tag to compare

HQQ v0.1.7.post3

  • Enable CPU quantization and runtime
  • _load_state_dict fix
  • fix extra_repr in HQQLinear
  • fix from_quantized bugs
  • fix | typing
  • fix 3-bit axis=1 slicing bug
  • add 5/6 bit for testing

v0.1.7.post2

06 May 16:41
Compare
Choose a tag to compare

HQQ v0.1.7.post2

  • Various bug fixes, especially with AutoHQQHFModel and the patching logic, to make it work with any transformers model.
  • Readme refactoring.
  • Whisper example.

v0.1.7

24 Apr 08:59
Compare
Choose a tag to compare

HQQ v0.1.7

  • Faster inference with torchao / marlin 4-bit kernels
  • Multi-gpu support for model.quantize()
  • Custom HF generator
  • Various bug fixes/improvements

v0.1.6.post2

19 Mar 18:24
Compare
Choose a tag to compare

HQQ v0.1.6.post2

Same as v0.1.6 with setup.py fixes:

  • find_packages fix: #25
  • Auto-build CUDA kernels via pypi package: #26

v0.1.6.post1

19 Mar 15:16
Compare
Choose a tag to compare

HQQ v0.1.6.post1

Same as v0.1.6 with a find_packages fix #25

v0.1.6

19 Mar 13:35
Compare
Choose a tag to compare

HQQ v0.1.6

Use v0.1.6.post1 instead, unless you clone the repo first then install.

Features

  • Quantize on target device.
  • Meta-offloading uses pinned memory for faster/async transfers.
  • Loading saved LoRA weights automatically adds LoRA modules if not already present.
  • pip install automatically compiles the CUDA kernels now.
  • CUDA backend automatically detected and used when available.
  • You can quantize any HF model automatically via AutoHQQHFModel.
  • Faster meta-offloading with CUDA streams (experimental).
  • Int8 matmul (experimental).
  • Shared memory CUDA kernels (experimental).

Bugs

  • Fix Peft bias dtype.
  • Removed auto backend setting in LoRA.
  • All HQQLinear dtype/device-related overloads now return self which should solve a couple of issues.

Other

  • Refactor backends (using backprop backends by default now).
  • Added typing.
  • Ruff fix and reformat all Python files.
  • Refactor ATEN for reference tensors.

Issues

  • Using CUDA streams for offloading is faster but uses more memory (~+700MB with Llama2-7B 2-bit/gs=16) . In fact, sometimes it's almost as fast as keeping data on the GPU, so worth looking into this.
  • Shared memory CUDA kernels are a bit slower than without for some reason.
  • The block size setting doesn't have much influence on the speed.
  • Int8 matmul is slower than fp16 with the current "placeholder" implementation, it should be done on the Aten/CUDA side.