Releases: mobiusml/hqq
Releases · mobiusml/hqq
v0.2.2
v.0.2.1
HQQ v0.2.1
HQQLinear.state_dict()
for non-initialized layers. Mainly used in for huggingface/transformers#33141
v.0.2.0
HQQ v0.2.0
- Bug fixes
- Safetensors support for transformers via huggingface/transformers#33141
quant_scale
,quant_zero
andoffload_meta
are now deprecated. You can still use them with the hqq lib, but you can't use them with the transformers lib
v.0.1.8
v0.1.7.post3
HQQ v0.1.7.post3
- Enable CPU quantization and runtime
_load_state_dict
fix- fix
extra_repr
inHQQLinear
- fix
from_quantized
bugs - fix
|
typing - fix 3-bit
axis=1
slicing bug - add 5/6 bit for testing
v0.1.7.post2
HQQ v0.1.7.post2
- Various bug fixes, especially with
AutoHQQHFModel
and the patching logic, to make it work with any transformers model. - Readme refactoring.
- Whisper example.
v0.1.7
v0.1.6.post2
v0.1.6.post1
v0.1.6
HQQ v0.1.6
Use v0.1.6.post1 instead, unless you clone the repo first then install.
Features
- Quantize on target device.
- Meta-offloading uses pinned memory for faster/async transfers.
- Loading saved LoRA weights automatically adds LoRA modules if not already present.
pip install
automatically compiles the CUDA kernels now.- CUDA backend automatically detected and used when available.
- You can quantize any HF model automatically via
AutoHQQHFModel
. - Faster meta-offloading with CUDA streams (experimental).
- Int8 matmul (experimental).
- Shared memory CUDA kernels (experimental).
Bugs
- Fix Peft bias dtype.
- Removed auto backend setting in LoRA.
- All
HQQLinear
dtype/device-related overloads now return self which should solve a couple of issues.
Other
- Refactor backends (using backprop backends by default now).
- Added typing.
- Ruff fix and reformat all Python files.
- Refactor ATEN for reference tensors.
Issues
- Using CUDA streams for offloading is faster but uses more memory (~+700MB with Llama2-7B 2-bit/gs=16) . In fact, sometimes it's almost as fast as keeping data on the GPU, so worth looking into this.
- Shared memory CUDA kernels are a bit slower than without for some reason.
- The block size setting doesn't have much influence on the speed.
- Int8 matmul is slower than fp16 with the current "placeholder" implementation, it should be done on the Aten/CUDA side.