Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs"
-
Updated
Nov 13, 2024 - Python
Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs"
Model Compression Toolkit (MCT) is an open source project for neural network model optimization under efficient, constrained hardware. This project provides researchers, developers, and engineers advanced quantization and compression tools for deploying state-of-the-art neural networks.
[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
🤗 Optimum Intel: Accelerate inference with Intel optimization tools
PyTorch native quantization and sparsity for training and inference
Train, Evaluate, Optimize, Deploy Computer Vision Models via OpenVINO™
模型压缩的小白入门教程
AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Simple dithering library in Rust, based on image-rs
Neural Network Compression Framework for enhanced OpenVINO™ inference
Faster Whisper transcription with CTranslate2
Easy-Translate is a script for translating large text files with a SINGLE COMMAND. Easy-Translate is designed to be as easy as possible for beginners and as seamlesscustomizable and as possible for advanced users.
Quantized Attention that achieves speedups of 2.1x and 2.7x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
A project built Electron + React.js, to dig out the potential of cross platform AI completion.
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
Official implementation of Half-Quadratic Quantization (HQQ)
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
Add a description, image, and links to the quantization topic page so that developers can more easily learn about it.
To associate your repository with the quantization topic, visit your repo's landing page and select "manage topics."