SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
-
Updated
Nov 14, 2024 - Python
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
Reimplement RetinaFace use C++ and TensorRT
👀 Apply YOLOv8 exported with ONNX or TensorRT(FP16, INT8) to the Real-time camera
a simple pipline of int8 quantization based on tensorrt.
quantization example for pqt & qat
LLM-Lora-PEFT_accumulate explores optimizations for Large Language Models (LLMs) using PEFT, LORA, and QLORA. Contribute experiments and implementations to enhance LLM efficiency. Join discussions and push the boundaries of LLM optimization. Let's make LLMs more efficient together!
TensorRT Int8 Python version sample. TensorRT Int8 Python 实现例子。TensorRT Int8 Pythonの例です
A LLaMA2-7b chatbot with memory running on CPU, and optimized using smooth quantization, 4-bit quantization or Intel® Extension For PyTorch with bfloat16.
Add a description, image, and links to the int8 topic page so that developers can more easily learn about it.
To associate your repository with the int8 topic, visit your repo's landing page and select "manage topics."