Lists (2)
Sort Name ascending (A-Z)
Starred repositories
🎉 Modern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, 📖150+ CUDA Kernels with PyTorch bindings, 📖HGEMM/SGEMM (95%~99% cuBLAS performance), 📖100+ LLM/CUDA Blogs.
A bibliography and survey of the papers surrounding o1
Accessible large language models via k-bit quantization for PyTorch.
Tutorials and useful scripts for using RapidStream.
Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton
Single-thread, end-to-end C++ implementation of the Bitnet (1.58-bit weight) model
depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.
Quantized Attention that achieves speedups of 2.1x and 2.7x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
DHLS (Dynamic High-Level Synthesis) compiler based on MLIR
KV cache compression for high-throughput LLM inference
Development repository for the Triton language and compiler
RapidStream TAPA compiles task-parallel HLS program into high-frequency FPGA accelerators.
A minimal Jekyll Theme to host your resume (CV) on GitHub with a few clicks.
FlagGems is an operator library for large language models implemented in Triton Language.
ASPLOS'24: Optimal Kernel Orchestration for Tensor Programs with Korch
Official PyTorch implementation of DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)
Efficient Triton Kernels for LLM Training
Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.
[ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs