Skip to content
View kaizizzzzzz's full-sized avatar
🤒
🤒
  • Ithaca, New York

Block or report kaizizzzzzz

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

🎉 Modern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, 📖150+ CUDA Kernels with PyTorch bindings, 📖HGEMM/SGEMM (95%~99% cuBLAS performance), 📖100+ LLM/CUDA Blogs.

Cuda 1,441 157 Updated Nov 15, 2024

GPU Performance Advisor

Python 63 8 Updated Jul 25, 2022

A bibliography and survey of the papers surrounding o1

TeX 725 31 Updated Nov 8, 2024
Python 1 Updated Nov 7, 2024

Accessible large language models via k-bit quantization for PyTorch.

Python 6,283 630 Updated Nov 14, 2024

Tutorials and useful scripts for using RapidStream.

Verilog 1 Updated Nov 1, 2024

Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton

Python 1,334 69 Updated Nov 14, 2024

Single-thread, end-to-end C++ implementation of the Bitnet (1.58-bit weight) model

C++ 1 Updated Nov 11, 2024

depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.

Python 498 11 Updated Nov 4, 2024

Quantized Attention that achieves speedups of 2.1x and 2.7x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

Python 389 16 Updated Nov 15, 2024

[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Python 241 23 Updated Oct 10, 2024

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

Python 729 56 Updated Oct 8, 2024

Official inference framework for 1-bit LLMs

C++ 11,078 752 Updated Nov 11, 2024

Repo to submit jobs to the AMD cluster

Python 8 Updated Oct 30, 2024

DHLS (Dynamic High-Level Synthesis) compiler based on MLIR

C++ 62 19 Updated Nov 14, 2024

extensible collectives library in triton

Python 65 2 Updated Sep 23, 2024

KV cache compression for high-throughput LLM inference

Python 83 4 Updated Nov 4, 2024

Development repository for the Triton language and compiler

C++ 13,396 1,638 Updated Nov 15, 2024

RapidStream TAPA compiles task-parallel HLS program into high-frequency FPGA accelerators.

C++ 155 32 Updated Nov 15, 2024

A minimal Jekyll Theme to host your resume (CV) on GitHub with a few clicks.

JavaScript 3,187 6,006 Updated Aug 19, 2024

「大模型」3小时完全从0训练26M的小参数GPT,个人显卡即可推理训练!

Python 2,674 326 Updated Nov 10, 2024

Machine learning on FPGAs using HLS

C++ 1,280 415 Updated Nov 14, 2024

A simple and effective LLM pruning approach.

Python 667 90 Updated Aug 9, 2024

FlagGems is an operator library for large language models implemented in Triton Language.

Python 340 43 Updated Nov 14, 2024

ASPLOS'24: Optimal Kernel Orchestration for Tensor Programs with Korch

Python 29 Updated Aug 8, 2024
Python 45 16 Updated Nov 7, 2024

Official PyTorch implementation of DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)

Python 135 17 Updated Sep 20, 2024

Efficient Triton Kernels for LLM Training

Python 3,424 201 Updated Nov 15, 2024

Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.

Python 207 22 Updated Oct 12, 2024

[ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs

Python 81 3 Updated Aug 13, 2024
Next