| Date | Paper | Key Words | Github |
|---|---|---|---|
| 2024.4.22 | A Survey on Efficient Inference for Large Language Models | Efficient Inference | |
| 2024.12.27 | A Survey on Large Language Model Acceleration based on KV Cache Management | KV Cache Management | Awesome-KV-Cache-Management & Awesome-LLM-KV-Cache |
| 2025.8.19 | Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention | Efficient Attention | Efficient_Attention_Survey |
| Date | Paper | Key Words |
|---|---|---|
| 2025.5.7 | Towards Large-scale Generative Ranking | GenRank & xiaohongshu |
| Date | Paper | Key Words |
|---|---|---|
| 2018.11.19 | Modeling Deep Learning Accelerator Enabled GPUs | Tensor Core Design && GPGPU-Sim |
| 2019 | Understanding the Overheads of Launching CUDA Kernels | Kernel Launch Overhead |
| 2021.10.25 | Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance | Persistent kernel fusion |
| 2022.4.5 | PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications | PERsistent KernelS (PERKS) |
| 2023.12.19 | A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library | FlashAttention2 using cutlass |
| 2025.4.8 | Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching | Prefetches required KV Cache into GPU L2 cache |
| Date | Paper | Key Words |
|---|---|---|
| 2024.5.7 | QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving | Boosts efficiency with W4A8KV4 quantization & Reduces dequantization overheads |
| 2025.2.20 | LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention | Accelerates long-context LLM inference through unified sparse attention & Hierarchical KV cache management |
| Date | Paper | Key Words |
|---|---|---|
| 2021.7.14 | Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines | Bidirectional Pipelines |
| 2023.11.30 | Zero Bubble Pipeline Parallelism | Zero Bubble PP |
| Date | Paper | Key Words |
|---|---|---|
| 2020.6.29 | Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention | Linear Attention |
| 2022.5.27 | FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | Flash Attention |
| 2023.7.18 | FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning | Flash Attention 2 |
| 2024.7.12 | FlashAttention-3 is optimized for Hopper GPUs (e.g. H100) | Flash Attention 3 |
| 2024.10.3 | SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration | Sage Attention |
| 2024.11.17 | SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization | Sage Attention 2 |
| 2024.3.7 | Slim attention: cut your context memory in half without loss of accuracy -- K-cache is all you need for MHA | slim Attention |
| 2025.4.1 | Multi-Token Attention | Multi-Token Attention |
| Date | Paper | Key Words |
|---|---|---|
| 2021.1.11 | Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity | Mixture of Expert (MoE) |
| 2024.1.11 | DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models | DeepSeekMoE |
| Date | Paper | Key Words |
|---|---|---|
| 2020.10.22 | An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale | Vision Transformer (ViT) |
| Date | Paper | Key Words |
|---|---|---|
| 2025.10.6 | Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper) | Politeness Affects LLM Accuracy |
| Date | Paper | Key Words |
|---|---|---|
| 2023.2.9 | Toolformer: Language Models Can Teach Themselves to Use Tools | Agent or RAG concepts |
| 2024.2.2 | TravelPlanner: A Benchmark for Real-World Planning with Language Agents | Real-World Planning |
| 2024.4.18 | Large Language Models Can Solve Real-World Planning Rigorously with Formal Verification Tools | Real-World Planning |
| Date | Paper | Key Words |
|---|---|---|
| 1972 | Karp's 21 NP-complete problems | Karp's 21 NP-complete problems |
| 1973 | An n^{5/2} algorithm for maximum matchings in bipartite graphs | Hopcroft-Karp Algorithm |
| 2002 | A 27/26-Approximation Algorithm for the Chromatic Sum Coloring of Bipartite Graphs | Chromatic Sum Coloring of Bipartite Graphs |
| 2015.6.16 | An Efficient Data Structure for Processing Palindromes in Strings | Palindromic Tree |
| 2017.8.11 | An Introduction to Quantum Computing, Without the Physics | Quantum Computing, Without the Physics |
| 2018.7.30 | A Simple Near-Linear Pseudopolynomial Time Randomized Algorithm for Subset Sum | A Simple Near-Linear Pseudopolynomial Time Randomized Algorithm for Subset Sum |
| 2021.2.11 | Hybrid Neural Fusion for Full-frame Video Stabilization | Video Stabilization Algorithm |
| 2022.11.21 | The Berlekamp-Massey Algorithm revisited | Berlekamp-Massey Algorithm |
| 2025.4.23 | Breaking the Sorting Barrier for Directed Single-Source Shortest Paths | O(mlog2/3n)-time algorithm for single-source shortest paths |