Skip to content

lzyrapx/Papers-Books-Reading

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Papers-Reading

LLM

Survey

Date Paper Key Words Github
2024.4.22 A Survey on Efficient Inference for Large Language Models Efficient Inference
2024.12.27 A Survey on Large Language Model Acceleration based on KV Cache Management KV Cache Management Awesome-KV-Cache-Management & Awesome-LLM-KV-Cache
2025.8.19 Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention Efficient Attention Efficient_Attention_Survey

Generative Recommendation

Date Paper Key Words
2025.5.7 Towards Large-scale Generative Ranking GenRank & xiaohongshu

Models

Date Paper Key Words
2019.2.24 Language Models are Unsupervised Multitask Learners GPT-2
2020.5.28 Language Models are Few-Shot Learners GPT-3
2022.3.4 Training language models to follow instructions with human feedback InstructGPT & Human feedback training
2022.4.5 PaLM: Scaling Language Modeling with Pathways PaLM
2023.2.27 LLaMA: Open and Efficient Foundation Language Models LLaMA
2023.3.15 GPT-4 Technical Report GPT-4
2023.4.17 Visual Instruction Tuning LLaVa
2023.6.20 textbooks are all you need Phi-1
2024.3.8 DeepSeek-VL: Towards Real-World Vision-Language Understanding DeepSeek-VL: Dense & VLM
2024.7.10 PaliGemma: A versatile 3B VLM for transfer Google small VLM: Paligemma
2024.10.8 Aria: An Open Multimodal Native Mixture-of-Experts Model First MoE VLM: Aria
2024.12.6 Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling VLM: InternVL 2.5
2024.12.13 DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding DeepSeek-VL2: MOE & VLM
2024.12.27 DeepSeek-V3 Technical Report DeepSeek-V3 Technical Report
2025.1.22 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning DeepSeek-R1
2025.9.29 DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention DeepSeek-V3.2-Exp
2025.10.20 DeepSeek-OCR: Contexts Optical Compression DeepSeek-OCR

Kernel Optimization

Date Paper Key Words
2018.11.19 Modeling Deep Learning Accelerator Enabled GPUs Tensor Core Design && GPGPU-Sim
2019 Understanding the Overheads of Launching CUDA Kernels Kernel Launch Overhead
2021.10.25 Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance Persistent kernel fusion
2022.4.5 PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications PERsistent KernelS (PERKS)
2023.12.19 A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library FlashAttention2 using cutlass
2025.4.8 Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching Prefetches required KV Cache into GPU L2 cache

Serving

Date Paper Key Words
2024.5.7 QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Boosts efficiency with W4A8KV4 quantization & Reduces dequantization overheads
2025.2.20 LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention Accelerates long-context LLM inference through unified sparse attention & Hierarchical KV cache management

Training

Date Paper Key Words
2021.7.14 Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines Bidirectional Pipelines
2023.11.30 Zero Bubble Pipeline Parallelism Zero Bubble PP

Attention

Date Paper Key Words
2020.6.29 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention Linear Attention
2022.5.27 FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Flash Attention
2023.7.18 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning Flash Attention 2
2024.7.12 FlashAttention-3 is optimized for Hopper GPUs (e.g. H100) Flash Attention 3
2024.10.3 SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration Sage Attention
2024.11.17 SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization Sage Attention 2
2024.3.7 Slim attention: cut your context memory in half without loss of accuracy -- K-cache is all you need for MHA slim Attention
2025.4.1 Multi-Token Attention Multi-Token Attention

Quantization

Date Paper Key Words
2022.6.4 ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers INT8 weights and INT8 activations
2022.8.15 LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale LLM.int8
2022.11.18 SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models 8-bit Weight,8-bit Activation (W8A8)
2023.5.23 Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization Parameter-Efficient and Quantization-aware Adaptation (PEQA) [LLM-QAT]
2023.5.23 QLoRA: Efficient Finetuning of Quantized LLMs QLoRA & NF4 (4-bit NormalFloat) [LLM-QAT]
2023.5.29 LLM-QAT: Data-Free Quantization Aware Training for Large Language Models LLM Quantization Aware Training [LLM-QAT]
2023.3.13 FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU KV Cache 4-bit
2023.6.1 AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration Activation-aware Weight Quantization (AWQ)
2023.6.13 SqueezeLLM: Dense-and-Sparse Quantization KV Cache 3-bit
2024.1.31 KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization KV Cache 2、3、4-bit
2024.2.5 KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache KV Cache 2-bit
2024.2.26 A Comprehensive Evaluation of Quantization Strategies for Large Language Models PTQ
2024.3.8 GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM KV Cache Compression
2024.6.5 QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead 3 Bits KV Cache
2024.11.26 Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation KV Cache Recomputation
2025.1.25 RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations 2-Bit KV Cache
2025.2.4 ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization Low-bit LLM Quantization
2025.2.15 CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs 1-Bit KV Cache
2025.3.25 LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation 2-Bit KV Cache

MOE

Date Paper Key Words
2021.1.11 Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity Mixture of Expert (MoE)
2024.1.11 DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models DeepSeekMoE

Inference

Date Paper Key Words
2017.6.12 Attention Is All You Need Transformer & Attention
2018.6.11 Improving Language Understanding by Generative Pre-Training Generative transformer model
2018.10.11 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding BERT (Bidirectional Encoder Representations from Transformers)
2019.1.9 Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context Transformer-XL (extra-long)
2019.5.17 ERNIE: Enhanced Language Representation with Informative Entities Knowledge graphs with BERT
2024.1.19 Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads Speculative decoding: Medusa
2024.1.26 EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty Speculative decoding: Eagle
2024.2.27 Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations LLM for Large-scale recommendation systems
2024.3.19 When Do We Not Need Larger Vision Models? Scaling on Scales
2024.6.24 EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees Speculative decoding: Eagle 2
2024.7.19 LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference Dynamic Token Pruning
2024.7.28 Enhancing Taobao Display Advertising with Multimodal Representations: Challenges, Approaches and Insights Advertising with Multimodal
2024.8.22 NanoFlow: Towards Optimal Large Language Model Serving Throughput A novel serving framework: NanoFlow
2025.3.3 EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test Speculative decoding: Eagle 3
2025.5.8 Scaling Laws for Speculative Decoding Scaling Laws for Speculative Decoding
2025.5.12 PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications Prefill Only Inference

Transformer

Date Paper Key Words
2020.10.22 An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Vision Transformer (ViT)

Prompt Engineering

Date Paper Key Words
2025.10.6 Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper) Politeness Affects LLM Accuracy

AI Agent

Date Paper Key Words
2023.2.9 Toolformer: Language Models Can Teach Themselves to Use Tools Agent or RAG concepts
2024.2.2 TravelPlanner: A Benchmark for Real-World Planning with Language Agents Real-World Planning
2024.4.18 Large Language Models Can Solve Real-World Planning Rigorously with Formal Verification Tools Real-World Planning

Others

Date Paper Key Words
2019.10.2 DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter Bert distilled version & knowledge distillation
2019.10.23 Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Unified Text-to-Text Transformer & T5 (Encoder-Decoder)
2020.05.22 Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Retrieval-Augmented Generation (RAG)
2020.10.29 AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts Auto generate prompt
2021.4.20 RoFormer: Enhanced Transformer with Rotary Position Embedding RoPE
2021.6.17 LoRA: Low-Rank Adaptation of Large Language Models LoRA
2021.7.7 Evaluating Large Language Models Trained on Code Finetune
2021.9.3 Finetuned Language Models Are Zero-Shot Learners Finetune
2021.12.13 GLaM: Efficient Scaling of Language Models with Mixture-of-Experts GLaM & MOE
2021.12.17 WebGPT: Browser-assisted question-answering with human feedback WebGPT
2022.1.28 Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Chain-of-Thought
2022.5.31 Let's Verify Step by Step Process-supervised Reward Models (PRM)
2022.10.6 ReAct: Synergizing Reasoning and Acting in Language Models ReAct:Synergizing Reasoning + Acting
2025.5.14 Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures DeepSeek's AI Architectures

Algorithm

Date Paper Key Words
1972 Karp's 21 NP-complete problems Karp's 21 NP-complete problems
1973 An n^{5/2} algorithm for maximum matchings in bipartite graphs Hopcroft-Karp Algorithm
2002 A 27/26-Approximation Algorithm for the Chromatic Sum Coloring of Bipartite Graphs Chromatic Sum Coloring of Bipartite Graphs
2015.6.16 An Efficient Data Structure for Processing Palindromes in Strings Palindromic Tree
2017.8.11 An Introduction to Quantum Computing, Without the Physics Quantum Computing, Without the Physics
2018.7.30 A Simple Near-Linear Pseudopolynomial Time Randomized Algorithm for Subset Sum A Simple Near-Linear Pseudopolynomial Time Randomized Algorithm for Subset Sum
2021.2.11 Hybrid Neural Fusion for Full-frame Video Stabilization Video Stabilization Algorithm
2022.11.21 The Berlekamp-Massey Algorithm revisited Berlekamp-Massey Algorithm
2025.4.23 Breaking the Sorting Barrier for Directed Single-Source Shortest Paths O(mlog2/3n)-time algorithm for single-source shortest paths

About

📚Some papers & books I’ve read.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published