Skip to content

FairyFali/SLMs-Survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 

Repository files navigation

SLM Survey

Awesome

A Comprehensive Survey of Small Language Models: Technology, On-Device Applications, Efficiency, Enhancements for LLMs, and Trustworthiness

This repo include the papers discussed in our latest survey paper on small language models.
📖 Read the full paper here: Paper Link

News

  • 2024/11/04: The first version of our survey is on Arxiv!

Reference

If our survey is useful for your research, please kindly cite our paper:

@article{wang2024comprehensive,
  title={A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness},
  author={Wang, Fali and Zhang, Zhiwei and Zhang, Xianren and Wu, Zongyu and Mo, Tzuhao and Lu, Qiuhao and Wang, Wanjing and Li, Rui and Xu, Junjie and Tang, Xianfeng and others},
  journal={arXiv preprint arXiv:2411.03350},
  year={2024}
}

Overview of SLMs

Overview of Small Language Models

Timeline of SLMs

Timeline of Small Language Models

SLMs Paper List

Existing SLMs

Model #Params Date Paradigm Domain Code HF Model Paper/Blog
Llama 3.2 1B; 3B 2024.9 Pre-train Generic Github HF Blog
Qwen 1 1.8B; 7B; 14B; 72B 2023.12 Pre-train Generic Github HF Paper
Qwen 1.5 0.5B; 1.8B; 4B; 7B; 14B; 32B; 72B 2024.2 Pre-train Generic Github HF Paper
Qwen 2 0.5B; 1.5B; 7B; 57B; 72B 2024.6 Pre-train Generic Github HF Paper
Qwen 2.5 0.5B; 1.5B; 3B; 7B; 14B; 32B; 72B 2024.9 Pre-train Generic Github HF Paper
Gemma 2B; 7B 2024.2 Pre-train Generic HF Paper
Gemma 2 2B; 9B; 27B 2024.7 Pre-train Generic HF Paper
H2O-Danube3 500M; 4B 2024.7 Pre-train Generic HF Paper
Fox-1 1.6B 2024.6 Pre-train Generic HF Blog
Rene 1.3B 2024.5 Pre-train Generic HF Paper
MiniCPM 1.2B; 2.4B 2024.4 Pre-train Generic Github HF Paper
OLMo 1B; 7B 2024.2 Pre-train Generic Github HF Paper
TinyLlama 1B 2024.1 Pre-train Generic Github HF Paper
Phi-1 1.3B 2023.6 Pre-train Coding HF Paper
Phi-1.5 1.3B 2023.9 Pre-train Generic HF Paper
Phi-2 2.7B 2023.12 Pre-train Generic HF Paper
Phi-3 3.8B; 7B; 14B 2024.4 Pre-train Generic HF Paper
Phi-3.5 3.8B; 4.2B; 6.6B 2024.4 Pre-train Generic HF Paper
OpenELM 270M; 450M; 1.1B; 3B 2024.4 Pre-train Generic Github HF Paper
MobiLlama 0.5B; 0.8B 2024.2 Pre-train Generic Github HF Paper
MobileLLM 125M; 350M 2024.2 Pre-train Generic Github HF Paper
StableLM 3B; 7B 2023.4 Pre-train Generic Github HF Paper
StableLM 2 1.6B 2024.2 Pre-train Generic Github HF Paper
Cerebras-GPT 111M-13B 2023.4 Pre-train Generic HF Paper
BLOOM, BLOOMZ 560M; 1.1B; 1.7B; 3B; 7.1B; 176B 2022.11 Pre-train Generic HF Paper
OPT 125M; 350M; 1.3B; 2.7B; 5.7B 2022.5 Pre-train Generic HF Paper
XGLM 1.7B; 2.9B; 7.5B 2021.12 Pre-train Generic Github HF Paper
GPT-Neo 125M; 350M; 1.3B; 2.7B 2021.5 Pre-train Generic Github Paper
Megatron-gpt2 355M; 2.5B; 8.3B 2019.9 Pre-train Generic Github Paper, Blog
MINITRON 4B; 8B; 15B 2024.7 Pruning and Distillation Generic Github HF Paper
Orca 2 7B 2023.11 Distillation Generic HF Paper
Dolly-v2 3B; 7B; 12B 2023.4 Instruction tuning Generic Github HF Blog
LaMini-LM 61M-7B 2023.4 Distillation Generic Github HF Blog
Specialized FlanT5 250M; 760M; 3B 2023.1 Instruction Tuning Generic (math) Github - Paper
FlanT5 80M; 250M; 780M; 3B 2022.10 Instruction Tuning Generic Gihub HF Paper
T5 60M; 220M; 770M; 3B; 11B 2019.9 Pre-train Generic Github HF Paper

SLM Architecture

  1. Transformer: Attention is all you need. Ashish Vaswani. NeurIPS 2017.
  2. Mamba 1: Mamba: Linear-time sequence modeling with selective state spaces. Albert Gu and Tri Dao. COLM 2024. [Paper].
  3. Mamba 2: Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. Tri Dao and Albert Gu. ICML 2024. [Paper] [Code]

Enhancement for SLM

Training from scratch

  1. MobiLlama: "MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT". Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan. arXiv 2024. [Paper] [Github] [HuggingFace]
  2. MobileLLM: "MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases". Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra ICML 2024. [Paper] [Github] [HuggingFace]
  3. Rethinking optimization and architecture for tiny language models. Yehui Tang, Fangcheng Liu, Yunsheng Ni, Yuchuan Tian, Zheyuan Bai, Yi-Qi Hu, Sichao Liu, Shangling Jui, Kai Han, and Yunhe Wang. ICML 2024. [Paper] [Code]
  4. MindLLM: "MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications". Yizhe Yang, Huashan Sun, Jiawei Li, Runheng Liu, Yinghao Li, Yuhang Liu, Heyan Huang, Yang Gao. arXiv 2023. [Paper] [HuggingFace]

Supervised fine-tuning

  1. Direct preference optimization: Your language model is secretly a reward model. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. NeurIPS, 2024. [Paper] [Code]
  2. Enhancing chat language models by scaling high-quality instructional conversations. Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. EMNLP 2023. [Paper] [Code]
  3. SlimOrca: An Open Dataset of GPT-4 Augmented FLAN Reasoning Traces, with Verification. Wing Lian, Guan Wang, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, and "Teknium". Huggingface, 2023. [Data]
  4. Stanford Alpaca: An Instruction-following LLaMA model. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. GitHub, 2023. [Blog] [Github] [HuggingFace]
  5. OpenChat: Advancing Open-source Language Models with Mixed-Quality Data. Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, and Yang Liu. ICLR, 2024. [Paper] [Code] [HuggingFace]
  6. Training language models to follow instructions with human feedback. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, Ryan Lowe. NeurIPS, 2022. [Paper]
  7. RLHF: "Training language models to follow instructions with human feedback". Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe. 2022. [Paper]
  8. MobileBERT: "MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices". Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou. ACL 2020. [Paper] [Github] [HuggingFace]
  9. Language models are unsupervised multitask learners. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever. OpenAI Blog, 2019. [Paper]

Data quality in KD

  1. TinyStory: "TinyStories: How Small Can Language Models Be and Still Speak Coherent English?". Ronen Eldan, Yuanzhi Li. 2023. [Paper] [HuggingFace]
  2. AS-ES: "AS-ES Learning: Towards Efficient CoT Learning in Small Models". Nuwa Xi, Yuhan Chen, Sendong Zhao, Haochun Wang, Bing Qin, Ting Liu. 2024. [Paper]
  3. Self-Amplify: "Self-AMPLIFY: Improving Small Language Models with Self Post Hoc Explanations". Milan Bhan, Jean-Noel Vittaut, Nicolas Chesneau, Marie-Jeanne Lesot. 2024. [Paper]
  4. Large Language Models Can Self-Improve. Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. EMNLP 2023. [Paper]
  5. Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing. Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, and Dong Yu. NeurIPS 2024. [Paper] [Code]

Distillation for SLM

  1. GKD: "On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes". Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem. ICLR 2024. [Paper][Website] [PowerPoint]
  2. DistilLLM: "DistiLLM: Towards Streamlined Distillation for Large Language Models". Jongwoo Ko, Sungnyun Kim, Tianyi Chen, Se-Young Yun. ICML 2024. [Paper] [Github]
  3. Adapt-and-Distill: "Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains". Yunzhi Yao, Shaohan Huang, Wenhui Wang, Li Dong, Furu Wei. ACL2021. [Paper] [Github]

Quantization

  1. SmoothQuant: "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models". Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han. ICML 2023. [Paper] [Github][Slides][Video]
  2. BiLLM: "BiLLM: Pushing the Limit of Post-Training Quantization for LLMs". Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi. 2024. [Paper] [Github]
  3. LLM-QAT: "LLM-QAT: Data-Free Quantization Aware Training for Large Language Models". Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, Vikas Chandra. 2023. [Paper]
  4. PB-LLM: "PB-LLM: Partially Binarized Large Language Models". Zhihang Yuan, Yuzhang Shang, Zhen Dong. ICLR 2024. [Paper] [Github]
  5. OneBit: "OneBit: Towards Extremely Low-bit Large Language Models". Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, Wanxiang Che. NeurIPS 2024. [Paper]
  6. BitNet: "BitNet: Scaling 1-bit Transformers for Large Language Models". Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei. 2023. [Paper]
  7. BitNet b1.58: "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits". Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei. 2024. [Paper]
  8. SqueezeLLM: "SqueezeLLM: Dense-and-Sparse Quantization". Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer. ICML 2024. [Paper] [Github]
  9. JSQ: "Compressing Large Language Models by Joint Sparsification and Quantization". Jinyang Guo, Jianyu Wu, Zining Wang, Jiaheng Liu, Ge Yang, Yifu Ding, Ruihao Gong, Haotong Qin, Xianglong Liu. ICML 2024. [Paper] [Github]
  10. FrameQuant: "FrameQuant: Flexible Low-Bit Quantization for Transformers". Harshavardhan Adepu, Zhanpeng Zeng, Li Zhang, Vikas Singh. 2024. [Paper] [Github]
  11. BiLLM: "BiLLM: Pushing the Limit of Post-Training Quantization for LLMs". Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi. 2024. [Paper] [Github]
  12. LQER: "LQER: Low-Rank Quantization Error Reconstruction for LLMs". Cheng Zhang, Jianyi Cheng, George A. Constantinides, Yiren Zhao. ICML 2024. [Paper] [Github]
  13. I-LLM: "I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models". Xing Hu, Yuan Cheng, Dawei Yang, Zhihang Yuan, Jiangyong Yu, Chen Xu, Sifan Zhou. 2024. [Paper] [Github]
  14. PV-Tuning: "PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression". Vladimir Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh, Peter Richtarik. 2024. [Paper]
  15. PEQA: "Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization". Jeonghoon Kim, Jung Hyun Lee, Sungdong Kim, Joonsuk Park, Kang Min Yoo, Se Jung Kwon, Dongsoo Lee. NIPS 2023. [Paper]
  16. QLoRA: "QLORA: efficient finetuning of quantized LLMs". Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke ZettlemoyerAuthors Info & Claims. NIPS 2023. [Paper] [Github]

LLM techniques for SLMs

  1. Ma et al.: "Large Language Model Is Not a Good Few-shot Information Extractor, but a Good Reranker for Hard Samples!". Yubo Ma, Yixin Cao, YongChing Hong, Aixin Sun. EMNLP 2023. [Paper] [Github]
  2. MoQE: "Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness". Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla. 2023. [Paper]
  3. SLM-RAG: "Can Small Language Models With Retrieval-Augmented Generation Replace Large Language Models When Learning Computer Science?". Suqing Liu, Zezhu Yu, Feiran Huang, Yousef Bulbulia, Andreas Bergen, Michael Liut. ITiCSE 2024. [Paper]

Task-specific SLM Applications

SLM in QA

  1. Alpaca: "Alpaca: A Strong, Replicable Instruction-Following Model". Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, Tatsunori B. Hashimoto. 2023. [Paper] [Github] [HuggingFace] [Website]
  2. Stable Beluga 7B: "Stable Beluga 2". Mahan, Dakota and Carlow, Ryan and Castricato, Louis and Cooper, Nathan and Laforte, Christian. 2023. [HuggingFace]
  3. Fine-tuned BioGPT Guo et al.: "Improving Small Language Models on PubMedQA via Generative Data Augmentation". Zhen Guo, Peiqi Wang, Yanwei Wang, Shangdi Yu. 2023. [Paper]
  4. Financial SLMs: "Fine-tuning Smaller Language Models for Question Answering over Financial Documents". Karmvir Singh Phogat Karmvir Singh Phogat, Sai Akhil Puranam, Sridhar Dasaratha, Chetan Harsha, Shashishekar Ramakrishna. 2024. [Paper]
  5. ColBERT: "ColBERT Retrieval and Ensemble Response Scoring for Language Model Question Answering". Alex Gichamba, Tewodros Kederalah Idris, Brian Ebiyau, Eric Nyberg, Teruko Mitamura. IEEE 2024. [Paper]
  6. T-SAS: "Test-Time Self-Adaptive Small Language Models for Question Answering". Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Hwang, Jong Park. ACL 2023. [Paper] [Github]
  7. Rationale Ranking: "Answering Unseen Questions With Smaller Language Models Using Rationale Generation and Dense Retrieval". Tim Hartill, Diana Benavides-Prado, Michael Witbrock, Patricia J. Riddle. 2023. [Paper]

SLM in Coding

  1. Phi-3.5-mini: "Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone". Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, ..., Chunyu Wang, Guanhua Wang, Lijuan Wang et al. 2024. [Paper] [HuggingFace] [Website]
  2. TinyLlama: "TinyLlama: An Open-Source Small Language Model". Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, Wei Lu. 2024. [Paper] [HuggingFace] [Chat Demo] [Discord]
  3. CodeLlama: "Code Llama: Open Foundation Models for Code". Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, ..., Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve. 2024. [Paper] [HuggingFace]
  4. CodeGemma: "CodeGemma: Open Code Models Based on Gemma". CodeGemma Team: Heri Zhao, Jeffrey Hui, Joshua Howland, Nam Nguyen, Siqi Zuo, Andrea Hu, Christopher A. Choquette-Choo, Jingyue Shen, Joe Kelley, Kshitij Bansal, ..., Kathy Korevec, Kelly Schaefer, Scott Huffman. 2024. [Paper] [HuggingFace]

SLM in Recommendation

  1. PromptRec: "Could Small Language Models Serve as Recommenders? Towards Data-centric Cold-start Recommendations". Xuansheng Wu, Huachi Zhou, Yucheng Shi, Wenlin Yao, Xiao Huang, Ninghao Liu. 2024. [Paper] [Github]
  2. SLIM: "Can Small Language Models be Good Reasoners for Sequential Recommendation?". Yuling Wang, Changxin Tian, Binbin Hu, Yanhua Yu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou, Liang Pang, Xiao Wang. 2024. [Paper]
  3. BiLLP: "Large Language Models are Learnable Planners for Long-Term Recommendation". Wentao Shi, Xiangnan He, Yang Zhang, Chongming Gao, Xinyue Li, Jizhi Zhang, Qifan Wang, Fuli Feng. 2024. [Paper]
  4. ONCE: "ONCE: Boosting Content-based Recommendation with Both Open- and Closed-source Large Language Models". Qijiong Liu, Nuo Chen, Tetsuya Sakai, Xiao-Ming Wu. WSDM 2024. [Paper] [Github]
  5. RecLoRA: "Lifelong Personalized Low-Rank Adaptation of Large Language Models for Recommendation". Jiachen Zhu, Jianghao Lin, Xinyi Dai, Bo Chen, Rong Shan, Jieming Zhu, Ruiming Tang, Yong Yu, Weinan Zhang. 2024. [Paper]

SLM in Web Search

  1. Content encoder: "Pre-training Tasks for Embedding-based Large-scale Retrieval". Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, Sanjiv Kumar. ICLR 2020. [Paper]
  2. Poly-encoders: "Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring". Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, Jason Weston. ICLR 2020. [Paper]
  3. Twin-BERT: "TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for Efficient Retrieval". Wenhao Lu, Jian Jiao, Ruofei Zhang. 2020. [Paper]
  4. H-ERNIE: "H-ERNIE: A Multi-Granularity Pre-Trained Language Model for Web Search". Xiaokai Chu, Jiashu Zhao, Lixin Zou, Dawei Yin. SIGIR 2022. [Paper]
  5. Ranker: "Passage Re-ranking with BERT". Rodrigo Nogueira, Kyunghyun Cho. 2019. [Paper] [Github]
  6. Rewriter: "Query Rewriting for Retrieval-Augmented Large Language Models". Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, Nan Duan. EMNLP2023. [Paper] [Github]

SLM in Mobile-device

  1. Octopus: "Octopus: On-device language model for function calling of software APIs". Wei Chen, Zhiyuan Li, Mingyuan Ma. 2024. [Paper] [HuggingFace]
  2. MobileAgent: "Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration". Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang. 2024. [Paper] [Github] [HuggingFace]
  3. Revolutionizing Mobile Interaction: "Revolutionizing Mobile Interaction: Enabling a 3 Billion Parameter GPT LLM on Mobile". Samuel Carreira, Tomás Marques, José Ribeiro, Carlos Grilo. 2023. [Paper]
  4. AutoDroid: "AutoDroid: LLM-powered Task Automation in Android". Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, Yunxin Liu. 2023. [Paper]
  5. On-device Agent for Text Rewriting: "Towards an On-device Agent for Text Rewriting". Yun Zhu, Yinxiao Liu, Felix Stahlberg, Shankar Kumar, Yu-hui Chen, Liangchen Luo, Lei Shu, Renjie Liu, Jindong Chen, Lei Meng. 2023. [Paper]

On-device Deployment Optimization Techniques

Memory Efficiency Optimization

  1. EDGE-LLM: "EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting". Zhongzhi Yu, Zheng Wang, Yuhan Li, Haoran You, Ruijie Gao, Xiaoya Zhou, Sreenidhi Reedy Bommu, Yang Katie Zhao, Yingyan Celine Lin. 2024. [Paper] [Github]
  2. LLM-PQ: "LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization". Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, Chuan Wu. 2024. [Paper] [Github]
  3. AWQ: "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration". Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han. MLSys 2024. [Paper] [Github]
  4. MobileAIBench: "MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases". Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, Jianguo Zhang, Zhiwei Liu, Shirley Kokane, Zuxin Liu, Ming Zhu, Huan Wang, Caiming Xiong, Silvio Savaresel. 2024. [Paper] [Github]
  5. MobileLLM: "MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases". Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra. ICML 2024. [Paper] [Github] [HuggingFace]
  6. EdgeMoE: "EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models". Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, Mengwei Xu. 2023. [Paper] [Github]
  7. GEAR: "GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM". Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao. 2024. [Paper] [Github]
  8. DMC: "Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference". Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti. 2024. [Paper]
  9. Transformer-Lite: "Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs". Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie. 2024. [Paper]
  10. LLMaaS: "LLM as a System Service on Mobile Devices". Wangsong Yin, Mengwei Xu, Yuanchun Li, Xuanzhe Liu. 2024. [Paper]

Runtime Efficiency Optimization

  1. EdgeMoE: "EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models". Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, Mengwei Xu. 2023. [Paper] [Github]
  2. LLMCad: "LLMCad: Fast and Scalable On-device Large Language Model Inference". Daliang Xu, Wangsong Yin, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, Xuanzhe Liu. 2023. [Paper]
  3. LinguaLinked: "LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices". Junchen Zhao, Yurun Song, Simeng Liu, Ian G. Harris, Sangeetha Abdu Jyothi. 2023 [Paper]

SLMs enhance LLMs

SLMs for LLMs Calibration

  1. Calibrating Large Language Models Using Their Generations Only. Dennis Ulmer, Martin Gubri, Hwaran Lee, Sangdoo Yun, Seong Joon Oh. ACL 2024 Long, [pdf] [code]
  2. Pareto Optimal Learning for Estimating Large Language Model Errors. Theodore Zhao, Mu Wei, J. Samuel Preston, Hoifung Poon. ACL 2024 Long, [pdf]
  3. The Internal State of an LLM Knows When It’s Lying. Amos Azaria, Tom Mitchell. EMNLP 2023 Findings. [pdf]

SLMs for LLMs RAG

  1. Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs. Jiejun Tan, Zhicheng Dou, Yutao Zhu, Peidong Guo, Kun Fang, Ji-Rong Wen. ACL 2024 Long. [pdf] [code] [huggingface]
  2. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, Hannaneh Hajishirzi. ICLR 2024 Oral. [pdf] [huggingface] [code] [website] [model] [data]
  3. LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu. ICLR 2024 Workshop ME-FoMo Poster. [pdf]
  4. Corrective Retrieval Augmented Generation. Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, Zhen-Hua Ling. arXiv 2024.1. [pdf] [code]
  5. Self-Knowledge Guided Retrieval Augmentation for Large Language Models. Yile Wang, Peng Li, Maosong Sun, Yang Liu. EMNLP 2023 Findings. [pdf] [code]
  6. In-Context Retrieval-Augmented Language Models. Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, Yoav Shoham. TACL 2023. [pdf] [code]

Star History

Star History Chart