This repository contains the code for our paper: Surgical Feature-Space Decomposition of LLMs: Why, When and How?. The paper was published in Association for Computational Linguistics (ACL), [2024] by Arnav Chavan, Nahush Lele, and Deepak Gupta
This repository contains the code to reproduce our results by following the steps outlined below. The initial decomposition can be executed on a CPU-only machine, while the surgical rank search experiments require a single NVIDIA L4 GPU.
To be able to run the evaluation functions present in our repository it is neccessary to pull the master branch from the llm-evaluation-harness: https://github.com/EleutherAI/lm-evaluation-harness/tree/master and run the command 'pip install -e' from inside the pulled repository.
- Efficient : This is a Zero Shot compression algorithm requiring no training steps, thus there is no need for large GPU resources.
- Targeted Compression : Allows compression in a task specific as well as task agnostic (perplexity based) way.
- Compression Control : The Surgical Rank Search process results in a more fine-grained control over the compressed models' budgets.
- Bias Reduction : Compressed models showcase reduced stereotype biases and undergo significant unlearning-learning which are added benefits of this method.
Almost all LLMs, comprising of repeated modules of Attention Block + MLP Block, will be readily supported with minimal to no adjustments required for the --layer argument.
The table below shows the results of our experiments comparing Feature Space Decomposition, Weight Space Decomposition, and LLM-Pruner. The decomposition experiments apply uniform sparsity to a subset of the LLM layers to achieve the desired budget.
Decomposition | #Params (B) | #MACS | BoolQ | PIQA | HellaSwag | WinoGrande | ARC-e | ARC-c | Average |
---|---|---|---|---|---|---|---|---|---|
Baseline | 6.7 | 423.93 | 75.04 | 78.67 | 76.22 | 70.00 | 72.85 | 44.88 | 69.61 |
Feature Space (Ours) | 5.4 | 339.99 | 74.34 | 74.86 | 66.72 | 67.40 | 66.33 | 39.42 | 64.68 |
Weight Space | 5.4 | 339.99 | 62.20 | 62.57 | 43.91 | 58.80 | 44.95 | 30.03 | 50.41 |
LLM-Pruner | 5.4 | 339.60 | 57.06 | 75.68 | 66.80 | 59.83 | 60.94 | 36.52 | 59.47 |
Feature Space (Ours) | 3.4 | 215.61 | 62.02 | 61.37 | 34.64 | 56.43 | 40.32 | 28.75 | 47.25 |
Weight Space | 3.4 | 215.61 | 62.08 | 53.59 | 27.88 | 48.46 | 27.15 | 27.05 | 41.10 |
LLM-Pruner | 3.4 | 206.59 | 52.32 | 59.63 | 35.64 | 53.20 | 33.50 | 27.22 | 43.58 |
Below are the results of a task-specific rank search aimed at maintaining the performance on a 20% evaluation set while reporting the numbers on a disjoint 80% of the evaluation set. There is no specific budget constraint for the rank search; instead, it is conducted to achieve maximum compression while preserving performance. The rank search is performed individually for each dataset.
Model | Dataset | Layers Pruned | |||||||
---|---|---|---|---|---|---|---|---|---|
0 | 35 | 70 | 140 | ||||||
Accuracy | Budget | Accuracy | Budget | Accuracy | Budget | Accuracy | Budget | ||
LLaMA-7B | PIQA | 78.23 | 100% | 77.21 | 96% | 76.60 | 93% | 75.78 | 89% |
BoolQ | 75.56 | 100% | 75.03 | 90% | 74.80 | 84% | 73.50 | 76% | |
ARC-C | 44.82 | 100% | 43.86 | 94% | 41.73 | 90% | 42.16 | 86% | |
ARC-E | 72.21 | 100% | 70.32 | 93% | 69.26 | 87% | 67.68 | 84% | |
Winogrande | 70.09 | 100% | 69.50 | 90% | 69.79 | 80% | 62.69 | 71% | |
Hellaswag | 75.89 | 100% | 75.60 | 97% | 75.23 | 95% | 74.83 | 93% | |
Mistral-7B | PIQA | 80.27 | 100% | 80.14 | 97% | 78.84 | 95% | 78.57 | 90% |
BoolQ | 83.79 | 100% | 84.17 | 99% | 83.94 | 97% | 83.98 | 94% | |
ARC-C | 54.64 | 100% | 47.07 | 88% | 45.35 | 85% | 43.54 | 83% | |
ARC-E | 79.11 | 100% | 78.68 | 92% | 77.32 | 90% | 77.21 | 88% | |
Winogrande | 73.35 | 100% | 74.43 | 96% | 73.15 | 94% | 72.26 | 91% | |
Hellaswag | 80.85 | 100% | 79.36 | 96% | 79.24 | 96% | 79.00 | 95% |
In the Perplexity-based Surgical Rank Search, the goal during compression is to achieve maximum compression while limiting the increase in perplexity to a fixed value, which is updated after compressing each layer. The WikiText-2 test set is divided into two disjoint splits, containing 20% and 80% of the samples. The perplexity numbers reported below are based on the 80% split and the rank search is done using the remainder 20%. For the commonsense reasoning tasks, the scores reported are based on the full test set evaluation.
Model | Datasets | Budget | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
100% | 94% | 87% | 83% | 79% | 75% | 70% | ||||
LLaMa-7b | PIQA | 78.67 | 76.82 | 76.39 | 75.13 | 73.55 | 71.71 | 71.27 | ||
BoolQ | 75.04 | 73.61 | 72.26 | 73.21 | 71.07 | 66.02 | 64.92 | |||
ARC-C | 44.88 | 42.92 | 42.24 | 41.38 | 40.01 | 36.86 | 35.07 | |||
ARC-E | 72.85 | 71.46 | 68.56 | 66.50 | 64.18 | 60.48 | 55.26 | |||
Winogrande | 70.00 | 69.29 | 69.46 | 69.37 | 67.56 | 62.67 | 56.35 | |||
Hellaswag | 76.22 | 74.15 | 71.65 | 69.09 | 65.67 | 60.29 | 52.62 | |||
Average | 69.61 | 68.04 | 66.79 | 65.78 | 63.68 | 59.67 | 55.92 | |||
Wikitext-2 (Perplexity) | 12.33 | 15.07 | 18.29 | 22.23 | 27.20 | 33.57 | 40.82 |
Model | Datasets | Budget | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
100% | 97% | 93% | 90% | 87% | 83% | 80% | ||||
Mistral-7b | PIQA | 80.52 | 80.01 | 78.73 | 78.84 | 76.99 | 75.68 | 75.84 | ||
BoolQ | 83.58 | 81.38 | 81.56 | 81.34 | 78.96 | 76.54 | 73.33 | |||
ARC-C | 54.01 | 52.90 | 49.74 | 47.61 | 43.51 | 38.22 | 37.20 | |||
ARC-E | 79.54 | 79.38 | 78.37 | 77.65 | 74.83 | 71.93 | 70.41 | |||
Winogrande | 74.03 | 74.51 | 73.56 | 72.06 | 70.80 | 65.43 | 64.48 | |||
Hellaswag | 81.05 | 79.91 | 77.65 | 75.77 | 71.80 | 66.13 | 60.37 | |||
Average | 75.46 | 74.69 | 73.27 | 72.21 | 69.48 | 65.56 | 63.60 | |||
Wikitext-2 (Perplexity) | 11.60 | 13.90 | 15.56 | 17.91 | 20.68 | 23.96 | 27.49 |
For detailed plots on the variation of model performance versus parameters sparsified using surgical rank search, for all common sense reasoning tasks, please refer to our paper.
Installing requirements
pip install -r requirements.txt
Step 1 :
Run the decomposer.py script to create a model instance of choice and decompose all its layers into low rank matrices of maximum rank and create a checkpoint. (No GPU required)
python3 decomposer.py --layers o_proj,q_proj,v_proj,k_proj,gate_proj,up_proj,down_proj \
--dataset combination --batch_size 512 \
--seq_len 128 \
--log_path surgical_logs.txt \
--algo eigen \
--weights_name decomposed_mistral_combination.pt \
--model mistralai/Mistral-7B-v0.1
Step 2:
To perform surgical rank search on commonsense reasoning datasets, provide the checkpoint path from the previous step as an argument to surgical.py and execute it. This script will conduct continuous evaluation for both disjoint splits (Search split and Test split). A log file will be generated to monitor the progress of the rank search and evaluation metrics. At this stage, you have the flexibility to switch the dataset to any commonsense reasoning dataset, and the performance on it will serve as a metric for the surgical rank search.
python3 surgical.py --layers o_proj,q_proj,v_proj,k_proj,gate_proj,up_proj,down_proj \
--dataset piqa \
--log_path surgical_logs.txt \
--delta 0.0 \
--start_layer 28
--base_model decomposed_mistral_combination.pt \
--model mistralai/Mistral-7B-v0.1
Run the perplexity_test.py script providing the path of the checkpoint from Step 1 as an argument. Logs will be created and evaluation on common sense reasoning tasks will be done on the entire test dataset.