Surgical Feature-Space Decomposition of LLMs: Why, When and How?

This repository contains the code for our paper: Surgical Feature-Space Decomposition of LLMs: Why, When and How?. The paper was published in Association for Computational Linguistics (ACL), [2024] by Arnav Chavan, Nahush Lele, and Deepak Gupta

Overview

This repository contains the code to reproduce our results by following the steps outlined below. The initial decomposition can be executed on a CPU-only machine, while the surgical rank search experiments require a single NVIDIA L4 GPU.

To be able to run the evaluation functions present in our repository it is neccessary to pull the master branch from the llm-evaluation-harness: https://github.com/EleutherAI/lm-evaluation-harness/tree/master and run the command 'pip install -e' from inside the pulled repository.

Key Features :

Efficient : This is a Zero Shot compression algorithm requiring no training steps, thus there is no need for large GPU resources.
Targeted Compression : Allows compression in a task specific as well as task agnostic (perplexity based) way.
Compression Control : The Surgical Rank Search process results in a more fine-grained control over the compressed models' budgets.
Bias Reduction : Compressed models showcase reduced stereotype biases and undergo significant unlearning-learning which are added benefits of this method.

Supported Models

LLaMa - HuggingFace

Mistral - HuggingFace

Almost all LLMs, comprising of repeated modules of Attention Block + MLP Block, will be readily supported with minimal to no adjustments required for the --layer argument.

Results

Results for Uniform Sparsity

The table below shows the results of our experiments comparing Feature Space Decomposition, Weight Space Decomposition, and LLM-Pruner. The decomposition experiments apply uniform sparsity to a subset of the LLM layers to achieve the desired budget.

Decomposition	#Params (B)	#MACS	BoolQ	PIQA	HellaSwag	WinoGrande	ARC-e	ARC-c	Average
Baseline	6.7	423.93	75.04	78.67	76.22	70.00	72.85	44.88	69.61
Feature Space (Ours)	5.4	339.99	74.34	74.86	66.72	67.40	66.33	39.42	64.68
Weight Space	5.4	339.99	62.20	62.57	43.91	58.80	44.95	30.03	50.41
LLM-Pruner	5.4	339.60	57.06	75.68	66.80	59.83	60.94	36.52	59.47
Feature Space (Ours)	3.4	215.61	62.02	61.37	34.64	56.43	40.32	28.75	47.25
Weight Space	3.4	215.61	62.08	53.59	27.88	48.46	27.15	27.05	41.10
LLM-Pruner	3.4	206.59	52.32	59.63	35.64	53.20	33.50	27.22	43.58

Results for Task Specific Rank Search

Below are the results of a task-specific rank search aimed at maintaining the performance on a 20% evaluation set while reporting the numbers on a disjoint 80% of the evaluation set. There is no specific budget constraint for the rank search; instead, it is conducted to achieve maximum compression while preserving performance. The rank search is performed individually for each dataset.

Model	Dataset	Layers Pruned
		0		35		70		140
		Accuracy	Budget	Accuracy	Budget	Accuracy	Budget	Accuracy	Budget
LLaMA-7B	PIQA	78.23	100%	77.21	96%	76.60	93%	75.78	89%
	BoolQ	75.56	100%	75.03	90%	74.80	84%	73.50	76%
	ARC-C	44.82	100%	43.86	94%	41.73	90%	42.16	86%
	ARC-E	72.21	100%	70.32	93%	69.26	87%	67.68	84%
	Winogrande	70.09	100%	69.50	90%	69.79	80%	62.69	71%
	Hellaswag	75.89	100%	75.60	97%	75.23	95%	74.83	93%
Mistral-7B	PIQA	80.27	100%	80.14	97%	78.84	95%	78.57	90%
	BoolQ	83.79	100%	84.17	99%	83.94	97%	83.98	94%
	ARC-C	54.64	100%	47.07	88%	45.35	85%	43.54	83%
	ARC-E	79.11	100%	78.68	92%	77.32	90%	77.21	88%
	Winogrande	73.35	100%	74.43	96%	73.15	94%	72.26	91%
	Hellaswag	80.85	100%	79.36	96%	79.24	96%	79.00	95%

Results for Perplexity Based Surgical Rank Search

In the Perplexity-based Surgical Rank Search, the goal during compression is to achieve maximum compression while limiting the increase in perplexity to a fixed value, which is updated after compressing each layer. The WikiText-2 test set is divided into two disjoint splits, containing 20% and 80% of the samples. The perplexity numbers reported below are based on the 80% split and the rank search is done using the remainder 20%. For the commonsense reasoning tasks, the scores reported are based on the full test set evaluation.

Model	Datasets	Budget
Model	Datasets	100%	94%	87%	83%	79%	75%	70%
LLaMa-7b	PIQA	78.67	76.82	76.39	75.13	73.55	71.71	71.27
	BoolQ	75.04	73.61	72.26	73.21	71.07	66.02	64.92
	ARC-C	44.88	42.92	42.24	41.38	40.01	36.86	35.07
	ARC-E	72.85	71.46	68.56	66.50	64.18	60.48	55.26
	Winogrande	70.00	69.29	69.46	69.37	67.56	62.67	56.35
	Hellaswag	76.22	74.15	71.65	69.09	65.67	60.29	52.62
	Average	69.61	68.04	66.79	65.78	63.68	59.67	55.92
	Wikitext-2 (Perplexity)	12.33	15.07	18.29	22.23	27.20	33.57	40.82

Model	Datasets	Budget
Model	Datasets	100%	97%	93%	90%	87%	83%	80%
Mistral-7b	PIQA	80.52	80.01	78.73	78.84	76.99	75.68	75.84
	BoolQ	83.58	81.38	81.56	81.34	78.96	76.54	73.33
	ARC-C	54.01	52.90	49.74	47.61	43.51	38.22	37.20
	ARC-E	79.54	79.38	78.37	77.65	74.83	71.93	70.41
	Winogrande	74.03	74.51	73.56	72.06	70.80	65.43	64.48
	Hellaswag	81.05	79.91	77.65	75.77	71.80	66.13	60.37
	Average	75.46	74.69	73.27	72.21	69.48	65.56	63.60
	Wikitext-2 (Perplexity)	11.60	13.90	15.56	17.91	20.68	23.96	27.49

For detailed plots on the variation of model performance versus parameters sparsified using surgical rank search, for all common sense reasoning tasks, please refer to our paper.

Steps to reproduce results

Installing requirements

pip install -r requirements.txt

Step 1 :

Run the decomposer.py script to create a model instance of choice and decompose all its layers into low rank matrices of maximum rank and create a checkpoint. (No GPU required)

Example

python3 decomposer.py --layers o_proj,q_proj,v_proj,k_proj,gate_proj,up_proj,down_proj \
       --dataset combination --batch_size 512 \
       --seq_len 128 \
       --log_path surgical_logs.txt \
       --algo eigen \
       --weights_name decomposed_mistral_combination.pt \
       --model mistralai/Mistral-7B-v0.1

Step 2:

To perform surgical rank search on commonsense reasoning datasets, provide the checkpoint path from the previous step as an argument to surgical.py and execute it. This script will conduct continuous evaluation for both disjoint splits (Search split and Test split). A log file will be generated to monitor the progress of the rank search and evaluation metrics. At this stage, you have the flexibility to switch the dataset to any commonsense reasoning dataset, and the performance on it will serve as a metric for the surgical rank search.

Example

python3 surgical.py --layers o_proj,q_proj,v_proj,k_proj,gate_proj,up_proj,down_proj \
       --dataset piqa \
       --log_path surgical_logs.txt \
       --delta 0.0 \
       --start_layer 28
       --base_model decomposed_mistral_combination.pt \
       --model mistralai/Mistral-7B-v0.1

To run rank search based on perplexity:

Run the perplexity_test.py script providing the path of the checkpoint from Step 1 as an argument. Logs will be created and evaluation on common sense reasoning tasks will be done on the entire test dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
README.md		README.md
dataset_ppl.py		dataset_ppl.py
decomposer.py		decomposer.py
evaluate.py		evaluate.py
evaluator_modified.py		evaluator_modified.py
layers.py		layers.py
perplexity_test.py		perplexity_test.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
surgical.py		surgical.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Surgical Feature-Space Decomposition of LLMs: Why, When and How?

Overview

Key Features :

Supported Models

Results

Results for Uniform Sparsity

Results for Task Specific Rank Search

Results for Perplexity Based Surgical Rank Search

Steps to reproduce results

Example

Example

To run rank search based on perplexity:

About

Releases

Packages

Contributors 2

Languages

nyunAI/SFSD-LLM

Folders and files

Latest commit

History

Repository files navigation

Surgical Feature-Space Decomposition of LLMs: Why, When and How?

Overview

Key Features :

Supported Models

Results

Results for Uniform Sparsity

Results for Task Specific Rank Search

Results for Perplexity Based Surgical Rank Search

Steps to reproduce results

Example

Example

To run rank search based on perplexity:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages