Skip to content

nyunAI/SFSD-LLM

Repository files navigation

Surgical Feature-Space Decomposition of LLMs: Why, When and How?

This repository contains the code for our paper: Surgical Feature-Space Decomposition of LLMs: Why, When and How?. The paper was published in Association for Computational Linguistics (ACL), [2024] by Arnav Chavan, Nahush Lele, and Deepak Gupta

Overview

This repository contains the code to reproduce our results by following the steps outlined below. The initial decomposition can be executed on a CPU-only machine, while the surgical rank search experiments require a single NVIDIA L4 GPU.

To be able to run the evaluation functions present in our repository it is neccessary to pull the master branch from the llm-evaluation-harness: https://github.com/EleutherAI/lm-evaluation-harness/tree/master and run the command 'pip install -e' from inside the pulled repository.

Key Features :

  • Efficient : This is a Zero Shot compression algorithm requiring no training steps, thus there is no need for large GPU resources.
  • Targeted Compression : Allows compression in a task specific as well as task agnostic (perplexity based) way.
  • Compression Control : The Surgical Rank Search process results in a more fine-grained control over the compressed models' budgets.
  • Bias Reduction : Compressed models showcase reduced stereotype biases and undergo significant unlearning-learning which are added benefits of this method.

Supported Models

LLaMa - HuggingFace

Mistral - HuggingFace

Almost all LLMs, comprising of repeated modules of Attention Block + MLP Block, will be readily supported with minimal to no adjustments required for the --layer argument.

Results

Results for Uniform Sparsity

The table below shows the results of our experiments comparing Feature Space Decomposition, Weight Space Decomposition, and LLM-Pruner. The decomposition experiments apply uniform sparsity to a subset of the LLM layers to achieve the desired budget.

Decomposition #Params (B) #MACS BoolQ PIQA HellaSwag WinoGrande ARC-e ARC-c Average
Baseline 6.7 423.93 75.04 78.67 76.22 70.00 72.85 44.88 69.61
Feature Space (Ours) 5.4 339.99 74.34 74.86 66.72 67.40 66.33 39.42 64.68
Weight Space 5.4 339.99 62.20 62.57 43.91 58.80 44.95 30.03 50.41
LLM-Pruner 5.4 339.60 57.06 75.68 66.80 59.83 60.94 36.52 59.47
Feature Space (Ours) 3.4 215.61 62.02 61.37 34.64 56.43 40.32 28.75 47.25
Weight Space 3.4 215.61 62.08 53.59 27.88 48.46 27.15 27.05 41.10
LLM-Pruner 3.4 206.59 52.32 59.63 35.64 53.20 33.50 27.22 43.58

Results for Task Specific Rank Search

Below are the results of a task-specific rank search aimed at maintaining the performance on a 20% evaluation set while reporting the numbers on a disjoint 80% of the evaluation set. There is no specific budget constraint for the rank search; instead, it is conducted to achieve maximum compression while preserving performance. The rank search is performed individually for each dataset.

Model Dataset Layers Pruned
0 35 70 140
Accuracy Budget Accuracy Budget Accuracy Budget Accuracy Budget
LLaMA-7B PIQA 78.23 100% 77.21 96% 76.60 93% 75.78 89%
BoolQ 75.56 100% 75.03 90% 74.80 84% 73.50 76%
ARC-C 44.82 100% 43.86 94% 41.73 90% 42.16 86%
ARC-E 72.21 100% 70.32 93% 69.26 87% 67.68 84%
Winogrande 70.09 100% 69.50 90% 69.79 80% 62.69 71%
Hellaswag 75.89 100% 75.60 97% 75.23 95% 74.83 93%
Mistral-7B PIQA 80.27 100% 80.14 97% 78.84 95% 78.57 90%
BoolQ 83.79 100% 84.17 99% 83.94 97% 83.98 94%
ARC-C 54.64 100% 47.07 88% 45.35 85% 43.54 83%
ARC-E 79.11 100% 78.68 92% 77.32 90% 77.21 88%
Winogrande 73.35 100% 74.43 96% 73.15 94% 72.26 91%
Hellaswag 80.85 100% 79.36 96% 79.24 96% 79.00 95%

Results for Perplexity Based Surgical Rank Search

In the Perplexity-based Surgical Rank Search, the goal during compression is to achieve maximum compression while limiting the increase in perplexity to a fixed value, which is updated after compressing each layer. The WikiText-2 test set is divided into two disjoint splits, containing 20% and 80% of the samples. The perplexity numbers reported below are based on the 80% split and the rank search is done using the remainder 20%. For the commonsense reasoning tasks, the scores reported are based on the full test set evaluation.

Model Datasets Budget
100% 94% 87% 83% 79% 75% 70%
LLaMa-7b PIQA 78.67 76.82 76.39 75.13 73.55 71.71 71.27
BoolQ 75.04 73.61 72.26 73.21 71.07 66.02 64.92
ARC-C 44.88 42.92 42.24 41.38 40.01 36.86 35.07
ARC-E 72.85 71.46 68.56 66.50 64.18 60.48 55.26
Winogrande 70.00 69.29 69.46 69.37 67.56 62.67 56.35
Hellaswag 76.22 74.15 71.65 69.09 65.67 60.29 52.62
Average 69.61 68.04 66.79 65.78 63.68 59.67 55.92
Wikitext-2 (Perplexity) 12.33 15.07 18.29 22.23 27.20 33.57 40.82
Model Datasets Budget
100% 97% 93% 90% 87% 83% 80%
Mistral-7b PIQA 80.52 80.01 78.73 78.84 76.99 75.68 75.84
BoolQ 83.58 81.38 81.56 81.34 78.96 76.54 73.33
ARC-C 54.01 52.90 49.74 47.61 43.51 38.22 37.20
ARC-E 79.54 79.38 78.37 77.65 74.83 71.93 70.41
Winogrande 74.03 74.51 73.56 72.06 70.80 65.43 64.48
Hellaswag 81.05 79.91 77.65 75.77 71.80 66.13 60.37
Average 75.46 74.69 73.27 72.21 69.48 65.56 63.60
Wikitext-2 (Perplexity) 11.60 13.90 15.56 17.91 20.68 23.96 27.49

For detailed plots on the variation of model performance versus parameters sparsified using surgical rank search, for all common sense reasoning tasks, please refer to our paper.

Steps to reproduce results

Installing requirements

pip install -r requirements.txt 

Step 1 :

Run the decomposer.py script to create a model instance of choice and decompose all its layers into low rank matrices of maximum rank and create a checkpoint. (No GPU required)

Example

python3 decomposer.py --layers o_proj,q_proj,v_proj,k_proj,gate_proj,up_proj,down_proj \
       --dataset combination --batch_size 512 \
       --seq_len 128 \
       --log_path surgical_logs.txt \
       --algo eigen \
       --weights_name decomposed_mistral_combination.pt \
       --model mistralai/Mistral-7B-v0.1

Step 2:

To perform surgical rank search on commonsense reasoning datasets, provide the checkpoint path from the previous step as an argument to surgical.py and execute it. This script will conduct continuous evaluation for both disjoint splits (Search split and Test split). A log file will be generated to monitor the progress of the rank search and evaluation metrics. At this stage, you have the flexibility to switch the dataset to any commonsense reasoning dataset, and the performance on it will serve as a metric for the surgical rank search.

Example

python3 surgical.py --layers o_proj,q_proj,v_proj,k_proj,gate_proj,up_proj,down_proj \
       --dataset piqa \
       --log_path surgical_logs.txt \
       --delta 0.0 \
       --start_layer 28
       --base_model decomposed_mistral_combination.pt \
       --model mistralai/Mistral-7B-v0.1

To run rank search based on perplexity:

Run the perplexity_test.py script providing the path of the checkpoint from Step 1 as an argument. Logs will be created and evaluation on common sense reasoning tasks will be done on the entire test dataset.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages