Addressing Token Uniformity in Transformers via Singular Value Transformation

UAI2022: Addressing Token Uniformity in Transformers via Singular Value Transformation

Hanqi Yan, Lin Gui, Wenjie Li, Yulan He.

Motivations

In this work, we characterise the token uniformity problem (See in the Figure below) commonly observed in the output of transformer-based architectures by the degree of skewness in singular value distributions and propose a singular value transformation function (SoftDecay) to address the problem.

Singular value distribution of the outputs from BERT layer 0, 7 and 12 (from top to bottom) in the GLUE- MRPC dataset. The skewness, token uniformity and [CLS] uniformity values increase as BERT layer goes deeper, while the median of singular values decreases drastically, close to vanishing.

Requirements

Our project is based on Huggingface Transformers, so basically you can refer the requirements from their repository.

Code Structure

Main Functions

Our method is evaluated on text-classification tasks (GLUE Datasets), so we define the main functions in run_glue_no_trainer.py. It should be noted that if you have already downloaded the official Transformer packgae, you may need to specify the directory for our updated version.

# involve our updated SoftDecay Transofmer Directory /tokenUni/src/

sys.path.insert(0,"/YourDir/tokenUni/src/")

Pretrained Language Models + SoftDecay

Our SoftDecay Transofmer code is in this repo, including the AlBERT, BERT, DistilBERT, Roberta. Take BERT as an example, we modify the configuration file configuration_bert.py and the modeling_bert.py to insert the SoftDecay function. And the SoftDecay function is defined in soft_decay.py

Visualization

This code visualization.py is used to visualize the singular value distributions, i.e., the CDF and histgrams, as well as calculate the representation metrics.

Demo of Unsupervised Evaluation on STS datasets

Since SoftDecay is directly applied on output representation, and we can fix the function parameter $\alpha$ to define a decay function and get the transorformed representations for evaluation without training. (In supervised setting, $\alpha$ is trained under the task supervision to get better downstream task results.) This demo is mainly based on WhiteningBERT repo. We have added the softDecay post-process method in the unsupervisedSTS directory

#specify the PLTM/pooling method/layer_index/post_process(soft_decay or whitening)
python evaluation_stsbenchmark.py --pooling aver --encoder_name bert-base-cased --last2avg --post_process soft_decay

By changing the different PLTMs, you are suppposed to get the results below:

Citation

If you find our work useful, please cite as:

@inproceedings{
yan2022addressing,
title={Addressing Token Uniformity in Transformers via Singular Value Transformation},
author={Hanqi Yan and Lin Gui and Wenjie Li and Yulan He},
booktitle={The 38th Conference on Uncertainty in Artificial Intelligence},
year={2022},
url={https://openreview.net/forum?id=BtUxE_8i5l5}
}

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.vscode		.vscode
examples		examples
src		src
unsupervisedSTS		unsupervisedSTS
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
intro_pic.png		intro_pic.png
sts_results.png		sts_results.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Addressing Token Uniformity in Transformers via Singular Value Transformation

Motivations

Singular value distribution of the outputs from BERT layer 0, 7 and 12 (from top to bottom) in the GLUE- MRPC dataset. The skewness, token uniformity and [CLS] uniformity values increase as BERT layer goes deeper, while the median of singular values decreases drastically, close to vanishing.

Requirements

Code Structure

Demo of Unsupervised Evaluation on STS datasets

Citation

About

Releases

Packages

Languages

License

hanqi-qi/tokenUni

Folders and files

Latest commit

History

Repository files navigation

Addressing Token Uniformity in Transformers via Singular Value Transformation

Motivations

Singular value distribution of the outputs from BERT layer 0, 7 and 12 (from top to bottom) in the GLUE- MRPC dataset. The skewness, token uniformity and [CLS] uniformity values increase as BERT layer goes deeper, while the median of singular values decreases drastically, close to vanishing.

Requirements

Code Structure

Demo of Unsupervised Evaluation on STS datasets

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages