UAI2022: Addressing Token Uniformity in Transformers via Singular Value Transformation
Hanqi Yan, Lin Gui, Wenjie Li, Yulan He.
In this work, we characterise the token uniformity problem (See in the Figure below) commonly observed in the output of transformer-based architectures by the degree of skewness in singular value distributions and propose a singular value transformation function (SoftDecay) to address the problem.
Singular value distribution of the outputs from BERT layer 0, 7 and 12 (from top to bottom) in the GLUE- MRPC dataset. The skewness, token uniformity and [CLS] uniformity values increase as BERT layer goes deeper, while the median of singular values decreases drastically, close to vanishing.
Our project is based on Huggingface Transformers, so basically you can refer the requirements from their repository.
- Main Functions
Our method is evaluated on text-classification tasks (GLUE Datasets), so we define the main functions in run_glue_no_trainer.py. It should be noted that if you have already downloaded the official Transformer packgae, you may need to specify the directory for our updated version.
# involve our updated SoftDecay Transofmer Directory /tokenUni/src/
sys.path.insert(0,"/YourDir/tokenUni/src/")
- Pretrained Language Models + SoftDecay
Our SoftDecay Transofmer code is in this repo, including the AlBERT, BERT, DistilBERT, Roberta. Take BERT as an example, we modify the configuration file configuration_bert.py and the modeling_bert.py to insert the SoftDecay function. And the SoftDecay function is defined in soft_decay.py
- Visualization
This code visualization.py is used to visualize the singular value distributions, i.e., the CDF and histgrams, as well as calculate the representation metrics.
Since SoftDecay is directly applied on output representation, and we can fix the function parameter
#specify the PLTM/pooling method/layer_index/post_process(soft_decay or whitening)
python evaluation_stsbenchmark.py --pooling aver --encoder_name bert-base-cased --last2avg --post_process soft_decay
By changing the different PLTMs, you are suppposed to get the results below:
If you find our work useful, please cite as:
@inproceedings{
yan2022addressing,
title={Addressing Token Uniformity in Transformers via Singular Value Transformation},
author={Hanqi Yan and Lin Gui and Wenjie Li and Yulan He},
booktitle={The 38th Conference on Uncertainty in Artificial Intelligence},
year={2022},
url={https://openreview.net/forum?id=BtUxE_8i5l5}
}