AuxFormer: Robust Approach to Audiovisual Emotion Recognition
A challenging task in audiovisual emotion recognition is to implement neural network architectures that can leverage and fuse multimodal information while temporally aligning modalities, handling missing modalities, and capturing information from all modalities without losing information during training. These requirements are important to achieve model robustness and to increase accuracy on the emotion recognition task. A recent approach to perform multimodal fusion is to use the transformer architecture to properly fuse and align the modalities. This study proposes the AuxFormer framework, which addresses in a principled way the aforementioned challenges. AuxFormer combines the transformer framework with auxiliary networks. It uses shared losses to infuse information from single-modality networks that are separately embedded. The extra layer of audiovisual information added to our main network retains information that would otherwise be lost during training.
Lucas Goncalves and Carlos Busso, "AuxFormer: Robust approach to audiovisual emotion recognition," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022), Singapore, May 2022.
Updated codes for easier implementation of the model and feature extracting. Releasing corpora partitions for experimental evaluations and features to be used. In this update, we are releasing the wav2vec2 features and EMOCA features version. Opensmile and VGG-Face features are being updated for upcoming release.
- Python 3.9.7
- Pytorch 1.12.0
- To create conda environment based on requirements use:
conda install --name AuxFormer_env --file requirements.txt
- Note:
pip install transformers
is needed after creating env - Activate environment with:
conda activate AuxFormer_env
- Access to features/partitions here
Note: Five pre-set partitions are provided using three different labelling methods P - plurality rule, M - majority rule, and D - distributional.
The video features provided in this drive are EMOCA features. Audios can be obtained by directly downloading the CREMA-D. dataset from source and processing the files with
AuxFormer/pre-processing/crema_auds.py
. VGG-Face features to be uploaded.
For more details make sure to visit these files to look at script arguments and description:
AuxFormer/run_model.sh
- main script with input setting and run/infer model
AuxFormer/utils
- utils folder containing loss manager, data_manager, feature extractor, normalizer, etc.
AuxFormer/config
- config files with information about datasets
AuxFormer/net
- model wrapper and Auxformer framework
AuxFormer/train.py
- training script
AuxFormer/test.py
- inference script
AuxFormer/modules/
- folder containing tranformer framework and position_embedding configurations
-
Download dataset and data partitions specifications. Place data and partitions inside a folder called data like:
AuxFormer/data
Folder should contain the following. CREMA-D example:
AuxFormer/data/Audios
,AuxFormer/data/Videos
,AuxFormer/data/labels_consensus_6class_X
-
If using wav2vec2 features. Download wav2vec2 model for audio feature extraction and place folder in
AuxFormer/wav2vec2-large-robust-finetunned
-
Execute run_model.sh
`conda activate AuxFormer_env` `bash run_model.sh`
The AuxFormer framework, which consists of the main audiovisual fusion network (middle) labelled fav(•), the auxiliary acoustic network (top) labelled fa(•), and the auxiliary visual network (bottom) labelled fv(•).
Lucas Goncalves and Carlos Busso, "Robust audiovisual emotion recognition: Aligning modalities, capturing temporal information, and handling missing features," IEEE Transactions on Affective Computing, vol. early access, 2022.
Lucas Goncalves and Carlos Busso, "AuxFormer: Robust approach to audiovisual emotion recognition," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022), Singapore, May 2022.