In Multispeech, we consider speech as a multimodal signal with different facets: acoustic, facial, articulatory, gestural, etc. Historically, speech was mainly considered under its acoustic facet, which is still the most important one. However, the acoustic signal is a consequence of the temporal evolution of the shape of the vocal tract (pharynx, tongue, jaws, lips, etc.) that is the articulatory facet of speech. The shape of the vocal tract is partly visible on the face, that is the main visual facet of speech. The face can provide additional information on the speaker’s state through facial expressions. Speech can be accompanied by gestures (head nodding, arm and hand movements, etc.), that help to clarify the linguistic message. In some cases, such as in sign language, these gestures can bear the main linguistic content and be the only means of communication.
The general objective of Multispeech is to study the analysis and synthesis of the different facets of this multimodal signal and their multimodal coordination in the context of human-human or human-computer interaction. While this multimodal signal carries all of the information used in spoken communication, the collection, processing, and extraction of meaningful information by a machine system remains a challenge. In particular, to operate in real-world conditions, such a system must be robust to noisy or missing facets. We are especially interested in designing models and learning techniques that rely on limited amounts of labeled data and that preserve privacy.
Therefore, Multispeech addresses data-efficient, privacy-preserving learning methods, and the robust extraction of various streams of information from speech signals. These two axes will allow us to address multimodality, i.e., the analysis and the generation of multimodal speech and its consideration in an interactional context.
The outcomes will crystallize into a unified software platform for the development of embodied voice assistants. Our main objective is that the results of our research feed this platform, and that the platform itself facilitates our research and that of other researchers in the general domain of human-computer interaction, as well as the development of concrete applications that help humans to interact with one another or with machines. We will focus on two main application areas: language learning and health assistance.
A central aspect of our research is to design machine learning models and methods for multimodal speech data, whether acoustic, visual or gestural. By contrast with big tech companies, we focus on scenarios where the amount of speech data is limited and/or access to the raw data is infeasible due to privacy requirements, and little or no human labels are available.
State-of-the-art methods for speech and audio processing are based on discriminative neural networks trained for the targeted task. This paradigm faces major limitations: lack of interpretability, large data requirements and inability to generalize to unseen classes or tasks. Our approach is to combine the representation power of deep learning with our acoustic expertise to obtain smaller generative models describing the probability distribution of speech and audio signals. Particular attention will be paid to designing physically-motivated input layers, output layers, and unsupervised representations that capture complex-valued, multi-scale spectro-temporal dependencies. Given these models, we derive computationally efficient inference algorithms that address the above limitations. We also explore the integration of deep learning with symbolic reasoning and common-sense knowledge to increase the generalization ability of deep models.
While supervised learning from fully labeled data is economically costly, unlabeled data are inexpensive but provide intrinsically less information. Our goal is to learn representations that disentangle the attributes of speech by equipping the unsupervised representation learning methods above with supervised branches exploiting the available labels and supervisory signals, and with multiple adversarial branches overcoming the usual limitations of adversarial.
To preserve privacy, speech must be transformed to hide the users’ identity and other privacy-sensitive attributes (e.g., accent, health status) while leaving intact those attributes which are required for the task (e.g., phonetic content for automatic speech recognition) and preserving the data variability for training purposes. We develop strong attacks to evaluate privacy. We also seek to hide personal identifiers and privacy-sensitive attributes in the linguistic content, focusing on their robust extraction and replacement from speech signals.
In this axis, we focus on extracting meaningful information from speech signals in real conditions. This information can be related (1) to the linguistic content, (2) to the speaker, and (3) to the speech environment.
Speech recognition is the main means to extract linguistic information from speech. Although it is a mature research area, performance drops in real-world environments pursue the development of speech enhancement and source separation methods to effectively improve robustness in such real-world scenarios. Semantic content analysis is required to interpret the spoken message. The challenges include learning from little real data, quickly adapting to new topics, and robustness to speech recognition errors. The detection and classification of hate speech in social media videos will also be considered as a benchmark, thereby extending the work on text-only detection. Finally, we also consider extracting phonetic and prosodic information to study the categorization of speech sounds and certain aspects of prosody by learners of a foreign language.
Speaker identity is required for the personalization of human-computer interaction. Speaker recognition and diarization are still challenging in real-world conditions. The speaker states that we aim to recognize include emotion and stress, which can be used to adapt the interaction in real time.
We develop audio event detection methods that exploit both strongly/weakly labeled and unlabeled data, operate in real-world conditions, can discover new events, and provide a semantic interpretation. Modeling the temporal, spatial and logical structure of ambient sound scenes over a long duration is also considered.
In our project, we consider speech as a multimodal object, where we study (1) multimodality modeling and analysis, focusing on multimodal fusion and coordination, (2) the generation of multimodal speech by taking into account its different facets (acoustic, articulatory, visual, gestural), separately or combined, and (3) interaction, in the context of human-human or human-computer interaction.
The study of multimodality concerns the interaction between modalities, their fusion, coordination and synchronization for a single speaker, as well as their synchronization across the speakers in a conversation. We focus on audiovisual speech enhancement to improve the intelligibility and quality of noisy speech by considering the speaker’s lip movements. We also consider the semi/weakly/self-supervised learning methods for multimodal data to obtain interpretable representations that disentangle in each modality the attributes related to linguistic and semantic content, emotion, reaction, etc. We also study the contribution of each modality to the intelligibility of spoken communication.
Multimodal speech generation refers to articulatory, acoustic, and audiovisual speech synthesis techniques which output one or more facets. Articulatory speech synthesis relies on 2D and 3D modeling of the dynamics of the vocal tract from real-time MRI (rtMRI) data. We consider the generation of the full vocal tract, from the vocal folds to the lips, first in 2D then in 3D. This comprises the generation of the face and the prediction of the glottis opening. We also consider audiovisual speech synthesis. Both the animation of the lower part of the face related to speech and of the upper part related to the facial expressions are considered, and development continues towards a multilingual talking head. We investigate further the modeling of expressivity for both audio-only and audiovisual speech synthesis, for a better control of expressivity, where we consider several disentangled attributes at the same time.
Interaction is a new field of research for our project-team that we will approach gradually. We start by studying the multimodal components (prosody, facial expressions, gestures) used during interaction, both by the speaker and by the listener, where the goal is to simultaneously generate speech and gestures by the speaker, and generating regulatory gestures for the listener. We will introduce different dialog bricks progressively: Spoken language understanding, Dialog management, and Natural language generation. Dialog will be considered in a multimodal context (gestures, emotional states of the interlocutor, etc.) and we will break the classical dialog management scheme to dynamically account for the interlocutor’s evolution during the speaker’s response.
This research program aims to develop a unified software platform for embodied voice assistants, fueled by our research outcomes. The platform will not only aid our research but also facilitate other researchers in the field of human-computer interaction. It will also help in creating practical applications for human interactions, with a primary focus on language learning and health assistance.
The approaches and models developed in Multispeech will have several applications to help humans interact with one another or with machines. Each application will typically rely on an embodied voice assistant developed via our generic software platform or on individual components, as presented above. We will put special effort into two application domains: language learning and health assistance. We chose these domains mainly because of their economic and social impact. Moreover, many outcomes of our research will be naturally applicable in these two domains, which will help us showcase their relevance.
Learning a second language, or acquiring the native language for people suffering from language disorders, is a challenge for the learner and represents a significant cognitive load. Many scientific activities have therefore been devoted to these issues, both from the point of view of production and perception. We aim to show the learner (native or second language) how to articulate the sounds of the target language by illustrating articulation with a talking head augmented by the vocal tract which allows animating the articulators of speech. Moreover, based on the analysis of the learner’s production, an automatic diagnosis can be envisaged. However, reliable diagnosis remains a challenge, which depends on the accuracy of speech recognition and prosodic analysis techniques. This is still an open question.
Speech technology can facilitate healthcare access to all patients and it provides an unprecedented opportunity to transform the healthcare industry. This includes speech disorders and hearing impairments. For instance, it is possible to use automatic techniques to diagnose disfluencies from an acoustic or an audiovisual signal, as in the case of stuttering. Speech enhancement and separation can enhance speech intelligibility for hearing aid wearers in complex acoustic environments, while articulatory feedback tools can be beneficial for articulatory rehabilitation of cochlear implant wearers. More generally, voice assistants are a valuable tool for senior or disabled people, especially for those who are unable to use other interfaces due to lack of hand dexterity, mobility, and/or good vision. Speech technologies can also facilitate communication between hospital staff and patients, and help emergency call operators triage the callers by quantifying their stress level and getting the maximum amount of information automatically thanks to a robust speech recognition system adapted to these extreme conditions.
The Défi Inria COLaF co-led by S. Ouni aims to increase the inclusiveness of speech technologies by releasing open data, models and software for accented French and for regional, overseas and non-territorial languages of France.
A. Deleforge co-chaired the Commission pour l'Action et la Responsabilité Ecologique (CARE), formerly called the Commission Locale de Développement Durable, a joint entity between Loria and Inria Nancy. Its goals are to raise awareness, guide policies and take action at the lab level and to coordinate with other national and local initiatives and entities on the subject of the environmental impact of science, particularly in information technologies.
M.-A. Lacroix worked on the compression of large Wav2vec 2.0 audio models for embedded devices. Her work was applied to bird monitoring.
T. Biasutto-Lervat also paid special attention to the memory and computational footprint of speech recognition and synthesis models in the context of the development of the team's software platform for embodied voice assistants.
R. Serizel et al. 43 performed an extensive study about energy consumption used to train a sound event detection model for different GPU types and batch sizes. The goal was to identify which aspects can have an impact on the estimation of energy consumption and should be normalized for a fair comparison across systems. Additionally, they proposed an analysis of the relationship between energy consumption and the sound event detection performance that calls into question the current way to evaluate systems. Following this study, C. Douwes et al. 46 proposed a tutorial on how to effectively measure the energy consumption of machine listening systems at DCASE workshop.
E. Vincent received the 2023 IEEE SPS Sustained Impact Paper Award for the paper 56.
The startup Nijta co-founded by E. Vincent was awarded the i-Lab Prize of the national innovation challenge organized by the French Ministry of Higher Education, Research and Innovation in partnership with Bpifrance.
L. Abel is the recipient of the national Pépite award (for the startup Dynalips co-founded by S. Ouni, L. Abel and T. Biasutto–Lervat)
The implemented method is inspired from the speaker anonymisation method proposed in [Fan+19], which performs voice conversion based on x-vectors [Sny+18], a fixed-length representation of speech signals that form the basis of state-of-the-art speaker verification systems. We have brought several improvements to this method such as pitch transformation, and new design choices for x-vector selection
[Fan+19] F. Fang, X. Wang, J. Yamagishi, I. Echizen, M. Todisco, N. Evans, and J.F. Bonastre. “Speaker Anonymization Using x-vector and Neural Waveform Models”. In: Proceedings of the 10th ISCA Speech Synthesis Workshop. 2019, pp. 155–160. [Sny+18] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur. “X-vectors: Robust DNN embeddings for speaker recognition”. In: Proceedings of ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018, pp. 5329–5333.
Voice assistants and voice interfaces have become a key technology, simplifying the user experience and increasing the accessibility of many applications, and their use will intensify in the coming years. However, this technology poses two major problems today: on the one hand, the quasi-hegemony of large technology companies (mainly American) raises questions about European digital sovereignty, and on the other hand, the commonly used client-server architecture raises privacy risks. To simultaneously address these two problems, we are currently developing an open-source platform for the creation of embedded virtual assistants.
This platform will provide the main speech processing and natural language processing bricks that are necessary to build a voice interface, such as denoising, recognition or speech synthesis. The generated assistant can be fully embedded in the users' terminal. The data being processed locally, we ensure the protection of their private lives. We envisage a multiplatform solution on PC (Windows, Linux, MacOS) as well as on mobile (Android, iOS).
During the second year of development, we mainly focused on the final design of the middleware API, relying on Protobuf and ZeroMQ, and on the implementation of the cross-platform python framework.
Moreover, we also add several key components into the python library, such as speech processing components (microphone and speaker access, voice activity detection with webrtcvad, speech enhancement with ConvTasNet, speech recognition with ZipFormer and ConformerTransducer, speech synthesis with BalacoonTTS, FastPitch and HifiGAN), video processing components (Face bounding-box and landmarks detection with dlib, Body bounding-box and pose estimation with opencv2) and text processing components (text completion and chat with llama-cpp).
The study of articulatory gestures has a wide range of applications, particularly in the study of speech production and automatic speech recognition. Real-time MRI data is particularly interesting because it offers good temporal resolution and complete coverage of the medio-sagittal section of the vocal tract. Existing MRI databases focus mainly on English, although the articulation of sounds depends on the language concerned. We therefore acquired MRI data for 10 native French speakers with no speech production or perception problems. A corpus consisting of sentences was used to ensure good phonetic coverage of French. Real-time MRI technology with a temporal resolution of 20 ms was used to acquire images of the vocal tract of the participants during speech production. The Sound was recorded simultaneously, denoised and temporally aligned with the images. The Speech was transcribed to obtain the phonetic segmentation. We also acquired static 3D MRI images of the French phonemes. In addition, we included annotations on spontaneous swallowing. Data are available together with the presentation of the database https://www.nature.com/articles/s41597-021-01041-3.
Transformer-based language models embed linguistic and commonsense knowledge and reasoning capabilities which are not always correct. We believe that integrating symbolic knowledge and reasoning into these models is a necessary step towards making them more trustworthy. Georgios Zervakis successfully defended his PhD on this topic 52. In parallel, we tackled the issue of linguistic ambiguities arising from changes in entities in videos. Focusing on instructional cooking videos as a challenging use case, we released Find2Find, a joint anaphora resolution and object localization dataset consisting of 500 anaphora-annotated recipes with corresponding videos, and we presented experimental results of a novel end-to-end joint multitask learning framework for these two tasks 39.
Optimization approaches for signal processing are interesting since they require little or no training data, and generally exhibit more robustness to acoustic conditions and other variability factors than deep learning systems. Besides, in such a framework, light neural networks can be used as a proxy to obtain appropriate prior information and/or initialization. We derived optimization-based algorithms for data-efficient speech signal restoration 20. We also combined such algorithms with factorization models for music signal restoration 11. Finally, we derived novel optimization algorithms for audio source separation and speech enhancement 34. These have the potential to be further unfolded into neural networks in order to perform these tasks in an end-to-end fashion.
A widely used approach for speech enhancement consists in directly learning a deep neural network (DNN) to estimate clean speech from input noisy speech. Despite its promising performance, it comes with two main challenges. First, it requires very large DNNs to learn over a huge dataset, covering many noise types, noise levels, etc. Second, its generalisation is usually limited to seen environments. Unsupervised speech enhancement tries to address these challenges by proposing to learn only clean speech distribution and model noise at inference. Along with this line of research, we proposed a novel diffusion-based speech enhancement method 36 that leverages the power of diffusion-based generative models, currently showing great performance in computer vision. Furthermore, we developed a new training loss for diffusion-based supervised speech enhancement 19, which bridges the gap between the performance of supervised and unsupervised speech enhancement approaches.
Training of multi-speaker text-to-speech (TTS) systems relies on high-quality curated datasets, which lack speaker diversity and are expensive to collect. As an alternative, we proposed to automatically select high-quality training samples from large, readily available crowdsourced automatic speech recognition (ASR) datasets using a non-intrusive perceptual mean opinion score estimator. Our method enhances the quality of training on a curated dataset and paves the way for automated TTS dataset curation across a broader spectrum of languages. 37.
Domain-specific ASR systems are usually trained or adapted on a suitable amount of transcribed speech data. By contrast, we studied the training and the adaptation of recurrent neural network (RNN) ASR language models from a small amount of untranscribed speech data using multiple ASR hypotheses embedded in ASR confusion networks. Our sampling-based method achieved up to 12% relative reduction in perplexity on a meeting dataset as compared to training on ASR 1-best hypotheses 17.
We proposed a subband diffusion based sound signal reconstruction from discrete compressed representations 40. While diffusion-based reconstruction approaches were mainly targetting speech signals we proposed a high-fidelity multi-band diffusion-based framework that generates any type of audio (e.g., speech, music, environmental sounds) from low-bitrate discrete representations.
Speech signals convey a lot of private information. To protect speakers, we pursued our investigation of x-vector based voice anonymization, which relies on splitting the speech signal into the speaker (x-vector), phonetic and pitch features and resynthesizing the signal with a different target x-vector. To reduce the amount of residual speaker information in the phonetic and pitch features, we explored the use of Laplacian noise 14 inspired from differential privacy. Pierre Champion defended his PhD 48, and we released the report 8 of the interdisciplinary Dagstuhl Seminar organized in 2022 on this topic.
End-to-end ASR has enabled the transcription of overlapping speech utterances using speaker-attributed ASR (SA-ASR) systems. We presented an end-to-end multichannel SA-ASR system that combines a Conformer-based encoder with multi-frame crosschannel attention and a speaker-attributed Transformer-based decoder. To the best of our knowledge, this is the first model that efficiently integrates ASR and speaker identification modules in a multichannel setting 21, 55.
We proposed two approaches to joint rich and normalized ASR, that produces transcriptions both with and without punctuation and capitalization. The first approach, which uses a language model to generate pseudo-rich transcriptions of normalized training data, performs better on out-of-domain data, with up to 9% relative error reduction. The second approach, which uses a single decoder conditioned on the type of output, demonstrates the feasibility of joint rich and normalized ASR using as little as 5% rich training data with moderate (2.4% absolute) error increase 53.
During the master internship of Raphaël Bagat, we investigated the representation of emotion in latent space. We combined several acoustic representations from melspectrum, extracted features or Wav2Vec2 encoding, with linguistic representation based on SBERT of the utterance. We evaluated the latent representation in several places in an emotion recognition system (concatenation after encodings or during steps of decoding). We showed that the kind of combination is very polarizing and one of both modalities can be under-leveraged. The system with Wav2Vec2 and with a secondary system with only-acoustic emotion detection gives 71% of recognition. A constractive loss have been also used but without giving a significant gain. The latent representations have been used to evaluate a set of emotional acoustic features (eGeMAPS) in order to evaluate the coded information contained in the latent space. This work could be continued to drive an expressive TTS system with this kind of latent representation to express an emotion.
In the framework of our study concerning the perception of German fricatives by French dyslexic subjects, we analyzed the homogeneity of the answers inside the groups of dyslexic people and average readers. The targeted sounds were /s/ and /sh/, present in the French and German systems, and the voiceless palatal /ç/ (the final sound in “ich”), absent in French. Previous results have shown that people with dyslexia exhibited a slight deficit in the categorization of all the sounds and a relatively poor discrimination of the new sound /ç/, which invalidated Serniclaes’ hypothesis about a better sensitivity to universal contrasts by dyslexic people. At first sight, this result seems to corroborate the relatively common view that dyslexic people have impoverished phonological representations. A new analysis of the results for each individual showed, in agreement with those of Hazan for L1, that a substantial number of dyslexic individuals (approximately half of people with dyslexia, in our study) behave like average readers (a very homogeneous group). Thus for a large number of individuals with dyslexia, it appears that phonological representations are in fact intact. With respect to L2, this last group is not disadvantaged by dyslexia (as would suggest the hypothesis of impoverished phonological representations), at least at the phonological level and for the sounds present in this study.
The wide usage of social media has given rise to the problem of online hate speech. Deep neural network-based classifiers have become the state-of-the-art for automatic hate speech classification. The performance of these classifiers depends on the amount of available labelled training data. However, most hate speech corpora have a small number of hate speech samples. We considered transferring knowledge from a resource-rich source to a low-resource target with fewer labeled instances, across different online platforms. A novel training strategy is proposed, which allows flexible modeling of the relative proximity of neighbors retrieved from the resource-rich corpus to learn the amount of transfer. We incorporate neighborhood information with Optimal Transport that permits exploiting the embedding space geometry. By aligning the joint embedding and label distributions of neighbors, substantial improvements are obtained in low-resource hate speech corpora 47. Moreover, in 47, we proposed two DA approaches using feature attributions, which are post-hoc model explanations. Particularly, the problem of spurious corpusspecific correlations is studied that restricts the generalizability of classifiers for detecting hate speech, a sub-category of abusive language. While the prior approaches rely on a manually curated list of terms, we automatically extracted and penalized the terms causing spurious correlations. Our dynamic approaches improved the cross-corpus performance over previous works both independently and in combination with pre-defined dictionaries.
Multiword expression (MWE) identification in tweets is a complex task due to the complex linguistic nature of MWEs combined with the non-standard language use in social networks. MWE features were shown to be helpful for hate speech detection (HSD). In 45, we studied the impact of the self-attention mechanism and the multi-task learning for hate speech detection. The two tasks that we want to achieve are the MWE identification task and the hate speech detection task. We carried out our experiments on four corpora and using two contextual embeddings. We observed that multi-task systems significantly outperform the baseline single-task system. The best performance is obtained using the multi-task system with two attention heads.
We proposed diffusion probabilistic models investigated for multichannel speech enhancement as a front-end for a state-of-the-art ECAPA-TDNN speaker verification system. Results show that a joint training of the two modules leads to better performance than separate training of the enhancement and of the speaker verification models 25. This approach was further extended to replace the fully supervised joint training stage by a self supervised joint training stage 24.
Stuttering is a speech disorder during which the flow of speech is interrupted by involuntary pauses and repetition of sounds. Stuttering identification is an interesting interdisciplinary domain research problem which involves pathology, psychology, acoustics, and signal processing that makes it hard and complicated to detect. Within the ANR project BENEPHIDIRE, the goal is to automatically identify typical kinds of stuttering disfluency using acoustic and visual cues for their automatic detection. The stuttered speech is usually available in limited amounts and is highly imbalanced. This year, we addressed the class imbalance problem via a multibranching scheme and by weighting the contribution of classes in the overall loss function, resulting in a huge improvement in stuttering classes on the SEP-28k dataset over the baseline (StutterNet) 15. We have also applied speech embeddings from pre-trained deep learning models, specifically ECAPA-TDNN and Wav2Vec2.0, for various tasks. When benchmarked with traditional classifiers on Speaker Diarization tasks, our method outperforms standard systems trained on the limited SEP-28k dataset, with further improvements observed when combining embeddings and concatenating multiple layers of Wav2Vec2.0 16. Shakeel Sheikh defended his PhD thesis on February 24th, 2023 50.
Pursuing our involvement in the community on ambient sound recognition, we co-organized a task on sound event detection and separation as part of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023. For this new edition we have been investigating new evaluation metrics that are pontential more independent to post processing tuning 26. This evaluation method can provide a more complete picture of the systems behaviour under different working conditions. In 2022, we introduced an energy consumption metric in order to raise awareness about the footprint of algorithms. In relation with this aspect, we measured the energy consumption of the baseline on several devices and for different hyperparameter values in order to define good practices to compare energy consumption of challenge submissions 43.
We also continued working on the automatic audio captioning. We participated in the organization of the audio-captioning task within the DCASE challenge. We also worked on proposing new metrics to evaluate captioning systems 30.
Following the work done by Nicolas Furnon during his PhD , we investigated to which extent using signals obtained in simulated acoustic environments is relevant to evaluate speech enhancement approaches compared to using real recorded signals 23. This study focused in particular on distributed algorithms. It was shown that simulated acoustic environments that do not take the head and torso of the person wearing hearing devices into account can provide unreliable performance estimation. A parallel corpus with simulated signals and recorded signals under similar acoustic conditions was designed for these experiments and will be released. Targeting speech enhancement for hearing aids, we also started investing the performance of speech enhancement at a fine grained phonetic level. The goal here is to link the results obtained with objective metrics to the outcome of listening tests conducted at our partner site (Institut de l'audition).
To estimate acoustic quantities of interest from speech signals, state-of-the-art methods rely on supervised learning on simulated data. However, few studies carefully examine the impact of acoustic simulation realism. We contributed such a study on speech direction-of-arrival estimation 44 and revealed that improving the realism of source, microphone and wall responses at training time consistently and significantly improves generalization to real data. Prerak Srivastava defended his PhD on this topic 51.
A new direction in speech enhancement involves an unsupervised framework. Unlike the common supervised method, which trains models using both clean and noisy speech data, the unsupervised method trains solely with clean speech. This training often employs variational autoencoders (VAEs) to create a data-driven speech model. This unsupervised approach could significantly enhance generalisation performance while keeping the model less complex than supervised alternatives. However, unsupervised methods typically require more computational resources during the inference (enhancement) phase. To address this issue, we have introduced several fast and efficient inference techniques tailored for speech enhancement, using posterior sampling strategies 41, 42. Initially, we applied these techniques to a basic, non-sequential VAE model 41. Later, we adapted them for more advanced dynamical VAE models and introduced additional sampling-based methods 42. Our experiments demonstrated the effectiveness of the proposed methods, narrowing the performance gap between supervised and unsupervised approaches in speech enhancement.
Speech processing tasks that utilize visual information from a speaker's lips, such as enhancement and separation, typically require a front-facing view of the speaker to extract as much useful information as possible from the speaker's lip movements. Previous methods have not taken this into account and instead rely on data augmentation to improve robustness to different face poses, which can lead to increased complexity in the models. Recently, we developed a robust statistical frontalization technique 10 that alternates between estimating a rigid transformation (scale, rotation, and translation) and a non-rigid deformation between an arbitrarily viewed face and a face model. The method has been extensively evaluated and compared with other state-of-the-art frontalization techniques, including those that use modern deep learning architectures, for lip-reading and audio-visual speech enhancement tasks. The results confirmed the benefits of the proposed framework over the previous works.
In a related work 13, we addressed the problem of analyzing the performance of 3D face alignment (3DFA) methods, which is a necessary preprocessing step for audio-visual speech processing. While typically reliant on supervised learning from annotated datasets, 3DFA faces annotation errors, which could strongly bias the results. We explored unsupervised performance analysis (UPA), centering on estimating the rigid transformation between predicted and model landmarks. This approach is resilient to non-rigid facial changes and landmark errors. UPA involves extracting 3D landmarks from a 2D face, mapping them onto a canonical pose, and computing a robust confidence score for each landmark to determine their accuracy. The methodology is tested using public datasets and various 3DFA software, demonstrating consistency with supervised metrics and effectiveness in error detection and correction in 3DFA datasets.
Variational autoencoder (VAE) based generative models have shown potential in unsupervised audiovisual speech enhancement (AVSE), but current models do not fully leverage the sequential nature of speech and visual data. In a recent work 29, we introduced an audio-visual deep Kalman filter (AV-DKF) generative model, which combines audio-visual data more effectively using a first-order Markov chain for latent variables and an efficient inference method for speech signal estimation. Experiments comparing various generative models highlighted the AV-DKF's superiority over audio-only and non-sequential VAE-based models in speech enhancement.
This year, in collaboration with the IADI laboratory (P.-A. Vuissoz and K. Isaieva), we recorded rt-MRI data for 3 stuttering subjects plus a normally fluent control subject within the framework of the BENEPHIDIRE ANR project. We also recorded beatboxing data for an expert in the field. Those data enabled the investigation of the preparation of beatboxing patterns and the corresponding places of articulation 22. Finally, the manual correction of the phonetic segmentation of the large database (2100 sentences) recorded for one female speaker has been completed this year. This database 32 was exploited by Vinicius Ribeiro to achieve his evaluation of the prediction of the vocal tract shape from a sequence of phonemes to be articulated.
The exploitation of real-time MRI data requires the capability to process a very large number of images automatically. We have continued our work on segmenting MRI images so that we can move easily from one speaker to another and identify all the articulators. A Mask R-CNN network was trained to detect and segment the vocal tract articulator contours in two real-time MRI (rt-MRI) datasets with speech recordings of multiple speakers 12. Two post-processing algorithms were then proposed to convert the network's outputs into geometrical curves. Nine articulators were considered: the two lips, tongue, soft palate, pharynx, arytenoid cartilage, epiglottis, thyroid cartilage, and vocal folds. Rt-MRI of the vocal tract is often performed in 2D because, despite its interest, 3D rt-MRI does not offer sufficient quality. The goal of this study was to test the applicability of super-resolution algorithms for dynamic vocal tract MRI 9. In total, 25 sagittal 2D slices of 8 mm with an in-plane resolution of 1.6 × 1.6 mm2 were acquired consecutively. The slices were aligned using the simultaneously recorded speech signal. The super-resolution strategy was used to reconstruct 1.6 × 1.6 × 1.6 mm3 isotropic volumes. The resulting images were less sharp than the native 2D images but demonstrated a higher signal-to-noise ratio. Super-resolution also allows for eliminating inconsistencies leading to regular transitions between the slices. The proposed method allows for the reconstruction of high-quality dynamic 3D volumes of the vocal tract during natural speech.
Our recent work on the generation of the temporal vocal tract shape from a sequence of phonemes to be articulated exploited a large real time MRI database. However, the assessment of the shape quality still needs to be included in the process. Ranking generative models is tricky since the acoustic simulation alone is not good enough to guarantee that it does not introduce strong perceptive biases. A purely geometric assessment is therefore generally used, which is itself insufficient to deal with articulatory speaker variability.
Sign languages are rich visual languages with complex grammatical structures. As with any other natural language, they have their own unique linguistic and grammatical structures, which often do not have a one-to-one mapping to their spoken language counterparts. Computational sign language research lacks the large-scale datasets that enable immediate applicability. To date, most datasets have been suffering from small domains of discourse, e.g., weather forecasts, lack of the necessary inter- and intra-signer variance on shared content, limited vocabulary and phrase variance, and poor visual quality due to a low resolution, a motion blur and interlacing artifacts. We collected a large dataset that includes over 300 hours of signing News video footage of a German broadcaster. We processed the video to extract spatial human skeletal features for the face, hands and body, and textual transcription of the signing content. We have analyzed the data (signer-based sample labeling, statistical outlier distribution, measurement of undersigning quality, and calculation of landmark error rate). We proposed a multimodal Transformer-based cross-attention framework to annotate our corpus with the existing glossary annotations extracted from the DGS (mDGS) dataset.
Flow-based generative models are widely used in text-to-speech (TTS) systems to learn the distribution of audio features given the input tokens. Yet the generated utterances lack diversity and naturalness. We proposed to improve the diversity of utterances by explicitly learning the distribution of pitch contours of each speaker during training using a stochastic flow-based pitch predictor, then conditioning the model on generated pitch contours during inference. The experimental results demonstrate that the proposed method yields a significant improvement in the naturalness and diversity of generated speech 38.
Our goal is to study the multimodal components (prosody, facial expressions, gestures) used during interaction. We consider the concurrent generation of speech and gestures by the speaker, taking into account both non-verbal and verbal gestures. In the context of Louis Abel’s Ph.D., we focus on non-verbal gesture generation (upper body and arms) derived from the acoustic signal and the text. The first step is to autoencode the motion representation from positions to obtain a latent representation. The model is based on Graphical Neural Networks to leverage the skeleton’s constraints. The motion is then predicted using a flow-based architecture from the motion (latent) representation, audio speech feature context, and the text of the utterance. The model is based on a short context window for recurrent architecture. At present, we are conducting perceptive experiments to assess the benefits of using GNN in this type of generation. In Mickaëlla Grondin-Verdon's PhD, our focus shifts more towards dyadic gestures by analyzing specific gesture components, the strokes (duration, intensity, alignment, etc.). The aim is to feed models with this data to predict and generate gestures in a dialogue context. This year, we used the BEAT corpus, a comprehensive and varied multimodal corpus, to examine gesture categories, temporal references, durations, and comparisons. This analysis helps us in gaining a deeper understanding of the corpus structure and exploit the data in subsequent project stages. Since joining Multispeech, Domitille Caillat has been working on the role of multimodal annotations in the generation and automatic recognition of gestures. She plans to manually annotate a sample of the BEAT corpus, which is used by several Multispeech members, with the goal to objectively evaluate the results of gesture generation through a measured comparison of authentic data and generated data (frequency of gestures, favored places of occurrence, relationship between gestures and lexical affiliates, etc.).