Skip to content

Latest commit

 

History

History
729 lines (672 loc) · 28 KB

README.md

File metadata and controls

729 lines (672 loc) · 28 KB
image-20241111160012489

🚀Quick Start

  1. Introduction
  2. Overall
  3. Representations of Spoken Dialogue Models
  4. Training Paradigm of Spoken Dialogue Model
  5. Streaming, Duplex, and Interaction
  6. Training Resources and Evaluation
  7. Cite

🔥What's new

  • 2024.11.22: We release WavChat (A survey of spoken dialogue models about 60 pages) on arxiv! 🎉
  • 2024.08.31: We release WavTokenizer on arxiv.

Introduction

This repository is the official repository of the WavChat: A Survey of Spoken Dialogue Models Paper page.

img1-paper-list

Figure 1: The timeline of existing spoken dialogue models in recent years.

Abstract

Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. In the broader context of multimodal models, the speech modality offers a direct interface for human-computer interaction, enabling direct communication between AI and users. Compared to traditional three-tier cascaded spoken dialogue models that comprise speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS), modern spoken dialogue models exhibit greater intelligence. These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech. Moreover, they generate high-quality, multi-turn speech responses with low latency, enabling real-time interaction through simultaneous listening and speaking capability. Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems and the underlying technologies. To address this, we have first compiled existing spoken dialogue systems in the chronological order and categorized them into the cascaded and end-to-end paradigms. We then provide an in-depth overview of the core technologies in spoken dialogue models, covering aspects such as speech representation, training paradigm, streaming, duplex, and interaction capabilities. Each section discusses the limitations of these technologies and outlines considerations for future research. Additionally, we present a thorough review of relevant datasets, evaluation metrics, and benchmarks from the perspectives of training and evaluating spoken dialogue systems. We hope this survey will contribute to advancing both academic research and industrial applications in the field of spoken dialogue systems.

Overall

1. The organization of this survey

WavChat - 副本

Figure 2: Orgnization of this survey.

2. General classification of spoken dialogue systems

img2-method

Figure 3: A general overview of current spoken dialogue systems.

3. Key capabilities of speech dialogue systems

image-20241111165006367

Figure 4: An overview of the spoken dialogue systems' nine ideal capabilities.

4. Publicly Available Speech Dialogue Models

Model URL
AudioGPT https://github.com/AIGC-Audio/AudioGPT
SpeechGPT https://github.com/0nutation/SpeechGPT
Freeze-Omni https://github.com/VITA-MLLM/Freeze-Omni
Baichuan-Omni https://github.com/westlake-baichuan-mllm/bc-omni
GLM-4-Voice https://github.com/THUDM/GLM-4-Voice
Mini-Omni https://github.com/gpt-omni/mini-omni
Mini-Omni2 https://github.com/gpt-omni/mini-omni2
FunAudioLLM https://github.com/FunAudioLLM
Qwen-Audio https://github.com/QwenLM/Qwen-Audio
Qwen2-Audio https://github.com/QwenLM/Qwen2-Audio
LLaMA3.1 https://www.llama.com
Audio Flamingo https://github.com/NVIDIA/audio-flamingo
Ultravox https://github.com/fixie-ai/ultravox
Spirit LM https://github.com/facebookresearch/spiritlm
dGSLM https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/dgslm
Spoken-LLM https://arxiv.org/abs/2305.11000
LLaMA-Omni https://github.com/ictnlp/LLaMA-Omni
Moshi https://github.com/kyutai-labs/moshi
SALMONN https://github.com/bytedance/SALMONN
LTU-AS https://github.com/YuanGongND/ltu
VITA https://github.com/VITA-MLLM/VITA
SpeechGPT-Gen https://github.com/0nutation/SpeechGPT
WavLLM https://github.com/microsoft/SpeechT5/tree/main/WavLLM
Westlake-Omni https://github.com/xinchen-ai/Westlake-Omni
MooER-Omni https://github.com/MooreThreads/MooER
Hertz-dev https://github.com/Standard-Intelligence/hertz-dev
Fish-Agent https://github.com/fishaudio/fish-speech
SpeechGPT2 https://0nutation.github.io/SpeechGPT2.github.io/

Table 1: The list of publicly available speech dialogue models and their URL

Representations of Spoken Dialogue Models

In the section Representations of Spoken Dialogue Models, we provide insights into how to represent the data in a speech dialogue model for better understanding and generation of speech. The choice of representation method directly affects the model's effectiveness in processing speech signals, system performance, and range of applications. The section covers two main types of representations: semantic representations and acoustic representations.

Advantages of the comprehension side Performance of unify music and audio Compression rate of speech Convert to historical context Emotional and acoustic information Pipeline for post-processing
Semantic Strong Weak High Easy Less Cascade
Acoustic Weak Strong Low Difficult More End-to-end
Table 2: The comparison of semantic and acoustic representations

And we provide a comprehensive list of publicly available codec models and their URLs.
Model URL
Encodec https://github.com/facebookresearch/encodec
SoundStream https://github.com/wesbz/SoundStream
DAC https://github.com/descriptinc/descript-audio-codec
WavTokenizer https://github.com/jishengpeng/WavTokenizer
SpeechTokenizer https://github.com/ZhangXInFD/SpeechTokenizer
SNAC https://github.com/hubertsiuzdak/snac
SemantiCodec https://github.com/haoheliu/SemantiCodec-inference
Mimi https://github.com/kyutai-labs/moshi
HiFi-Codec https://github.com/yangdongchao/AcademiCodec
FunCodec https://github.com/modelscope/FunCodec
APCodec https://github.com/YangAi520/APCodec/tree/main
AudioDec https://github.com/facebookresearch/AudioDec
FACodec https://github.com/lifeiteng/naturalspeech3_facodec
Language-Codec https://github.com/jishengpeng/Languagecodec
XCodec https://github.com/zhenye234/xcodec
TiCodec https://github.com/y-ren16/TiCodec
SoCodec https://github.com/hhguo/SoCodec
FUVC https://github.com/z21110008/FUVC
HILCodec https://github.com/aask1357/hilcodec
LaDiffCodec https://github.com/haiciyang/LaDiffCodec
LLM-Codec https://github.com/yangdongchao/LLM-Codec
SpatialCodec https://github.com/XZWY/SpatialCodec
BigCodec https://github.com/Aria-K-Alethia/BigCodec
SuperCodec https://github.com/exercise-book-yq/Supercodec
RepCodec https://github.com/mct10/RepCodec
EnCodecMAE https://github.com/habla-liaa/encodecmae
MuCodec https://github.com/xuyaoxun/MuCodec
SPARC https://github.com/Berkeley-Speech-Group/Speech-Articulatory-Coding
BANC https://github.com/anton-jeran/MULTI-AUDIODEC
SpeechRVQ https://huggingface.co/ibm/DAC.speech.v1.0
QINCo https://github.com/facebookresearch/Qinco
SimVQ https://github.com/youngsheen/SimVQ
Table 3: A comprehensive list of publicly available codec models and their URLs.

Training Paradigm of Spoken Dialogue Model

In the Training Paradigm of Spoken Dialogue Model section, we focuse on how to adapt text-based large language models (LLMs) into dialogue systems with speech processing capabilities. The selection and design of training paradigms have a direct impact on the performance, real-time performance, and multimodal alignment of the model.

Figure 5: Categorization Diagram of Spoken Dialogue Model Architectural Paradigms (left) and Diagram of Multi-stage Training Steps (right)

And we also comprehensively summarize an overview of the Alignment Post-training Methods.
image-20241114142618698
Figure 6: Alignment Post-training Methods

Streaming, Duplex, and Interaction

The Streaming, Duplex, and Interaction section mainly discusses the implementation of streaming processing, duplex communication, and interaction capabilities inspeech dialogue models. These features are crucial for improving the response speed, naturalness, and interactivity of the model in real-time conversations.

tu_03

Figure 7: The Example Diagram of Duplex Interaction

Training Resources and Evaluation

1. Training resources

Datasets used in the various training stages
Stage Task Dataset Size URL Modality
Modal Alignment Multilingual TTS Emilia 101k hrs Link Text, Speech
Mandarin ASR AISHELL-1 170 hrs Link Text, Speech
Mandarin ASR AISHELL-2 1k hrs Link Text, Speech
Mandarin TTS AISHELL-3 85 hrs, 88,035 utt., 218 spk. Link Text, Speech
TTS LibriTTS 585 hrs Link Text, Speech
ASR TED-LIUM 452 hrs Link Text, Speech
ASR VoxPopuli 1.8k hrs Link Text, Speech
ASR Librispeech 1,000 hrs Link Text, Speech
ASR MLS 44.5k hrs Link Text, Speech
TTS Wenetspeech 22.4k hrs Link Text, Speech
ASR Gigaspeech 40k hrs Link Text, Speech
ASR VCTK 300 hrs Link Text, Speech
TTS LJSpeech 24 hrs Link Text, Speech
ASR Common Voice 2,500 hrs Link Text, Speech
Dual-Stream Processing Instruction Alpaca 52,000 items Link Text + TTS
Instruction Moss - Link Text + TTS
Instruction BelleCN - Link Text + TTS
Dialogue UltraChat 1.5 million Link Text + TTS
Instruction Open-Orca - Link Text + TTS
Noise DNS 2425 hrs Link Noise data
Noise MUSAN - Link Noise data
Conversation Fine-Tune Dialogue Fisher 964 hrs Link Text, Speech
Dialogue GPT-Talker - Link Text, Speech
Instruction INSTRUCTS2S-200K 200k items Link Text + TTS
Instruction Open Hermes 900k items Link Text + TTS

Table 4: Datasets used in the various training stages

Music and Non-Speech Sound Datasets
Dataset Size URL Modality
ESC-50 2,000 clips (5s each) Link Sound
UrbanSound8K 8,732 clips (<=4s each) Link Sound
AudioSet 2000k+ clips (10s each) Link Sound
TUT Acoustic Scenes 2017 52,630 segments Link Sound
Warblr 10,000 clips Link Sound
FSD50K 51,197 clips (total 108.3 hours) Link Sound
DCASE Challenge varies annually Link Sound
IRMAS 6,705 audio files (3s each) Link Music
FMA 106,574 tracks Link Music
NSynth 305,979 notes Link Music
EMOMusic 744 songs Link Music
MedleyDB 122 multitrack recordings Link Music
MagnaTagATune 25,863 clips (30s each) Link Music
MUSDB 150 songs Link Music
M4Singer 700 songs Link Music
Jamendo 600k songs Link Music

Table 5: Music and Non-Speech Sound Datasets

2. Evaluation

Evaluation is a crucial aspect of training and testing spoken dialogue models. In this section, we provide a comprehensive overview of the evaluation from 11 aspects. The evaluation metrics are categorized into two main types: Basic Evaluation, and Advanced Evaluation.

image-20241114143903640
Table 6: This table evaluates model performance across various abilities, common tasks, representative benchmarks, and corresponding metrics.

Cite

@article{ji2024wavchat,
  title={WavChat: A Survey of Spoken Dialogue Models},
  author={Ji, Shengpeng and Chen, Yifu and Fang, Minghui and Zuo, Jialong and Lu, Jingyu and Wang, Hanting and Jiang, Ziyue and Zhou, Long and Liu, Shujie and Cheng, Xize and others},
  journal={arXiv preprint arXiv:2411.13577},
  year={2024}
}