January 30, 2024

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

~68 seconds

Abstract

Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i.e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue. With an increasing demand to evaluate multi-modal LLMs of human intention understanding and cooperation with foundation models, we outline the principles and processes and test AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrate the capabilities of AudioGPT in solving AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, which empower humans to create rich and diverse audio content with unprecedented ease.

~243 seconds

Introduction

Nowadays, Large language models (LLMs) (Devlin et al., 2018;Raffel et al., 2020;Brown et al., 2020;Ouyang et al., 2022;Zhang et al., 2022a) are posing a significant impact on the AI community, and the advent of ChatGPT and GPT-4 leads to the advancement of natural language processing. Based on the massive corpora of web-text data and powerful architecture, LLMs are empowered to read, write, and communicate like humans. Despite the successful applications in text processing and generation, replicating this success for audio modality (speech (Ren et al., 2020;Huang et al., 2022a), music (Huang et al., 2021;Liu et al., 2022a), sound (Yang et al., 2022;Huang et al., 2023a), and talking head (Wu et al., 2021;Ye et al., 2023)) is limited, while it is highly beneficial since: 1) In real-world scenarios, humans communicate using spoken language across daily conversations, and utilize spoken assistant (e.g., Siri or Alexa) to boost life convenience; 2) As an inherent part of intelligence, processing audio modality information is a necessity to achieve artificial general intelligence. Understanding and generating speech, music, sound, and talking head could be the critical step for LLMs toward more advanced AI systems. Despite the benefits of audio modality, training LLMs that support audio processing is still challenging due to the following issues: 1) Data: Obtaining human-labeled speech data is an expensive and time-consuming task, and there are only a few resources available that provide real-world spoken dialogues. Furthermore, the amount of data is limited compared to the vast corpora of web-text data, and multi-lingual conversational speech data is even scarcer; and 2) Computational resources: Training multi-modal LLMs from scratch is computationally intensive and time-consuming. Given that there are already existing audio foundation models that can understand and generate speech, music, sound, and talking head, it would be wasteful to start training from scratch. In this work, we introduce "AudioGPT", a system designed to excel in understanding and generating audio modality in spoken dialogues. Specifically, 1) Instead of training multi-modal LLMs from scratch, we leverage a variety of audio foundation models to process complex audio information, where LLMs (i.e., ChatGPT) are regarded as the general-purpose interface (Wu et al., 2023;Shen et al., 2023) which empowers AudioGPT to solve numerous audio understanding and generation tasks; 2) Instead of training a spoken language model, we connect LLMs with input/output interface (ASR, TTS) for speech conversations; As illustrated in Figure 1, the whole process of AudioGPT can be divided into four stages: • Modality Transformation. Using input/output interface for modality transformation between speech and text, bridging the gap between the spoken language LLMs and ChatGPT. • Task Analysis. Utilizing the dialogue engine and prompt manager to help ChatGPT understands the intention of a user to process audio information. • Model Assignment. Receiving the structured arguments for prosody, timbre, and language control, ChatGPT assigns the audio foundation models for understanding and generation. • Response Generation. Generating and returning a final response to users after the execution of audio foundation models. Figure 1: A high-level overview of AudioGPT. AudioGPT can be divided into four stages, including modality transformation, task analysis, model assignment, and response generation. It equips ChatGPT with audio foundation models to handle complex audio tasks and is connected with a modality transformation interface to enable spoken dialogue. We design principles to evaluate multi-modal LLMs in terms of consistency, capability, and robustness.

~79 seconds

Task Analysis

As introduced in Sec. 3.1, the task analysis step focuses on extracting structured argument a n from (q n , C). Specifically, the context C is fed into the dialogue engine L ahead of the argument extraction. Based on the types of query resources {q n } from q n , the task handler H first classifies the query into different task families, which is classified through I/O modalities. Then, given the task family selected, the query description q (d) n is passed into the prompt manager M to generate argument a n , including the selected audio foundation model P p and its corresponding task-related arguments h Pp , where p is the index of the selected audio model from the audio model set {P i } P i=1 . where H(q n ) is the task family selected by the task handler H. Noted that, for an audio/image-input task family, h Pp may also contain the necessary resources (e.g., audio or images) from the previous context C. As aforementioned, the task family is determined through the task handler H by considering the I/O modality. To be specific, the families are: • Audio-to-Text -Speech Recognition: Transcribe human speech o Pp = P p ({q (s1) n , q (s2) n , ..., q (s k ) n }, h Pp ). (5) To keep the efficiency of AudioGPT, we conduct the audio model initialization during either environmental setups or server initialization.

~2 seconds

Audiogpt

Evaluating Multi-modal LLMs

~38 seconds

Consistency Capability Robustness

• We propose AudioGPT, which equips ChatGPT with audio foundation models to handle complex audio tasks. As a general-purpose interface, ChatGPT is connected with a modality transformation interface to enable spoken dialogue. • We outline the design principles and process of evaluating multi-modal LLMs, and test AudioGPT in terms of consistency, capability, and robustness. • Demonstrations present the efficiency of AudioGPT in audio understanding and generation with multiple rounds of dialogue, which empowers humans to create rich and diverse audio content with unprecedented ease. 2 Related Works

~91 seconds

Large Language Models

The research areas of AI are being revolutionized by the rapid progress of Large Language Models (LLMs) (Brown et al., 2020;Ouyang et al., 2022;Zhang et al., 2022a), where they can serve as a general-purpose language task solver, and the research paradigm has been shifting towards the use of LLMs. They have long been considered a core problem in natural language processing and demonstrated remarkable abilities for tasks such as machine translation (Gulcehre et al., 2017;Baziotis et al., 2020), open-ended dialogue modeling (Hosseini-Asl et al., 2020;Thoppilan et al., 2022), and even code completion (Svyatkovskiy et al., 2019;Liu et al., 2020). Among them, Kaplan et al. (2020) studied the impact of scaling on the performance of deep learning models, showing the existence of power laws between the model and dataset sizes and the performance of the system. Language models (LMs) at scale, such as GPT-3 (Brown et al., 2020) have demonstrated remarkable performance in few-shot learning. FLAN (Wei et al., 2021) is proposed to improve the zero-shot performance of large language models, which would expand their reach to a broader audience. LLaMA (Touvron et al., 2023) shows that it is possible to achieve state-of-the-art performance by training exclusively on publicly available data, without resorting to proprietary datasets. The advent of ChatGPT and GPT-4 leads to rethinking the possibilities of artificial general intelligence (AGI).

~118 seconds

Spoken Generative Language Models

Self-supervised learning (SSL) has emerged as a popular solution to many speech processing problems with a massive amount of unlabeled speech data. HuBERT (Hsu et al., 2021) is trained with a masked prediction with masked continuous audio signals. Inspired by vector quantization (VQ) techniques, SoundStream (Zeghidour et al., 2021) and Encodec (Défossez et al., 2022) present the hierarchical architecture for high-level representations that carry semantic information. Most of these models build discrete units in a compact and discrete space, which could be modeled with an autoregressive Transformer whose predictions are then mapped back to the original signal space. Hayashi & Watanabe (2020) leverage discrete VQ-VAE representations to build speech synthesis models via autoregressive machine translation. "textless NLP" (Kharitonov et al., 2022;Huang et al., 2022c) is proposed to model language directly without any transcription by training autoregressive generative models of low-bitrate audio tokens. AudioLM (Borsos et al., 2022) and MusicLM (Agostinelli et al., 2023) follow a similar way to address the trade-off between coherence and high-quality synthesis, where they cast audio synthesis as a language modeling task and leverage a hierarchy of coarse-to-fine audio discrete units in a discrete representation space. Recently, Nguyen et al. (2023) leverage the success of discrete representation and introduce the first end-to-end generative spoken dialogue language model. However, due to the data and computational resource scarcity mentioned above, it would be challenging to train spoken generative language models from scratch that enables the processing of complex audio information. Differently, we regard LLMs (i.e., ChatGPT) as the general-purpose interface and leverage various audio foundational models to solve audio understanding and generation tasks, where AudioGPT is further connected with modality transformation to support speech conversations. 3 AudioGPT

~75 seconds

System Formulation

As briefly discussed in Sec. 1, AudioGPT is a prompt-based system, defined as where T is a modality transformer, L is a dialogue engine (i.e., large language model, LLM), M is a prompt manager, H is a task handler, and {P i } P i=1 is a set of P audio foundation models. Let a context with (n -1)-rounds interactions to be defined as C = {(q 1 , r 1 ), (q 2 , r 2 ), ..., (q n-1 , r n-1 ))}, where q i is the query of i th round and r i is the response of i th round. Denoted a new query q n , the execution of the AudioGPT is to generate the response r n as formulated in: During inference, AudioGPT can be decomposed into four major steps: 1) Modality transformation: transfer various input modalities within q n into a query q n with a consistent modality; 2) Task analysis: utilize the dialogue engine L and the prompt manager M to parse (q n , C) into structure arguments a n for the task handler H; 3) Model assignment: the task handler H consumes structured arguments a n and send the arguments to its corresponding audio task processor P s , where s is the selected task index, and 4) response generation: after execution of P s (a n ), the final response r n is generated through L by combining information from (q n , C, P s (a n )).

~36 seconds

Modality Transformation

As discussed in Sec. 3.1, the first stage aims to transform the query q n into a new query q n in a consistent format. The user input query q n includes two parts: a query description q n and a set of query-related resources of size k, {q In AudioGPT, the query description q (d) n can be either in textual or audio (i.e., speech) format. And the modality transformer T first checks the modality of query description q (d) n . If the query description q n is in audio, T is then responsible for converting q (d) n in audio to textual modality as: n , ..., q n is text, (T (q n is audio. (3)

~32 seconds

Response Generation

The response generation is highly related to the select task P p and its output o Pp . Specifically, for audio generation tasks, AudioGPT shows both the waveform in an image and the corresponding audio file for downloading/playing; for tasks that generate text, the model directly returns the transcribed text; for the video generation task, the output video and some related image frames are shown; for classification tasks, a posteriorgram of categories is shown over the time span. 4 Evaluating Multi-Modal LLMs

~62 seconds

Overview

The rapid development of multi-modal LLMs (Wu et al., 2023;Shen et al., 2023;Huang et al., 2023b) has significantly increased the research demand for evaluating its performance and behavior in understanding human intention, performing complex reasoning, and organizing the cooperation of multiple audio foundation models. In this section, we outline the design principles and process of evaluating multi-modal LLMs (i.e., AudioGPT). Specifically, we evaluate the LLMs in the following three aspects: 1) Consistency, which measures whether the LLMs properly understand the intention of a user, and assigns the audio foundation models closely aligned with human cognition and problem-solving; 2) Capabilitity, which measures the performance of audio foundation models in handling complex audio tasks, understanding and generating speech, music, sound, and talking head in a zero-shot fashion; and 3) Robustness, which measures the ability of LLMs deals with special cases.

~7 seconds

Consistency

Figure 2: A high-level overview of consistency evaluation. {"Please generate a voice from text", Text to Speech}

~3 seconds

Llms

Does the response align with human intention faithfully?

~110 seconds

Audio Models

Respond AudioGPT • Can you synthesize a voice audio with the text content? • Please convert written text into natural-sounding speech. • Convert text into natural-sounding speech. • Transform your written words into spoken audio. • Bring your written content to life with speech synthesis. In the consistency evaluation for the zero-shot setting, models are directly evaluated on the questions without being provided any prior examples of the specific tasks, which evaluate whether multi-modal LLMS could reason and solve problems without explicit training. More specifically, as shown in Figure 2, the consistency evaluation is carried out in three steps for each task in the benchmark. In the first step, we request human annotators to provide prompts for each task in a format of {prompts, task_name}. This allows us to evaluate the model's ability to comprehend complex tasks and identify the essential prompts needed for successful task assignments. In the second step, we leverage the outperformed language generation capacity of LLMs to produce descriptions with the same semantic meanings while having different expressions, enabling a comprehensive evaluation of whether LLMs understands the intention of a broader amount of user. Finally, we use crowd-sourced human evaluation via Amazon Mechanical Turk, where AudioGPT is prompted with these natural language descriptions corresponding to a variety of tasks and intentions. Human raters are shown the response of multi-modal LLMs and a prompt input and asked "Does the response closely align with human cognition and intention faithfully?". They must respond with "completely", "mostly", or "somewhat" on a 20-100 Likert scale, which is documented with 95% confidence intervals (CI).

~20 seconds

Capability

As the task executors for processing complex audio information, audio foundation models have a significant impact on handling complex downstream tasks. Taking AudioGPT as an example, we report evaluation matrics and downstream datasets for understanding and generating speech, music, sound, and talking head in Table 3.

~109 seconds

Robustness

We evaluate the robustness of multi-modal LLMs by assessing their ability to handle special cases. These cases can be classified into the following categories: • Long chains of evaluation: Multi-modal LLMs are expected to handle long chains of evaluation while considering short and long context dependencies in multi-modal generation and reuse. A chain of tasks can be presented either as a query that requires sequential application of candidate audio models, as consecutive queries that ask for different tasks, or as a mixture of the two types. • Unsupported tasks: Multi-modal LLMs should be able to provide reasonable feedback to queries that require unsupported tasks not covered by the foundation models. • Error handling of multi-modal models: Multi-modal foundation models can fail due to different reasons, such as unsupported arguments or unsupported input formats. In such scenarios, multi-modal LLMs need to provide reasonable feedback to queries that explain the encountered issue and suggest potential solutions. • Breaks in context: Multi-modal LLMs are expected to process queries that are not in a logical sequence. For instance, the user may submit random queries in a query sequence but continue to proceed with previous queries that have more tasks. To evaluate the robustness, we conduct a three-step subjective user rating process, similar to the steps discussed in Sec.4.2. In the first step, human annotators provide prompts based on the above four categories. In the second step, the prompts are fed into the LLM to formulate a complete interaction session. Finally, a different set of subjects recruited from multi-modal LLMs rate the interaction on the same 20-100 scale as described in Sec.4.2.

~27 seconds

Experimental Setup

~31 seconds

Experimental Setup

In our experiments, we employ the gpt-3.5-turbo of the GPT models as the large language models and guide the LLM with LangChain (Chase, 2022). The deployment of the audio foundation models requires only a flexible NVIDIA T4 GPU on hugging face space. We use a temperature of zero to generate output using greedy search and set the maximum number of tokens for generation to 2048. The current manuscript mainly covers the system description, where the experiments are designed more for demonstration.

~28 seconds

Case Study On Multiple Rounds Dialogue

Figure 3 shows a 12-rounds dialogue case of AudioGPT, which demonstrates the capabilities of Au-dioGPT for processing audio modality, covering a series of AI tasks in generating and understanding speech, music, sound, and talking head. The dialogue involves multiple requests to process audio information and shows that AudioGPT maintains the context of the current conversation, handles follow-up questions, and interacts with users actively.

~22 seconds

Case Study On Simple Tasks

AudioGPT equips ChatGPT with audio foundation models, where ChatGPT is regarded as the general-purpose interface to solve numerous audio understanding and generation tasks. We test AudioGPT on a wide range of audio tasks in generating and understanding speech, music, sound, and talking head, where some cases are illustrated in Figure 4 and 5.

~44 seconds

Limitation

Although AudioGPT excels at solving complex audio-related AI tasks, limitations could be observed in this system as follows: 1) Prompt Engineering: AudioGPT uses ChatGPT to connect a large number of foundation models, and thus it requires prompt engineering to describe audio foundation models in natural language, which could be time-consuming and expertise-required; 2) Length Limitation: the maximum token length in ChatGPT may limit the multi-turn dialogue, which also influences the user's context instruction, and 3) Capabliity Limitation AudioGPT relies heavily on audio foundation models to process audio information, which is heavily influenced by the accuracy and effectiveness of these models.

~68 seconds

Conclusion

In this work, we presented AudioGPT, which connected ChatGPT with 1) audio foundation models to handle challenging audio tasks, and 2) a modality transformation interface to enable spoken dialogue. By combining the advantages of ChatGPT and audio-modality solvers, AudioGPT presented strong capacities in processing audio information in the following four stages: modality transformation, task analysis, model assignment, and response generation. To assess the ability of multi-modal LLMs in human intention understanding and cooperation with foundation models, we outlined the design principles and processes, and evaluated AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrated the outperformed abilities of AudioGPT in solving AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, empowering humans to create rich and diverse audio content with unprecedented ease. The current manuscript mainly covers the system description, where the experiments are designed more for demonstration.

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head