Skip to content

Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation

Notifications You must be signed in to change notification settings

kuan2jiu99/speech-copilot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation

[SLT2024] [arXiv] [Demo]

logo

The official Github page of the paper "Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation".

  • Authors: Chun-Yi Kuan🐤, Chih-Kai Yang🐤, Wei-Ping Huang🐦 , Ke-Han Lu🐦 , Hung-yi Lee
  • 🐤 🐦: Equal Contribution
  • Affiliation: National Taiwan University
  • Accepted to SLT2024
  • arXiv Link: https://arxiv.org/abs/2407.09886
  • Refer to our demo website for more examples. 👍

Overview ⭐

overview

Abstract 🔥

In this work, we introduce Speech-Copilot, a modular framework for instruction-oriented speech-processing tasks that minimizes human effort in toolset construction. Unlike end-to-end methods using large audio-language models, Speech-Copilot builds speech processing-specific toolsets by analyzing pre-collected task instructions and breaking tasks into manageable sub-tasks. It features a flexible agent based on large language models that performs tasks through program generation. Our approach achieves state-of-the-art performance on the Dynamic-SUPERB benchmark, demonstrating its effectiveness across diverse speech-processing tasks. Key contributions include: 1) developing an innovative framework for speech processing-specific toolset construction, 2) establishing a high-performing agent based on large language models, and 3) offering a new perspective on addressing challenging instruction-oriented speech-processing tasks. Without additional training processes required by end-to-end approaches, our method provides a flexible and extendable solution for a wide range of speech-processing applications.

Evaluation Results 🏆

chart overall

results

Tooleset Construction 🔧

Step1: User Instruction

  • (1) Collect a pool of user instructions.

Step2: Tasks Decomposition

  • (1) Break down all the user instructions in the given pool and identify the speech modules required for each instruction.
  • (2) Refer to task_decomposition.py for details.

Step3: Tasks Modularization

  • (1) After identifying the necessary speech modules, users can choose their preferred speech modules to serve as the backbone for these tasks.
  • (2) Examples are provided in prompt_templates/speech_module_generated_by_LLM.py.

Step4: Program Generation

  • (1) Use an LLM to analyze the given user instruction and generate a solution program based on the speech modules created in the previous steps.
  • (2) Refer to program_generation.py for guidance.
  • (3) In program_generation.py, we provide an example task from Dynamic-SUPERB called EmotionRecognition_MultimodalEmotionlinesDataset with the instruction: Recognize and organize the emotional expressions in the spoken words. The answer could be anger, disgust, sadness, joy, neutral, surprise, or fear.

Step5: Program Execution

  • (1) Execute the program generated in Step 4. Ensure the required speech modules from Step 3 are properly set up beforehand, so they can be utilized during execution.

Baseline Models 🌲

  • Qwen-Audio-Chat

    • Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models [arXiv]
  • SALMONN

    • SALMONN: Towards Generic Hearing Abilities for Large Language Models [arXiv]
  • LTU-AS

    • Joint Audio and Speech Understanding [arXiv]
  • WavLLM

    • WavLLM: Towards Robust and Adaptive Speech Large Language Model [arXiv]
  • Cascade (ASR + Audio Captioning Model + LLM)

    • ASR: Robust Speech Recognition via Large-Scale Weak Supervision [arXiv]
    • Audio Captioning Model: Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models [arXiv]
    • LLM: gpt-3.5-turbo

About

Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages