- Awesome Codec, TTS & Speech LM
- Music Generation
- Some Interesting Models
- Speech DataSet
- Some Interesting knowledge
- Reference
- Acoustic Tokens: Acoustic tokens focuses on speech compression and reconstruction, which rely on encoder-decoder architectures with residual vector quantization (RVQ). Specifically, these models quantify speech features (which are downsampled from raw wavforms by one encoder) into a series of discrete tokens, then use one decoder to upsample these discrete tokens into the speech, calculating the reconstruction loss against the original signal. By this approach, we can get discrete acoustic tokens with impressive compression rates and high-fidelity acoustic information, making it more suitable for tasks such as speech synthesis and emotion analysis. (requires maintaining reconstruction ability and a low bitrate)
- Semantic Tokens: Semantic tokens involve applying clustering algorithms such as K-means to extract features from self-supervised learning models, using the cluster indices as discrete representations. And it is prediction-based modeling, these models are trained for representation learning by predicting future frames in an autoregressive manner or by using surrounding frames to predict masked frames. This approach tends to prioritize capturing linguistic information within speech, making it particularly useful for recognition and understanding tasks.
- Speech Large Language Models: These models are trained on top of speech and acoustic tokens in a language modeling approach. They demonstrate proficiency in tasks on speech understanding and speech generation. (From speech-trident)
- [2021/07] SoundStream: An End-to-End Neural Audio Codec [paper][code][demo] ✔️
- [2022/09] AudioLM: a Language Modeling Approach to Audio Generation [paper][demo]
- [2023/01] InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt [paper][code][demo] ✔️
- [2023/05] AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec [paper][code][demo] ✔️
- [2023/05] HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec [paper][code] AcademiCodec & Group-RVQ ✔️
- [2023/09] SpatialCodec: Neural Spatial Speech Coding [paper][code][demo] ✔️
- [2023/09] High-Fidelity Audio Compression with Improved RVQGAN [paper][code][demo] DAC ✔️
- [2023/09] Soundstorm: Efficient parallel audio generation [paper][demo]
- [2023/09] High Fidelity Neural Audio Compression [paper][code][code-Unofficial] [demo] Encodec ✔️
- [2023/09] FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec [paper][code][demo] ✔️
- [2023/09] Fewer-token Neural Speech Codec with Time-invariant Codes [paper][code][demo] Ti-Codec ✔️
- [2023/09] BANC: Towards Efficient Binaural Audio Neural Codec for Overlapping Speech [paper][code][demo] ✔️
- [2023/10] Acoustic BPE for Speech Generation with Discrete Tokens [paper][code] ✔️
- [2024/01] Residual Quantization with Implicit Neural Codebooks [paper][code] ✔️
- [2024/01] SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models [paper][code][demo] ✔️
- [2024/01] Residual Quantization with Implicit Neural Codebooks [paper][code] Qinco ✔️
- [2024/04] SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound [paper][code][demo] ✔️
- [2024/05] HILCodec: High Fidelity and Lightweight Neural Audio Codec [paper][code][demo] ✔️
- [2024/06] Coding Speech through Vocal Tract Kinematics [paper][code] ✔️
- [2024/06] Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder [paper]
- [2023/06] UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding [paper][code][demo] acoustic model CTX-txt2vec and vocoder CTX-vec2wav | speech continuation and editing | similar to Encoder-Decoder ✔️
- [2024/04] The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge [paper]
- [2024/06] BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation [paper][demo]
- [2023/09] Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [paper]
- [2024/06] Spectral Codecs: Spectrogram-Based Audio Codecs for High Quality Speech Synthesis [paper][code][demo] ✔️
- [2024/01] Finite Scalar Quantization: VQ-VAE Made Simple [paper][code] FSQ, no codebook collapse ✔️
- [2024/06] UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner [paper][code] LLM-Codec ✔️
- [2024/04] SNAC: Multi-Scale Neural Audio Codec [paper][code][demo] ✔️
- [2023/06] Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis [paper][code][demo] ✔️
- [2024/07] CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [paper][code][demo] ✔️
- [2024/06] Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation [paper][demo]
- [2024/02] APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding [paper][code][demo] ✔️
- [2024/07] dMel: Speech Tokenization made Simple [paper]
Code Comming Soon
- [2024/07] SuperCodec: A Neural Speech Codec with Selective Back-Projection Network [paper][code][demo] ✔️
- [2024/04] ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers [paper][code] ✔️
- [2024/02] Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models [paper][code][demo] ✔️
- [2024/06] SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models [paper][code][demo] SQ-Codec |
Code Comming Soon
- [2024/08] SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [paper][demo]
- [2024/08] Music2Latent: Consistency Autoencoders for Latent Audio Compression [paper][code][demo] continuous latent space ✔️
- [2024/08] WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [paper][code][demo] ✔️
- [2024/08] Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model [paper][code][demo] X-Codec ✔️
- [2024/09] SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis [paper][code][demo] ✔️
- [2024/09] Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation [paper][demo] CoFi-Speech
- [2024/09] NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization [paper][code]
Code Comming Soon
- [2024/09] Audio Codec Augmentation for Robust Collaborative Watermarking of Speech Synthesis [paper][code][demo] Watermarking ✔️
- [2024/09] MuCodec: Ultra Low-Bitrate Music Codec [paper][code][demo] Music Codec ✔️
- [2024/09] ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech [paper][code] Comprehensive Platform ✔️
- [2024/09] FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates [paper] Flow Matching
- [2024/09] Reverse Engineering of Supervised Semantic Speech Tokenizer (S3Tokenizer) proposed in CosyVoice [code] S3Tokenizer ✔️
- [2024/10] Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models [paper][demo] Inconsistency
- [2024/09] BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec [paper][code][demo] low-bitrate neural speech codec ✔️
- [2024/10] Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer [paper][code][demo] finetuned-version of DAC ✔️
- [2020/06] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations [paper][code] ✔️
- [2021/06] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [paper][code] semantic information & content generation ✔️
- [2021/08] W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training [paper]
- [2021/10] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [paper][code] semantic information & content generation ✔️
- [2024/10] Code Drift: Towards Idempotent Neural Audio Codecs [paper][demo] Idempotence – the stability of a codec’s decoded output under multiple rounds of encoding and decoding
- [2024/10] ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs [paper][demo] address codebook collapse based on intra- and inter-codebook optimization
- [2024/10] DM-Codec: Distilling Multimodal Representations for Speech Tokenization [paper][code] acoustic properties, semantic meaning, and contextual clues ✔️
- [2024/10] LSCodec: Low-Bandwidth and Speaker-Decoupled Discrete Speech Codec [paper][demo] speaker timbre decouple
- [2024/10] Optimizing Neural Speech Codec for Low-Bitrate Compression via Multi-Scale Encoding [paper][demo] MsCodec, Multi-Scale Encoding
- [2024/10] APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training Paradigm [paper][demo] two-stage joint-individual training paradigm
- [2024/10] A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation [paper][demo] Is predicting the remaining RVQ codes necessary?
- [2024/11] DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models [paper] Double-Codebook Speaker-invariant Clustering
- [2024/10] Pushing the frontiers of audio generation [blog] google deepmind
- [2024/11] MDCTCodec: A Lightweight MDCT-based Neural Audio Codec towards High Sampling Rate and Low Bitrate Scenarios [paper][demo] discrete cosine transform (MDCT) as input
- [2024/11] SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer [paper][code] codebook collapse ✔️
- [2024/11] hertz-dev [code] WaveCodec ✔️
- [2024/11] Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations [paper] UniCodec | several information-disentangled discrete tokens, similar to ns3_codec
- [2024/11] Towards Codec-LM Co-design for Neural Codec Language Models [paper]
Code Comming Soon
| proposing several codec-LM co-design strategies - [2024/11] VChangeCodec: A High-efficiency Neural Speech Codec with Built-in Voice Changer for Real-time Communication [paper] integrates the Voice Changer model directly into the speech Codec
- [2024/11] Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation [paper][code][demo] aliasing-free ✔️
- [2024/11] PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain [paper][demo]
Code Comming Soon
| Music Tokenizer, Similar to MsCodec - [2024/11] Scaling Transformer for Low-bitrate High-Quality Speech Coding [paper][code][demo]
Code Comming Soon
| transformer-based and scale it into 1B parameter range - [2024/11] TS3-Codec: Transformer-Based Simple Streaming Single Codec [paper] free-convolution
- [2023/05] Better speech synthesis through scaling [paper][code] tortoise-tts ✔️
- [2023/09] Voiceflow: Efficient text-to-speech with rectified flow matching [paper][code][demo] ✔️
- [2023/09] Voicebox: Text-guided multilingual universal speech generation at scale [paper][demo]
- [2023/09] Matcha-tts: A fast tts architecture with conditional flow matching [paper][code][demo] ✔️
- [2023/09] PromptTTS 2: Describing and Generating Voices with Text Prompt [paper][code][demo] ✔️
- [2023/01] Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [paper][code][demo] VALL-E ✔️
- [2024/03] VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild [paper][code][demo] ✔️
- [2024/01] NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers [paper][demo]
- [2024/03] NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models [paper][demo]
- [2024/01] Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis [paper][demo]
- [2024/03] HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling [paper][demo]
- [2024/04] TextrolSpeech: A Text Style Control Speech Corpus with Codec Language Text-to-Speech Models [paper][code][demo]
Code Comming Soon
- [2024/04] StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations [paper][code][demo] Lian Liru(连丽如) dataset ✔️
- [2024/04] SpeechAlign: Aligning Speech Generation to Human Preferences [paper][code][demo] Human Feedback ✔️
- [2024/06] Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback [paper] Human Feedback
- [2024/06] Seed-TTS: A Family of High-Quality Versatile Speech Generation Models [paper][demo]
- [2024/06] WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark [paper][demo]
- [2024/02] Natural language guidance of high-fidelity text-to-speech with synthetic annotations [paper][code][demo] Prompt Control | Parler-TTS ✔️
- [2023/02] Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision [paper][code][demo] SpearTTS | WhisperSpeech ✔️
- [2024/06] High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model [paper][demo] Transducer/End-to-End
- [2024/01] VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech [paper][code][demo]
Code Comming Soon
| Transducer - [2024/01] Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction [paper][demo] Transducer/End-to-End
- [2024/06] Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment [paper][demo] Monotonic Alignment
- [2024/01] EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine [code] ✔️
- [2024/07] Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models [paper][demo] Spontaneous
- [2024/08] EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech [paper] LORA
- [2024/08] StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech [paper][demo] LORA
- [2024/08] VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech [paper][demo] LORA
- [2024/08] SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS [paper][demo] SSL
- [2024/06] ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec [paper][code][demo] ✔️
- [2024/06] XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model [paper][code][demo] ✔️
- [2024/06] VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers [paper][demo]
- [2024/06] Autoregressive Diffusion Transformer for Text-to-Speech Synthesis [paper][demo]
- [2024/06] VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [paper][demo]
- [2024/06] DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer [paper][demo]
- [2024/01] CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech [paper][demo]
- [2024/06] TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers [paper][code][demo] ✔️
- [2023/11] HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis [paper][code][demo] ✔️
- [2024/06] E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS [paper][demo] similar to Seed-TTS
- [2024/07] Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization [paper][demo] Human FeedBack
- [2024/07] CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [paper] [code][demo] ✔️
- [2024/04] FlashSpeech: Efficient Zero-Shot Speech Synthesis [paper][code][demo]
Code Comming Soon
- [2024/08] Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation [paper][demo]
- [2024/08] VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling [paper][code][demo] ✔️
- [2024/09] MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer [paper][code][demo] Masked Generative Model | Similar to Seed-TTS ✔️
- [2024/09] FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications [paper][code][demo] voice cloning for dubbing and human-like speech generation for chatbots ✔️
- [2024/09] Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models [paper][demo]
- [2024/10] F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching [paper][code][demo] ✔️
- [2023/05] Better speech synthesis through scaling [paper][code][blog] Tortoise TTS ✔️
- [2024/10] SPIRIT LM: Interleaved Spoken and Written Language Model [paper][code][demo] ✔️
- [2024/10] STTATTS: Unified Speech-To-Text And Text-To-Speech Model [paper][code]
- [2024/11] Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis [paper][code] ✔️
- [2024/11] OuteTTS-0.1-350M [blog][code] ✔️
- [2024/11] Debatts: Zero-Shot Debating Text-to-Speech Synthesis [paper][demo] Debating TTS & Dataset
- [2024/11] Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis [paper] Code Comming Soon | Text & Video to Speech
- WavChat classify all spoken dialogue models based on whether the core language model can directly understand and generate speech representations, dividing them into cascaded and end-to-end categories.
- [2024/04] CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations [paper][code)[demo] multi-round dialogue speech generation ✔️
- [2024/08] Style-Talker: Finetuning Audio Language Model and StyleBased Text-to-Speech Model for Fast Spoken Dialogue Generation [paper][code][demo]
Code Comming Soon
- [2024/08] DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance [paper][demo]
- [2024/03] WavLLM: Towards Robust and Adaptive Speech Large Language Model [paper][code] ✔️
- [2024/02] AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling [paper][code][demo] ✔️
- [2024/02] Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities [paper][code][demo] ✔️
- [2024/07] Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation [paper][code][demo]
Code Coming Soon
| speech interaction model - [2024/06] GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities [paper][code][demo] ✔️
- [2024/07] Generative Expressive Conversational Speech Synthesis [paper][code] GPT-Talker | GPT for response and Style, VITS for audio ✔️
- [2023/05] SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities [paper][code][demo] ✔️
- [2024/01] SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [paper][demo]
Code Comming Soon
- [????/??] SpeechGPT2: End-to-End Human-Like Spoken Chatbot [paper][code][demo] paper &
Code Comming Soon
| speech interaction model - [2024/08] Language Model Can Listen While Speaking [paper][demo] Full Duplex Modeling | speech interaction model
- [2024/08] Speech To Speech: an effort for an open-sourced and modular GPT4-o [code] End-to-End | speech interaction model ✔️
- [2024/08] Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming [paper][code] End-to-End | speech interaction model ✔️
- [2024/09] EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions [paper][demo]
- [2024/09] LLaMA-Omni: Seamless Speech Interaction with Large Language Models [paper][code][demo] only english ✔️
- [2024/09] Moshi: a speech-text foundation model for real time dialogue [paper][code][demo] low delay | only english ✔️
- [2024/09] Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control [paper][demo]
- [2024/09] Westlake-Omni: Open-Source Chinese Emotional Speech Interaction Large Language Model with Unified Discrete Sequence Modeling [code] ✔️
- [2024/10] OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [paper][demo]
Code Comming Soon
- [2024/10] Recent Advances in Speech Language Models: A Survey [paper] speech interaction model: survey
- [2024/10] IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [paper][demo] reducing the length difference between speech and text
- [2024/10] Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities [paper][code] ✔️
- [2024/10] GLM-4-Voice [code] speech interaction model & emotion, intonation, speech rate, and dialect & low latency ✔️
- [2024/11] Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM [paper][demo][code] frozen llm in training ✔️
- [2024/11] hertz-dev [code] ✔️
- [2024/11] Building a Taiwanese Mandarin Spoken Language Model: A First Attempt [paper][code]
Code Comming Soon
- [2024/11] SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation [paper] Code Comming Soon | free-codec
- [2024/11] MooER: Moore-threads Open Omni model for speech-to-speech intERaction [code]
Paper Comming Soon
- [2023/10] SALMONN: Towards Generic Hearing Abilities for Large Language Models [paper][code] ✔️
- [2023/11] Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models [paper][code] ✔️
- [2024/07] Qwen2-Audio Technical Report [paper][code] ✔️
- [2024/08] VITA: Towards Open-Source Interactive Omni Multimodal LLM [paper][code][demo] ✔️
- [2024/10] Ocean-omni: To Understand the World with Omni-modality [paper][code] Baichuan-Omni ✔️
- [2024/10] Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant [paper][code] ✔️
- [2024/11] A fast multimodal LLM for real-time voice [blog][code][demo] Ultravox ✔️
- [2022/10] Flow Matching for Generative Modeling [paper] Conditional Flow Matching
- [2023/02] Improving and generalizing flow-based generative models with minibatch optimal transport [paper][code] TorchCFM | Tutorials ✔️
- [2024/05] EmoLLM(心理健康大模型) [code][demo] ✔️
- [2024/07] Stable Audio Open [paper] [code] ✔️
- [2024/11] LLaMA-O1: Open Large Reasoning Model Frameworks For Training, Inference and Evaluation With PyTorch and HuggingFace [code] ✔️
- [2024/11] O1 Replication Journey: A Strategic Progress Report -- Part 1 [paper][code] ✔️
- [2024/11] FLowHigh: Towards efficient and high-quality audio super-resolution with single-step flow matching [code][demo] ✔️
- [2024/09] SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis [paper][code][demo] ✔️
- [2023/06] Simple and Controllable Music Generation [paper][code] Prompt Control | AudioCraft ✔️
- [2024/05] Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning [paper][code][demo] Instruction Tuning ✔️
- [2024/05] QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation [paper][code][demo] ✔️
- [2024/09] Seed-Music: A Unified Framework for High Quality and Controlled Music Generation [paper][demo] tech-report
- [2024/09] FLUX that Plays Music [paper][code][melodio] KunLun ✔️
- [2024/10] MusicFlow: Cascaded Flow Matching for Text Guided Music Generation [paper]
Code Comming Soon
| Similar to MaskGCT
- [2024/07] Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation [paper][code][demo][dataset] ✔️
- [2024/06] WenetSpeech4TTS: A 12,800-hour Mandarin TTS corpus for large speech generation model benchmark [paper][demo][dataset] ✔️
- [2020/10] Didispeech: A large scale Mandarin speech corpus [paper][code][demo][dataset]
- Anthropic courses [github]
- LLM101n: Let's build a Storyteller [github]
- Build a Large Language Model (From Scratch) [github]
- build nanoGPT from Karpathy [github]
GitHub
- ChatTTS: https://github.com/2noise/ChatTTS/tree/main
- OpenVoice: https://github.com/myshell-ai/OpenVoice
- GPT-SoVITS: https://github.com/RVC-Boss/GPT-SoVITS
- Bert-vits2-NoBug: https://github.com/ywh-my/Bert-vits2-NoBug
- VoiceCraft: https://github.com/jasonppy/VoiceCraft
- YourTTS: https://github.com/Edresson/YourTTS
- Coqui: https://github.com/coqui-ai/TTS
- ebook2audiobookXTTS: https://github.com/DrewThomasson/ebook2audiobookXTTS
- MARS5-TTS: https://github.com/Camb-ai/MARS5-TTS
- edge-tts: https://github.com/rany2/edge-tts
- metavoice-src: https://github.com/metavoiceio/metavoice-src
- StyleTTS2: https://github.com/yl4579/StyleTTS2
- open-tts-tracker: https://github.com/Vaibhavs10/open-tts-tracker
- Amphion: https://github.com/open-mmlab/Amphion
- CTranslate2: https://github.com/OpenNMT/CTranslate2
- CFM: https://github.com/atong01/conditional-flow-matching
- speech-trident: https://github.com/ga642381/speech-trident
- bark: https://github.com/suno-ai/bark
- LangGPT: https://github.com/langgptai/LangGPT (提示词工程)
- composio: https://github.com/ComposioHQ/composio
- torchdiffeq: https://github.com/rtqichen/torchdiffeq
- podlm: https://github.com/lihuithe/podlm-public (NoteBookLM 的平替)
- NotebookLlama: https://github.com/meta-llama/llama-recipes/recipes/quickstart/NotebookLlama (类似 NoteBookLM)
- playnote: https://play.ai/playnote (类似 NotebookLM)
- dify: https://github.com/langgenius/dify (开源的 LLM 应用开发平台)
- Awesome-Dify-Workflow: https://github.com/svcvit/Awesome-Dify-Workflow
- LiblibAI: https://www.liblib.art (AI创作平台)
Nice Tool
- pytorch-OpCounter: https://github.com/Lyken17/pytorch-OpCounter
- rich: https://github.com/Textualize/rich
- argbind: https://github.com/pseeth/argbind/
- audiotools: https://github.com/descriptinc/audiotools
- hydra: https://github.com/facebookresearch/hydra
- joblib: https://github.com/joblib/joblib
- einops: https://github.com/arogozhnikov/einops
- safetensors: https://github.com/huggingface/safetensors
- OpenDiloco: https://github.com/PrimeIntellect-ai/OpenDiloco
- WeTextProcessing: https://github.com/wenet-e2e/WeTextProcessing
- zed: https://github.com/zed-industries/zed
- weekly: https://github.com/ljinkai/weekly
- tinygrad: https://github.com/tinygrad/tinygrad
- ffmpeg-normalize: https://github.com/slhck/ffmpeg-normalize
- kohya_ss: https://github.com/bmaltais/kohya_ss
- Lora-Training-in-Comfy: https://github.com/LarryJane491/Lora-Training-in-Comfy
- ComfyUI-Manager: https://github.com/ltdrdata/ComfyUI-Manager
- ComfyUI: https://github.com/comfyanonymous/ComfyUI
- comfyui-workspace-manager: https://github.com/11cafe/comfyui-workspace-manager
- CosyVoice+ComfyUI: https://github.com/AIFSH/CosyVoice-ComfyUI
- ComfyUI-wiki: https://github.com/602387193c/ComfyUI-wiki
- ZHO: https://github.com/ZHO-ZHO-ZHO
- tmux: https://github.com/tmux/tmux
- LoRAlib: https://github.com/microsoft/LoRA
- codespaces: https://github.com/codespaces
- Foliate(PDF): https://johnfactotum.github.io/foliate/
- Okular(PDF): https://okular.kde.org/zh-cn/
- audioFlux: https://github.com/libAudioFlux/audioFlux
- PyWavelets: https://github.com/PyWavelets/pywt
- 智能体或工作流平台: https://ai-bot.cn/ai-agent-development-platform/
- open-webui: https://github.com/open-webui/open-webui
- qwen-2.5-code-interpreter: https://github.com/cfahlgren1/qwen-2.5-code-interpreter
- ollama: https://github.com/ollama/ollama; https://ollama.com/
- vllm: https://github.com/vllm-project/vllm
- anythingLLM: https://github.com/Mintplex-Labs/anything-llm
- Windsurf: https://codeium.com/windsurf
- cursor: https://www.cursor.com/
- docling: https://github.com/DS4SD/docling
- 别慌!一文教你看懂GPT-4o背后的语音技术
- 百花齐放的Audio Codec: 语音合成利器
- InterSpeech2024 Speech Processing Using Discrete Speech Units : https://www.wavlab.org/activities/2024/Interspeech2024-Discrete-Speech-Unit-Challenge/ : https://huggingface.co/discrete-speech : arxiv 2024 : [paper]
- Codec-SUPERB: An In-Depth Analysis of Sound Codec Models : https://github.com/voidful/Codec-SUPERB
- EMO-Codec: A Depth Look at Emotion Preservation Capacity of Legacy and Neural Codec Models With Subjective and Objective Evaluations
- speech-trident : Awesome speech/audio LLMs, representation learning, and codec models
- Awesome-Speech-Language-Model : Paper, Code and Resources for Speech Language Model and End2End Speech Dialogue System
- Towards audio language modeling -- an overview :
arXiv 2402.13236
- A Survey on Speech Large Language Models :
arXiv 2410.18908
- Recent Advances in Speech Language Models: A Survey :
arXiv 2410.03751
- WavChat: A Survey of Spoken Dialogue Models :
arXiv 2411.13577