AudioGPT by AIGC-Audio

Audio processing and generation research project

Created 3 years ago

10,168 stars

Top 5.2% on SourcePulse

4 Experts Love This Project

chiphuyen

Author of "AI Engineering", "Designing Machine Learning Systems"

hammer

Jeff Hammerbacher

Cofounder of Cloudera

JustinLin610

Core Maintainer at Alibaba Qwen

ogabrielluiz

Gabriel Almeida

Cofounder of Langflow

Project Summary

AudioGPT is an open-source project that aims to provide a unified framework for understanding and generating various audio modalities, including speech, music, and sound effects, along with talking head synthesis. It targets researchers and developers working with multimodal AI, offering a comprehensive suite of tools for audio-centric AI applications.

How It Works

AudioGPT leverages a modular architecture, integrating multiple state-of-the-art foundation models for diverse audio tasks. It supports a wide range of capabilities, from text-to-speech and speech recognition to audio generation from text or images, and even sound event detection. The project's strength lies in its ability to combine these specialized models into a cohesive system, facilitating complex audio manipulation and generation workflows.

Quick Start & Requirements

Installation and usage details are available in run.md.
Requires Python and potentially specific deep learning libraries.
Refer to the repository for detailed prerequisites and setup instructions.

Highlighted Details

Supports Text-to-Speech, Speech Recognition, Style Transfer, and Speech Enhancement.
Includes capabilities for Text-to-Audio, Audio Inpainting, and Image-to-Audio generation.
Features Sound Detection, Target Sound Detection, and Sound Extraction.
Offers Talking Head Synthesis capabilities.

Maintenance & Community

The project acknowledges contributions from ESPNet, NATSpeech, Visual ChatGPT, Hugging Face, LangChain, and Stable Diffusion.
More supported models and tasks are planned for future releases.

Licensing & Compatibility

The repository is provided as open source. Specific license details are not explicitly stated in the provided text, but it is generally permissive for research and development.

Limitations & Caveats

Several features are marked as "WIP" (Work in Progress), including Text-to-Speech, Speech Enhancement, Speech Separation, Speech Translation, and Text-to-Sing.
Not all models have corresponding repositories linked.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

8 stars in the last 30 days

Explore Similar Projects

ControlSpeech by jishengpeng

Speech synthesis with simultaneous zero-shot speaker cloning and language style control

Created 2 years ago

Updated 1 year ago

SpeechGPT-2.0-preview by OpenMOSS

Real-time spoken dialogue system with GPT-4o-level capabilities

Created 1 year ago

Updated 1 year ago

ComfyUI-F5-TTS by niknah

Text-to-speech voice cloning and generation for ComfyUI

Created 1 year ago

Updated 2 months ago

awesome-ai-voice by wildminder

AI audio models for synthesis, generation, and understanding

Created 4 months ago

Updated 2 days ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

FastDiff by Rongjiehuang

PyTorch implementation for fast, high-fidelity speech synthesis via conditional diffusion

Created 4 years ago

Updated 2 years ago

ultimate-rvc by JackismyShephard

AI-powered audio generation and voice manipulation

Created 2 years ago

Updated 3 months ago

VibeVoice-finetuning by voicepowered-ai

Efficient LoRA finetuning for VibeVoice speech synthesis

Created 9 months ago

Updated 9 months ago

Lip2Wav by Rudrabha

Lip-to-speech synthesis for generating speech from lip movements

Created 6 years ago

Updated 3 years ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral) and

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

AudioLDM by haoheliu

Audio generation research paper using latent diffusion

Created 3 years ago

Updated 1 year ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory) and

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

mini-omni by gpt-omni

Open-source multimodal LLM for real-time speech interaction

Created 1 year ago

Updated 1 year ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI) and

Didier Lopes

Didier Lopes(Founder of OpenBB).

Zonos by Zyphra

Open-weight text-to-speech model for expressive, high-quality speech generation

Created 1 year ago

Updated 1 year ago

Starred by

Alex Chen

Alex Chen(Cofounder of Nexa AI),

Amin Ahmad

Amin Ahmad(Cofounder of Vectara), and

4 more.

csm by SesameAILabs

Speech generation model for conversational AI research

Created 1 year ago

Updated 1 year ago

Feedback? Help us improve.