Fun-Audio-Chat by FunAudioLLM

Advanced Audio LLM for natural, low-latency voice interactions

Created 2 months ago

859 stars

Top 41.5% on SourcePulse

Project Summary

Fun-Audio-Chat is a Large Audio Language Model designed for natural, low-latency voice interactions. It addresses the computational demands of audio LLMs by introducing an efficient dual-resolution speech representation, enabling significant compute reduction while preserving high speech quality. This project benefits researchers and developers by offering state-of-the-art performance across various spoken language tasks, including QA, understanding, function calling, and instruction following.

How It Works

The core innovation lies in Dual-Resolution Speech Representations, employing an efficient 5Hz shared backbone combined with a 25Hz refined head. This approach reduces GPU hours by nearly 50% compared to standard 12.5Hz or 25Hz models without sacrificing speech quality. Additionally, Core-Cocktail training is utilized to ensure strong preservation of underlying text LLM capabilities, leading to top-tier results on demanding audio benchmarks.

Quick Start & Requirements

Primary Install: Clone the repository with submodules (git clone --recurse-submodules), activate a Python 3.12 environment, install PyTorch 2.8.0 (with CUDA 12.8 support), and then pip install -r requirements.txt. ffmpeg is also a prerequisite.
Prerequisites: Python 3.12, PyTorch 2.8.0, ffmpeg. GPU Memory: ~24GB for inference, 4x80GB for training.
Setup: Download pretrained models using huggingface-hub or modelscope.
Links:
- arXiv: https://arxiv.org/pdf/2512.20156
- HuggingFace: https://huggingface.co/FunAudioLLM/Fun-Audio-Chat-8B
- ModelScope: https://modelscope.cn/FunAudioLLM/Fun-Audio-Chat-8B
- Demo Page: https://funaudiollm.github.io/funaudiochat

Highlighted Details

Efficiency: Dual-Resolution Speech Representations with a 5Hz frame rate reduce compute by approximately 50%.
Performance: Ranks top among ~8B parameter models on benchmarks like OpenAudioBench, VoiceBench, UltraEval-Audio, MMAU, MMAU-Pro, MMSU, Speech-ACEBench, Speech-BFCL, Speech-SmartInteract, and VStyle.
Capabilities: Supports spoken question answering, audio understanding, speech function calling, speech instruction-following, and voice empathy.

Maintenance & Community

The project is developed by the "Tongyi Fun Team". Community interaction is facilitated via GitHub Issues, Pull Requests, and email. An official Dingding chat group is also available for support.

Licensing & Compatibility

Fun-Audio-Chat is licensed under the Apache License (Version 2.0). The project notes that it contains third-party components under other open-source licenses, with details available in the NOTICE file. The Apache 2.0 license is generally permissive for commercial use.

Limitations & Caveats

The provided README does not explicitly detail limitations such as alpha status, known bugs, or unsupported platforms. The release appears to be based on a technical report, suggesting a research-oriented focus.

Fun-Audio-Chat by FunAudioLLM

Explore Similar Projects

SpeechGPT-2.0-preview by OpenMOSS

local_llm_assistant by nickbild

LLMVoX by mbzuai-oryx

alibabacloud-bailian-speech-demo by aliyun

VITA-Audio by VITA-MLLM

fast-voice-assistant by dsa

Step-Audio2 by stepfun-ai

10x by 0xCrunchyy

ichigo by janhq

ASR-LLM-TTS by ABexit

mini-omni by gpt-omni

Orpheus-TTS by canopyai