Fun-Audio-Chat  by FunAudioLLM

Advanced Audio LLM for natural, low-latency voice interactions

Created 2 weeks ago

New!

641 stars

Top 51.9% on SourcePulse

GitHubView on GitHub
Project Summary

Fun-Audio-Chat is a Large Audio Language Model designed for natural, low-latency voice interactions. It addresses the computational demands of audio LLMs by introducing an efficient dual-resolution speech representation, enabling significant compute reduction while preserving high speech quality. This project benefits researchers and developers by offering state-of-the-art performance across various spoken language tasks, including QA, understanding, function calling, and instruction following.

How It Works

The core innovation lies in Dual-Resolution Speech Representations, employing an efficient 5Hz shared backbone combined with a 25Hz refined head. This approach reduces GPU hours by nearly 50% compared to standard 12.5Hz or 25Hz models without sacrificing speech quality. Additionally, Core-Cocktail training is utilized to ensure strong preservation of underlying text LLM capabilities, leading to top-tier results on demanding audio benchmarks.

Quick Start & Requirements

Highlighted Details

  • Efficiency: Dual-Resolution Speech Representations with a 5Hz frame rate reduce compute by approximately 50%.
  • Performance: Ranks top among ~8B parameter models on benchmarks like OpenAudioBench, VoiceBench, UltraEval-Audio, MMAU, MMAU-Pro, MMSU, Speech-ACEBench, Speech-BFCL, Speech-SmartInteract, and VStyle.
  • Capabilities: Supports spoken question answering, audio understanding, speech function calling, speech instruction-following, and voice empathy.

Maintenance & Community

The project is developed by the "Tongyi Fun Team". Community interaction is facilitated via GitHub Issues, Pull Requests, and email. An official Dingding chat group is also available for support.

Licensing & Compatibility

Fun-Audio-Chat is licensed under the Apache License (Version 2.0). The project notes that it contains third-party components under other open-source licenses, with details available in the NOTICE file. The Apache 2.0 license is generally permissive for commercial use.

Limitations & Caveats

The provided README does not explicitly detail limitations such as alpha status, known bugs, or unsupported platforms. The release appears to be based on a technical report, suggesting a research-oriented focus.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
33
Star History
648 stars in the last 19 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

Orpheus-TTS by canopyai

0.2%
6k
Open-source TTS for human-sounding speech, built on Llama-3b
Created 10 months ago
Updated 1 month ago
Feedback? Help us improve.