WavChat  by jishengpeng

Survey paper on spoken dialogue models

created 8 months ago
306 stars

Top 88.7% on sourcepulse

GitHubView on GitHub
Project Summary

WavChat is a comprehensive survey of spoken dialogue models, targeting researchers and developers in the AI and speech technology fields. It systematically organizes and analyzes the rapidly evolving landscape of spoken dialogue systems, offering insights into their architecture, capabilities, and training methodologies, thereby facilitating advancements in human-computer interaction.

How It Works

The survey categorizes spoken dialogue systems into cascaded and end-to-end paradigms, detailing core technologies such as speech representation (semantic vs. acoustic), training paradigms, and streaming/duplex capabilities. It provides a chronological timeline of models and discusses limitations and future research directions for each aspect.

Quick Start & Requirements

This repository serves as a companion to the WavChat survey paper. The primary "quick start" is to access the survey document itself, which is available on arXiv. The survey details numerous models and datasets, requiring users to visit external GitHub repositories and research papers for specific implementation details and requirements.

Highlighted Details

  • Provides a curated list of over 30 publicly available speech dialogue models with direct GitHub links.
  • Compiles an extensive list of over 40 audio codec models relevant to speech processing.
  • Details datasets used for various training stages, including ASR, TTS, and dialogue fine-tuning, along with music and non-speech sound datasets.
  • Offers a structured overview of model capabilities, training paradigms, and evaluation metrics.

Maintenance & Community

The project is associated with the research paper "WavChat: A Survey of Spoken Dialogue Models" published on arXiv. The primary contributor is Shengpeng Ji. Further community engagement or project updates are not explicitly detailed in the README.

Licensing & Compatibility

The repository itself does not specify a license. The associated survey paper is a pre-print on arXiv. Individual models and datasets mentioned within the survey will have their own respective licenses, which may vary and could include restrictions on commercial use.

Limitations & Caveats

This repository is a survey document and does not contain executable code for the models discussed. Users must refer to the individual project repositories linked within the survey for implementation, setup, and usage. The field is rapidly evolving, meaning the survey represents a snapshot in time.

Health Check
Last commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
16 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind) and Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers).

audio-ai-timeline by archinetai

0%
2k
AI model timeline for audio generation
created 2 years ago
updated 1 year ago
Feedback? Help us improve.