diart by juanmc2005

Real-time audio applications framework

Created 4 years ago

1,910 stars

Top 22.6% on SourcePulse

Project Summary

Diart is a Python framework for building real-time AI-powered audio applications, specializing in speaker diarization. It enables developers to recognize different speakers in live or recorded audio streams with state-of-the-art performance, offering a flexible pipeline that can be customized, benchmarked, and served via WebSockets.

How It Works

Diart combines speaker segmentation and speaker embedding models within an incremental clustering algorithm. This approach refines accuracy as a conversation progresses. The framework supports custom AI pipelines, hyper-parameter tuning, and web serving. It is built upon pyannote.audio models, leveraging their segmentation and embedding capabilities for efficient and accurate speaker diarization.

Quick Start & Requirements

Installation: pip install diart
Prerequisites: ffmpeg < 4.4, portaudio == 19.6.X, libsndfile >= 1.2.2. A Conda environment file (environment.yml) is provided for easier setup.
Pyannote Models: Requires accepting user conditions and logging into Hugging Face CLI for default models (pyannote/segmentation, pyannote/embedding).
Resources: Supports CPU and GPU (RTX 4060 Max-Q tested). Latency benchmarks provided for various models.
Documentation: Links to installation, streaming, models, tuning, pipelines, WebSockets, and research papers are available within the README.

Highlighted Details

Supports real-time streaming from microphones or audio files.
Offers hyper-parameter optimization using Optuna for custom tuning.
Enables building custom pipelines by combining modular blocks (e.g., SpeakerSegmentation, OverlapAwareSpeakerEmbedding).
Provides WebSocket compatibility for serving pipelines over the web.

Maintenance & Community

The project is associated with research from Université Paris-Saclay and CNRS. The README includes a citation for the core research paper and notes on reproducibility, recommending pyannote.audio<3.1.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive for commercial use and closed-source linking, allowing modification, distribution, and sale of the software.

Limitations & Caveats

Transcription and speaker-aware transcription features are listed as "coming soon." Reproducing exact benchmark results may require specific versions of pyannote.audio.

diart by juanmc2005

Explore Similar Projects

speech-dataset-generator by davidmartinrius

edgedict by theblackcat102

aTrain by JuergenFleiss

realtime-transcription-fastrtc by sofdog-gh

api4sensevoice by 0x5446

xtts-api-server by daswer123

moonshine by moonshine-ai

wespeaker by wenet-e2e

awesome-diarization by wq2012

3D-Speaker by modelscope

RealtimeVoiceChat by KoljaB

whisper_streaming by ufal