shuo by NickTikhonov

Real-time phone agent orchestration with sub-500ms latency

Created 5 months ago

664 stars

Top 49.9% on SourcePulse

Project Summary

sub-500ms latency phone agent orchestration

This project provides a Python framework for building voice agent orchestrations with sub-500ms latency. It targets developers and researchers aiming to create highly responsive, real-time conversational AI experiences, offering a streamlined approach to integrating STT, LLM, and TTS pipelines.

How It Works

The framework employs two core abstractions: Deepgram Flux for continuous, low-latency Speech-to-Text (STT) and turn detection over a single WebSocket, and an Agent pipeline handling the LLM, Text-to-Speech (TTS), and audio playback. The entire conversational state machine is encapsulated in a pure function (process_event(state, event) -> (state, actions)) within approximately 30 lines of code. A key design principle is end-to-end streaming: LLM tokens immediately feed the TTS engine, and the resulting audio streams directly to the user via Twilio. This architecture enables instant barge-in, where user interruptions are detected and processed immediately, cancelling ongoing audio playback and clearing buffers.

Quick Start & Requirements

Installation: pip install -r requirements.txt
Prerequisites: Python 3.9+, ngrok, and API keys for Twilio, Deepgram, OpenAI, and ElevenLabs.
Setup: Copy .env.example to .env and populate with API keys. Run ngrok http 3040 in a separate terminal. Execute the main script with python main.py +1234567890.
Links: A running ngrok URL (https://mature-spaniel-physically.ngrok-free.app) is shown during execution, indicating a live demo capability.

Highlighted Details

Achieves sub-500ms latency for voice agent orchestration.
Features end-to-end streaming from LLM tokens to TTS audio output.
Supports instant barge-in through real-time turn detection.
Core state machine logic is concise (~30 lines).
Integrates Deepgram Flux, OpenAI GPT-4o-mini, and ElevenLabs streaming services.

Maintenance & Community

No specific details regarding contributors, sponsorships, community channels (like Discord/Slack), or roadmaps are provided in the README.

Licensing & Compatibility

The project is released under the MIT License, which is highly permissive and generally compatible with commercial use and closed-source applications.

Limitations & Caveats

Initial setup requires obtaining and configuring multiple third-party API keys (Twilio, Deepgram, OpenAI, ElevenLabs) and running a tunneling service (ngrok), which can present a barrier to entry for quick experimentation. The project is presented as a framework, suggesting it may require further development for specific application needs.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days