shuo  by NickTikhonov

Real-time phone agent orchestration with sub-500ms latency

Created 1 month ago
629 stars

Top 52.7% on SourcePulse

GitHubView on GitHub
Project Summary

sub-500ms latency phone agent orchestration

This project provides a Python framework for building voice agent orchestrations with sub-500ms latency. It targets developers and researchers aiming to create highly responsive, real-time conversational AI experiences, offering a streamlined approach to integrating STT, LLM, and TTS pipelines.

How It Works

The framework employs two core abstractions: Deepgram Flux for continuous, low-latency Speech-to-Text (STT) and turn detection over a single WebSocket, and an Agent pipeline handling the LLM, Text-to-Speech (TTS), and audio playback. The entire conversational state machine is encapsulated in a pure function (process_event(state, event) -> (state, actions)) within approximately 30 lines of code. A key design principle is end-to-end streaming: LLM tokens immediately feed the TTS engine, and the resulting audio streams directly to the user via Twilio. This architecture enables instant barge-in, where user interruptions are detected and processed immediately, cancelling ongoing audio playback and clearing buffers.

Quick Start & Requirements

  • Installation: pip install -r requirements.txt
  • Prerequisites: Python 3.9+, ngrok, and API keys for Twilio, Deepgram, OpenAI, and ElevenLabs.
  • Setup: Copy .env.example to .env and populate with API keys. Run ngrok http 3040 in a separate terminal. Execute the main script with python main.py +1234567890.
  • Links: A running ngrok URL (https://mature-spaniel-physically.ngrok-free.app) is shown during execution, indicating a live demo capability.

Highlighted Details

  • Achieves sub-500ms latency for voice agent orchestration.
  • Features end-to-end streaming from LLM tokens to TTS audio output.
  • Supports instant barge-in through real-time turn detection.
  • Core state machine logic is concise (~30 lines).
  • Integrates Deepgram Flux, OpenAI GPT-4o-mini, and ElevenLabs streaming services.

Maintenance & Community

No specific details regarding contributors, sponsorships, community channels (like Discord/Slack), or roadmaps are provided in the README.

Licensing & Compatibility

The project is released under the MIT License, which is highly permissive and generally compatible with commercial use and closed-source applications.

Limitations & Caveats

Initial setup requires obtaining and configuring multiple third-party API keys (Twilio, Deepgram, OpenAI, ElevenLabs) and running a tunneling service (ngrok), which can present a barrier to entry for quick experimentation. The project is presented as a framework, suggesting it may require further development for specific application needs.

Health Check
Last Commit

4 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
0
Star History
91 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.