aoai-realtime-audio-sdk  by Azure-Samples

Azure OpenAI SDK for real-time audio processing with GPT-4o

created 10 months ago
823 stars

Top 44.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides resources for leveraging Azure OpenAI's GPT-4o real-time audio capabilities via a new /realtime WebSocket API. It targets developers building low-latency, speech-to-speech conversational applications like support agents, assistants, and translators, offering a more responsive interaction model than traditional request-response APIs.

How It Works

The /realtime API utilizes WebSockets for asynchronous, bi-directional streaming between a client application and the Azure OpenAI service. It supports text, function calling, and audio input/output. A key feature is flexible turn detection, allowing either server-side Voice Activity Detection (VAD) for automatic response triggering or manual response.create calls for explicit control, suitable for push-to-talk scenarios. The architecture involves an intermediate service managing user connections and model endpoint communication.

Quick Start & Requirements

  • Install: No specific installation command is provided; usage relies on sample code and potentially standalone libraries.
  • Prerequisites:
    • Azure OpenAI resource in eastus2 or swedencentral region.
    • Deployed gpt-4o-realtime-preview model (version 2024-10-01).
    • Supported API version (2024-10-01-preview).
    • Authentication via Microsoft Entra token or API key.
  • Resources: Requires establishing a WebSocket connection.
  • Links: Realtime API Documentation, Realtime OpenAPI Spec

Highlighted Details

  • Supports low-latency "speech in, speech out" interactions.
  • Enables function tool calling within the real-time stream.
  • Offers configurable Voice Activity Detection (VAD) for automatic turn management.
  • Allows for asynchronous streaming of audio, text, and function call data.

Maintenance & Community

  • The project is in Public Preview, indicating potential API changes and updates.
  • Official library support for Python and JavaScript is planned but not yet available. .NET preview support exists.

Licensing & Compatibility

  • License details are not explicitly stated in the README.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The /realtime API is in public preview, meaning API contracts and behavior may change. It is not designed for direct use from untrusted end-user devices, requiring an intermediate service. Handling lengthy audio inputs with server VAD can lead to rapid, potentially unreliable responses; manual turn control is recommended for such cases.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
24 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.