aoai-realtime-audio-sdk by Azure-Samples

Azure OpenAI SDK for real-time audio processing with GPT-4o

Created 1 year ago

837 stars

Top 42.6% on SourcePulse

Project Summary

This repository provides resources for leveraging Azure OpenAI's GPT-4o real-time audio capabilities via a new /realtime WebSocket API. It targets developers building low-latency, speech-to-speech conversational applications like support agents, assistants, and translators, offering a more responsive interaction model than traditional request-response APIs.

How It Works

The /realtime API utilizes WebSockets for asynchronous, bi-directional streaming between a client application and the Azure OpenAI service. It supports text, function calling, and audio input/output. A key feature is flexible turn detection, allowing either server-side Voice Activity Detection (VAD) for automatic response triggering or manual response.create calls for explicit control, suitable for push-to-talk scenarios. The architecture involves an intermediate service managing user connections and model endpoint communication.

Quick Start & Requirements

Install: No specific installation command is provided; usage relies on sample code and potentially standalone libraries.
Prerequisites:
- Azure OpenAI resource in eastus2 or swedencentral region.
- Deployed gpt-4o-realtime-preview model (version 2024-10-01).
- Supported API version (2024-10-01-preview).
- Authentication via Microsoft Entra token or API key.
Resources: Requires establishing a WebSocket connection.
Links: Realtime API Documentation, Realtime OpenAPI Spec

Highlighted Details

Supports low-latency "speech in, speech out" interactions.
Enables function tool calling within the real-time stream.
Offers configurable Voice Activity Detection (VAD) for automatic turn management.
Allows for asynchronous streaming of audio, text, and function call data.

Maintenance & Community

The project is in Public Preview, indicating potential API changes and updates.
Official library support for Python and JavaScript is planned but not yet available. .NET preview support exists.

Licensing & Compatibility

License details are not explicitly stated in the README.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The /realtime API is in public preview, meaning API contracts and behavior may change. It is not designed for direct use from untrusted end-user devices, requiring an intermediate service. Handling lengthy audio inputs with server VAD can lead to rapid, potentially unreliable responses; manual turn control is recommended for such cases.

aoai-realtime-audio-sdk by Azure-Samples

Explore Similar Projects

ChatGptNet by marcominerva

RuntimeSpeechRecognizer by gtreshchev

simple-openai by sashirestela

com.openai.unity by RageAgainstThePixel

swift-realtime-openai by m1guelpf

openai-realtime-embedded by openai

openai by anasfik

yakGPT by yakGPT

ruby-openai by alexrudall

OpenAI-API-dotnet by OkGoDoIt

01 by openinterpreter

pipecat by pipecat-ai