xiaozhi-esp32-server-golang by hackers365

High-performance AI backend for voice-driven IoT and edge devices

Created 1 year ago

368 stars

Top 76.5% on SourcePulse

Project Summary

This project provides a high-performance, full-streaming AI backend service written in Go, designed for IoT and smart voice applications. It integrates Automatic Speech Recognition (ASR), Large Language Models (LLM), and Text-to-Speech (TTS) capabilities, enabling low-latency, real-time AI voice interaction for smart terminals and edge devices. The service supports massive concurrency and multiple protocols, offering a flexible and scalable solution for developers.

How It Works

The core architecture features an end-to-end, full-streaming AI voice pipeline (ASR → LLM → TTS) for minimal latency. It employs a modular, pluggable design, abstracting transport layers (WebSocket, MQTT, UDP) and utilizing message queues for asynchronous LLM and TTS processing. The system leverages resource pooling and connection reuse for high throughput. It integrates diverse AI engines like FunASR, OpenAI-compatible models, Ollama, EdgeTTS, and CosyVoice through the Eino framework, allowing for flexible AI capability injection.

Quick Start & Requirements

The recommended installation is via a one-click startup package, available from the releases page, which includes the main program, console, and voiceprint service. Alternatively, Docker Compose or Docker deployments are supported. Local compilation requires Go 1.20+, Opus codec libraries (libopus0, libopusfile-dev), and ONNX Runtime (v1.21.0). A web console is accessible at http://<server_ip_or_domain>:8080 post-startup.

Releases: https://github.com/hackers365/xiaozhi-esp32-server-golang/releases
Quickstart Tutorial: doc/quickstart_bundle_tutorial.md

Highlighted Details

End-to-end full-streaming AI voice link (ASR → LLM → TTS) for low-latency real-time interaction.
Voiceprint recognition and dynamic TTS switching for personalized voice experiences.
Modular and extensible architecture supporting VAD, ASR, LLM, TTS, MCP, Vision, and more.
Integration with multiple AI engines (FunASR, OpenAI, Ollama, EdgeTTS, CosyVoice) via the Eino framework.
Full-featured Web management console for configuration, testing, device management, and monitoring.
Advanced features include MCP Market aggregation, voice cloning, knowledge base integration (Dify/RAGFlow/WeKnora), and OpenClaw intelligent agent access.

Maintenance & Community

The project is primarily maintained by "hackers365". Community interaction is facilitated via a WeChat group (QR code expired, direct contact recommended) and the author's personal WeChat. The roadmap indicates plans for establishing long connections with devices and implementing proactive AI features.

Licensing & Compatibility

The project is released under the MIT License, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

A security and permission system is currently in the planning phase. Access to community support may require direct contact with the author due to expired links. Local compilation has specific dependency requirements for Go and ONNX Runtime.

xiaozhi-esp32-server-golang by hackers365

Explore Similar Projects

open-xiaoai-bridge by coderzc

Intervo by Intervo

echokit_server by second-state

SoulNexus by LingByte

esp-ai by wangzongming

pipecat-examples by pipecat-ai

friday-tony-stark-demo by SAGAR-TAMANG

RealtimeVoiceChat by KoljaB

01 by openinterpreter

dograh by dograh-hq

pipecat by pipecat-ai

lobehub by lobehub