xiaozhi-esp32-server-golang  by hackers365

High-performance AI backend for voice-driven IoT and edge devices

Created 10 months ago
262 stars

Top 97.0% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a high-performance, full-streaming AI backend service written in Go, designed for IoT and smart voice applications. It integrates Automatic Speech Recognition (ASR), Large Language Models (LLM), and Text-to-Speech (TTS) capabilities, enabling low-latency, real-time AI voice interaction for smart terminals and edge devices. The service supports massive concurrency and multiple protocols, offering a flexible and scalable solution for developers.

How It Works

The core architecture features an end-to-end, full-streaming AI voice pipeline (ASR → LLM → TTS) for minimal latency. It employs a modular, pluggable design, abstracting transport layers (WebSocket, MQTT, UDP) and utilizing message queues for asynchronous LLM and TTS processing. The system leverages resource pooling and connection reuse for high throughput. It integrates diverse AI engines like FunASR, OpenAI-compatible models, Ollama, EdgeTTS, and CosyVoice through the Eino framework, allowing for flexible AI capability injection.

Quick Start & Requirements

The recommended installation is via a one-click startup package, available from the releases page, which includes the main program, console, and voiceprint service. Alternatively, Docker Compose or Docker deployments are supported. Local compilation requires Go 1.20+, Opus codec libraries (libopus0, libopusfile-dev), and ONNX Runtime (v1.21.0). A web console is accessible at http://<server_ip_or_domain>:8080 post-startup.

  • Releases: https://github.com/hackers365/xiaozhi-esp32-server-golang/releases
  • Quickstart Tutorial: doc/quickstart_bundle_tutorial.md

Highlighted Details

  • End-to-end full-streaming AI voice link (ASR → LLM → TTS) for low-latency real-time interaction.
  • Voiceprint recognition and dynamic TTS switching for personalized voice experiences.
  • Modular and extensible architecture supporting VAD, ASR, LLM, TTS, MCP, Vision, and more.
  • Integration with multiple AI engines (FunASR, OpenAI, Ollama, EdgeTTS, CosyVoice) via the Eino framework.
  • Full-featured Web management console for configuration, testing, device management, and monitoring.
  • Advanced features include MCP Market aggregation, voice cloning, knowledge base integration (Dify/RAGFlow/WeKnora), and OpenClaw intelligent agent access.

Maintenance & Community

The project is primarily maintained by "hackers365". Community interaction is facilitated via a WeChat group (QR code expired, direct contact recommended) and the author's personal WeChat. The roadmap indicates plans for establishing long connections with devices and implementing proactive AI features.

Licensing & Compatibility

The project is released under the MIT License, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

A security and permission system is currently in the planning phase. Access to community support may require direct contact with the author due to expired links. Local compilation has specific dependency requirements for Go and ONNX Runtime.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
42
Issues (30d)
15
Star History
73 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.