ichigo  by menloresearch

Speech package for local, real-time voice AI development

created 1 year ago
2,345 stars

Top 19.9% on sourcepulse

GitHubView on GitHub
Project Summary

Ichigo is a Python package providing local, real-time speech AI capabilities for developers, focusing on Automatic Speech Recognition (ASR) and experimental Speech Language Modeling (LLM). It aims to simplify speech tasks by offering intuitive Python interfaces and a scalable FastAPI service, abstracting away complex audio processing.

How It Works

Ichigo-ASR is a compact (22M parameters) speech tokenizer based on Whisper-medium, designed for efficient multilingual performance. It converts speech into discrete tokens, enhancing compatibility with LLMs for direct speech understanding. This approach, inspired by early fusion techniques, allows for modularity and potential cross-task training, enabling ASR data to inform TTS models and vice-versa.

Quick Start & Requirements

  • Install via pip: pip install ichigo
  • Requires Python.
  • API server can be started with uvicorn asr:app --host 0.0.0.0 --port 8000 or via Docker.
  • API documentation available at http://localhost:8000/docs.

Highlighted Details

  • Ichigo-ASR offers competitive benchmarks, outperforming Whisper-medium.en on several metrics.
  • Supports batch processing for single files or folders via simple Python calls.
  • Provides a FastAPI service for integration, with REST API endpoints for Speech-to-Text (S2T), Speech-to-Representation (S2R), and Representation-to-Text (R2T).
  • Ichigo-LLM is an experimental project aiming for on-device, open-weight voice assistants, inspired by Meta's Chameleon.

Maintenance & Community

  • Open research project seeking collaborators.
  • Discord community is available for discussion and feedback.
  • Mentions torchtune, WhisperSpeech, and Llama3 as foundational components.

Licensing & Compatibility

  • The README does not explicitly state a license. The project's foundation on WhisperSpeech (MIT License) and Llama3 (custom license) suggests potential compatibility, but explicit clarification is needed for commercial use.

Limitations & Caveats

  • Ichigo-TTS (Text-to-Speech) is listed as "Coming Soon."
  • Ichigo-LLM is described as an "experimental" research project.
  • Streaming is not currently supported for the API.
  • The project is a "work in progress," welcoming feedback and collaborations.
Health Check
Last commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
61 stars in the last 90 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

ultravox by fixie-ai

0.4%
4k
Multimodal LLM for real-time voice interactions
created 1 year ago
updated 4 days ago
Feedback? Help us improve.