ichigo by janhq

Speech package for local, real-time voice AI development

Created 1 year ago

2,424 stars

Top 18.7% on SourcePulse

View on GitHub

4 Experts Love This Project

Dan Guido

Cofounder of Trail of Bits

Luis Capelo

Cofounder of Lightning AI

Michael Han

Cofounder of Unsloth

Thomas Wolf

Cofounder of Hugging Face

Project Summary

Ichigo is a Python package providing local, real-time speech AI capabilities for developers, focusing on Automatic Speech Recognition (ASR) and experimental Speech Language Modeling (LLM). It aims to simplify speech tasks by offering intuitive Python interfaces and a scalable FastAPI service, abstracting away complex audio processing.

How It Works

Ichigo-ASR is a compact (22M parameters) speech tokenizer based on Whisper-medium, designed for efficient multilingual performance. It converts speech into discrete tokens, enhancing compatibility with LLMs for direct speech understanding. This approach, inspired by early fusion techniques, allows for modularity and potential cross-task training, enabling ASR data to inform TTS models and vice-versa.

Quick Start & Requirements

Install via pip: pip install ichigo
Requires Python.
API server can be started with uvicorn asr:app --host 0.0.0.0 --port 8000 or via Docker.
API documentation available at http://localhost:8000/docs.

Highlighted Details

Ichigo-ASR offers competitive benchmarks, outperforming Whisper-medium.en on several metrics.
Supports batch processing for single files or folders via simple Python calls.
Provides a FastAPI service for integration, with REST API endpoints for Speech-to-Text (S2T), Speech-to-Representation (S2R), and Representation-to-Text (R2T).
Ichigo-LLM is an experimental project aiming for on-device, open-weight voice assistants, inspired by Meta's Chameleon.

Maintenance & Community

Open research project seeking collaborators.
Discord community is available for discussion and feedback.
Mentions torchtune, WhisperSpeech, and Llama3 as foundational components.

Licensing & Compatibility

The README does not explicitly state a license. The project's foundation on WhisperSpeech (MIT License) and Llama3 (custom license) suggests potential compatibility, but explicit clarification is needed for commercial use.

Limitations & Caveats