gazelle by tincans-ai

Joint speech-language model for direct audio response

Created 1 year ago

371 stars

Top 76.2% on SourcePulse

View on GitHub

3 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Benjamin Bolte

Cofounder of K-Scale Labs

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Project Summary

Gazelle is a joint speech-language model designed to process audio input directly, enabling conversational AI that responds to spoken language. It targets researchers and developers interested in multimodal AI and speech-enabled applications. The primary benefit is a unified model that handles both speech recognition and language understanding, simplifying the pipeline for audio-based interactions.

How It Works

Gazelle integrates speech and language processing into a single model, eliminating the need for separate Automatic Speech Recognition (ASR) and Large Language Model (LLM) components. This joint approach allows for more direct and potentially more efficient processing of audio inputs, enabling the model to understand and respond to spoken commands or queries without intermediate text conversion. The inference code is based on Huggingface's Llava implementation.

Quick Start & Requirements

Install: Code is available via Hugging Face.
Prerequisites: Requires Python and dependencies managed by Hugging Face's ecosystem. Specific hardware requirements (e.g., GPU, VRAM) are not detailed but are implied for running LLM-based models.
Links:
- Checkpoints: v0.2, v0.2-dpo, v0.1
- Blog Post: Original blog post

Highlighted Details

Joint speech-language modeling for direct audio-to-response capabilities.
Inference code derived from Huggingface's Llava implementation.
Available checkpoints include v0.1, v0.2, and v0.2-dpo.

Maintenance & Community

Community: Join on Discord.

Licensing & Compatibility

Modeling Code: Apache 2.0.
v0.2 Checkpoint: Apache 2.0 (derived from Mistral 7B).
v0.1 Checkpoint: Derived from Llama 2, governed by the Llama 2 license. Users must agree to Llama 2 license terms.

Limitations & Caveats

The v0.2 model is noted as not robust to adversarial attacks or jailbreaks and is not recommended for production use. Initial checkpoints are described as "backproppin' on a budget" and may not be robust to many real-world considerations.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days