LinaCodec  by ysharma3501

Highly compressive audio tokenizer for speech models

Created 1 month ago
256 stars

Top 98.5% on SourcePulse

GitHubView on GitHub
Project Summary

LinaCodec is an audio tokenizer designed for speech models, offering highly compressive audio encoding at 12.5 tokens per second (171 bps) while decoding to high-fidelity 48kHz audio. It significantly benefits Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) models by improving speed, quality, and enabling new capabilities like voice conversion and audio super-resolution.

How It Works

LinaCodec employs a novel Dual-Path Vocos Decoder, enabling high-quality 48kHz audio reconstruction from 24kHz vocos with significantly reduced training data. It utilizes a distilled WavLM Base+ for increased encoder speed while maintaining quality, and incorporates a custom snake-based upsampling block for feature enhancement, drawing inspiration from BigVGAN. This architecture allows for extreme compression and high-fidelity output.

Quick Start & Requirements

  • Primary install / run command: pip install git+https://github.com/ysharma3501/LinaCodec.git
  • Non-default prerequisites and dependencies: The model is automatically downloaded from Hugging Face (YatharthS/LinaCodec). The primary output is 48kHz audio.
  • Links: Hugging Face model: https://huggingface.co/YatharthS/LinaCodec

Highlighted Details

  • Achieves 12.5 tokens/sec compression (60x more than DAC) at 171 bps.
  • Decodes to high-quality 48kHz audio, offering superior clarity over standard 16kHz/24kHz outputs.
  • Boasts impressive inference speeds: 200x realtime for encoding and 400x realtime for decoding (faster with batching).
  • Enables TTS models to run up to 800x realtime, significantly faster than existing solutions like MiraTTS.
  • Supports indirect applications including voice conversion, audio super-resolution, and audio denoising.
  • Facilitates fast training of high-quality TTS models in under 1 day.

Maintenance & Community

The project is maintained by ysharma3501, with contact available via yatharthsharma3501@gmail.com. Future development includes releasing a detailed article and potentially a paper on the underlying techniques.

Licensing & Compatibility

The README does not specify a software license. Compatibility for commercial use or linking with closed-source projects is undetermined without a clear license.

Limitations & Caveats

The project is heavily based on the kanade-tokenizer, and further documentation and research papers explaining the core techniques are planned for future release. The current scope of applicability for these techniques across various codecs is still under investigation.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
23 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.