LinaCodec by ysharma3501

Highly compressive audio tokenizer for speech models

Created 6 months ago

269 stars

Top 95.3% on SourcePulse

Project Summary

LinaCodec is an audio tokenizer designed for speech models, offering highly compressive audio encoding at 12.5 tokens per second (171 bps) while decoding to high-fidelity 48kHz audio. It significantly benefits Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) models by improving speed, quality, and enabling new capabilities like voice conversion and audio super-resolution.

How It Works

LinaCodec employs a novel Dual-Path Vocos Decoder, enabling high-quality 48kHz audio reconstruction from 24kHz vocos with significantly reduced training data. It utilizes a distilled WavLM Base+ for increased encoder speed while maintaining quality, and incorporates a custom snake-based upsampling block for feature enhancement, drawing inspiration from BigVGAN. This architecture allows for extreme compression and high-fidelity output.

Quick Start & Requirements

Primary install / run command: pip install git+https://github.com/ysharma3501/LinaCodec.git
Non-default prerequisites and dependencies: The model is automatically downloaded from Hugging Face (YatharthS/LinaCodec). The primary output is 48kHz audio.
Links: Hugging Face model: https://huggingface.co/YatharthS/LinaCodec

Highlighted Details

Achieves 12.5 tokens/sec compression (60x more than DAC) at 171 bps.
Decodes to high-quality 48kHz audio, offering superior clarity over standard 16kHz/24kHz outputs.
Boasts impressive inference speeds: 200x realtime for encoding and 400x realtime for decoding (faster with batching).
Enables TTS models to run up to 800x realtime, significantly faster than existing solutions like MiraTTS.
Supports indirect applications including voice conversion, audio super-resolution, and audio denoising.
Facilitates fast training of high-quality TTS models in under 1 day.

Maintenance & Community

The project is maintained by ysharma3501, with contact available via yatharthsharma3501@gmail.com. Future development includes releasing a detailed article and potentially a paper on the underlying techniques.

Licensing & Compatibility

The README does not specify a software license. Compatibility for commercial use or linking with closed-source projects is undetermined without a clear license.

Limitations & Caveats

The project is heavily based on the kanade-tokenizer, and further documentation and research papers explaining the core techniques are planned for future release. The current scope of applicability for these techniques across various codecs is still under investigation.

LinaCodec by ysharma3501

Explore Similar Projects

HiFTNet by yl4579

LongCat-Audio-Codec by meituan-longcat

stable-codec by Stability-AI

unified-audio by alibaba

DiffGAN-TTS by keonlee9420

LavaSR by ysharma3501

VITA-Audio by VITA-MLLM

GPA by AutoArk

HierSpeechpp by sh-lee-prml

WhisperSpeech by WhisperSpeech

audiolm-pytorch by lucidrains

encodec by facebookresearch