Discover and explore top open-source AI tools and projects—updated daily.
ysharma3501Highly compressive audio tokenizer for speech models
Top 98.5% on SourcePulse
LinaCodec is an audio tokenizer designed for speech models, offering highly compressive audio encoding at 12.5 tokens per second (171 bps) while decoding to high-fidelity 48kHz audio. It significantly benefits Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) models by improving speed, quality, and enabling new capabilities like voice conversion and audio super-resolution.
How It Works
LinaCodec employs a novel Dual-Path Vocos Decoder, enabling high-quality 48kHz audio reconstruction from 24kHz vocos with significantly reduced training data. It utilizes a distilled WavLM Base+ for increased encoder speed while maintaining quality, and incorporates a custom snake-based upsampling block for feature enhancement, drawing inspiration from BigVGAN. This architecture allows for extreme compression and high-fidelity output.
Quick Start & Requirements
pip install git+https://github.com/ysharma3501/LinaCodec.gitYatharthS/LinaCodec). The primary output is 48kHz audio.Highlighted Details
Maintenance & Community
The project is maintained by ysharma3501, with contact available via yatharthsharma3501@gmail.com. Future development includes releasing a detailed article and potentially a paper on the underlying techniques.
Licensing & Compatibility
The README does not specify a software license. Compatibility for commercial use or linking with closed-source projects is undetermined without a clear license.
Limitations & Caveats
The project is heavily based on the kanade-tokenizer, and further documentation and research papers explaining the core techniques are planned for future release. The current scope of applicability for these techniques across various codecs is still under investigation.
1 month ago
Inactive
playht
WhisperSpeech
lucidrains
facebookresearch