smol-audio by Deep-unlearning

Audio AI model fine-tuning and inference notebooks

Created 4 months ago

418 stars

Top 69.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Luis Capelo

Cofounder of Lightning AI

Project Summary

Summary

Smol Audio provides a curated collection of practical, Colab-friendly Jupyter notebooks designed to simplify the fine-tuning, optimization, and customization of state-of-the-art audio AI models. Targeting researchers, engineers, and power users, this repository democratizes access to advanced audio processing capabilities by leveraging the extensive Hugging Face ecosystem, enabling efficient adaptation of models for diverse applications without requiring extensive infrastructure.

How It Works

The core methodology revolves around integrating seamlessly with the Hugging Face ecosystem, utilizing libraries like transformers and datasets to streamline the fine-tuning pipeline. Smol Audio demonstrates both comprehensive full model fine-tuning and more resource-efficient parameter-adaptive techniques such as LoRA (Low-Rank Adaptation). This dual approach allows users to achieve deep customization for specific languages, domains, or nuanced tasks, significantly reducing the computational overhead and time investment typically associated with adapting large-scale audio models. The notebooks abstract away much of the boilerplate code, focusing on the practical application of these techniques.

Quick Start & Requirements

The primary mode of interaction and execution is through Google Colab notebooks, which are directly linked within the repository. This approach minimizes setup friction, eliminating the need for complex local environment configurations. Essential prerequisites include a working knowledge of Python and the Hugging Face libraries. Users will interact with specific pre-trained models and datasets as defined within each notebook, implying the need for their availability or download.

Highlighted Details

Advanced ASR Fine-tuning: Offers detailed notebooks for fine-tuning prominent Automatic Speech Recognition (ASR) models, including OpenAI's Whisper, IBM's Granite Speech (with a specific example for Italian ASR using the YODAS-Granary dataset), NVIDIA's Parakeet CTC, and Voxtral ASR with prompt masking capabilities.
Audio Captioning Customization: Provides comprehensive guidance on fine-tuning Meta's Audio Flamingo 3 model for generating descriptive audio captions, supporting both full fine-tuning and efficient LoRA adaptations.
Multimodal Capabilities: Features a notebook for zero-shot inference using Meta's Perception Encoder for Audio-Video (PE-AV) model, enabling sophisticated tasks such as video classification and cross-modal retrieval between audio and text.

Maintenance & Community

The provided README excerpt does not contain specific information regarding the project's maintenance schedule, active contributors, community support channels (such as Discord or Slack), or a public roadmap.

Licensing & Compatibility

Crucial details concerning the project's open-source license and its implications for commercial use, derivative works, or integration into closed-source software are not explicitly stated in the README excerpt.

Limitations & Caveats

The README includes a direct advisory that GitHub's native rendering of Jupyter notebooks can be inconsistent or problematic. Consequently, users are strongly recommended to utilize Google Colab for viewing and executing the notebooks to ensure a reliable and accurate experience, mitigating potential display or execution issues.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days