WavCaps  by XinhaoMei

Large-scale audio-language dataset and multimodal research tools

Created 2 years ago
253 stars

Top 99.4% on SourcePulse

GitHubView on GitHub
Project Summary

WavCaps addresses the challenge of data scarcity in audio-language (AL) multimodal learning by introducing a large-scale, weakly-labelled dataset of approximately 400,000 audio clips with paired captions. It targets researchers in AL multimodal learning, providing a resource to overcome the limitations of smaller existing datasets and enabling the development of state-of-the-art models.

How It Works

The dataset was created by sourcing audio clips from FreeSound, BBC Sound Effects, SoundBible, and the AudioSet Strongly-labelled Subset. Raw descriptions from web sources were processed through a three-stage pipeline that leverages ChatGPT to filter noisy data and generate high-quality captions. This approach enhances the usability of web-scraped audio data for tasks like automated audio captioning and audio-language retrieval.

Quick Start & Requirements

The WavCaps dataset is downloadable via HuggingFace. The repository includes source code and pre-trained models for downstream tasks such as audio-language retrieval, automated audio captioning, and zero-shot audio classification. Specific installation commands, detailed dependency lists (e.g., Python versions, GPU requirements), or setup time estimates are not provided in the README.

Highlighted Details

  • Large-scale dataset: ~400k audio clips with paired captions, sourced from multiple web repositories and AudioSet.
  • ChatGPT-assisted processing: Utilizes a three-stage pipeline with ChatGPT for filtering noisy raw descriptions and generating high-quality captions.
  • Performance claims: Systems trained on WavCaps reportedly outperform previous state-of-the-art (SOTA) models significantly.
  • Task support: Provides code and models for audio-language retrieval, automated audio captioning, and zero-shot audio classification.

Maintenance & Community

The provided README does not contain information regarding maintenance, community channels (like Discord or Slack), or notable contributors.

Licensing & Compatibility

The WavCaps dataset is strictly for academic and non-commercial research purposes only. Use of the audio clips is restricted to research. The models are created under a UK data copyright exemption for non-commercial research. This license prohibits commercial use or integration into closed-source commercial products.

Limitations & Caveats

The primary limitation is the restrictive academic-only license, prohibiting commercial use. While ChatGPT is employed for caption generation, the initial raw descriptions were noted as "highly noisy," suggesting potential residual noise or biases in the dataset despite the filtering process.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.