Discover and explore top open-source AI tools and projects—updated daily.
XinhaoMeiLarge-scale audio-language dataset and multimodal research tools
Top 99.4% on SourcePulse
WavCaps addresses the challenge of data scarcity in audio-language (AL) multimodal learning by introducing a large-scale, weakly-labelled dataset of approximately 400,000 audio clips with paired captions. It targets researchers in AL multimodal learning, providing a resource to overcome the limitations of smaller existing datasets and enabling the development of state-of-the-art models.
How It Works
The dataset was created by sourcing audio clips from FreeSound, BBC Sound Effects, SoundBible, and the AudioSet Strongly-labelled Subset. Raw descriptions from web sources were processed through a three-stage pipeline that leverages ChatGPT to filter noisy data and generate high-quality captions. This approach enhances the usability of web-scraped audio data for tasks like automated audio captioning and audio-language retrieval.
Quick Start & Requirements
The WavCaps dataset is downloadable via HuggingFace. The repository includes source code and pre-trained models for downstream tasks such as audio-language retrieval, automated audio captioning, and zero-shot audio classification. Specific installation commands, detailed dependency lists (e.g., Python versions, GPU requirements), or setup time estimates are not provided in the README.
Highlighted Details
Maintenance & Community
The provided README does not contain information regarding maintenance, community channels (like Discord or Slack), or notable contributors.
Licensing & Compatibility
The WavCaps dataset is strictly for academic and non-commercial research purposes only. Use of the audio clips is restricted to research. The models are created under a UK data copyright exemption for non-commercial research. This license prohibits commercial use or integration into closed-source commercial products.
Limitations & Caveats
The primary limitation is the restrictive academic-only license, prohibiting commercial use. While ChatGPT is employed for caption generation, the initial raw descriptions were noted as "highly noisy," suggesting potential residual noise or biases in the dataset despite the filtering process.
1 year ago
Inactive