bark by suno-ai

Generative audio model for realistic speech and sound effects

Created 2 years ago

38,992 stars

Top 0.8% on SourcePulse

View on GitHub

24 Experts Love This Project

Tobi Lutke

Cofounder of Shopify

Jiaming Song

Chief Scientist at Luma AI

Pawel Garbacki

Cofounder of Fireworks AI

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

and 20 more!

Project Summary

Bark is a text-prompted generative audio model that produces realistic, multilingual speech, music, and sound effects. It's designed for researchers and power users seeking a flexible audio generation tool beyond traditional text-to-speech. Bark offers creative control and can generate non-speech sounds like laughter and sighs, with a focus on realistic voice and prosody.

How It Works

Bark is a transformer-based, fully generative text-to-audio model, similar to AudioLM and Vall-E. It uses a quantized audio representation from EnCodec. Unlike conventional TTS, Bark converts text directly to audio without intermediate phonemes, enabling generalization to music, sound effects, and non-speech sounds. It supports voice presets for tone, pitch, and emotion matching, but not custom voice cloning.

Quick Start & Requirements

Install via pip: pip install git+https://github.com/suno-ai/bark.git or clone and install locally.
Requires PyTorch 2.0+, CUDA 11.7/12.0 for GPU acceleration.
Full model requires ~12GB VRAM; smaller models available for <4GB VRAM using SUNO_USE_SMALL_MODELS=True.
Official Docs: https://suno-ai.notion.site/Bark-models-and-how-to-use-them-f5101f7760014427877312006307230a
Hugging Face Integration: https://huggingface.co/docs/transformers/main/en/model_doc/bark

Highlighted Details

Supports 100+ speaker presets across multiple languages.
Can generate music, background noise, and sound effects alongside speech.
Offers automatic language detection and accent generation for code-switched text.
Long-form generation capabilities are documented in provided notebooks.

Maintenance & Community

Active community on Discord: https://discord.gg/J2B2vsjKuE
Model playground early access sign-up available.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use.

Limitations & Caveats

Bark is a research model and may deviate unexpectedly from prompts, producing higher variance outputs than traditional TTS. Generations are typically limited to ~13-14 seconds due to its GPT-style architecture. Audio quality can vary, sometimes resembling older phone calls.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

125 stars in the last 30 days