Dolphin  by DataoceanAI

Multilingual ASR model for diverse Eastern languages

Created 9 months ago
690 stars

Top 49.4% on SourcePulse

GitHubView on GitHub
Project Summary

A multilingual, multitask Automatic Speech Recognition (ASR) model developed by Dataocean AI and Tsinghua University, Dolphin addresses the need for accurate speech processing across diverse Eastern languages and Chinese dialects. It offers benefits in applications requiring robust speech-to-text capabilities, voice activity detection, segmentation, and language identification, targeting researchers and developers working with non-English audio data.

How It Works

Dolphin employs a joint CTC-Attention architecture, featuring an E-Branchformer-based encoder and a standard Transformer decoder, building upon established models like Whisper and OWSM. A key innovation is its two-level language token system, which distinguishes between language (e.g., <zh>) and region (e.g., <CN>), enabling finer-grained linguistic and regional diversity handling, particularly beneficial for extensive datasets.

Quick Start & Requirements

  • Installation: pip install -U dataoceanai-dolphin or from source: pip install git+https://github.com/SpeechOceanTech/Dolphin.git.
  • Prerequisites: FFmpeg is required for audio conversion to WAV format.
  • Dependencies: Supports CUDA, Apple MPS, Huawei Ascend NPU (requires specific torch_npu and CANN versions), and CPU.
  • Resources: Model sizes range from 140M (base) to 1679M (large) parameters.
  • Docs: Links to Paper, Huggingface, Modelscope, Openi, Wisemodel are mentioned.

Highlighted Details

  • Supports 40 Eastern languages and 22 Chinese dialects.
  • Performs Automatic Speech Recognition (ASR), Voice Activity Detection (VAD), segmentation, and Language Identification (LID).
  • Available models include 'base' (140M, 33.3% WER) and 'small' (372M, 25.2% WER).
  • Trained on over 210,000 hours of proprietary and open-source data.

Maintenance & Community

No specific details on contributors, sponsorships, or community channels are present in the provided README text.

Licensing & Compatibility

  • License: Apache 2.0 License.
  • Compatibility: The Apache 2.0 license generally permits commercial use and integration into closed-source projects.

Limitations & Caveats

Dolphin does not support translation tasks and omits the use of previous text and its related tokens. The medium and large models are not yet publicly available.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
22 stars in the last 30 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
2 more.

pyctcdecode by kensho-technologies

0%
467
CTC beam search decoder for speech recognition
Created 4 years ago
Updated 2 years ago
Feedback? Help us improve.