tarsier  by bytedance

Video-language model for high-quality video descriptions and video understanding

created 1 year ago
448 stars

Top 66.9% on SourcePulse

GitHubView on GitHub
Project Summary

Tarsier is a family of large-scale video-language models designed for high-quality video description and general video understanding. It targets researchers and developers working on advanced video AI applications. The models achieve state-of-the-art results on various benchmarks, offering strong video captioning and understanding capabilities.

How It Works

Tarsier employs a simple yet effective architecture: a CLIP-ViT visual encoder and a Large Language Model (LLM) decoder, connected by a projection layer. Frames are encoded independently and then concatenated before being fed into the LLM. This approach is enhanced by a two-stage training strategy: multi-task pre-training on a large dataset (40M video-text pairs for Tarsier2) and multi-grained instruction tuning. This meticulous training process, including fine-grained temporal alignment and Direct Preference Optimization (DPO), allows Tarsier to achieve superior performance.

Quick Start & Requirements

  • Install: Clone the repository, checkout the tarsier2 branch, and run bash setup.sh.
  • Prerequisites: Python 3.9 is recommended. Environment parameters for Azure OpenAI Service may be needed for specific evaluations.
  • Models: Download checkpoints from Hugging Face (e.g., omni-research/Tarsier2-Recap-7b).
  • Data: Benchmarks like DREAM-1K and TVBench need to be downloaded separately.
  • Demo: Online demos and CLI/Gradio demos are available.
  • Docs: Tarsier Technical Report

Highlighted Details

  • Tarsier2-7B(-0115) achieves state-of-the-art results across 16 public benchmarks, including video captioning, VQA, and grounding.
  • Outperforms GPT-4o in human side-by-side comparisons for video description (Tarsier2-7B-1105).
  • Introduces DREAM-1K, a challenging benchmark for fine-grained video description, and AutoDQ for evaluation.
  • Tarsier-34B achieved SOTA on 6 video understanding benchmarks and comparable performance to Gemini 1.5 Pro.

Maintenance & Community

The project is developed by ByteDance Research. Recent releases include Tarsier2-7B(-0115), Tarsier2-Recap-7B, and the Tarsier2 Technical Report. Links to Hugging Face model repositories and datasets are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for model checkpoints and datasets, especially for commercial use.

Limitations & Caveats

The Tarsier2-7B(-0115) model, while strong in video captioning, may exhibit limited instruction-following capabilities due to its post-training focus. A version with improved instruction-following is planned.

Health Check
Last commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
21 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.