tarsier  by bytedance

Video-language model for high-quality video descriptions and video understanding

Created 1 year ago
494 stars

Top 62.6% on SourcePulse

GitHubView on GitHub
Project Summary

Tarsier is a family of large-scale video-language models designed for high-quality video description and general video understanding. It targets researchers and developers working on advanced video AI applications. The models achieve state-of-the-art results on various benchmarks, offering strong video captioning and understanding capabilities.

How It Works

Tarsier employs a simple yet effective architecture: a CLIP-ViT visual encoder and a Large Language Model (LLM) decoder, connected by a projection layer. Frames are encoded independently and then concatenated before being fed into the LLM. This approach is enhanced by a two-stage training strategy: multi-task pre-training on a large dataset (40M video-text pairs for Tarsier2) and multi-grained instruction tuning. This meticulous training process, including fine-grained temporal alignment and Direct Preference Optimization (DPO), allows Tarsier to achieve superior performance.

Quick Start & Requirements

  • Install: Clone the repository, checkout the tarsier2 branch, and run bash setup.sh.
  • Prerequisites: Python 3.9 is recommended. Environment parameters for Azure OpenAI Service may be needed for specific evaluations.
  • Models: Download checkpoints from Hugging Face (e.g., omni-research/Tarsier2-Recap-7b).
  • Data: Benchmarks like DREAM-1K and TVBench need to be downloaded separately.
  • Demo: Online demos and CLI/Gradio demos are available.
  • Docs: Tarsier Technical Report

Highlighted Details

  • Tarsier2-7B(-0115) achieves state-of-the-art results across 16 public benchmarks, including video captioning, VQA, and grounding.
  • Outperforms GPT-4o in human side-by-side comparisons for video description (Tarsier2-7B-1105).
  • Introduces DREAM-1K, a challenging benchmark for fine-grained video description, and AutoDQ for evaluation.
  • Tarsier-34B achieved SOTA on 6 video understanding benchmarks and comparable performance to Gemini 1.5 Pro.

Maintenance & Community

The project is developed by ByteDance Research. Recent releases include Tarsier2-7B(-0115), Tarsier2-Recap-7B, and the Tarsier2 Technical Report. Links to Hugging Face model repositories and datasets are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for model checkpoints and datasets, especially for commercial use.

Limitations & Caveats

The Tarsier2-7B(-0115) model, while strong in video captioning, may exhibit limited instruction-following capabilities due to its post-training focus. A version with improved instruction-following is planned.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
25 stars in the last 30 days

Explore Similar Projects

Starred by Matei Zaharia Matei Zaharia(Cofounder of Databricks), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

LWM by LargeWorldModel

0.1%
7k
Multimodal autoregressive model for long-context video/text
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.