tarsier by bytedance

Video-language model for high-quality video descriptions and video understanding

Created 1 year ago

510 stars

Top 61.2% on SourcePulse

Project Summary

Tarsier is a family of large-scale video-language models designed for high-quality video description and general video understanding. It targets researchers and developers working on advanced video AI applications. The models achieve state-of-the-art results on various benchmarks, offering strong video captioning and understanding capabilities.

How It Works

Tarsier employs a simple yet effective architecture: a CLIP-ViT visual encoder and a Large Language Model (LLM) decoder, connected by a projection layer. Frames are encoded independently and then concatenated before being fed into the LLM. This approach is enhanced by a two-stage training strategy: multi-task pre-training on a large dataset (40M video-text pairs for Tarsier2) and multi-grained instruction tuning. This meticulous training process, including fine-grained temporal alignment and Direct Preference Optimization (DPO), allows Tarsier to achieve superior performance.

Quick Start & Requirements

Install: Clone the repository, checkout the tarsier2 branch, and run bash setup.sh.
Prerequisites: Python 3.9 is recommended. Environment parameters for Azure OpenAI Service may be needed for specific evaluations.
Models: Download checkpoints from Hugging Face (e.g., omni-research/Tarsier2-Recap-7b).
Data: Benchmarks like DREAM-1K and TVBench need to be downloaded separately.
Demo: Online demos and CLI/Gradio demos are available.
Docs: Tarsier Technical Report

Highlighted Details

Tarsier2-7B(-0115) achieves state-of-the-art results across 16 public benchmarks, including video captioning, VQA, and grounding.
Outperforms GPT-4o in human side-by-side comparisons for video description (Tarsier2-7B-1105).
Introduces DREAM-1K, a challenging benchmark for fine-grained video description, and AutoDQ for evaluation.
Tarsier-34B achieved SOTA on 6 video understanding benchmarks and comparable performance to Gemini 1.5 Pro.

Maintenance & Community

The project is developed by ByteDance Research. Recent releases include Tarsier2-7B(-0115), Tarsier2-Recap-7B, and the Tarsier2 Technical Report. Links to Hugging Face model repositories and datasets are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for model checkpoints and datasets, especially for commercial use.

Limitations & Caveats

The Tarsier2-7B(-0115) model, while strong in video captioning, may exhibit limited instruction-following capabilities due to its post-training focus. A version with improved instruction-following is planned.

tarsier by bytedance

Explore Similar Projects

Youku-mPLUG by X-PLUG

MiraData by mira-space

LongVA by EvolvingLMMs-Lab

VideoGPT-plus by mbzuai-oryx

UniVTG by showlab

ShareGPT4Video by ShareGPT4Omni

LaViLa by facebookresearch

MiniGPT4-video by Vision-CAIR

Video-ChatGPT by mbzuai-oryx

Awesome-LLMs-for-Video-Understanding by yunlong10

Video-LLaMA by DAMO-NLP-SG

LWM by LargeWorldModel