Video-language model for high-quality video descriptions and video understanding
Top 66.9% on SourcePulse
Tarsier is a family of large-scale video-language models designed for high-quality video description and general video understanding. It targets researchers and developers working on advanced video AI applications. The models achieve state-of-the-art results on various benchmarks, offering strong video captioning and understanding capabilities.
How It Works
Tarsier employs a simple yet effective architecture: a CLIP-ViT visual encoder and a Large Language Model (LLM) decoder, connected by a projection layer. Frames are encoded independently and then concatenated before being fed into the LLM. This approach is enhanced by a two-stage training strategy: multi-task pre-training on a large dataset (40M video-text pairs for Tarsier2) and multi-grained instruction tuning. This meticulous training process, including fine-grained temporal alignment and Direct Preference Optimization (DPO), allows Tarsier to achieve superior performance.
Quick Start & Requirements
tarsier2
branch, and run bash setup.sh
.omni-research/Tarsier2-Recap-7b
).Highlighted Details
Maintenance & Community
The project is developed by ByteDance Research. Recent releases include Tarsier2-7B(-0115), Tarsier2-Recap-7B, and the Tarsier2 Technical Report. Links to Hugging Face model repositories and datasets are provided.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Users should verify licensing for model checkpoints and datasets, especially for commercial use.
Limitations & Caveats
The Tarsier2-7B(-0115) model, while strong in video captioning, may exhibit limited instruction-following capabilities due to its post-training focus. A version with improved instruction-following is planned.
3 days ago
Inactive