Valley  by bytedance

Advanced multimodal LLM for text, image, and video

Created 8 months ago
251 stars

Top 99.8% on SourcePulse

GitHubView on GitHub
Project Summary

Valley is a multimodal large language model developed by ByteDance, designed to process and understand text, images, and video data. It targets researchers and developers seeking advanced capabilities in multimodal AI, offering strong performance on e-commerce and short-video benchmarks, and achieving top rankings on leaderboards like OpenCompass for models under 10 billion parameters.

How It Works

The foundational Valley model aligns with Siglip and Qwen2.5, employing LargeMLP and ConvAdapter for its projector component. The "Valley-Eagle" variant builds upon this by integrating an additional, parallel VisionEncoder (specifically Qwen2vl) that enables flexible adjustment of token counts. This architectural enhancement is designed to improve the model's performance, particularly in challenging or "extreme" scenarios.

Quick Start & Requirements

Installation involves setting up PyTorch 2.4.0 with CUDA 12.1 support and installing dependencies via requirements.txt. The repository provides Python code examples for performing inference with single images, multiple images, and video data, utilizing Hugging Face Transformers and a custom ValleyEagleChat class. Official links are available for Hugging Face, ModelScope, and the research paper.

Highlighted Details

  • Valley2-DPO achieved a score of 38.62 on the OpenCompass Multi-modal Leaderboard, ranking top-3 among models under 10 billion parameters.
  • The model demonstrates strong performance on in-house e-commerce and short-video benchmarks, outperforming other open-source models of similar scale.
  • OpenCompass tests show average scores >= 67.40, placing it as TOP2 among models under 10 billion parameters.
  • The research paper "Valley2: Exploring Multimodal Models with Scalable Vision-Language Design" details the model's architecture and findings.

Maintenance & Community

The project is developed by ByteDance's Tiktop-Ecommerce Team, with hiring efforts noted for Beijing, Shanghai, Hangzhou, and Singapore locations. No specific community channels (e.g., Discord, Slack) or public roadmaps are mentioned in the README.

Licensing & Compatibility

The open-source models are licensed under the Apache-2.0 license, which generally permits commercial use and integration into closed-source projects.

Limitations & Caveats

The provided README does not explicitly detail any limitations, known bugs, or alpha status. Performance claims are primarily based on internal benchmarks and specific leaderboard evaluations.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.