Valley by bytedance

Advanced multimodal LLM for text, image, and video

Created 1 year ago

270 stars

Top 95.4% on SourcePulse

Project Summary

Valley is a multimodal large language model developed by ByteDance, designed to process and understand text, images, and video data. It targets researchers and developers seeking advanced capabilities in multimodal AI, offering strong performance on e-commerce and short-video benchmarks, and achieving top rankings on leaderboards like OpenCompass for models under 10 billion parameters.

How It Works

The foundational Valley model aligns with Siglip and Qwen2.5, employing LargeMLP and ConvAdapter for its projector component. The "Valley-Eagle" variant builds upon this by integrating an additional, parallel VisionEncoder (specifically Qwen2vl) that enables flexible adjustment of token counts. This architectural enhancement is designed to improve the model's performance, particularly in challenging or "extreme" scenarios.

Quick Start & Requirements

Installation involves setting up PyTorch 2.4.0 with CUDA 12.1 support and installing dependencies via requirements.txt. The repository provides Python code examples for performing inference with single images, multiple images, and video data, utilizing Hugging Face Transformers and a custom ValleyEagleChat class. Official links are available for Hugging Face, ModelScope, and the research paper.

Highlighted Details

Valley2-DPO achieved a score of 38.62 on the OpenCompass Multi-modal Leaderboard, ranking top-3 among models under 10 billion parameters.
The model demonstrates strong performance on in-house e-commerce and short-video benchmarks, outperforming other open-source models of similar scale.
OpenCompass tests show average scores >= 67.40, placing it as TOP2 among models under 10 billion parameters.
The research paper "Valley2: Exploring Multimodal Models with Scalable Vision-Language Design" details the model's architecture and findings.

Maintenance & Community

The project is developed by ByteDance's Tiktop-Ecommerce Team, with hiring efforts noted for Beijing, Shanghai, Hangzhou, and Singapore locations. No specific community channels (e.g., Discord, Slack) or public roadmaps are mentioned in the README.

Licensing & Compatibility

The open-source models are licensed under the Apache-2.0 license, which generally permits commercial use and integration into closed-source projects.

Limitations & Caveats

The provided README does not explicitly detail any limitations, known bugs, or alpha status. Performance claims are primarily based on internal benchmarks and specific leaderboard evaluations.

Valley by bytedance

Explore Similar Projects

dots.vlm1 by rednote-hilab

Keye by Kwai-Keye

NextStep-1 by stepfun-ai

LLaVA-UHD by thunlp

Bunny by BAAI-DCAI

Ovis by AIDC-AI

Emu3 by baaivision

MiniGPT-4-ZH by RiseInRose

perception_models by facebookresearch

LLaVA-NeXT by LLaVA-VL

Bagel by ByteDance-Seed

sdnext by vladmandic