EVE by baaivision

Vision-language model research paper exploring encoder-free architectures

Created 1 year ago

363 stars

Top 77.4% on SourcePulse

Project Summary

EVE Series offers encoder-free Vision-Language Models (VLMs) designed to remove the need for a separate vision encoder, enabling efficient and stable transfer of Large Language Models (LLMs) to multimodal tasks. Targeting researchers and practitioners in multimodal AI, EVE aims to bridge the performance gap between encoder-free and encoder-based VLM architectures.

How It Works

EVE employs a pioneering route by developing a pure decoder-only architecture across modalities. This approach allows for arbitrary image aspect ratios and focuses on efficient, transparent, and practical training strategies. By filtering and recaptioning less than 100 million publicly available data points from sources like OpenImages, SAM, and LAION, EVE demonstrates data efficiency while achieving performance competitive with modular encoder-based VLMs.

Highlighted Details

Superior capability with arbitrary image aspect ratios, outperforming encoder-free counterparts.
Data efficiency achieved by utilizing a filtered subset of publicly available data.
Pioneering a transparent and practical training strategy for decoder-only multimodal architectures.
EVEv1 accepted to NeurIPS 2024 (Spotlight).

Maintenance & Community

The project is associated with BAAI (Beijing Academy of Artificial Intelligence). Further community engagement details are not provided in the README.

Licensing & Compatibility

The project content is licensed under a specific LICENSE file, with no explicit mention of common open-source licenses like MIT or Apache. Users should verify compatibility for commercial use or closed-source linking.

Limitations & Caveats

The README does not detail specific limitations, unsupported platforms, or known bugs. The project appears to be actively developed with EVEv2 recently released.

EVE by baaivision

Explore Similar Projects

e4t-diffusion by mkshing

UniWorld by PKU-YuanGroup

seemore by AviSoori1x

EdgeNeXt by mmaaz60

Lumina-mGPT-2.0 by Alpha-VLLM

transfusion-pytorch by lucidrains

Bunny by BAAI-DCAI

Cosmos-Tokenizer by NVIDIA

Vary by Ucas-HaoranWei

open_flamingo by mlfoundations

minimind-v by jingyaogong

vision_transformer by google-research