Namo-R1 by lucasjinreal

Compact, CPU-first multimodal AI for diverse applications

Created 1 year ago

256 stars

Top 98.5% on SourcePulse

Project Summary

Summary

Namo R1 is an open-source, compact (500M parameter) Visual Language Model (VLM) designed for efficient CPU execution, addressing the accessibility gap for users without high-end GPUs. It offers researchers and developers a powerful, yet lightweight, MLLM solution with a focus on training transparency and future extensibility, aiming to democratize VLM research and deployment.

How It Works

This project introduces Namo R1, a 500M parameter MLLM engineered for exceptional CPU performance. Its core innovations include an architecture optimized for CPU-friendly inference, native support for omni-modal scalability (encompassing future audio capabilities), and complete training transparency. By fully disclosing data curation processes and dynamic curriculum scheduling, Namo R1 facilitates reproducible AI research and development, differentiating itself from many closed-source or less transparent MLLM projects.

Quick Start & Requirements

Installation: pip install -U namo
Prerequisites: Primarily designed for CPU execution, though GPU usage is supported via torch.cuda.is_available(). No specific OS or hardware constraints beyond standard Python environments are detailed for basic operation.
Links:
- Community Discord: https://discord.gg/5ftPBVspXj

Highlighted Details

Achieves performance surpassing SmolVLM and Moondream2 on specific benchmarks for models of comparable size.
Features multilingual OCR capabilities (English, Chinese, Japanese, etc.) within its 500M parameter footprint.
Supports native dynamic resolution, enhancing robustness with images of varying aspect ratios.
Provides full open-source access to all model code, training scripts, and data curation methodologies.
Incorporates SigLIP2 as a vision encoder option for enhanced training capabilities.

Maintenance & Community

The project is actively under development, with recent updates including SigLIP2 integration. A community Discord server is available for support and discussion.

Licensing & Compatibility

License: MIT License.
Compatibility: The MIT license permits broad use, including commercial applications and integration into closed-source projects.

Limitations & Caveats

Current benchmark results are based on a limited set of metrics, with more comprehensive evaluations planned. Some larger model variants (e.g., 700M) are still undergoing training. Users encountering issues with deepspeed should ensure their transformers library is updated to version 4.48 or later.

Namo-R1 by lucasjinreal

Explore Similar Projects

dots.vlm1 by rednote-hilab

LightRFT by opendilab

Ola by Ola-Omni

smol-audio by Deep-unlearning

LongCat-Flash-Omni by meituan-longcat

MiniCPM-o-Demo by OpenBMB

LLaVA-OneVision-2 by EvolvingLMMs-Lab

Aria by rhymes-ai

molmo by allenai

TencentPretrain by Tencent

Speech by NVIDIA-NeMo

transformers by huggingface