Open multimodal LLM framework for vision-language tasks
Top 85.5% on SourcePulse
HPT (Hyper-Pretrained Transformers) is a multimodal LLM framework from HyperGAI, designed for vision-language understanding. It offers several open-source models, including HPT 1.5 Edge (<5B parameters) for edge devices and HPT 1.5 Air (8B parameters) built with Llama 3, both achieving competitive results on benchmarks like MMMU.
How It Works
HPT models are built by hyper-pretraining existing large language models with visual encoders. This approach leverages established LLM architectures (like Llama 3, Phi-3, Yi) and visual encoders (like SigLIP, CLIP) to create efficient and capable vision-language models. The framework focuses on achieving state-of-the-art performance on multimodal benchmarks with relatively smaller model sizes.
Quick Start & Requirements
pip install -r requirements.txt
and pip install -e .
git lfs install
followed by git clone https://huggingface.co/HyperGAI/HPT1_5-Edge [Local Path]
python demo/demo.py --image_path [Image] --text [Text] --model [Config]
torchrun --nproc-per-node=8 run.py --data [Dataset] --model [Config]
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The models do not have moderation mechanisms and provide no guarantees on results, requiring community engagement for guardrail implementation for real-world applications.
1 year ago
Inactive