HPT  by HyperGAI

Open multimodal LLM framework for vision-language tasks

created 1 year ago
315 stars

Top 85.5% on SourcePulse

GitHubView on GitHub
Project Summary

HPT (Hyper-Pretrained Transformers) is a multimodal LLM framework from HyperGAI, designed for vision-language understanding. It offers several open-source models, including HPT 1.5 Edge (<5B parameters) for edge devices and HPT 1.5 Air (8B parameters) built with Llama 3, both achieving competitive results on benchmarks like MMMU.

How It Works

HPT models are built by hyper-pretraining existing large language models with visual encoders. This approach leverages established LLM architectures (like Llama 3, Phi-3, Yi) and visual encoders (like SigLIP, CLIP) to create efficient and capable vision-language models. The framework focuses on achieving state-of-the-art performance on multimodal benchmarks with relatively smaller model sizes.

Quick Start & Requirements

  • Install via pip: pip install -r requirements.txt and pip install -e .
  • Download model weights from HuggingFace: git lfs install followed by git clone https://huggingface.co/HyperGAI/HPT1_5-Edge [Local Path]
  • Requires Python and PyTorch. Specific CUDA versions are not explicitly stated but are generally expected for GPU acceleration.
  • Demo: python demo/demo.py --image_path [Image] --text [Text] --model [Config]
  • Evaluation: torchrun --nproc-per-node=8 run.py --data [Dataset] --model [Config]
  • More details: technical blog post

Highlighted Details

  • HPT 1.5 Edge (<5B parameters) is optimized for edge devices.
  • HPT 1.5 Air (8B parameters) uses Llama 3 and claims state-of-the-art results among <10B models on benchmarks like MMMU, POPE, SEED-I.
  • HPT 1.0 Air is noted as a cost-effective solution for vision-and-language tasks.
  • Evaluation code is extended from the VLMEvalKit project.

Maintenance & Community

Licensing & Compatibility

  • Released under Apache 2.0 license.
  • Parts of the project may use code/models from other sources with their own licenses, potentially impacting commercial use.

Limitations & Caveats

The models do not have moderation mechanisms and provide no guarantees on results, requiring community engagement for guardrail implementation for real-world applications.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI).

DeepSeek-Coder-V2 by deepseek-ai

0.4%
6k
Open-source code language model comparable to GPT4-Turbo
created 1 year ago
updated 10 months ago
Starred by Shizhe Diao Shizhe Diao(Research Scientist at NVIDIA; Author of LMFlow), Zhiyuan Li Zhiyuan Li(Cofounder of Nexa AI), and
15 more.

LLaVA by haotian-liu

0.3%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 1 year ago
Feedback? Help us improve.