MiniCPM-V-CookBook by OpenSQZ

Building multimodal AI applications

Created 10 months ago

559 stars

Top 56.9% on SourcePulse

Project Summary

Summary

This repository offers a comprehensive set of "recipes" and documentation for building multimodal AI applications with the MiniCPM-o model, integrating vision, speech, and live-streaming capabilities. It targets individuals, enterprises, and researchers with tailored deployment and fine-tuning solutions, enabling effortless development and deployment across diverse hardware and software environments.

How It Works

The cookbook provides ready-to-run examples for leveraging MiniCPM-o's multimodal understanding. It supports a wide array of inference frameworks: user-friendly options like Ollama and Llama.cpp for individuals; high-performance solutions like vLLM and SGLang for enterprises; and advanced toolkits such as Transformers, LLaMA-Factory, SWIFT, and Align-anything for researchers.

Quick Start & Requirements

Setup varies by chosen framework; specific instructions are in the ./deployment/ directory. Key frameworks include:

Edge/Local: Llama.cpp (iOS/iPad support available), Ollama (integration in progress).
Cloud Serving: vLLM, SGLang.
Fine-tuning: Transformers, LLaMA-Factory, SWIFT, Align-anything.
Demos: FastAPI, Gradio, OpenWebUI. Users may require GPUs and CUDA for certain serving frameworks.
Full Documentation: https://minicpm-o.readthedocs.io/en/latest/index.html
Main Repository: https://github.com/OpenBMB/MiniCPM-o

Highlighted Details

Versatile Deployment: Supports edge devices (iPhone/iPad), local machines, and cloud infrastructure.
Comprehensive Multimodal Features: Includes recipes for image/video QA, document parsing, OCR, visual grounding, speech-to-text, text-to-speech, and voice cloning.
Flexible Fine-tuning & Serving: Integrates with popular frameworks like Transformers, LLaMA-Factory, vLLM, and SGLang for customization and high-throughput inference.
Quantization Support: Recipes for GGUF, BNB, and AWQ formats to enhance efficiency.

Maintenance & Community

Developed by OpenBMB and OpenSQZ. Community support and contributions are encouraged via their Discord channel. Active development is indicated by ongoing framework integrations.

Licensing & Compatibility

Released under the permissive Apache-2.0 License, allowing free use, modification, and distribution, including commercial applications.

Limitations & Caveats

Support for Ollama on edge devices is listed as "Waiting for official release," indicating ongoing development. Other listed framework integrations appear current or recently completed.

Health Check

Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

89 stars in the last 30 days