Falcon-Perception by tiiuae

Dense autoregressive Transformer for multimodal vision-language understanding

Created 3 months ago

730 stars

Top 46.4% on SourcePulse

Project Summary

Falcon-Perception offers a performant, minimal PyTorch inference engine for natively multimodal, dense, autoregressive Transformer models. It enables object detection, instance segmentation, and OCR via natural language queries, targeting researchers and engineers for efficient multimodal AI deployment. The engine utilizes advanced inference techniques for significant performance gains.

How It Works

The core architecture features dense, autoregressive Transformers with native multimodality. Inference leverages FlexAttention, compiled into fused Triton kernels via PyTorch's flex_attention, enabling composable attention masks and seamless continuous batching with paged attention. This optimizes memory and throughput using a paged KV cache with virtual page tables, eliminating padding waste.

Quick Start & Requirements

Installation uses pip install -e ., with extras like .[torch] for PyTorch/CUDA and .[mlx] for Apple Silicon. PyTorch backend requires CUDA GPUs and compatible drivers. MLX runs natively on Apple Silicon without PyTorch/transformer dependencies. Initial PyTorch setup involves a 10-30 second compilation and CUDA graph capture; subsequent inference is much faster.

Highlighted Details

Paged Inference Engine: Implements CUDAGraph, continuous batching, paged KV cache (virtual page tables), background tokenization, preemption, and a high-resolution image feature cache to reduce prefill times on repeated queries.
MLX Batch Inference Engine: Provides equivalent performance on Apple Silicon Macs using MLX, featuring a dense KV cache and tiled windowed cross-attention for memory efficiency.
Inference Server: A FastAPI REST API supports the Paged Inference Engine across multiple GPUs via data parallelism for scalable deployment.
Falcon-OCR Throughput: Achieves high serving throughput (~6,000 tok/s on A100-80GB) for the full layout-detection-to-OCR pipeline, outperforming larger models.

Maintenance & Community

No specific details regarding maintenance, community channels, or notable contributors were found in the provided README excerpt.

Licensing & Compatibility

The license type and compatibility notes for commercial use or closed-source linking are not explicitly stated in the provided README excerpt.

Limitations & Caveats

Initial PyTorch backend setup incurs a compilation delay. The MLX backend is limited to Apple Silicon hardware. Layout-aware OCR requires an additional installation and a third-party layout detection model. The vLLM Docker server is exclusively for FalconOCR.

Falcon-Perception by tiiuae

Explore Similar Projects

VLMCSHFG by GingerCohle

dots.vlm1 by rednote-hilab

PixelLM by MaverickRen

cobra by OpenHelix-Team

Pixel-Reasoner by TIGER-AI-Lab

seemore by AviSoori1x

MoAI by ByungKwanLee

deep-text-recognition-benchmark by roatienza

Rex-Omni by IDEA-Research

Vary by Ucas-HaoranWei

X-AnyLabeling by CVHub520

TensorRT by NVIDIA