Falcon-Perception  by tiiuae

Dense autoregressive Transformer for multimodal vision-language understanding

Created 1 week ago

New!

406 stars

Top 71.6% on SourcePulse

GitHubView on GitHub
Project Summary

Falcon-Perception offers a performant, minimal PyTorch inference engine for natively multimodal, dense, autoregressive Transformer models. It enables object detection, instance segmentation, and OCR via natural language queries, targeting researchers and engineers for efficient multimodal AI deployment. The engine utilizes advanced inference techniques for significant performance gains.

How It Works

The core architecture features dense, autoregressive Transformers with native multimodality. Inference leverages FlexAttention, compiled into fused Triton kernels via PyTorch's flex_attention, enabling composable attention masks and seamless continuous batching with paged attention. This optimizes memory and throughput using a paged KV cache with virtual page tables, eliminating padding waste.

Quick Start & Requirements

Installation uses pip install -e ., with extras like .[torch] for PyTorch/CUDA and .[mlx] for Apple Silicon. PyTorch backend requires CUDA GPUs and compatible drivers. MLX runs natively on Apple Silicon without PyTorch/transformer dependencies. Initial PyTorch setup involves a 10-30 second compilation and CUDA graph capture; subsequent inference is much faster.

Highlighted Details

  • Paged Inference Engine: Implements CUDAGraph, continuous batching, paged KV cache (virtual page tables), background tokenization, preemption, and a high-resolution image feature cache to reduce prefill times on repeated queries.
  • MLX Batch Inference Engine: Provides equivalent performance on Apple Silicon Macs using MLX, featuring a dense KV cache and tiled windowed cross-attention for memory efficiency.
  • Inference Server: A FastAPI REST API supports the Paged Inference Engine across multiple GPUs via data parallelism for scalable deployment.
  • Falcon-OCR Throughput: Achieves high serving throughput (~6,000 tok/s on A100-80GB) for the full layout-detection-to-OCR pipeline, outperforming larger models.

Maintenance & Community

No specific details regarding maintenance, community channels, or notable contributors were found in the provided README excerpt.

Licensing & Compatibility

The license type and compatibility notes for commercial use or closed-source linking are not explicitly stated in the provided README excerpt.

Limitations & Caveats

Initial PyTorch backend setup incurs a compilation delay. The MLX backend is limited to Apple Silicon hardware. Layout-aware OCR requires an additional installation and a third-party layout detection model. The vLLM Docker server is exclusively for FalconOCR.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
8
Star History
407 stars in the last 12 days

Explore Similar Projects

Feedback? Help us improve.