Glyph by thu-coai

Scaling context windows via visual-text compression

Created 4 months ago

564 stars

Top 57.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

Glyph addresses the challenge of scaling context windows in Large Language Models (LLMs) by transforming long textual sequences into images, which are then processed by Vision-Language Models (VLMs). This approach targets researchers and practitioners dealing with extensive documents or conversations, offering significant reductions in computational and memory costs while preserving semantic information, thereby enabling more efficient long-context processing.

How It Works

Glyph reframes long-context modeling as a multimodal problem. Instead of directly processing lengthy text inputs, it renders text into compact images. These images are then fed into VLMs, leveraging their inherent ability to process visual information. This paradigm shift allows for substantial input-token compression, leading to considerable savings in computational resources and inference time compared to conventional text-only LLMs operating on extended contexts.

Quick Start & Requirements

Installation: Requires apt-get install poppler-utils and pip install transformers==4.57.1. Optional dependencies include vllm==0.10.2 and sglang==0.5.2 for accelerated inference.
Prerequisites: Python environment, poppler-utils. The core model is based on GLM-4.1V-9B-Base. GPU acceleration is recommended for VLM inference.
Demo: A ready-to-run demo script (demo/run_demo_compared.sh) allows side-by-side comparison of Glyph with a baseline text model.
Links:
- Paper: https://arxiv.org/abs/2510.17800 (Note: arXiv date is 2025, suggesting a future publication)
- Model: https://huggingface.co/zai-org/Glyph
- Demo Video: https://github.com/user-attachments/assets/9317c567-2b25-40c0-a4f3-7c8edd7a4387

Highlighted Details

Achieves competitive performance on benchmarks like LongBench and MRCR.
Offers significant compression ratios (3-4x with DPI=72) and inference speedups over text backbones.
Supports vLLM acceleration for enhanced throughput and response speed.
Customizable rendering configurations (DPI, newline markup) allow tuning for compression and performance.

Maintenance & Community

The project is the official repository for the Glyph paper, with the fine-tuned model publicly available on Hugging Face. No specific community channels (e.g., Discord, Slack) or details on ongoing maintenance, sponsorships, or partnerships are provided in the README.

Licensing & Compatibility

The README does not explicitly state the software license. This absence is a significant factor for potential adopters, as it leaves commercial use, distribution, and derivative works undefined.

Limitations & Caveats

Glyph's performance is sensitive to rendering parameters (resolution, font, spacing), potentially limiting generalization to unseen rendering styles. OCR-related challenges persist, particularly with fine-grained or rare alphanumeric strings in ultra-long inputs. The model's generalization capabilities beyond long-context understanding require further study. The current text rendering implementation using reportlab has room for performance acceleration.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days