AdvancedLiterateMachinery by AlibabaResearch

Collection of algorithms for Advanced Literate Machinery research

Created 3 years ago

1,832 stars

Top 22.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

This repository, Advanced Literate Machinery (ALM), by Alibaba's OCR Team, aims to develop AI systems capable of reading, thinking, and creating, with an initial focus on advanced OCR and document understanding. It targets researchers and developers in multimodal AI, offering innovative algorithms and benchmarks to push the boundaries of machine literacy beyond current state-of-the-art models like GPT-4V.

How It Works

The project explores various novel approaches for text recognition and document understanding. Key innovations include unified architectures for multi-task visual text parsing (OmniParser), Gestalt principles for web understanding (GEM), and specialized decoders for length-insensitive scene text recognition (LISTER). Many models leverage transformer architectures and pre-training techniques, often incorporating explicit geometric or logical reasoning for improved performance on complex document layouts and diverse text forms.

Quick Start & Requirements

Installation and usage details are not explicitly provided in the README.
Many projects reference specific papers (e.g., CVPR, ECCV, ICCV) which may contain detailed setup instructions.
Some components are available via ModelScope.

Highlighted Details

CC-OCR Benchmark: A comprehensive benchmark for evaluating OCR capabilities of Large Multimodal Models, featuring diverse scenarios and real-world data.
Platypus: A generalized specialist model for recognizing text in various forms using a single unified architecture.
OmniParser: A unified framework handling text spotting, key information extraction, and table recognition with a shared encoder-decoder and point-conditioned text generation.
DocXChain: An open-source toolchain for document parsing, including text detection, recognition, table structure recognition, and layout analysis.

Maintenance & Community

This project is maintained by the 读光 OCR Team within Alibaba's Tongyi Lab. Links to demos (DocMaster) and the portal are provided. Specific community channels like Discord or Slack are not mentioned.

Licensing & Compatibility

The README does not specify a license. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The README focuses on research advancements and does not detail specific limitations, unsupported platforms, or the project's maturity level (e.g., alpha/beta status). Setup and integration details are sparse, requiring consultation of individual research papers.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days