deep-text-recognition-benchmark by roatienza

PyTorch code for scene text recognition research paper

Created 5 years ago

313 stars

Top 86.3% on SourcePulse

Project Summary

This repository provides PyTorch code for ViTSTR, a Vision Transformer-based model for fast and efficient scene text recognition. It offers comparable accuracy to state-of-the-art models with significantly fewer parameters and FLOPS, making it suitable for researchers and developers working on OCR and text detection tasks.

How It Works

ViTSTR leverages a pre-trained Vision Transformer (ViT) architecture for scene text recognition. This approach capitalizes on the parallel processing capabilities inherent in ViTs, leading to faster inference times compared to traditional recurrent or convolutional models. The model is designed as a single-stage architecture, simplifying its implementation and training.

Quick Start & Requirements

Install: pip3 install -r requirements.txt
Prerequisites: Python 3, PyTorch. GPU recommended for training and faster inference.
Inference: python3 infer.py --image <path_to_image> --model <model_url_or_path>
Demo: https://github.com/roatienza/deep-text-recognition-benchmark/blob/main/demo_image/demo_1.png
Model Weights: Available for Tiny, Small, and Base configurations, with and without augmentation.

Highlighted Details

Achieves competitive accuracy on various benchmarks (IIIT, SVT, IC03, IC15, etc.) with fewer parameters and FLOPS.
Demonstrates fast inference times: ~2.57ms on Quadro RTX 6000, ~28ms on CPU.
Offers quantized models optimized for x86 and Raspberry Pi 4.
Supports training from scratch with detailed configurations for data augmentation and multi-GPU setups.

Maintenance & Community

The project is based on a fork of CLOVA AI Deep Text Recognition Benchmark.
The primary contributor is Rowel Atienza.
The paper associated with this work is "Vision Transformer for Fast and Efficient Scene Text Recognition" (ICDAR 2021).

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, being a fork of another project, its licensing status may depend on the original repository. Users should verify licensing for commercial use.

Limitations & Caveats

The README does not specify the exact license, which may pose a risk for commercial adoption.
Training requires significant computational resources, especially for larger models and extensive data augmentation.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

1 stars in the last 30 days

Explore Similar Projects

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI).

native-sparse-attention-triton by XunhaoLai

Efficient sparse attention for LLMs

Created 10 months ago

Updated 7 months ago

MPP-LLaVA by Coobiw

MLLM for training LLaVA-like models on limited hardware

Created 2 years ago

Updated 10 months ago

Starred by

Jiayi Pan

Jiayi Pan(Author of SWE-Gym; MTS at xAI) and

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

X-VLM by zengyan-97

Vision-language model for multi-grained alignment (ICML 2022 paper)

Created 4 years ago

Updated 3 years ago

mindocr by mindspore-lab

OCR toolbox for text detection and recognition, based on MindSpore

Created 3 years ago

Updated 5 months ago

yolov8-pytorch by bubbliiiing

PyTorch implementation for YOLOv8 object detection

Created 2 years ago

Updated 2 years ago

TensorflowASR by Z-yq

ASR toolkit for CPU/edge deployment, approaching GPU model performance

Created 6 years ago

Updated 10 months ago

app_deep_learning by jeffheaton

PyTorch course for deep learning applications

Created 2 years ago

Updated 20 hours ago

yolox-pytorch by bubbliiiing

PyTorch implementation for the YOLOX object detection model

Created 4 years ago

Updated 2 years ago

ailia-models by ailia-ai

AI model zoo for ailia SDK (cross-platform inference)

Created 6 years ago

Updated 1 day ago

dl-colab-notebooks by tugstugi

Colab notebooks for deep learning model demos

Created 6 years ago

Updated 3 years ago

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI),

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI), and

4 more.

Qwen-VL by QwenLM

Vision-language model for multimodal understanding, localization, and text reading

Created 2 years ago

Updated 1 year ago

Starred by

Alex Yu

Alex Yu(Research Scientist at OpenAI; Cofounder of Luma AI),

Lianmin Zheng

Lianmin Zheng(Coauthor of SGLang, vLLM), and

2 more.

HunyuanVideo by Tencent-Hunyuan

PyTorch code for video generation research

Created 1 year ago

Updated 1 month ago

Feedback? Help us improve.