SmartResume  by alibaba

AI-powered resume parsing system

Created 2 months ago
319 stars

Top 85.0% on SourcePulse

GitHubView on GitHub
Project Summary

An intelligent, layout-aware resume parsing system, SmartResume ingests resumes in PDF, image, and Office formats to extract clean text and reconstruct reading order. It leverages LLMs to convert this content into structured fields like basic info, education, and work experience, benefiting engineers and researchers by providing structured data for efficient analysis.

How It Works

SmartResume processes resumes by first extracting clean text using OCR and PDF metadata. It then reconstructs the correct reading order by employing layout detection. Finally, Large Language Models (LLMs) are utilized to convert this semantically ordered content into structured data fields. This layout-aware approach is advantageous for accurately interpreting resumes where visual formatting is critical to meaning.

Quick Start & Requirements

  • Installation: Clone the repository, create and activate a conda environment (conda create -n resume_parsing python=3.9, conda activate resume_parsing), and install dependencies (pip install -e .).
  • Prerequisites: Python >= 3.9, CUDA >= 11.0 (optional for GPU acceleration), Memory >= 8GB, Storage >= 10GB.
  • Configuration: Requires editing configs/config.yaml to add API keys.
  • Links: Code repository (implied), Model, Demo, Technical Report (English/Chinese).

Highlighted Details

  • Layout Detection achieves an mAP@0.5 of 92.1%.
  • Information Extraction reports an Overall Accuracy of 93.1%.
  • Processing Speed is noted at 1.22s per single page.
  • Supports many major global languages.

Maintenance & Community

The project includes a TODO list indicating ongoing development, such as optimizing model loading and enhancing vLLM deployment. No specific community channels (e.g., Discord, Slack) or notable contributors/sponsorships are detailed in the provided text.

Licensing & Compatibility

The project states it is licensed under "LICENSE," with plans to adopt more permissive licenses. However, the codebase is a refactored version due to open-source compliance requirements, and internal PDF parsing/OCR components were replaced with open-source alternatives. This suggests potential licensing ambiguities or restrictions that require further investigation for commercial use or closed-source integration.

Limitations & Caveats

This is a refactored version of the original system due to open-source compliance, with internal PDF parsing and OCR components replaced by open-source alternatives, potentially impacting compatibility with the original implementation. Some features may not be fully functional. Ongoing development is indicated by a TODO list, including optimizing model loading and enhancing vLLM deployment support.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
22 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
4 more.

olmocr by allenai

0.8%
17k
Toolkit for linearizing PDFs for LLM datasets/training
Created 1 year ago
Updated 2 days ago
Feedback? Help us improve.