langextract  by google

Extract structured data from text with LLMs

Created 3 months ago
16,357 stars

Top 2.9% on SourcePulse

GitHubView on GitHub
Project Summary

LangExtract is a Python library designed for extracting structured data from unstructured text using Large Language Models (LLMs). It targets developers and researchers needing to process documents like clinical notes or reports, offering precise source grounding, interactive visualization, and flexible LLM support for adaptable domain-specific extraction without fine-tuning.

How It Works

LangExtract employs a strategy of text chunking, parallel processing, and multiple passes to handle long documents efficiently. It leverages few-shot examples and prompt engineering to guide LLMs (like Gemini or OpenAI models) in identifying and structuring information, mapping each extraction to its exact source location. This approach ensures reliable, consistent outputs and enables interactive HTML visualizations for easy verification.

Quick Start & Requirements

  • Installation: pip install langextract
  • Prerequisites: API keys for cloud-hosted models (Gemini, OpenAI). Local LLM support via Ollama.
  • Setup: API key setup via environment variable (LANGEXTRACT_API_KEY) or .env file is recommended.
  • Documentation: Full Romeo and Juliet Extraction Example, RadExtract Demo

Highlighted Details

  • Precise source grounding with visual highlighting.
  • Supports cloud LLMs (Gemini, OpenAI) and local LLMs via Ollama.
  • Interactive HTML visualization of extracted entities.
  • Optimized for long documents with chunking and parallel processing.

Maintenance & Community

This is not an officially supported Google product. Contributions are welcome via pull requests after signing a Contributor License Agreement.

Licensing & Compatibility

Licensed under the Apache 2.0 License. Use in health-related applications is subject to the Health AI Developer Foundations Terms of Use.

Limitations & Caveats

Schema constraints are not yet implemented for OpenAI models. Users should consult model lifecycle documentation for Gemini versions.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
8
Issues (30d)
15
Star History
1,398 stars in the last 30 days

Explore Similar Projects

Starred by John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), Sasha Rush Sasha Rush(Research Scientist at Cursor; Professor at Cornell Tech), and
2 more.

llmparser by kyang6

0%
426
LLM tool for structured data extraction and classification
Created 2 years ago
Updated 2 years ago
Feedback? Help us improve.