langextract  by google

Extract structured data from text with LLMs

created 1 month ago
11,437 stars

Top 4.4% on SourcePulse

GitHubView on GitHub
Project Summary

LangExtract is a Python library designed for extracting structured data from unstructured text using Large Language Models (LLMs). It targets developers and researchers needing to process documents like clinical notes or reports, offering precise source grounding, interactive visualization, and flexible LLM support for adaptable domain-specific extraction without fine-tuning.

How It Works

LangExtract employs a strategy of text chunking, parallel processing, and multiple passes to handle long documents efficiently. It leverages few-shot examples and prompt engineering to guide LLMs (like Gemini or OpenAI models) in identifying and structuring information, mapping each extraction to its exact source location. This approach ensures reliable, consistent outputs and enables interactive HTML visualizations for easy verification.

Quick Start & Requirements

  • Installation: pip install langextract
  • Prerequisites: API keys for cloud-hosted models (Gemini, OpenAI). Local LLM support via Ollama.
  • Setup: API key setup via environment variable (LANGEXTRACT_API_KEY) or .env file is recommended.
  • Documentation: Full Romeo and Juliet Extraction Example, RadExtract Demo

Highlighted Details

  • Precise source grounding with visual highlighting.
  • Supports cloud LLMs (Gemini, OpenAI) and local LLMs via Ollama.
  • Interactive HTML visualization of extracted entities.
  • Optimized for long documents with chunking and parallel processing.

Maintenance & Community

This is not an officially supported Google product. Contributions are welcome via pull requests after signing a Contributor License Agreement.

Licensing & Compatibility

Licensed under the Apache 2.0 License. Use in health-related applications is subject to the Health AI Developer Foundations Terms of Use.

Limitations & Caveats

Schema constraints are not yet implemented for OpenAI models. Users should consult model lifecycle documentation for Gemini versions.

Health Check
Last commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
67
Issues (30d)
87
Star History
11,527 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), and
1 more.

LightRAG by HKUDS

1.5%
20k
RAG framework for fast, simple retrieval-augmented generation
created 10 months ago
updated 1 day ago
Feedback? Help us improve.