Extract structured data from text with LLMs
Top 4.4% on SourcePulse
LangExtract is a Python library designed for extracting structured data from unstructured text using Large Language Models (LLMs). It targets developers and researchers needing to process documents like clinical notes or reports, offering precise source grounding, interactive visualization, and flexible LLM support for adaptable domain-specific extraction without fine-tuning.
How It Works
LangExtract employs a strategy of text chunking, parallel processing, and multiple passes to handle long documents efficiently. It leverages few-shot examples and prompt engineering to guide LLMs (like Gemini or OpenAI models) in identifying and structuring information, mapping each extraction to its exact source location. This approach ensures reliable, consistent outputs and enables interactive HTML visualizations for easy verification.
Quick Start & Requirements
pip install langextract
LANGEXTRACT_API_KEY
) or .env
file is recommended.Highlighted Details
Maintenance & Community
This is not an officially supported Google product. Contributions are welcome via pull requests after signing a Contributor License Agreement.
Licensing & Compatibility
Licensed under the Apache 2.0 License. Use in health-related applications is subject to the Health AI Developer Foundations Terms of Use.
Limitations & Caveats
Schema constraints are not yet implemented for OpenAI models. Users should consult model lifecycle documentation for Gemini versions.
1 day ago
Inactive