langextract by google

Extract structured data from text with LLMs

Created 4 months ago

17,005 stars

Top 2.8% on SourcePulse

4 Experts Love This Project

jmorganca

Cofounder of Ollama

dguido

Cofounder of Trail of Bits

hammer

Jeff Hammerbacher

Cofounder of Cloudera

omarsar

Founder of DAIR.AI

Project Summary

LangExtract is a Python library designed for extracting structured data from unstructured text using Large Language Models (LLMs). It targets developers and researchers needing to process documents like clinical notes or reports, offering precise source grounding, interactive visualization, and flexible LLM support for adaptable domain-specific extraction without fine-tuning.

How It Works

LangExtract employs a strategy of text chunking, parallel processing, and multiple passes to handle long documents efficiently. It leverages few-shot examples and prompt engineering to guide LLMs (like Gemini or OpenAI models) in identifying and structuring information, mapping each extraction to its exact source location. This approach ensures reliable, consistent outputs and enables interactive HTML visualizations for easy verification.

Quick Start & Requirements

Installation: pip install langextract
Prerequisites: API keys for cloud-hosted models (Gemini, OpenAI). Local LLM support via Ollama.
Setup: API key setup via environment variable (LANGEXTRACT_API_KEY) or .env file is recommended.
Documentation: Full Romeo and Juliet Extraction Example, RadExtract Demo

Highlighted Details

Precise source grounding with visual highlighting.
Supports cloud LLMs (Gemini, OpenAI) and local LLMs via Ollama.
Interactive HTML visualization of extracted entities.
Optimized for long documents with chunking and parallel processing.

Maintenance & Community

This is not an officially supported Google product. Contributions are welcome via pull requests after signing a Contributor License Agreement.

Licensing & Compatibility

Licensed under the Apache 2.0 License. Use in health-related applications is subject to the Health AI Developer Foundations Terms of Use.

Limitations & Caveats

Schema constraints are not yet implemented for OpenAI models. Users should consult model lifecycle documentation for Gemini versions.

Health Check

Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)

9

Issues (30d)

10

Star History

328 stars in the last 30 days

Explore Similar Projects

Starred by

John Resig

John Resig(Author of jQuery; Chief Software Architect at Khan Academy),

Sasha Rush

Sasha Rush(Research Scientist at Cursor; Professor at Cornell Tech), and

2 more.

llmparser by kyang6

LLM tool for structured data extraction and classification

Created 2 years ago

Updated 2 years ago

Starred by

Gabriel Almeida

Gabriel Almeida(Cofounder of Langflow),

Rodrigo Nader

Rodrigo Nader(Cofounder of Langflow), and

1 more.

doctran by finic-ai

LLM tool for document transformation using natural language instructions

Created 2 years ago

Updated 1 year ago

KG-Pipeline by FareedKhan-dev

LLM-powered pipeline for text-to-knowledge graph conversion

Created 7 months ago

Updated 7 months ago

Starred by

Dharmesh Shah

Dharmesh Shah(Cofounder of HubSpot).

thepipe by emcf

SDK for extracting data from documents

Created 1 year ago

Updated 1 month ago

docstrange by NanoNets

Extract and convert data from any document to multiple formats

Created 4 months ago

Updated 1 month ago

Starred by

Jerry Liu

Jerry Liu(Cofounder of LlamaIndex).

OpenContracts by Open-Source-Legal

LLM workspace for unstructured document analytics

Created 3 years ago

Updated 1 day ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

kg-gen by stair-lab

Knowledge graph generator for text analysis and RAG

Created 1 year ago

Updated 1 week ago

Starred by

Andreas Jansson

Andreas Jansson(Cofounder of Replicate).

dsRAG by D-Star-AI

RAG engine for unstructured data, excelling on dense text QA

Created 1 year ago

Updated 2 weeks ago

Starred by

Xiaofan Luan

Xiaofan Luan(VP Engineering at Zilliz) and

Bryan Helmig

Bryan Helmig(Cofounder of Zapier).

open-parse by Filimoa

File parser for improved LLM document chunking

Created 1 year ago

Updated 1 year ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera),

Rodrigo Nader

Rodrigo Nader(Cofounder of Langflow), and

2 more.

llmsherpa by nlmatics

Developer APIs for LLM project acceleration

Created 2 years ago

Updated 1 year ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

nv-ingest by NVIDIA

Microservice SDK for parsing unstructured documents into retrieval system inputs

Created 1 year ago

Updated 4 days ago

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify),

Rodrigo Nader

Rodrigo Nader(Cofounder of Langflow), and

9 more.

ragflow by infiniflow

Open-source RAG engine for deep document understanding

Created 2 years ago

Updated 2 days ago

Feedback? Help us improve.