GoLLIE  by hitz-zentroa

LLM for zero-shot information extraction using annotation guidelines

created 1 year ago
389 stars

Top 74.9% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

GoLLIE is a Large Language Model designed for zero-shot Information Extraction (IE) by strictly following user-defined annotation guidelines. It enables dynamic schema definition and inference, outperforming prior methods by leveraging detailed instructions rather than relying solely on pre-existing LLM knowledge. This is beneficial for researchers and practitioners needing flexible and precise IE capabilities.

How It Works

GoLLIE utilizes a guideline-following approach where annotation schemas are defined as Python classes and instructions are embedded in docstrings. The model is trained to interpret these guidelines and extract information accordingly, allowing for on-the-fly schema adaptation. This method enhances zero-shot performance by explicitly conditioning the LLM on task-specific rules.

Quick Start & Requirements

  • Installation: pip install --upgrade transformers peft bitsandbytes and pip install flash-attn --no-build-isolation followed by pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary. Additional dependencies include numpy, black, Jinja2, tqdm, rich, psutil, datasets, ruff, wandb, fschat.
  • Prerequisites: PyTorch >= 2.0.0 (2.1.0+ recommended), Flash Attention 2.0.
  • Models: Available on HuggingFace Hub (7B, 13B, 34B parameter variants based on CodeLLaMA).
  • Usage: Refer to GoLLIE Notebooks.

Highlighted Details

  • Outperforms previous approaches in zero-shot Information Extraction.
  • Supports dynamic schema definition via Python classes and docstrings.
  • Offers three model sizes (7B, 13B, 34B) with varying performance benchmarks.
  • Enables custom task creation and dataset generation.

Maintenance & Community

  • Models are released by HITia (HiTZ) research group.
  • Links to a blog post, paper, and HuggingFace collection are provided.

Licensing & Compatibility

  • The repository itself appears to be under a permissive license, but the underlying datasets used for training may have different licensing terms or require manual acquisition and licensing. The README states, "We do not redistribute the datasets used to train and evaluate GoLLIE. Not all of them are publicly available; some require a license to access them."

Limitations & Caveats

The project does not redistribute training datasets, requiring users to acquire and potentially license certain datasets manually. Compatibility with commercial or closed-source applications may be impacted by the licensing of these external datasets.

Health Check
Last commit

9 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
23 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.