GoLLIE by hitz-zentroa

LLM for zero-shot information extraction using annotation guidelines

Created 2 years ago

423 stars

Top 69.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

GoLLIE is a Large Language Model designed for zero-shot Information Extraction (IE) by strictly following user-defined annotation guidelines. It enables dynamic schema definition and inference, outperforming prior methods by leveraging detailed instructions rather than relying solely on pre-existing LLM knowledge. This is beneficial for researchers and practitioners needing flexible and precise IE capabilities.

How It Works

GoLLIE utilizes a guideline-following approach where annotation schemas are defined as Python classes and instructions are embedded in docstrings. The model is trained to interpret these guidelines and extract information accordingly, allowing for on-the-fly schema adaptation. This method enhances zero-shot performance by explicitly conditioning the LLM on task-specific rules.

Quick Start & Requirements

Installation: pip install --upgrade transformers peft bitsandbytes and pip install flash-attn --no-build-isolation followed by pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary. Additional dependencies include numpy, black, Jinja2, tqdm, rich, psutil, datasets, ruff, wandb, fschat.
Prerequisites: PyTorch >= 2.0.0 (2.1.0+ recommended), Flash Attention 2.0.
Models: Available on HuggingFace Hub (7B, 13B, 34B parameter variants based on CodeLLaMA).
Usage: Refer to GoLLIE Notebooks.

Highlighted Details

Outperforms previous approaches in zero-shot Information Extraction.
Supports dynamic schema definition via Python classes and docstrings.
Offers three model sizes (7B, 13B, 34B) with varying performance benchmarks.
Enables custom task creation and dataset generation.

Maintenance & Community

Models are released by HITia (HiTZ) research group.
Links to a blog post, paper, and HuggingFace collection are provided.

Licensing & Compatibility

The repository itself appears to be under a permissive license, but the underlying datasets used for training may have different licensing terms or require manual acquisition and licensing. The README states, "We do not redistribute the datasets used to train and evaluate GoLLIE. Not all of them are publicly available; some require a license to access them."

Limitations & Caveats

The project does not redistribute training datasets, requiring users to acquire and potentially license certain datasets manually. Compatibility with commercial or closed-source applications may be impacted by the licensing of these external datasets.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days