GoLLIE  by hitz-zentroa

LLM for zero-shot information extraction using annotation guidelines

Created 1 year ago
395 stars

Top 73.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

GoLLIE is a Large Language Model designed for zero-shot Information Extraction (IE) by strictly following user-defined annotation guidelines. It enables dynamic schema definition and inference, outperforming prior methods by leveraging detailed instructions rather than relying solely on pre-existing LLM knowledge. This is beneficial for researchers and practitioners needing flexible and precise IE capabilities.

How It Works

GoLLIE utilizes a guideline-following approach where annotation schemas are defined as Python classes and instructions are embedded in docstrings. The model is trained to interpret these guidelines and extract information accordingly, allowing for on-the-fly schema adaptation. This method enhances zero-shot performance by explicitly conditioning the LLM on task-specific rules.

Quick Start & Requirements

  • Installation: pip install --upgrade transformers peft bitsandbytes and pip install flash-attn --no-build-isolation followed by pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary. Additional dependencies include numpy, black, Jinja2, tqdm, rich, psutil, datasets, ruff, wandb, fschat.
  • Prerequisites: PyTorch >= 2.0.0 (2.1.0+ recommended), Flash Attention 2.0.
  • Models: Available on HuggingFace Hub (7B, 13B, 34B parameter variants based on CodeLLaMA).
  • Usage: Refer to GoLLIE Notebooks.

Highlighted Details

  • Outperforms previous approaches in zero-shot Information Extraction.
  • Supports dynamic schema definition via Python classes and docstrings.
  • Offers three model sizes (7B, 13B, 34B) with varying performance benchmarks.
  • Enables custom task creation and dataset generation.

Maintenance & Community

  • Models are released by HITia (HiTZ) research group.
  • Links to a blog post, paper, and HuggingFace collection are provided.

Licensing & Compatibility

  • The repository itself appears to be under a permissive license, but the underlying datasets used for training may have different licensing terms or require manual acquisition and licensing. The README states, "We do not redistribute the datasets used to train and evaluate GoLLIE. Not all of them are publicly available; some require a license to access them."

Limitations & Caveats

The project does not redistribute training datasets, requiring users to acquire and potentially license certain datasets manually. Compatibility with commercial or closed-source applications may be impacted by the licensing of these external datasets.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0.3%
353
Vision-language research paper using LLMs
Created 2 years ago
Updated 1 month ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

autolabel by refuel-ai

0.1%
2k
Python library to label text datasets using LLMs
Created 2 years ago
Updated 6 months ago
Feedback? Help us improve.