KG-LLM-MDQA  by yuwvandy

KG-LLM pipeline for multi-document question answering

created 1 year ago
308 stars

Top 88.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides code and a demo for Knowledge Graph Prompting (KGP) applied to Multi-Document Question Answering (MDQA). It enables efficient and accurate answers from multiple text sources by leveraging knowledge graphs and large language models, targeting researchers and practitioners in NLP and information retrieval.

How It Works

The project implements a pipeline that first collects and processes documents relevant to question-answering datasets. It then constructs knowledge graphs from these documents using methods like TF-IDF, KNN, and TAGME. Dense Passage Retrieval (DPR) and Multi-hop Dense Retrieval (MDR) models are fine-tuned for passage retrieval. Finally, it integrates these components with instruction-tuned LLaMA or T5 models for intelligent graph traversal and answer generation, aiming to improve QA performance through structured knowledge.

Quick Start & Requirements

  • Install: conda install -c anaconda python=3.8, then pip install -r requirements.txt and other listed packages.
  • Prerequisites: Python 3.8, PyTorch with CUDA 11.8 (torch-scatter, Levenshtein), spacy with en_core_web_lg model, openai==0.28, langchain, nltk, rank_bm25, sentence-transformers, sentencepiece, transformers.
  • Data: Requires downloading model checkpoints and datasets from provided Dropbox links.
  • Links: Paper PDFTriage

Highlighted Details

  • Supports fine-tuning of DPR and MDR models for passage retrieval.
  • Provides code for instruction fine-tuning of T5 and LLaMA models for graph traversal.
  • Includes a pipeline for reproducing KGP-LLM and other models from the paper.
  • Offers evaluation scripts via Jupyter notebooks.

Maintenance & Community

No specific community links (Discord/Slack) or details on maintainers/sponsorships are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. The use of openai==0.28 suggests potential compatibility considerations with newer OpenAI API versions.

Limitations & Caveats

The project relies on specific versions of dependencies (e.g., openai==0.28, CUDA 11.8) which may require careful environment management. The README notes that parallel LLM API calls can incur significant costs, advising users to adjust CPU usage based on their budget. Access to all datasets and model checkpoints is via Dropbox links.

Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.