KG-LLM-MDQA by yuwvandy

KG-LLM pipeline for multi-document question answering

Created 2 years ago

326 stars

Top 83.8% on SourcePulse

Project Summary

This repository provides code and a demo for Knowledge Graph Prompting (KGP) applied to Multi-Document Question Answering (MDQA). It enables efficient and accurate answers from multiple text sources by leveraging knowledge graphs and large language models, targeting researchers and practitioners in NLP and information retrieval.

How It Works

The project implements a pipeline that first collects and processes documents relevant to question-answering datasets. It then constructs knowledge graphs from these documents using methods like TF-IDF, KNN, and TAGME. Dense Passage Retrieval (DPR) and Multi-hop Dense Retrieval (MDR) models are fine-tuned for passage retrieval. Finally, it integrates these components with instruction-tuned LLaMA or T5 models for intelligent graph traversal and answer generation, aiming to improve QA performance through structured knowledge.

Quick Start & Requirements

Install: conda install -c anaconda python=3.8, then pip install -r requirements.txt and other listed packages.
Prerequisites: Python 3.8, PyTorch with CUDA 11.8 (torch-scatter, Levenshtein), spacy with en_core_web_lg model, openai==0.28, langchain, nltk, rank_bm25, sentence-transformers, sentencepiece, transformers.
Data: Requires downloading model checkpoints and datasets from provided Dropbox links.
Links: Paper PDFTriage

Highlighted Details

Supports fine-tuning of DPR and MDR models for passage retrieval.
Provides code for instruction fine-tuning of T5 and LLaMA models for graph traversal.
Includes a pipeline for reproducing KGP-LLM and other models from the paper.
Offers evaluation scripts via Jupyter notebooks.

Maintenance & Community

No specific community links (Discord/Slack) or details on maintainers/sponsorships are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. The use of openai==0.28 suggests potential compatibility considerations with newer OpenAI API versions.

Limitations & Caveats

The project relies on specific versions of dependencies (e.g., openai==0.28, CUDA 11.8) which may require careful environment management. The README notes that parallel LLM API calls can incur significant costs, advising users to adjust CPU usage based on their budget. Access to all datasets and model checkpoints is via Dropbox links.

KG-LLM-MDQA by yuwvandy

Explore Similar Projects

wdoc by thiswillbeyourgithub

RAGOnMedicalKG by liuhuanyong

IncarnaMind by junruxiong

Chat_with_Datawhale_langchain by logan-zou

auto-evaluator by rlancemartin

WikiChat by stanford-oval

ChatPDF by shibing624

raptor by parthsarthi03

Chinese-LangChain by yanqiangmiffy

WeKnora by Tencent

rag-from-scratch by langchain-ai

graphrag by microsoft