KBLaM by microsoft

Knowledge augmentation research paper

Created 1 year ago

1,434 stars

Top 28.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Sebastian Raschka

Author of "Build a Large Language Model (From Scratch)"

Project Summary

KBLaM is a research project that augments Large Language Models (LLMs) with external knowledge bases without requiring retrieval modules. It targets researchers seeking to enhance LLM grounding and factual accuracy, offering a linear scaling computational overhead with knowledge base size, unlike quadratic scaling in in-context learning.

How It Works

KBLaM integrates knowledge by transforming knowledge base entries into special "knowledge tokens" that the LLM ingests via trained adapters. This approach leaves the base LLM's text input processing unmodified, ensuring that without a knowledge base, the model behaves identically to its base counterpart. The method's advantage lies in its efficient, linear scaling with KB size and its ability to ground responses in provided knowledge.

Quick Start & Requirements

Install: pip install -e .
Prerequisites: Hugging Face account and token for Llama models (huggingface-cli login). Azure OpenAI endpoint required for synthetic dataset generation. Supports text-embedding-ada-002 and all-MiniLM-L6-v2 for KB embeddings.
Links: Official Implementation, Hugging Face Hub

Highlighted Details

Supports Hugging Face models: meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Llama-3.2-1B-Instruct, Phi-3-mini-4k-instruct.
Evaluated on accuracy, refusal rate, and answer precision/recall against the knowledge base.
Training script train.py allows customization of dataset, batch size, steps, encoder, and embedding source.
Dataset generation scripts are available for synthetic KBs and QA pairs.

Maintenance & Community

Contributions are welcomed via pull requests, requiring agreement to a Contributor License Agreement (CLA).
Feedback can be provided by opening issues in the repository.
Follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README.

Limitations & Caveats

KBLaM is intended for research use and is not recommended for production settings. When used with knowledge bases significantly different from its training data, it may produce incomplete, reworded, or incorrect answers. Effective use requires training and use-case knowledge bases to be sufficiently similar.

Health Check

Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days