DALM by arcee-ai

RAG toolkit for domain-specific language modeling

Created 2 years ago

333 stars

Top 82.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Lewis Tunstall

Research Engineer at Hugging Face

Project Summary

The Arcee Domain Adapted Language Model (DALM) toolkit provides an end-to-end, fully differential Retrieval Augmented Generation (RAG) framework for adapting large language models (LLMs) to specific domains. It targets developers and researchers aiming to ground LLMs in proprietary or specialized knowledge bases, enhancing specificity and factual accuracy. The toolkit enables fine-tuning of decoder-only LLMs with RAG, incorporating efficient in-batch negative sampling.

How It Works

DALM implements a novel end-to-end RAG architecture compatible with decoder-only LLMs like Llama and Falcon, extending prior work on encoder-decoder models. It integrates contrastive learning for retriever training and joint training of the retriever and generator. This approach aims for efficient, domain-specific grounding by optimizing both passage retrieval and text generation within a single, differentiable pipeline.

Quick Start & Requirements

Installation: pip install indomain or clone and pip install --upgrade -e .
Prerequisites: Python, Hugging Face models (e.g., BAAI/bge-large-en retriever, meta-llama/Llama-2-7b-hf generator).
Data: Requires CSV with 'Passage', 'Query', and optionally 'Answer' columns. A data preprocessing script dalm/datasets/qa_gen/question_answer_generation.py is available.
Hardware: Training example (200k dataset) took 7 hours on a single A100 GPU (80GB) with a batch size of 18.
Docs: Demo DALMs Query example

Highlighted Details

Achieves a retriever end-to-end recall of 0.73634, significantly outperforming plain retrievers (0.45984).
Supports PEFT (Parameter-Efficient Fine-Tuning) for efficient adaptation.
Includes scripts for retriever-only training, joint RAG-e2e training, and evaluation.
Compatible with any Hugging Face compatible embedding or autoregressive models.

Maintenance & Community

No specific community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions a potential need to downgrade the transformers library to v4.30 on non-Intel Macs. License information is absent, which may impact commercial adoption.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days