DALM  by arcee-ai

RAG toolkit for domain-specific language modeling

created 2 years ago
325 stars

Top 85.0% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

The Arcee Domain Adapted Language Model (DALM) toolkit provides an end-to-end, fully differential Retrieval Augmented Generation (RAG) framework for adapting large language models (LLMs) to specific domains. It targets developers and researchers aiming to ground LLMs in proprietary or specialized knowledge bases, enhancing specificity and factual accuracy. The toolkit enables fine-tuning of decoder-only LLMs with RAG, incorporating efficient in-batch negative sampling.

How It Works

DALM implements a novel end-to-end RAG architecture compatible with decoder-only LLMs like Llama and Falcon, extending prior work on encoder-decoder models. It integrates contrastive learning for retriever training and joint training of the retriever and generator. This approach aims for efficient, domain-specific grounding by optimizing both passage retrieval and text generation within a single, differentiable pipeline.

Quick Start & Requirements

  • Installation: pip install indomain or clone and pip install --upgrade -e .
  • Prerequisites: Python, Hugging Face models (e.g., BAAI/bge-large-en retriever, meta-llama/Llama-2-7b-hf generator).
  • Data: Requires CSV with 'Passage', 'Query', and optionally 'Answer' columns. A data preprocessing script dalm/datasets/qa_gen/question_answer_generation.py is available.
  • Hardware: Training example (200k dataset) took 7 hours on a single A100 GPU (80GB) with a batch size of 18.
  • Docs: Demo DALMs Query example

Highlighted Details

  • Achieves a retriever end-to-end recall of 0.73634, significantly outperforming plain retrievers (0.45984).
  • Supports PEFT (Parameter-Efficient Fine-Tuning) for efficient adaptation.
  • Includes scripts for retriever-only training, joint RAG-e2e training, and evaluation.
  • Compatible with any Hugging Face compatible embedding or autoregressive models.

Maintenance & Community

  • No specific community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • The README mentions a potential need to downgrade the transformers library to v4.30 on non-Intel Macs. License information is absent, which may impact commercial adoption.
Health Check
Last commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.