RAG toolkit for domain-specific language modeling
Top 85.0% on sourcepulse
The Arcee Domain Adapted Language Model (DALM) toolkit provides an end-to-end, fully differential Retrieval Augmented Generation (RAG) framework for adapting large language models (LLMs) to specific domains. It targets developers and researchers aiming to ground LLMs in proprietary or specialized knowledge bases, enhancing specificity and factual accuracy. The toolkit enables fine-tuning of decoder-only LLMs with RAG, incorporating efficient in-batch negative sampling.
How It Works
DALM implements a novel end-to-end RAG architecture compatible with decoder-only LLMs like Llama and Falcon, extending prior work on encoder-decoder models. It integrates contrastive learning for retriever training and joint training of the retriever and generator. This approach aims for efficient, domain-specific grounding by optimizing both passage retrieval and text generation within a single, differentiable pipeline.
Quick Start & Requirements
pip install indomain
or clone and pip install --upgrade -e .
BAAI/bge-large-en
retriever, meta-llama/Llama-2-7b-hf
generator).dalm/datasets/qa_gen/question_answer_generation.py
is available.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
transformers
library to v4.30
on non-Intel Macs. License information is absent, which may impact commercial adoption.8 months ago
1 day