dolly  by databrickslabs

Instruction-following LLM trained on the Databricks Machine Learning Platform

created 2 years ago
10,809 stars

Top 4.8% on sourcepulse

GitHubView on GitHub
Project Summary

Databricks' Dolly is an open-source, instruction-following large language model derived from EleutherAI's Pythia-12b. It is fine-tuned on a ~15K instruction/response dataset created by Databricks employees, covering various capability domains. Dolly is designed for commercial use and aims to democratize access to instruction-tuned LLMs, offering a surprisingly capable model despite its limitations.

How It Works

Dolly-v2-12b is a 12 billion parameter causal language model. It leverages the Pythia-12b foundation model and is fine-tuned on a custom dataset of ~15,000 instruction-response pairs. This dataset was curated by Databricks employees, focusing on capabilities like brainstorming, classification, question answering, generation, information extraction, and summarization, inspired by the InstructGPT paper. This approach aims to imbue the model with strong instruction-following abilities.

Quick Start & Requirements

  • Inference: Use the Hugging Face transformers library:
    from transformers import pipeline
    import torch
    instruct_pipeline = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
    
  • Hardware: A100 GPUs are recommended for optimal performance. Inference is possible on A10 (with 8-bit loading) and V100 GPUs (using torch.float16).
  • Training: Requires Databricks environment with 8 A100 GPUs (e.g., Standard_ND96asr_v4 or p4d.24xlarge). Training on A10 or V100 GPUs is possible with modifications.
  • Links: Hugging Face Model

Highlighted Details

  • Licensed for commercial use (CC-BY-SA).
  • Fine-tuned on a proprietary, human-generated instruction dataset.
  • Demonstrates surprisingly high-quality instruction following for its base model.
  • Supports inference on various GPU types with specific configurations.

Maintenance & Community

The project is hosted by Databricks Labs. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

  • License: CC-BY-SA 3.0.
  • Compatibility: Permissive license allows for commercial use and linking with closed-source applications.

Limitations & Caveats

Dolly-v2-12b is not state-of-the-art and struggles with complex prompts, programming, math, factual accuracy, and nuanced tasks like humor or stylistic mimicry. The training data may reflect biases present in the internet and Wikipedia, as well as the specific demographics of Databricks employees.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
36 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.