dolly by databrickslabs

Instruction-following LLM trained on the Databricks Machine Learning Platform

Created 2 years ago

10,799 stars

Top 4.7% on SourcePulse

View on GitHub

20 Experts Love This Project

Matei Zaharia

Cofounder of Databricks

Pawel Garbacki

Cofounder of Fireworks AI

Carol Willing

Core Contributor to CPython, Jupyter

Junyang Lin

Core Maintainer at Alibaba Qwen

and 16 more!

Project Summary

Databricks' Dolly is an open-source, instruction-following large language model derived from EleutherAI's Pythia-12b. It is fine-tuned on a ~15K instruction/response dataset created by Databricks employees, covering various capability domains. Dolly is designed for commercial use and aims to democratize access to instruction-tuned LLMs, offering a surprisingly capable model despite its limitations.

How It Works

Dolly-v2-12b is a 12 billion parameter causal language model. It leverages the Pythia-12b foundation model and is fine-tuned on a custom dataset of ~15,000 instruction-response pairs. This dataset was curated by Databricks employees, focusing on capabilities like brainstorming, classification, question answering, generation, information extraction, and summarization, inspired by the InstructGPT paper. This approach aims to imbue the model with strong instruction-following abilities.

Quick Start & Requirements

Inference: Use the Hugging Face transformers library:

from transformers import pipeline
import torch
instruct_pipeline = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")

Hardware: A100 GPUs are recommended for optimal performance. Inference is possible on A10 (with 8-bit loading) and V100 GPUs (using torch.float16).
Training: Requires Databricks environment with 8 A100 GPUs (e.g., Standard_ND96asr_v4 or p4d.24xlarge). Training on A10 or V100 GPUs is possible with modifications.
Links: Hugging Face Model

Highlighted Details

Licensed for commercial use (CC-BY-SA).
Fine-tuned on a proprietary, human-generated instruction dataset.
Demonstrates surprisingly high-quality instruction following for its base model.
Supports inference on various GPU types with specific configurations.

Maintenance & Community

The project is hosted by Databricks Labs. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

License: CC-BY-SA 3.0.
Compatibility: Permissive license allows for commercial use and linking with closed-source applications.

Limitations & Caveats

Dolly-v2-12b is not state-of-the-art and struggles with complex prompts, programming, math, factual accuracy, and nuanced tasks like humor or stylistic mimicry. The training data may reflect biases present in the internet and Wikipedia, as well as the specific demographics of Databricks employees.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days