docprompting by shuyanzhou

Code generation via documentation retrieval

Created 3 years ago

251 stars

Top 99.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

This project provides the official implementation for "DocPrompting: Generating Code by Retrieving the Docs," an approach to natural language-to-code generation that explicitly leverages documentation. It addresses the challenge of keeping code generation models current with evolving APIs by retrieving relevant documentation before generating code. The target audience includes researchers and engineers in NLP and software engineering, offering a method to improve code generation accuracy and relevance.

How It Works

DocPrompting employs a two-stage pipeline: first, it retrieves relevant documentation snippets based on a natural language intent using either dense retrieval (e.g., CodeT5 with SimCSE) or sparse retrieval (e.g., BM25). Second, a generative model (e.g., FiD T5 or CodeT5) produces code conditioned on both the original natural language intent and the retrieved documentation. This retrieval-augmented generation approach aims to ground code generation in up-to-date API specifications.

Quick Start & Requirements

Datasets & Evaluation: Datasets (tldr, conala) and evaluation metrics are available via Huggingface datasets and evaluate libraries.
```
import datasets
tldr = datasets.load_dataset('neulab/tldr')
conala = datasets.load_dataset('neulab/docprompting-conala')
```

Pre-trained Models: Several models are hosted on Huggingface, including neulab/docprompting-tldr-gpt-neo-1.3B.

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("neulab/docprompting-tldr-gpt-neo-1.3B")
model = AutoModelForCausalLM.from_pretrained("neulab/docprompting-tldr-gpt-neo-1.3B")

Prerequisites: Python, transformers library (version 3.0.2 specifically required for FiD), datasets, evaluate. GPU is recommended for training and inference. Elasticsearch is needed for BM25 retrieval.
Setup: Requires downloading data (docprompting_data.zip) and generator models (docprompting_generator_models.zip).
Links: Huggingface Datasets, Huggingface Models

Highlighted Details

Official implementation for an ICLR 2023 Spotlight paper.
Introduces new NL-to-bash (tldr) and NL-to-Python (CoNaLa) benchmarks with unseen function splits.
Supports both dense (CodeT5) and sparse (BM25) retrieval methods.
Utilizes Fusion-in-Decoder (FiD) architecture for generation.

Maintenance & Community

The project is associated with authors Shuyan Zhou, Uri Alon, Frank F. Xu, Zhiruo Wang, Zhengbao Jiang, and Graham Neubig.
No direct links to community channels (e.g., Discord, Slack) or a public roadmap are provided in the README.

Licensing & Compatibility

The license is not explicitly stated in the provided README.
No specific compatibility notes for commercial use or closed-source linking are mentioned.

Limitations & Caveats

The FiD generation component has a strict dependency on transformers version 3.0.2, which may hinder reproducibility if not precisely matched.
The absence of a stated license prevents immediate assessment of commercial use or integration compatibility.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days