docprompting  by shuyanzhou

Code generation via documentation retrieval

Created 3 years ago
251 stars

Top 99.8% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project provides the official implementation for "DocPrompting: Generating Code by Retrieving the Docs," an approach to natural language-to-code generation that explicitly leverages documentation. It addresses the challenge of keeping code generation models current with evolving APIs by retrieving relevant documentation before generating code. The target audience includes researchers and engineers in NLP and software engineering, offering a method to improve code generation accuracy and relevance.

How It Works

DocPrompting employs a two-stage pipeline: first, it retrieves relevant documentation snippets based on a natural language intent using either dense retrieval (e.g., CodeT5 with SimCSE) or sparse retrieval (e.g., BM25). Second, a generative model (e.g., FiD T5 or CodeT5) produces code conditioned on both the original natural language intent and the retrieved documentation. This retrieval-augmented generation approach aims to ground code generation in up-to-date API specifications.

Quick Start & Requirements

  • Datasets & Evaluation: Datasets (tldr, conala) and evaluation metrics are available via Huggingface datasets and evaluate libraries.
    import datasets
    tldr = datasets.load_dataset('neulab/tldr')
    conala = datasets.load_dataset('neulab/docprompting-conala')
    
  • Pre-trained Models: Several models are hosted on Huggingface, including neulab/docprompting-tldr-gpt-neo-1.3B.
    from transformers import AutoTokenizer, AutoModelForCausalLM
    tokenizer = AutoTokenizer.from_pretrained("neulab/docprompting-tldr-gpt-neo-1.3B")
    model = AutoModelForCausalLM.from_pretrained("neulab/docprompting-tldr-gpt-neo-1.3B")
    
  • Prerequisites: Python, transformers library (version 3.0.2 specifically required for FiD), datasets, evaluate. GPU is recommended for training and inference. Elasticsearch is needed for BM25 retrieval.
  • Setup: Requires downloading data (docprompting_data.zip) and generator models (docprompting_generator_models.zip).
  • Links: Huggingface Datasets, Huggingface Models

Highlighted Details

  • Official implementation for an ICLR 2023 Spotlight paper.
  • Introduces new NL-to-bash (tldr) and NL-to-Python (CoNaLa) benchmarks with unseen function splits.
  • Supports both dense (CodeT5) and sparse (BM25) retrieval methods.
  • Utilizes Fusion-in-Decoder (FiD) architecture for generation.

Maintenance & Community

  • The project is associated with authors Shuyan Zhou, Uri Alon, Frank F. Xu, Zhiruo Wang, Zhengbao Jiang, and Graham Neubig.
  • No direct links to community channels (e.g., Discord, Slack) or a public roadmap are provided in the README.

Licensing & Compatibility

  • The license is not explicitly stated in the provided README.
  • No specific compatibility notes for commercial use or closed-source linking are mentioned.

Limitations & Caveats

  • The FiD generation component has a strict dependency on transformers version 3.0.2, which may hinder reproducibility if not precisely matched.
  • The absence of a stated license prevents immediate assessment of commercial use or integration compatibility.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Vasek Mlejnsky Vasek Mlejnsky(Cofounder of E2B).

super-rag by superagent-ai

0%
384
RAG pipeline for AI apps
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.