DS-1000 by xlang-ai

Benchmark for data science code generation

Created 3 years ago

267 stars

Top 96.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Eric Zhu

Coauthor of AutoGen; Research Scientist at Microsoft Research

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Project Summary

DS-1000 provides a benchmark for evaluating data science code generation models, focusing on natural language prompts and reliable execution. It targets researchers and developers building AI assistants for data science tasks, offering a standardized way to measure model performance across various libraries.

How It Works

The benchmark consists of 1000 data science problems, each with a natural language prompt, execution context, and evaluation logic. Generated code is executed within a sandboxed environment that includes test execution and string validation functions. This approach ensures that solutions are not only syntactically correct but also produce the expected outputs for given inputs and library states.

Quick Start & Requirements

Install via conda env create -f environment.yml and conda activate ds1000-3.10.
Additional dependencies: pip install datasets tqdm.
The dataset can be loaded from Hugging Face (load_dataset("xlangai/DS-1000")) or a local data/ds1000.jsonl.gz file.
Official project page: https://github.com/xlang-ai/DS-1000

Highlighted Details

Evaluates models on 7 popular data science libraries: Matplotlib, Numpy, Pandas, Pytorch, Scipy, Scikit-learn, and Tensorflow.
Simplified dataset format hosted on Hugging Face for improved usability.
Includes reference solutions and evaluation scripts for testing generated code.

Maintenance & Community

The project is associated with ICML 2023 and includes citation information for the original paper.

Licensing & Compatibility

The repository does not explicitly state a license.

Limitations & Caveats

A small percentage of executions are stateful, requiring each problem to be run in an independent process. Minor inconsistencies with the original dataset may exist due to import handling. The dataset may contain a small number of errors inherent in human-labeled data.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days