DS-1000  by xlang-ai

Benchmark for data science code generation

created 2 years ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

DS-1000 provides a benchmark for evaluating data science code generation models, focusing on natural language prompts and reliable execution. It targets researchers and developers building AI assistants for data science tasks, offering a standardized way to measure model performance across various libraries.

How It Works

The benchmark consists of 1000 data science problems, each with a natural language prompt, execution context, and evaluation logic. Generated code is executed within a sandboxed environment that includes test execution and string validation functions. This approach ensures that solutions are not only syntactically correct but also produce the expected outputs for given inputs and library states.

Quick Start & Requirements

  • Install via conda env create -f environment.yml and conda activate ds1000-3.10.
  • Additional dependencies: pip install datasets tqdm.
  • The dataset can be loaded from Hugging Face (load_dataset("xlangai/DS-1000")) or a local data/ds1000.jsonl.gz file.
  • Official project page: https://github.com/xlang-ai/DS-1000

Highlighted Details

  • Evaluates models on 7 popular data science libraries: Matplotlib, Numpy, Pandas, Pytorch, Scipy, Scikit-learn, and Tensorflow.
  • Simplified dataset format hosted on Hugging Face for improved usability.
  • Includes reference solutions and evaluation scripts for testing generated code.

Maintenance & Community

The project is associated with ICML 2023 and includes citation information for the original paper.

Licensing & Compatibility

The repository does not explicitly state a license.

Limitations & Caveats

A small percentage of executions are stateful, requiring each problem to be run in an independent process. Minor inconsistencies with the original dataset may exist due to import handling. The dataset may contain a small number of errors inherent in human-labeled data.

Health Check
Last commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.