ai-data-extraction  by 0xSero

AI coding assistant data extraction for ML training

Created 3 months ago
434 stars

Top 68.6% on SourcePulse

GitHubView on GitHub
Project Summary

AI Coding Assistant Training Data Extraction Toolkit

This toolkit automates the extraction of complete conversation history, code context, and metadata from various AI coding assistants. It is designed for machine learning engineers and researchers seeking to compile comprehensive datasets for fine-tuning models, offering a unified format from disparate local AI tool data.

How It Works

The project employs a suite of Python scripts, each tailored to a specific AI coding assistant. These scripts auto-discover installations by searching common operating system-specific locations (e.g., ~/Library/Application Support, ~/.config). They parse diverse storage formats, including JSONL session files and SQLite databases, to capture detailed interaction data such as user messages, AI responses, code snippets, diffs, tool usage, and timestamps. This systematic approach ensures a consistent and complete data extraction process.

Quick Start & Requirements

  • Installation: No external dependencies are required beyond Python 3.6+ standard library.
  • Primary install / run command: Execute individual scripts like python3 extract_claude_code.py or use the ./extract_all.sh script for comprehensive extraction.
  • Non-default prerequisites and dependencies: Python 3.6+
  • Output: Extracted data is saved into an extracted_data/ directory as timestamped JSONL files.

Highlighted Details

  • Supports extraction from: Claude Code, Codex, Cursor (all versions), Trae, Windsurf, Continue AI, Gemini CLI, and OpenCode.
  • Captures detailed data: User/AI messages, code context (file paths, snippets, line numbers), code diffs and edit histories, tool use and execution results, and metadata.
  • Handles multiple storage formats: JSONL, SQLite databases (.vscdb, .db), and Tauri .dat files.
  • Output format is JSONL, with each line representing a conversation containing a messages array with rich context fields like code_context and suggested_diffs.

Maintenance & Community

No specific details regarding contributors, sponsorships, or community channels (e.g., Discord, Slack) were found in the provided README.

Licensing & Compatibility

  • License type: MIT License.
  • Compatibility notes: Compatible with Python 3.6+ on macOS, Linux, and Windows. The MIT license permits free use for training ML models.

Limitations & Caveats

The toolkit may encounter limitations such as extracting partial data from incomplete or deleted sessions, or corrupted database entries. Users must be aware of potential privacy concerns, as proprietary code, API keys, secrets, and personal file paths may be extracted. A detect-secrets tool is recommended for scanning and sanitizing extracted data before use or sharing. Troubleshooting guidance is provided for common issues like installation discovery, database locking, and permission errors.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
255 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.