Discover and explore top open-source AI tools and projects—updated daily.
0xSeroAI coding assistant data extraction for ML training
Top 68.6% on SourcePulse
AI Coding Assistant Training Data Extraction Toolkit
This toolkit automates the extraction of complete conversation history, code context, and metadata from various AI coding assistants. It is designed for machine learning engineers and researchers seeking to compile comprehensive datasets for fine-tuning models, offering a unified format from disparate local AI tool data.
How It Works
The project employs a suite of Python scripts, each tailored to a specific AI coding assistant. These scripts auto-discover installations by searching common operating system-specific locations (e.g., ~/Library/Application Support, ~/.config). They parse diverse storage formats, including JSONL session files and SQLite databases, to capture detailed interaction data such as user messages, AI responses, code snippets, diffs, tool usage, and timestamps. This systematic approach ensures a consistent and complete data extraction process.
Quick Start & Requirements
python3 extract_claude_code.py or use the ./extract_all.sh script for comprehensive extraction.extracted_data/ directory as timestamped JSONL files.Highlighted Details
.vscdb, .db), and Tauri .dat files.messages array with rich context fields like code_context and suggested_diffs.Maintenance & Community
No specific details regarding contributors, sponsorships, or community channels (e.g., Discord, Slack) were found in the provided README.
Licensing & Compatibility
Limitations & Caveats
The toolkit may encounter limitations such as extracting partial data from incomplete or deleted sessions, or corrupted database entries. Users must be aware of potential privacy concerns, as proprietary code, API keys, secrets, and personal file paths may be extracted. A detect-secrets tool is recommended for scanning and sanitizing extracted data before use or sharing. Troubleshooting guidance is provided for common issues like installation discovery, database locking, and permission errors.
1 month ago
Inactive