Discover and explore top open-source AI tools and projects—updated daily.
peteromalletTool for transforming AI coding conversations into shareable datasets
New!
Top 22.3% on SourcePulse
<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> DataClaw addresses the challenge of proprietary data policies hindering the sharing of AI coding collaboration history. It empowers users to convert their conversation logs from tools like Claude Code, Codex, and Gemini CLI into structured, privacy-redacted datasets. This enables users to reclaim ownership of their data and contribute to a growing, open-source repository of human-AI coding interactions.
How It Works
The project parses session logs, applying multiple layers of automated redaction including path anonymization, username hashing, regex-based secret detection, entropy analysis for high-entropy strings, email removal, and custom string/username filtering. Processed data, including messages, tool calls, and metadata, is structured into JSONL format. The core advantage lies in its robust, multi-stage privacy approach designed to prepare sensitive conversation data for public sharing on Hugging Face, fostering open collaboration.
Quick Start & Requirements
pip install dataclaw or git clone https://github.com/banodoco/dataclaw.git /tmp/dataclaw && pip install /tmp/dataclaw. Requires huggingface-cli login --token YOUR_TOKEN.Highlighted Details
dataclaw on Hugging Face for discoverability.Maintenance & Community
No specific details regarding maintainers, community channels (e.g., Discord, Slack), or active development signals are present in the provided README.
Licensing & Compatibility
Limitations & Caveats
Automated redaction is not foolproof and may miss certain service-specific identifiers, third-party PII, or secrets in non-standard formats. Manual review of exported data before publishing is strongly advised. Users can opt out of exact-name privacy scans if they decline sharing their full name.
2 weeks ago
Inactive