dataclaw by peteromallet

Tool for transforming AI coding conversations into shareable datasets

Created 3 months ago

2,093 stars

Top 20.8% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> DataClaw addresses the challenge of proprietary data policies hindering the sharing of AI coding collaboration history. It empowers users to convert their conversation logs from tools like Claude Code, Codex, and Gemini CLI into structured, privacy-redacted datasets. This enables users to reclaim ownership of their data and contribute to a growing, open-source repository of human-AI coding interactions.

How It Works

The project parses session logs, applying multiple layers of automated redaction including path anonymization, username hashing, regex-based secret detection, entropy analysis for high-entropy strings, email removal, and custom string/username filtering. Processed data, including messages, tool calls, and metadata, is structured into JSONL format. The core advantage lies in its robust, multi-stage privacy approach designed to prepare sensitive conversation data for public sharing on Hugging Face, fostering open collaboration.

Quick Start & Requirements

Primary install / run command: pip install dataclaw or git clone https://github.com/banodoco/dataclaw.git /tmp/dataclaw && pip install /tmp/dataclaw. Requires huggingface-cli login --token YOUR_TOKEN.
Non-default prerequisites: Python environment, Hugging Face account and token. No specific hardware (GPU/CUDA) requirements are mentioned.
Links: GitHub repo (implied), Hugging Face datasets tagged 'dataclaw'.

Highlighted Details

Supports export of conversation history from Claude Code, Codex, Gemini CLI, and OpenCode.
Exports detailed session data: user/assistant messages, extended thinking (optional), tool calls, token usage, model, and metadata.
Implements advanced privacy features: path anonymization, username hashing, secret detection (API keys, tokens, passwords), email redaction, and custom redactions.
Published datasets are automatically tagged dataclaw on Hugging Face for discoverability.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), or active development signals are present in the provided README.

Licensing & Compatibility

License type: MIT.
Compatibility notes: The MIT license permits commercial use and integration into closed-source projects without copyleft restrictions.

Limitations & Caveats

Automated redaction is not foolproof and may miss certain service-specific identifiers, third-party PII, or secrets in non-standard formats. Manual review of exported data before publishing is strongly advised. Users can opt out of exact-name privacy scans if they decline sharing their full name.

dataclaw by peteromallet

Explore Similar Projects

she-love-me by 863401402

claude-council by hex

app by prem-research

kura by jxnl

awesome-human-distillation by mliu98

happier by happier-dev

ai-data-extraction by 0xSero

share by gityuanbao

anton by mindsdb

minutes by silverstein

claude-memory-compiler by coleam00

odysseus by pewdiepie-archdaemon