dataclaw  by peteromallet

Tool for transforming AI coding conversations into shareable datasets

Created 2 weeks ago

New!

1,921 stars

Top 22.3% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> DataClaw addresses the challenge of proprietary data policies hindering the sharing of AI coding collaboration history. It empowers users to convert their conversation logs from tools like Claude Code, Codex, and Gemini CLI into structured, privacy-redacted datasets. This enables users to reclaim ownership of their data and contribute to a growing, open-source repository of human-AI coding interactions.

How It Works

The project parses session logs, applying multiple layers of automated redaction including path anonymization, username hashing, regex-based secret detection, entropy analysis for high-entropy strings, email removal, and custom string/username filtering. Processed data, including messages, tool calls, and metadata, is structured into JSONL format. The core advantage lies in its robust, multi-stage privacy approach designed to prepare sensitive conversation data for public sharing on Hugging Face, fostering open collaboration.

Quick Start & Requirements

  • Primary install / run command: pip install dataclaw or git clone https://github.com/banodoco/dataclaw.git /tmp/dataclaw && pip install /tmp/dataclaw. Requires huggingface-cli login --token YOUR_TOKEN.
  • Non-default prerequisites: Python environment, Hugging Face account and token. No specific hardware (GPU/CUDA) requirements are mentioned.
  • Links: GitHub repo (implied), Hugging Face datasets tagged 'dataclaw'.

Highlighted Details

  • Supports export of conversation history from Claude Code, Codex, Gemini CLI, and OpenCode.
  • Exports detailed session data: user/assistant messages, extended thinking (optional), tool calls, token usage, model, and metadata.
  • Implements advanced privacy features: path anonymization, username hashing, secret detection (API keys, tokens, passwords), email redaction, and custom redactions.
  • Published datasets are automatically tagged dataclaw on Hugging Face for discoverability.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), or active development signals are present in the provided README.

Licensing & Compatibility

  • License type: MIT.
  • Compatibility notes: The MIT license permits commercial use and integration into closed-source projects without copyleft restrictions.

Limitations & Caveats

Automated redaction is not foolproof and may miss certain service-specific identifiers, third-party PII, or secrets in non-standard formats. Manual review of exported data before publishing is strongly advised. Users can opt out of exact-name privacy scans if they decline sharing their full name.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
19
Issues (30d)
5
Star History
1,932 stars in the last 17 days

Explore Similar Projects

Feedback? Help us improve.