galactic  by taylorai

Data cleaning/curation tool for unstructured text datasets

Created 2 years ago
328 stars

Top 83.2% on SourcePulse

GitHubView on GitHub
Project Summary

Galactic provides tools for cleaning and curating large unstructured text datasets, targeting users building fine-tuning datasets, RAG collections, or deduplicating web-scale data for LLM pre-training. It offers familiar HuggingFace-like dataset methods alongside specialized text curation workflows, aiming to simplify data preparation and analysis.

How It Works

Galactic leverages a data processing pipeline that supports streaming, filtering, and deduplication. It integrates various techniques for data understanding, including token counting, language detection, PII scanning, and embedding generation (CPU-based or OpenAI). For advanced curation, it offers AI-driven labeling and classification, dimensionality reduction (PCA, UMAP, SVD) for embeddings, clustering (HDBSCAN), and semantic deduplication using cosine similarity.

Quick Start & Requirements

  • Install from source: pip install git+https://github.com/taylorai/galactic.git
  • Prerequisites: Python. OpenAI API key required for OpenAI embeddings.
  • Links: API Reference, OpenHermes Example

Highlighted Details

  • AI data labeling and classifier distillation using OpenAI or local models.
  • Embedding generation on CPU (gte-small) or via OpenAI API.
  • Dimensionality reduction and visualization of embeddings with clustering.
  • Large-scale deduplication capabilities, including semantic deduplication.
  • Streaming data loading with on-the-fly filtering and deduplication.

Maintenance & Community

This project is no longer actively maintained and has been removed from PyPI due to a trademark dispute. Future releases are not planned.

Licensing & Compatibility

  • License: Apache 2.0
  • Compatibility: Suitable for commercial use.

Limitations & Caveats

The project is archived and no longer actively maintained, with no planned future updates. The package was removed from PyPI due to a trademark dispute.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), and
9 more.

lilac by databricks

0.1%
1k
Data exploration tool for LLM dataset curation and quality control
Created 2 years ago
Updated 1 year ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Jiaming Song Jiaming Song(Chief Scientist at Luma AI), and
1 more.

Curator by NVIDIA-NeMo

1.3%
1k
Data curation toolkit for LLMs
Created 1 year ago
Updated 1 day ago
Starred by Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
11 more.

datatrove by huggingface

0.9%
3k
Data processing library for large-scale text data
Created 2 years ago
Updated 2 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Alex Atallah Alex Atallah(Cofounder of OpenRouter), and
8 more.

cleanlab by cleanlab

0.2%
11k
Data-centric AI package for ML with messy data
Created 7 years ago
Updated 1 week ago
Feedback? Help us improve.