Data cleaning/curation tool for unstructured text datasets
Top 84.4% on sourcepulse
Galactic provides tools for cleaning and curating large unstructured text datasets, targeting users building fine-tuning datasets, RAG collections, or deduplicating web-scale data for LLM pre-training. It offers familiar HuggingFace-like dataset methods alongside specialized text curation workflows, aiming to simplify data preparation and analysis.
How It Works
Galactic leverages a data processing pipeline that supports streaming, filtering, and deduplication. It integrates various techniques for data understanding, including token counting, language detection, PII scanning, and embedding generation (CPU-based or OpenAI). For advanced curation, it offers AI-driven labeling and classification, dimensionality reduction (PCA, UMAP, SVD) for embeddings, clustering (HDBSCAN), and semantic deduplication using cosine similarity.
Quick Start & Requirements
pip install git+https://github.com/taylorai/galactic.git
Highlighted Details
Maintenance & Community
This project is no longer actively maintained and has been removed from PyPI due to a trademark dispute. Future releases are not planned.
Licensing & Compatibility
Limitations & Caveats
The project is archived and no longer actively maintained, with no planned future updates. The package was removed from PyPI due to a trademark dispute.
1 year ago
1 day