galactic by taylorai

Data cleaning/curation tool for unstructured text datasets

Created 2 years ago

329 stars

Top 83.4% on SourcePulse

View on GitHub

2 Experts Love This Project

Jesse Clark

Cofounder of Marqo

Teknium

Cofounder of Nous Research

Project Summary

Galactic provides tools for cleaning and curating large unstructured text datasets, targeting users building fine-tuning datasets, RAG collections, or deduplicating web-scale data for LLM pre-training. It offers familiar HuggingFace-like dataset methods alongside specialized text curation workflows, aiming to simplify data preparation and analysis.

How It Works

Galactic leverages a data processing pipeline that supports streaming, filtering, and deduplication. It integrates various techniques for data understanding, including token counting, language detection, PII scanning, and embedding generation (CPU-based or OpenAI). For advanced curation, it offers AI-driven labeling and classification, dimensionality reduction (PCA, UMAP, SVD) for embeddings, clustering (HDBSCAN), and semantic deduplication using cosine similarity.

Quick Start & Requirements

Install from source: pip install git+https://github.com/taylorai/galactic.git
Prerequisites: Python. OpenAI API key required for OpenAI embeddings.
Links: API Reference, OpenHermes Example

Highlighted Details

AI data labeling and classifier distillation using OpenAI or local models.
Embedding generation on CPU (gte-small) or via OpenAI API.
Dimensionality reduction and visualization of embeddings with clustering.
Large-scale deduplication capabilities, including semantic deduplication.
Streaming data loading with on-the-fly filtering and deduplication.

Maintenance & Community

This project is no longer actively maintained and has been removed from PyPI due to a trademark dispute. Future releases are not planned.

Licensing & Compatibility

License: Apache 2.0
Compatibility: Suitable for commercial use.

Limitations & Caveats

The project is archived and no longer actively maintained, with no planned future updates. The package was removed from PyPI due to a trademark dispute.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days