galactic  by taylorai

Data cleaning/curation tool for unstructured text datasets

created 1 year ago
328 stars

Top 84.4% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Galactic provides tools for cleaning and curating large unstructured text datasets, targeting users building fine-tuning datasets, RAG collections, or deduplicating web-scale data for LLM pre-training. It offers familiar HuggingFace-like dataset methods alongside specialized text curation workflows, aiming to simplify data preparation and analysis.

How It Works

Galactic leverages a data processing pipeline that supports streaming, filtering, and deduplication. It integrates various techniques for data understanding, including token counting, language detection, PII scanning, and embedding generation (CPU-based or OpenAI). For advanced curation, it offers AI-driven labeling and classification, dimensionality reduction (PCA, UMAP, SVD) for embeddings, clustering (HDBSCAN), and semantic deduplication using cosine similarity.

Quick Start & Requirements

  • Install from source: pip install git+https://github.com/taylorai/galactic.git
  • Prerequisites: Python. OpenAI API key required for OpenAI embeddings.
  • Links: API Reference, OpenHermes Example

Highlighted Details

  • AI data labeling and classifier distillation using OpenAI or local models.
  • Embedding generation on CPU (gte-small) or via OpenAI API.
  • Dimensionality reduction and visualization of embeddings with clustering.
  • Large-scale deduplication capabilities, including semantic deduplication.
  • Streaming data loading with on-the-fly filtering and deduplication.

Maintenance & Community

This project is no longer actively maintained and has been removed from PyPI due to a trademark dispute. Future releases are not planned.

Licensing & Compatibility

  • License: Apache 2.0
  • Compatibility: Suitable for commercial use.

Limitations & Caveats

The project is archived and no longer actively maintained, with no planned future updates. The package was removed from PyPI due to a trademark dispute.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

NeumAI by NeumTry

0%
858
Data platform for retrieval-augmented generation (RAG)
created 1 year ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind).

LightRAG by HKUDS

1.0%
19k
RAG framework for fast, simple retrieval-augmented generation
created 10 months ago
updated 20 hours ago
Feedback? Help us improve.