gpt4all-datalake  by nomic-ai

API for the GPT4All Datalake

created 2 years ago
398 stars

Top 73.7% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides an API for the GPT4All Datalake, an open-source initiative to ingest, organize, and store data contributions for GPT4All. It targets developers and researchers contributing data for LLM training, offering a structured way to manage and access these datasets.

How It Works

The datalake utilizes a FastAPI HTTP API to ingest data conforming to a fixed JSON schema. Upon receipt, it performs integrity checks and transforms the data into storage-efficient Arrow/Parquet files. These files are then organized into subdirectories by day and stored on a target filesystem or S3, ensuring easy manipulation across programming languages.

Quick Start & Requirements

  • Install/Run: Clone the repository and run make testenv to build Docker images and launch the HTTP server.
  • Prerequisites: Docker.
  • Documentation: API documentation available at http://localhost/docs.

Highlighted Details

  • Offers automatic snapshots of raw parquet data.
  • Provides interaction via raw exports, Atlas maps, and curated downloads for LLM training.
  • Data submitted is public and used for training open-source LLMs.
  • Contributors can opt for attribution with unique identifiers or submit anonymously.

Maintenance & Community

  • Managed and paid for by Nomic AI.
  • Source code is open-sourced under an Apache-2 License.

Licensing & Compatibility

  • License: Apache-2.0.
  • Compatibility: Data released under the same attribution terms if self-hosted.

Limitations & Caveats

Data sent to the datalake is public and intended for LLM training, with no expectation of privacy for submitted content.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Starred by Dominik Moritz Dominik Moritz(Professor at CMU; ML Researcher at Apple) and Casey Caruso Casey Caruso(Managing Partner of Topology Ventures).

latent-scope by enjalot

0.3%
717
Scientific tool for latent space investigation
created 2 years ago
updated 2 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Anton Troynikov Anton Troynikov(Cofounder of Chroma), and
20 more.

llama_index by run-llama

0.3%
43k
Data framework for building LLM-powered agents
created 2 years ago
updated 1 day ago
Feedback? Help us improve.