aella-data-explorer  by context-labs

Visual explorer for scientific research papers

Created 2 months ago
710 stars

Top 48.3% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This project provides an interactive web application for exploring the Aella open science dataset, which comprises approximately 100 million scientific articles. It targets researchers and power users by enabling semantic exploration through embeddings, dimensionality reduction, and clustering, offering a novel way to navigate and understand scientific literature.

How It Works

The application features a React/TypeScript frontend and a Python FastAPI backend, storing data locally in SQLite or Cloudflare D1/R2 in production. Its core innovation lies in the data pipeline: scientific papers are processed to generate 768-dimensional semantic embeddings using SPECTER2. These embeddings are then reduced to 2D using UMAP with cosine distance, followed by K-Means clustering optimized via silhouette scores. Interpretability is enhanced by LLM-curated, domain-specific labels, surpassing basic TF-IDF analysis.

Quick Start & Requirements

Prerequisites include Python 3.11+, bun, and the Task runner. Install dependencies with task setup. Download the SQLite database using task db:setup. Run the backend with task backend:dev and the frontend with task frontend:dev in separate terminals. The live explorer is available at https://aella.inference.net.

Highlighted Details

  • Leverages SPECTER2 for generating 768-dimensional semantic embeddings.
  • Utilizes UMAP with cosine distance for 2D dimensionality reduction.
  • Applies K-Means clustering with automatic optimization based on silhouette scores.
  • Incorporates LLM-curated labels for enhanced scientific domain interpretability.

Maintenance & Community

This project is a collaboration between Inference.net and LAION, intentionally scoped as a one-time preview. Significant feature additions are not planned; users are encouraged to fork the repository for further development. Contributions for bug fixes and minor improvements are welcome via pull requests.

Licensing & Compatibility

The project is released under the MIT License, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The code for the data pipeline used to construct the dataset is not open-source. The project's scope is limited to a preview, and it is not intended for substantial feature expansion.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
78 stars in the last 30 days

Explore Similar Projects

Starred by Dominik Moritz Dominik Moritz(Research Scientist at Apple; Professor at CMU) and Casey Caruso Casey Caruso(Managing Partner of Topology Ventures).

latent-scope by enjalot

0.3%
747
Scientific tool for latent space investigation
Created 2 years ago
Updated 1 month ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), and
9 more.

lilac by databricks

0%
1k
Data exploration tool for LLM dataset curation and quality control
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Gabriel Almeida Gabriel Almeida(Cofounder of Langflow), and
5 more.

lit by PAIR-code

0.2%
4k
Interactive ML model analysis tool for understanding model behavior
Created 5 years ago
Updated 1 month ago
Feedback? Help us improve.