Discover and explore top open-source AI tools and projects—updated daily.
Open-source LLM with pretraining data, pipeline, scripts, and alignment code
Top 38.2% on SourcePulse
MAP-NEO is a fully open-sourced Large Language Model series trained from scratch on 4.5T tokens, offering transparent LLM training and proprietary-model-like performance in reasoning, math, and coding. It targets researchers and developers seeking high-capability bilingual models with full access to training data, pipelines, and code.
How It Works
MAP-NEO is trained from scratch on a 4.5T token bilingual corpus, utilizing a data processing pipeline called "Matrix." The project emphasizes full transparency by releasing pretraining data, intermediate checkpoints, a custom tokenizer, and optimized pretraining codebase. This approach aims to provide a comprehensive resource for understanding and replicating LLM training.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README does not detail specific hardware requirements for running the models or the exact nature of the "Matrix" data processing pipeline beyond its name.
7 months ago
Inactive