MAP-NEO  by multimodal-art-projection

Open-source LLM with pretraining data, pipeline, scripts, and alignment code

Created 1 year ago
963 stars

Top 38.2% on SourcePulse

GitHubView on GitHub
Project Summary

MAP-NEO is a fully open-sourced Large Language Model series trained from scratch on 4.5T tokens, offering transparent LLM training and proprietary-model-like performance in reasoning, math, and coding. It targets researchers and developers seeking high-capability bilingual models with full access to training data, pipelines, and code.

How It Works

MAP-NEO is trained from scratch on a 4.5T token bilingual corpus, utilizing a data processing pipeline called "Matrix." The project emphasizes full transparency by releasing pretraining data, intermediate checkpoints, a custom tokenizer, and optimized pretraining codebase. This approach aims to provide a comprehensive resource for understanding and replicating LLM training.

Quick Start & Requirements

  • Models are available on HuggingFace: https://map-neo.github.io/
  • Requires significant computational resources for training/fine-tuning; inference requirements depend on model size.

Highlighted Details

  • Performance comparable to LLaMA2 7B, outperforming peers in reasoning, mathematics, and coding.
  • Comprehensive release includes base models, intermediate checkpoints, and scaling law models (250M to 7B parameters).
  • Includes the "Matrix" data processing pipeline and pretraining scripts.
  • Trained on 4.5T English and Chinese tokens.

Maintenance & Community

  • Active community support via Discord.
  • Project led by a large author list, indicating broad academic involvement.

Licensing & Compatibility

  • Licensed under the MIT License.
  • Commercial usage is permitted.

Limitations & Caveats

The README does not detail specific hardware requirements for running the models or the exact nature of the "Matrix" data processing pipeline beyond its name.

Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
6 more.

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
Created 11 months ago
Updated 2 months ago
Feedback? Help us improve.