MAP-NEO  by multimodal-art-projection

Open-source LLM with pretraining data, pipeline, scripts, and alignment code

Created 1 year ago
973 stars

Top 37.9% on SourcePulse

GitHubView on GitHub
Project Summary

MAP-NEO is a fully open-sourced Large Language Model series trained from scratch on 4.5T tokens, offering transparent LLM training and proprietary-model-like performance in reasoning, math, and coding. It targets researchers and developers seeking high-capability bilingual models with full access to training data, pipelines, and code.

How It Works

MAP-NEO is trained from scratch on a 4.5T token bilingual corpus, utilizing a data processing pipeline called "Matrix." The project emphasizes full transparency by releasing pretraining data, intermediate checkpoints, a custom tokenizer, and optimized pretraining codebase. This approach aims to provide a comprehensive resource for understanding and replicating LLM training.

Quick Start & Requirements

  • Models are available on HuggingFace: https://map-neo.github.io/
  • Requires significant computational resources for training/fine-tuning; inference requirements depend on model size.

Highlighted Details

  • Performance comparable to LLaMA2 7B, outperforming peers in reasoning, mathematics, and coding.
  • Comprehensive release includes base models, intermediate checkpoints, and scaling law models (250M to 7B parameters).
  • Includes the "Matrix" data processing pipeline and pretraining scripts.
  • Trained on 4.5T English and Chinese tokens.

Maintenance & Community

  • Active community support via Discord.
  • Project led by a large author list, indicating broad academic involvement.

Licensing & Compatibility

  • Licensed under the MIT License.
  • Commercial usage is permitted.

Limitations & Caveats

The README does not detail specific hardware requirements for running the models or the exact nature of the "Matrix" data processing pipeline beyond its name.

Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
7 more.

lingua by facebookresearch

0.0%
5k
LLM research codebase for training and inference
Created 1 year ago
Updated 5 months ago
Feedback? Help us improve.