MAP-NEO by multimodal-art-projection

Open-source LLM with pretraining data, pipeline, scripts, and alignment code

Created 1 year ago

973 stars

Top 37.9% on SourcePulse

View on GitHub

3 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Nathan Lambert

Research Scientist at AI2

Luca Soldaini

Research Scientist at Ai2

Project Summary

MAP-NEO is a fully open-sourced Large Language Model series trained from scratch on 4.5T tokens, offering transparent LLM training and proprietary-model-like performance in reasoning, math, and coding. It targets researchers and developers seeking high-capability bilingual models with full access to training data, pipelines, and code.

How It Works

MAP-NEO is trained from scratch on a 4.5T token bilingual corpus, utilizing a data processing pipeline called "Matrix." The project emphasizes full transparency by releasing pretraining data, intermediate checkpoints, a custom tokenizer, and optimized pretraining codebase. This approach aims to provide a comprehensive resource for understanding and replicating LLM training.

Quick Start & Requirements

Models are available on HuggingFace: https://map-neo.github.io/
Requires significant computational resources for training/fine-tuning; inference requirements depend on model size.

Highlighted Details

Performance comparable to LLaMA2 7B, outperforming peers in reasoning, mathematics, and coding.
Comprehensive release includes base models, intermediate checkpoints, and scaling law models (250M to 7B parameters).
Includes the "Matrix" data processing pipeline and pretraining scripts.
Trained on 4.5T English and Chinese tokens.

Maintenance & Community

Active community support via Discord.
Project led by a large author list, indicating broad academic involvement.

Licensing & Compatibility

Licensed under the MIT License.
Commercial usage is permitted.

Limitations & Caveats

The README does not detail specific hardware requirements for running the models or the exact nature of the "Matrix" data processing pipeline beyond its name.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days