MAP-NEO  by multimodal-art-projection

Open-source LLM with pretraining data, pipeline, scripts, and alignment code

created 1 year ago
950 stars

Top 39.5% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

MAP-NEO is a fully open-sourced Large Language Model series trained from scratch on 4.5T tokens, offering transparent LLM training and proprietary-model-like performance in reasoning, math, and coding. It targets researchers and developers seeking high-capability bilingual models with full access to training data, pipelines, and code.

How It Works

MAP-NEO is trained from scratch on a 4.5T token bilingual corpus, utilizing a data processing pipeline called "Matrix." The project emphasizes full transparency by releasing pretraining data, intermediate checkpoints, a custom tokenizer, and optimized pretraining codebase. This approach aims to provide a comprehensive resource for understanding and replicating LLM training.

Quick Start & Requirements

  • Models are available on HuggingFace: https://map-neo.github.io/
  • Requires significant computational resources for training/fine-tuning; inference requirements depend on model size.

Highlighted Details

  • Performance comparable to LLaMA2 7B, outperforming peers in reasoning, mathematics, and coding.
  • Comprehensive release includes base models, intermediate checkpoints, and scaling law models (250M to 7B parameters).
  • Includes the "Matrix" data processing pipeline and pretraining scripts.
  • Trained on 4.5T English and Chinese tokens.

Maintenance & Community

  • Active community support via Discord.
  • Project led by a large author list, indicating broad academic involvement.

Licensing & Compatibility

  • Licensed under the MIT License.
  • Commercial usage is permitted.

Limitations & Caveats

The README does not detail specific hardware requirements for running the models or the exact nature of the "Matrix" data processing pipeline beyond its name.

Health Check
Last commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
21 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), and
10 more.

TinyLlama by jzhang38

0.3%
9k
Tiny pretraining project for a 1.1B Llama model
created 1 year ago
updated 1 year ago
Feedback? Help us improve.