Awesome-Multimodal-Next-Token-Prediction  by LMM101

Survey of next token prediction for multimodal intelligence

created 8 months ago
446 stars

Top 68.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository serves as a comprehensive survey of Next Token Prediction (NTP) techniques applied to multimodal intelligence, covering advancements in vision and audio processing. It's a valuable resource for researchers and practitioners exploring the integration of language models with other modalities for enhanced understanding and generation tasks.

How It Works

The survey categorizes and details various approaches to multimodal NTP, focusing on tokenization strategies for vision and audio, state-of-the-art multimodal models, and prompt engineering techniques like In-Context Learning (ICL) and Chain-of-Thought (CoT). It highlights how NTP has become a versatile objective for tasks ranging from image captioning to speech synthesis.

Quick Start & Requirements

This repository is a curated collection of papers and associated code repositories. There are no direct installation or execution commands for the repository itself. Users are directed to individual linked repositories for specific setup instructions and dependencies, which will vary by project.

Highlighted Details

  • Comprehensive survey of multimodal Next Token Prediction (NTP).
  • Covers tokenization, models, and prompt engineering for vision and audio.
  • Includes links to numerous research papers and their corresponding GitHub repositories.
  • Features recent advancements up to late 2024.

Maintenance & Community

The survey was released on arXiv and GitHub on December 30, 2024. The authors encourage pull requests for seasonal updates to include the latest research.

Licensing & Compatibility

The repository itself does not specify a license. Individual linked repositories will have their own licenses, which may include restrictions on commercial use or linking with closed-source software.

Limitations & Caveats

As a survey, this repository does not provide executable code or models directly. Users must refer to the linked external repositories for implementation details and potential compatibility issues. The rapid pace of research means the survey may not be exhaustive of all cutting-edge developments immediately after release.

Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
19 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind) and Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers).

audio-ai-timeline by archinetai

0%
2k
AI model timeline for audio generation
created 2 years ago
updated 1 year ago
Feedback? Help us improve.