Awesome-LLM-Inference  by xlite-dev

Curated list of LLM/VLM inference research papers with code

Created 2 years ago
4,533 stars

Top 10.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository is a curated list of research papers and code related to Large Language Model (LLM) and Vision-Language Model (VLM) inference. It serves as a comprehensive resource for researchers and engineers looking to optimize LLM performance, covering topics from attention mechanisms and quantization to parallelism and KV cache management.

How It Works

The project organizes papers by key LLM inference topics, providing links to their respective research papers and associated code repositories. It categorizes advancements in areas like FlashAttention, PagedAttention, quantization techniques (WINT8/4, FP8), parallelism strategies (Tensor Parallelism, Sequence Parallelism), KV cache optimization, and efficient decoding methods. This structured approach allows users to quickly find and explore state-of-the-art solutions for specific inference challenges.

Quick Start & Requirements

This repository is a collection of links and does not require installation or execution. Users can directly access the linked papers and code repositories for their specific needs.

Highlighted Details

  • Extensive coverage of recent advancements in LLM inference, including papers from 2024 and early 2025.
  • Detailed categorization of techniques such as quantization, attention mechanisms, KV cache optimization, and parallelism.
  • Links to code repositories for many of the featured papers, facilitating practical implementation.
  • Inclusion of trending topics and specific model families like DeepSeek.

Maintenance & Community

The repository is maintained by xlite-dev and contributors. It welcomes contributions via pull requests.

Licensing & Compatibility

The repository itself is licensed under the GNU General Public License v3.0. Individual linked papers and code repositories will have their own licenses, which users must adhere to.

Limitations & Caveats

As a curated list, the repository does not provide direct implementations or benchmarks. Users are responsible for evaluating the applicability and performance of the linked resources in their specific use cases. The "Recom" column uses a star rating, which is subjective.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
138 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

dots.llm1 by rednote-hilab

0.2%
462
MoE model for research
Created 4 months ago
Updated 4 weeks ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

10.6%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.