Curated list of LLM/VLM inference research papers with code
Top 11.5% on sourcepulse
This repository is a curated list of research papers and code related to Large Language Model (LLM) and Vision-Language Model (VLM) inference. It serves as a comprehensive resource for researchers and engineers looking to optimize LLM performance, covering topics from attention mechanisms and quantization to parallelism and KV cache management.
How It Works
The project organizes papers by key LLM inference topics, providing links to their respective research papers and associated code repositories. It categorizes advancements in areas like FlashAttention, PagedAttention, quantization techniques (WINT8/4, FP8), parallelism strategies (Tensor Parallelism, Sequence Parallelism), KV cache optimization, and efficient decoding methods. This structured approach allows users to quickly find and explore state-of-the-art solutions for specific inference challenges.
Quick Start & Requirements
This repository is a collection of links and does not require installation or execution. Users can directly access the linked papers and code repositories for their specific needs.
Highlighted Details
Maintenance & Community
The repository is maintained by xlite-dev and contributors. It welcomes contributions via pull requests.
Licensing & Compatibility
The repository itself is licensed under the GNU General Public License v3.0. Individual linked papers and code repositories will have their own licenses, which users must adhere to.
Limitations & Caveats
As a curated list, the repository does not provide direct implementations or benchmarks. Users are responsible for evaluating the applicability and performance of the linked resources in their specific use cases. The "Recom" column uses a star rating, which is subjective.
3 days ago
1 day