recsys-examples by NVIDIA

Optimized recommender system examples for accelerated training and inference

Created 1 year ago

290 stars

Top 90.6% on SourcePulse

Project Summary

Recommender system examples optimized for NVIDIA accelerated infrastructure, this project provides easy-to-train and deploy components for large-scale recommendation tasks. It targets researchers and engineers seeking high-performance solutions for ranking, retrieval, and dynamic embedding management, enabling efficient deployment on advanced hardware.

How It Works

This project leverages NVIDIA's TorchRec and Megatron-Core for scalable training of HSTU (High-Throughput User) ranking/retrieval models and semantic-id based retrieval. Inference is heavily optimized using techniques like paged KV cache, Triton Inference Server integration, CUDA graphs, and C++ deployment via AOTInductor. DynamicEmb offers advanced features for parallelized dynamic embedding tables, including zero-collision hashing, eviction policies, admission control, and table fusion for efficient parameter management.

Quick Start & Requirements

Primary install/run: Not explicitly detailed, but examples are provided for HSTU training/inference, semantic-id retrieval, and Triton Inference Server integration.
Prerequisites: Requires NVIDIA GPUs (e.g., SM89 architecture, B200 benchmarks mentioned), CUDA, TorchRec, and Megatron-Core.
Resource footprint: Implied to be substantial due to large-scale model training and inference optimizations.
Relevant pages: HSTU inference overview, C++ inference guide, DynamicEmb documentation, HSTU training benchmark, E2E benchmark notes, HSTU inference benchmark, HSTU training setup, TritonServer for HSTU inference example, semantic‑id retrieval (sid_gr) documentation, releases page.

Highlighted Details

v26.03 (2026/4/14): Introduced Torch export and AOTInductor packaging for end-to-end HSTU C++ inference. Enhanced DynamicEmb with table fusion, relaxed alignment, and capacity sizing. Added HSTU end-to-end training benchmark suite and inference benchmarks on B200.
v26.01 (2026/2/13): Optimized HSTU KVCacheManager with a C++ implementation and compression support. Introduced workload-balanced batch shuffling for data parallel training. Added caching and prefetching for EmbeddingBagCollection.
v25.12 (2026/1/13): Added TritonServer support for HSTU inference and released the first semantic-id retrieval model example.
DynamicEmb: Features include admission control, table fusion, LRU score checkpointing, gradient clipping, distributed dumping, memory scaling, and cache support for hot embedding migration.
HSTU: Supports sequence parallelism, fbgemm_gpu_hstu migration, FP8 quantization, Tensor Parallelism, and pipeline execution.
Inference Optimizations: Leverages paged KV cache, CUDA graphs, Triton Inference Server, and C++ AOTInductor deployment.

Maintenance & Community

Active development is indicated by frequent releases (e.g., v26.03, v26.01). Community interaction is facilitated via GitHub Issues for bug reports and feature requests, and NVIDIA Developer Forums. Resources include videos and blogs detailing optimization practices.

Licensing & Compatibility

This project is licensed under the Apache License 2.0. This license is generally permissive and compatible with commercial use and closed-source linking.

Limitations & Caveats

The project heavily emphasizes NVIDIA hardware, suggesting a strong dependency on specific GPU architectures and CUDA versions. Setup complexity is implied, and detailed installation or performance benchmarks for all configurations are not exhaustively provided within the overview. The collection consists of examples, requiring users to integrate components into their specific workflows.

Health Check

Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

13 stars in the last 30 days