OpenOneRec by Kuaishou-OneRec

Generative recommendation framework and benchmark

Created 1 month ago

604 stars

Top 54.1% on SourcePulse

Project Summary

Summary

OpenOneRec addresses the limitations of traditional recommendation systems by offering an open-source framework that unifies foundation models and a comprehensive benchmark (RecIF-Bench) for generative recommendation. It targets researchers and engineers seeking to bridge Large Language Models (LLMs) with recommendation tasks, providing a reproducible pipeline and SOTA models to accelerate development.

How It Works

The framework reframes recommendation as sequence modeling, treating items as distinct modalities via "Itemic Tokens" derived from hierarchical vector quantization. This allows LLMs to process interaction history cohesively. A multi-stage training pipeline integrates collaborative signals through Itemic-Text Alignment and co-pretraining, followed by supervised fine-tuning, on-policy distillation, and reinforcement learning for enhanced recommendation capabilities.

Quick Start & Requirements

Models can be loaded using transformers>=4.51.0. The provided Python code demonstrates loading models and tokenizers, preparing inputs with itemic tokens, and generating text completions. GPU acceleration is implied by device_map="auto". Detailed usage instructions and code release are pending.

Highlighted Details

RecIF-Bench: The first holistic Recommendation Instruction-Following Benchmark, featuring 100M interactions across Short Video, Ads, and Product domains, structured into 8 tasks across a 4-layer capability hierarchy (Semantic Alignment to Reasoning).
Foundation Models: A family of 1.7B and 8B parameter models based on Qwen3, including Standard versions trained on open data and Pro versions enhanced with a large industrial corpus.
Performance: Achieves State-of-the-Art (SOTA) on RecIF-Bench tasks and demonstrates strong zero-shot/few-shot cross-domain transferability on the Amazon Benchmark, outperforming baselines by an average of 26.8% in Recall@10.
Full-Stack Pipeline: Open-sourced training pipeline for data processing, co-pretraining, and post-training, promoting reproducibility and scaling law research.

Maintenance & Community

The project roadmap includes developing general-domain data scripts, reproducible environments (Docker/Apptainer), streamlined training recipes, improved documentation, and support for more model sizes. Contributions are welcomed. Community links (e.g., Discord, Slack) are not specified.

Licensing & Compatibility

The code is licensed under Apache 2.0, which is permissive for commercial use. However, the model weights are subject to separate, unspecified license agreements, requiring careful review for compatibility with closed-source applications.

Limitations & Caveats

Full code release and detailed usage instructions are explicitly stated as "coming soon." The specific licenses for model weights are not detailed, potentially posing adoption blockers. Key features like reproducible environments and one-click reproduction are still under development, indicating an early-stage project.

Health Check

Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

110 stars in the last 30 days