Extra-CoT  by Mwie1024

Extreme-ratio Chain-of-Thought compression for efficient LLM reasoning

Created 2 months ago
607 stars

Top 53.9% on SourcePulse

GitHubView on GitHub
Project Summary

Extra-CoT introduces a novel three-stage framework for compressing Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) to extreme token budgets, targeting up to 80% reduction while preserving reasoning fidelity and achieving significant wall-clock speedups. It is designed for researchers and engineers seeking to deploy efficient LLM reasoning capabilities without sacrificing accuracy, enabling faster and more cost-effective inference.

How It Works

Extra-CoT employs a three-stage approach: Stage 1 (Compressor) generates high-fidelity compressed rationales by preserving critical information like formulas and anchors. Stage 2 (Mixed-ratio SFT) trains a single model to reliably follow multiple compression ratios, preventing "control collapse" at low budgets. Stage 3 (CHRPO) utilizes a hierarchical reinforcement learning algorithm to learn an adaptive policy, enabling the model to dynamically allocate tokens for ultra-low budgets. This method tackles the common failure mode of extreme CoT compression where symbolic consistency and controllability degrade.

Quick Start & Requirements

This repository provides code for SFT (Supervised Fine-Tuning) and vLLM-based evaluation, along with a ratio-controlled inference interface.

  • SFT Training: Requires LLaMA-Factory. Navigate to the LLaMA-Factory directory and execute: FORCE_TORCHRUN=1 NNODES=1 NODE_RANK=0 MASTER_ADDR=0.0.0.0 MASTER_PORT=12345 llamafactory-cli train examples/train_full/qwen3-1.7b_full_sft.yaml
  • Inference & Evaluation: Utilizes vLLM.
    1. Start vLLM server: vllm serve your_model_path --served-model-name local_core_model --host 0.0.0.0 --port 8000 --max-model-len 20000
    2. Run evaluation: python eval_all_ratios_vllm.py --host 127.0.0.1 --port 8000 --model local_core_model --output_dir outputs/qwen3-1.7b
  • Prerequisites: LLaMA-Factory, vLLM, Python. GPU is implicitly required for efficient operation.
  • Links: Paper: arXiv:2602.08324

Highlighted Details

  • Performance: Achieves extreme compression (e.g., 20% tokens) with high accuracy, such as 80.2% on GSM8K and 47.8% on MATH-500 with Qwen3-1.7B.
  • Latency Reduction: Demonstrates significant end-to-end latency improvements; for instance, GSM8K latency drops from 0.7298s (Base) to 0.2254s (Extra-CoT) at extreme compression.
  • Adaptive Policy: The <COMP_POLICY> mode enables dynamic token allocation, achieving 85.8% accuracy on GSM8K with only 0.24 realized compression ratio.
  • Features: Includes ratio-controlled inference via special tokens (<COMP_XX>, <COMP_POLICY>), a vLLM-based evaluation script, and LLaMA-Factory integration for SFT.

Maintenance & Community

No specific community channels (e.g., Discord, Slack), roadmap, or maintenance details are provided in the README. The project appears to be research-driven, with contributions from the authors of the associated paper.

Licensing & Compatibility

The project's license is not specified in the README. This omission presents a significant blocker for evaluating commercial use or closed-source integration compatibility.

Limitations & Caveats

The repository primarily provides code for inference, evaluation, and SFT fine-tuning. It does not explicitly include the training code for the Stage 1 Compressor or the Stage 3 CHRPO policy. The absence of a specified license is a critical limitation for adoption. Setup requires familiarity with LLaMA-Factory and vLLM.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
406 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.