Abstract
Hybrid Policy Distillation combines forward and reverse KL divergence approaches to improve knowledge distillation stability and efficiency across different model sizes and tasks.
Knowledge distillation (KD) is a powerful paradigm for compressing large language models (LLMs), whose effectiveness depends on intertwined choices of divergence direction, optimization strategy, and data regime. We break down the design of existing KD methods and present a unified view that establishes connections between them, reformulating KD as a reweighted log-likelihood objective at the token level. We further propose Hybrid Policy Distillation (HPD), which integrates the complementary advantages of forward and reverse KL to balance mode coverage and mode-seeking, and combines off-policy data with lightweight, approximate on-policy sampling. We validate HPD on long-generation math reasoning as well as short-generation dialogue and code tasks, demonstrating improved optimization stability, computational efficiency, and final performance across diverse model families and scales. The code related to this work is available at https://github.com/zwhong714/Hybrid-Policy-Distillation.
Community
🧭 A unified view of policy distillation methods
⚡ Efficient one-hot-style distillation
🧩 A hybrid KL objective with a masking mechanism
🪶 Lightweight sampling under an offline-prefix setting
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Entropy-Aware On-Policy Distillation of Language Models (2026)
- Reinforcement-aware Knowledge Distillation for LLM Reasoning (2026)
- Scaling Reasoning Efficiently via Relaxed On-Policy Distillation (2026)
- Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models (2026)
- SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting (2026)
- Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe (2026)
- OPSDL: On-Policy Self-Distillation for Long-Context Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper