Latest Knowledge Distillation Research Papers
The newest Knowledge Distillation papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Knowledge Distillation so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Knowledge Distillation papers in your inbox — free →Recent papers
- The Role of Feedback Alignment in Self-DistillationSemih Kara, Oğuzhan Ersoy · arXiv · Jun 9, 2026
Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distillation trains the model to retain this improvement when the context is not present. The method works by…
- Itô maps for any-step SDEsZhengkai Pan, Peter Potaptchik, Wenxi Yao, Michael S. Albergo et al. · arXiv · Jun 9, 2026
Recent one-step generative models accelerate sampling by learning deterministic flow maps of the underlying dynamics. These methods rely on learning from ordinary differential equations, leaving open how to define an exact distillation proc…
- OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinibAbhijoy Sarkar, Aarchi Singh Thakur · arXiv · Jun 9, 2026
Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computatio…
- Unsupervised Continual Clustering via Forward-Backward Knowledge DistillationMohammadreza Sadeghi, Sareh Soleimani, Zihan Wang, Narges Armanfard · arXiv · Jun 5, 2026
Unsupervised Continual Learning (UCL) aims to enable neural networks to learn sequential tasks without labels or access to past data. A major challenge in this setting is Catastrophic Forgetting, where models forget previously learned tasks…
- The Distillation Game: Adaptive Attacks & Efficient DefensesYoussef Allouah, Mahdi Haghifam, Sanmi Koyejo, Reza Shokri · arXiv · May 21, 2026
Distillation attacks create a deployment trade-off for model providers: the same outputs that make a model more useful can also make it easier to imitate. We study this trade-off through a minimax game between a utility-constrained teacher …
- Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-DistillationQianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin et al. · arXiv · May 18, 2026
Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answe…
- Distilling Tabular Foundation Models for Structured Health DataAditya Tanna, Nassim Bouarour, Mohamed Bouadi, Vinay Kumar Sankarapu et al. · arXiv · May 18, 2026
Tabular foundation models (TFMs) achieve strong performance on health datasets, but their inference cost and infrastructure requirements limit practical use. We study whether their predictive behavior can be transferred to lightweight tabul…
- Self-Distilled Agentic Reinforcement LearningZhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi-Han Wang et al. · arXiv · May 14, 2026
Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements…
- Learning from Language Feedback via Variational Policy DistillationYang Li, Erik Nijkamp, Semih Yavuz, Shafiq Rayhan Joty · arXiv · May 14, 2026
Reinforcement learning from verifiable rewards (RLVR) suffers from sparse outcome signals, creating severe exploration bottlenecks on complex reasoning tasks. Recent on-policy self-distillation methods attempt to address this by utilizing l…
- Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-TrainingYuanda Xu, Hejian Sang, Zhengze Zhou, Ran He et al. · arXiv · May 12, 2026
In settings where labeled verifiable training data is the binding constraint, each checked example should be allocated carefully. The standard practice is to use this data directly on the model that will be deployed, for example by running …
- TextSeal: A Localized LLM Watermark for Provenance & Distillation ProtectionTom Sander, Hongyan Chang, Tomáš Souček, Tuan Tran et al. · arXiv · May 12, 2026
We introduce TextSeal, a state-of-the-art watermark for large language models. Building on Gumbel-max sampling, TextSeal introduces dual-key generation to restore output diversity, along with entropy-weighted scoring and multi-region locali…
- Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and WhyMohammadreza Armandpour, Fatih Ilhan, David Harrison, Ajay Jaiswal et al. · arXiv · May 11, 2026
On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. Which teacher model should be used, an…
- Compute Where it Counts: Self Optimizing Language ModelsYash Akhauri, Mohamed S. Abdelfattah · arXiv · May 11, 2026
Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget to every generated token. In practice,…
- Normalizing Trajectory ModelsJiatao Gu, Tianrong Chen, Ying Shen, David Berthelot et al. · arXiv · May 8, 2026
Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, con…
- Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone DetectionMohamad Khajezade, Fatemeh H. Fard, Mohamed Sami Shehata · arXiv · May 4, 2026
Cross-language code clone detection (X-CCD) is challenging because semantically equivalent programs written in different languages often share little surface similarity. Although large language models (LLMs) have shown promise for semantic …
- Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language ModelsGongbo Zhang, Wen Wang, Ye Tian, Li Yuan · arXiv · Apr 29, 2026
Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference…
- Edge AI for Automotive Vulnerable Road User Safety: Deployable Detection via Knowledge DistillationAkshay Karjol, Darrin M. Hanna · arXiv · Apr 29, 2026
Deploying accurate object detection for Vulnerable Road User (VRU) safety on edge hardware requires balancing model capacity against computational constraints. Large models achieve high accuracy but fail under INT8 quantization required for…
- Random Cloud: Finding Minimal Neural Architectures Without TrainingJavier Gil Blázquez · arXiv · Apr 29, 2026
I propose the \emph{Random Cloud} method, a training-free approach to neural architecture search that discovers minimal feedforward network topologies through stochastic exploration and progressive structural reduction. Unlike post-training…
- PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled ReasonersZhiquan Tan, Yinrong Hong · arXiv · Apr 29, 2026
Improving large language model (LLM) reasoning requires supervision that is both aligned with the model's own test-time states and informative at the token level. Reinforcement learning with verifiable rewards provides on-policy exploration…
- Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via DistillationRajinder Sandhu, Di Mu, Cheng Chang, Md Shahriar Tasjid et al. · arXiv · Apr 24, 2026
Dense vector retrieval is the practical backbone of Retrieval- Augmented Generation (RAG), but similarity search can suffer from precision limitations. Conversely, utility-based approaches leveraging LLM re-ranking often achieve superior pe…
- VLA Foundry: A Unified Framework for Training Vision-Language-Action ModelsJean Mercat, Sedrick Keh, Kushal Arora, Isabella Huang et al. · arXiv · Apr 21, 2026
We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines…
- Multi-Hop Deep Joint Source-Channel Coding With Deep Hash Distillation for Semantically Aligned Image RecoveryD. E. BERGSTROM, Deniz Gündüz, Onur Günlü · OpenAlex · Apr 21, 2026
We consider image transmission via deep joint source-channel coding (DeepJSCC) over multi-hop additive white Gaussian noise (AWGN) channels by training a DeepJSCC encoder-decoder pair with a pre-trained deep hash distillation (DHD) module t…
- Thermal Desalination Technologies and Electromagnetic-Field-Assisted Approaches for Seawater Treatment: A Comprehensive ReviewNoura Azzi, H. Labrim, Rachid El Bouayadi, Redouane Mghaiouini · Eng—Advances in Engineering · Apr 16, 2026
Seawater desalination has become a critical approach to mitigating the global scarcity of freshwater resources. This study aims to comprehensively review desalination methods based on thermal and electromagnetic methods, examining their pro…
- Language models transmit behavioural traits through hidden signals in dataAlex Cloud, Minh Hoang Le, James Chua, Jan Betley et al. · Nature · Apr 15, 2026
. Here we show that distillation can lead to subliminal learning-the transmission of behavioural traits through semantically unrelated data. In our main experiments, a 'teacher' model with some trait T (such as disproportionately generating…
- Volatility Spillovers Between China’s Financial Markets and Strategic Metal Assets: Evidence from LLM Knowledge DistillationDian Sheng, Jining Wang, Lei Wang · Systems · Apr 7, 2026
This study employs a TVP-VAR-BK-DY framework to examine volatility spillovers between China’s financial markets and strategic metal assets. To capture retail investor sentiment, we construct a sentiment index using an LLM knowledge distilla…
- Process intensification of multi-stage heat pump assisted distillation with liquid injectionChengtian Cui, Anton A. Kiss · Chemical Engineering and Pr... · Apr 3, 2026
• Liquid injection is proposed as an intensified strategy for multi-stage MVR distillation. • Four steady-state multi-stage MVR configurations are developed and compared. • Exchanger-based intercooling with heat recovery reduces compression…
- S^2-KD: Semantic-Spectral Knowledge Distillation Spatiotemporal ForecastingWenshuo Wang, Yaomin Shen, Yingjie Tan, Yihao Chen · Proceedings of the AAAI Con... · Mar 14, 2026
Spatiotemporal forecasting often relies on computationally intensive models to capture complex dynamics. Knowledge distillation (KD) has emerged as a key technique for creating lightweight student models, with recent advances like frequency…
- Attention-based and context-aware knowledge distillation for enhancing crop disease detectionXiangyuan Zhu, Taotao Mao, Jianguo Chen, Feifan Peng et al. · Applied Soft Computing · Mar 10, 2026
- Exploring Knowledge Purification in Multi-Teacher Knowledge Distillation for LLMsRuihan Jin, Pengpeng Shao, Zhengqi Wen, Jinyang Wu et al. · ICLR 2026 Poster · Jan 26, 2026
Knowledge distillation has emerged as a pivotal technique for transferring knowledge from stronger large language models (LLMs) to smaller, more efficient models. However, traditional distillation approaches face challenges related to knowl…
- A Masked Reverse Knowledge Distillation Method Incorporating Global and Local Information for Image Anomaly DetectionYuxin Jiang, Yunkang Can, Weiming Shen · arXiv.org · Dec 17, 2025
Knowledge distillation is an effective image anomaly detection and localization scheme. However, a major drawback of this scheme is its tendency to overly generalize, primarily due to the similarities between input and supervisory signals. …