Latest Reinforcement Learning Research Papers
The newest Reinforcement Learning papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Reinforcement Learning so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Reinforcement Learning papers in your inbox — free →Recent papers
- OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinibAbhijoy Sarkar, Aarchi Singh Thakur · arXiv · Jun 9, 2026
Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computatio…
- TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement LearningHeming Zou, Qi Wang, Yun Qu, Yuhang Jiang et al. · arXiv · Jun 9, 2026
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward cont…
- Test-Time Gradient Guidance of Flow Policies in Reinforcement LearningZhiyuan Zhou, Andy Peng, Charles Xu, Qiyang Li et al. · arXiv · Jun 9, 2026
Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imita…
- An Agency-Transferring Model-Free Policy Enhancement TechniqueAnton Bolychev, Georgiy Malaniya, Sinan Ibrahim, Pavel Osinenko · arXiv · Jun 8, 2026
Training reinforcement learning (RL) policies from scratch is costly: it requires careful reward and environment design, extensive tuning, and substantial computation. Yet many control problems already have a functional but suboptimal polic…
- Rethinking the Divergence Regularization in LLM RLJiarui Yao, Xiangxin Zhou, Penghui Qi, Wee Sun Lee et al. · arXiv · Jun 8, 2026
Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential…
- RREDCoT: Segment-Level Reward Redistribution for Reasoning ModelsMykyta Ielanskyi, Kajetan Schweighofer, Lukas Aichberger, Sepp Hochreiter · arXiv · Jun 4, 2026
Recent advancements in reasoning language models have been driven by Reinforcement Learning (RL) fine-tuning. Most often, these rely on the Group Relative Policy Optimization (GRPO) algorithm or modifications thereof to steer the models to …
- Drifting Preference Optimization for One-Step Generative ModelsZhou Jiang, Yandong Wen, Zhen Liu · arXiv · Jun 1, 2026
One-step text-to-image generators are attractive for deployment because they generate an image with a single forward pass, but preference finetuning them remains difficult: standard alignment methods often rely on policy likelihoods, denois…
- A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RLLei Yang, Siyu Ding, Deyi Xiong · arXiv · Jun 1, 2026
Reinforcement learning (RL) post-training improves large language models (LLMs) on individual domains such as mathematical reasoning, code generation, question answering, and creative writing (CW), but training on one domain often degrades …
- Policy and World Modeling Co-Training for Language AgentsNing Lu, Baijiong Lin, Shengcai Liu, Jiahao Wu et al. · arXiv · Jun 1, 2026
Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment. World modeling (WM) can fill this gap, …
- Fine-tuning Timeseries Predictors Using Reinforcement LearningHugo Cazaux, Ralph Rudd, Hlynur Stefansson, Sverrir Ólafsson et al. · Recent Applications in Deep Learning Book Chapter · Jun 1, 2026
This chapter presents three major reinforcement learning algorithms used for fine-tuning financial forecasters. We propose a clear implementation plan for backpropagating the loss of a reinforcement learning task to a model trained using su…
- Reasoning with Sampling: Cutting at Decision PointsFelix Zhou, Anay Mehrotra, Quanquan C. Liu · arXiv · May 28, 2026
Frontier reasoning models are produced by posttraining base language models with reinforcement learning. Recent work has challenged this by showing that sampling from a sharpened version of the base model's distribution, a so-called power d…
- In-Context Reward Adaptation for Robust Preference ModelingZhenyu Sun, Zheng Xu, Ermin Wei · arXiv · May 28, 2026
Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human preferences. However, human values are inherently diverse and heterogeneous, and a single reward model ofte…
- Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact RepresentationJiahe Pan, Stelian Coros, Jitendra Malik, Toru Lin · arXiv · May 27, 2026
A primary bottleneck in contact-rich manipulation is the difficulty of collecting real-world data. Sim-to-real reinforcement learning offers a scalable alternative, but the simulation-reality gap prevents information-dense modalities like t…
- Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned BiasesDongyoon Hahm, Dylan Hadfield-Menell, Kimin Lee · arXiv · May 26, 2026
Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignm…
- Guiding LLM Post-training Data Engineering with Model Internals from Sparse AutoencodersYi Jing, Zao Dai, Jinwu Hu, Zijun Yao et al. · arXiv · May 26, 2026
Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model interna…
- BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM ReasoningShijin Gong, Erhan Xu, Kai Ye, Francesco Quinzan et al. · arXiv · May 26, 2026
Reinforcement learning with verifiable rewards has become a standard recipe for improving the reasoning abilities of large language models. Existing algorithms face a tradeoff between computational efficiency and sample efficiency in value …
- It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic UncertaintyKevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown et al. · arXiv · May 26, 2026
Large language models (LLMs) are known to abandon their initial stance to conform to user pushback. While prior research largely attributes this behavior to sycophancy learned during reinforcement learning from human feedback, we hypothesiz…
- FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object SegmentationZihui Zhang, Zhixuan Sun, Yafei Yang, Jinxi Li et al. · arXiv · May 26, 2026
We address the challenging task of 3D object segmentation in complex scene point clouds without relying on any scene-level human annotations during training. Existing methods are typically constrained to identifying simple objects, primaril…
- Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement LearningZhaoyu Zhu, Rui Gao, Shuang Li · arXiv · May 25, 2026
Wasserstein policy gradient (WPG) is a policy optimization method for reinforcement learning (RL) that exploits the optimal-transport geometry of action distributions. For the entropy-regularized RL objective, WPG evolves each state-conditi…
- Remember to be Curious: Episodic Context and Persistent Worlds for 3D ExplorationLily Goli, Justin Kerr, Daniele Reda, Alec Jacobson et al. · arXiv · May 21, 2026
Exploration is a prerequisite for learning useful behaviors in sparse-reward, long-horizon tasks, particularly within 3D environments. Curiosity-driven reinforcement learning addresses this via intrinsic rewards derived from the mismatch be…
- Superhuman Safe and Agile Racing through Multi-Agent Reinforcement LearningIsmail Geles, Leonard Bauersfeld, Markus Wulfmeier, Davide Scaramuzza · arXiv · May 21, 2026
Autonomous systems have achieved superhuman performance in isolation or simulation, yet they remain brittle in shared, dynamic real-world spaces. This failure stems from the dominant single-agent paradigm for physical applications, where ot…
- General Preference Reinforcement LearningMuhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal, Arslan Chaudhry et al. · arXiv · May 18, 2026
Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier …
- EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RLMinrui Xu, Zilin Wang, Mengyi DENG, Zhiwei Li et al. · arXiv · May 18, 2026
Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures impli…
- COOPO: Cyclic Offline-Online Policy Optimization AlgorithmQisai Liu, Zhanhong Jiang, Joshua Russell Waite, Aditya Balu et al. · arXiv · May 18, 2026
Offline reinforcement learning struggles with distributional shift and constrained performance due to static dataset limitations, while online RL demands prohibitive environment interactions. The recent advent of hybrid offline-to-online me…
- Self-Distilled Agentic Reinforcement LearningZhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi-Han Wang et al. · arXiv · May 14, 2026
Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements…
- Learning from Language Feedback via Variational Policy DistillationYang Li, Erik Nijkamp, Semih Yavuz, Shafiq Rayhan Joty · arXiv · May 14, 2026
Reinforcement learning from verifiable rewards (RLVR) suffers from sparse outcome signals, creating severe exploration bottlenecks on complex reasoning tasks. Recent on-policy self-distillation methods attempt to address this by utilizing l…
- AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable RewardRunhui Huang, Jie Wu, Rui Yang, Zhe Liu et al. · arXiv · May 12, 2026
In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start st…
- Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-TrainingYuanda Xu, Hejian Sang, Zhengze Zhou, Ran He et al. · arXiv · May 12, 2026
In settings where labeled verifiable training data is the binding constraint, each checked example should be allocated carefully. The standard practice is to use this data directly on the model that will be deployed, for example by running …
- Dynamic Skill Lifecycle Management for Agentic Reinforcement LearningJunhao Shen, Teng Zhang, Xiaoyan Zhao, Hong Cheng · arXiv · May 11, 2026
Large language model agents increasingly rely on external skills to solve complex tasks, where skills act as modular units that extend their capabilities beyond what parametric memory alone supports. Existing methods assume external skills …
- Equivariant Reinforcement Learning for Clifford Quantum Circuit SynthesisRichie Yeung, Aleks Kissinger, Rob Cornish · arXiv · May 11, 2026
We consider the problem of synthesizing Clifford quantum circuits for devices with all-to-all qubit connectivity. We approach this task as a reinforcement learning problem in which an agent learns to discover a sequence of elementary Cliffo…