Learning Paradigms

Latest Reinforcement Learning Research Papers

The newest Reinforcement Learning papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Reinforcement Learning so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.

Get the latest Reinforcement Learning papers in your inbox — free →

Recent papers

Motion Planning in Urban Environments via Self-Play Reinforcement Learning
Islem Kobbi, Fawzi Nashashibi, Tiago Rocha Gonçalves · HAL (Le Centre pour la Comm... · Sep 15, 2026
International audience...
When to Parallelize Stochastic Exploration of Rare Rewards in Reinforcement Learning
Ernesto García, Daniel Mastropietro, Paola Bermolen, Matthieu Jonckheere · HAL (Le Centre pour la Comm... · Aug 31, 2026
International audience...
Nicotine Aerosol Self-Administration and Reward Potentiation: Using a Preclinical Model of Vaping to Investigate the Stimulus-Enhancing Properties of Nicotine
Kiernan T. Callister · Digital Commons - USU (Utah... · Aug 1, 2026
Tobacco use is the leading cause of preventable disease and death in the United States and worldwide. Nicotine, the primary component of tobacco, is responsible for the addiction to tobacco and nicotine products, such as Electronic Nicotine…
Compact Latent Coordination for Autonomous Vehicles at Unsignalized Intersections
Gil Lifshits, Igal Bilik, Gilad Katz · arXiv · Jul 23, 2026
Coordinating autonomous vehicles at unsignalized intersections remains a critical challenge for multi-agent reinforcement learning (MARL) systems, which typically struggle with combinatorial action spaces, reliance on privileged information…
Towards Miniature Humanoid Tele-Loco-Manipulation Using Virtual Reality and Reinforcement Learning
Nicolas Kosanovic, Jordan Dowdy, Jean Chagas Vaz · arXiv · Jul 22, 2026
Full-sized humanoid robot capabilities have grown exponentially in recent years, aiming towards general-purpose deployment in human environments. A popular control method used by manufacturers utilizes Virtual Reality for upper-body teleope…
ISO: An RLVR-Native Optimization Stack
Hanqing Zhu, Wenyan Cong, Zhizhou Sha, Sagnik Mukherjee et al. · arXiv · Jul 21, 2026
Reinforcement learning with verifiable rewards (RLVR) is rapidly advancing the reasoning capabilities of language models, yet the optimization layer that converts reward feedback into weight-space updates remains poorly understood. Building…
Off-Context GRPO: Learning to Reason on Hard Problems using Privileged Information
Priyank Agrawal, Ankur Samanta, Shervin Ghasemlou, Jalaj Bhandari et al. · arXiv · Jul 21, 2026
Reinforcement learning with verifiable rewards (RLVR) improves reasoning in large language models. Yet, typical RLVR approaches fail on difficult problems: when a model cannot generate any correct solutions, it receives \textit{zero} learni…
A Reinforcement-Learning-Augmented Liquid-Fueled Reactor Network Model for Predicting Lean Blowout in Gas Turbine Combustors
Philip John, Eloghosa Ikponmwoba, Pinaki Pal, Opeoluwa Owoyele · arXiv · Jul 21, 2026
This study introduces a reinforcement learning (RL) framework for generating optimal liquid-fueled reactors to improve lean blowout (LBO) predictions in gas turbine combustors. Existing approaches for determining cluster boundaries rely on …
MeanFlowNFT: Bringing Forward-Process RL to Average-Velocity Generators
Yushi Huang, Xiangxin Zhou, Jun Zhang, Liefeng Bo et al. · arXiv · Jul 16, 2026
MeanFlow generators achieve fast few-step sampling by predicting average velocities over time intervals, making them attractive for efficient generation. Reinforcement learning (RL) has become a powerful way to align diffusion and flow mode…
Mask-Aware Policy Gradients for Diffusion Language Models
Haran Raajesh, Kulin Shah, Adam Klivans, Philipp Krähenbühl · arXiv · Jul 16, 2026
Reinforcement learning has proven effective for improving reasoning in large language models, but extending it to Masked Diffusion Language Models (MDLMs) remains challenging due to the intractability of the log-likelihood estimation. Exist…
On-Policy Delta Distillation
Byeongho Heo, Jaehui Hwang, Sangdoo Yun, Dongyoon Han · arXiv · Jul 16, 2026
On-policy distillation is an alternative post-training method in reinforcement learning that alleviates the constraints imposed by reward models by providing token-level supervision from a teacher model. Although on-policy distillation has …
Concept-Guided Spatial Regularization for World Models in Atari Pong
Yukuan Lu, Zaishuo Xia, Weyl Lu, Yubei Chen · arXiv · Jul 16, 2026
World models are usually evaluated as components of model-based reinforcement learning (MBRL) systems, while the world models themselves are rarely studied in isolation. We examine five representative visual world-model agents in Atari Pong…
Evaluating covariate balance for long time horizon Markov decision processes
Joshua Spear, Rebecca Pope, Neil J Sebire · arXiv · Jul 16, 2026
This article explores the application of covariate balance diagnostics for detecting the presence of hidden confounding/model miss-specification in studies applying offline reinforcement learning (RL) to deriving optimal treatment recommend…
TerraZero: Procedural Driving Simulation for Zero-Demonstration Self-Play at Scale
Zhouchonghao Wu, Akshay Rangesh, Weixin Li, Wei-Jer Chang et al. · arXiv · Jul 14, 2026
Training robust autonomous driving agents requires a simulator that is fast enough for reinforcement learning at scale, realistic enough to ground behavior in real-world map structure, and diverse enough to cover the safety-critical long ta…
Verifier-Based Reinforcement Fine-Tuning of Reasoning Models for Thermal Energy Storage Control
Takumi Shioda, Kohei Terashima, Tatsuo Nagai · arXiv · Jul 14, 2026
Buildings are expected to shift cooling loads in response to grid conditions. Thermal energy storage (TES) enables this shift, but scheduling it well requires planning hours ahead under storage constraints. Model predictive control (MPC) an…
Directional Constraints for Efficient Exploration in Safe Reinforcement Learning
Paolo Magliano, Puze Liu, Jan Peters, Davide Tateo et al. · arXiv · Jul 14, 2026
Reinforcement Learning has revolutionized the landscape of robotic research, allowing robust learning of complex robotic skills in simulation. However, real-world deployment in open-ended environments requires strong safety guarantees to pr…
A Minimalist Retargeting-Guided Reinforcement Learning Recipe for Dexterous Manipulation
Yunhai Feng, Natalie Leung, Jiaxuan Wang, Lujie Yang et al. · arXiv · Jul 13, 2026
Recent work in humanoid whole-body control has found success with a simple recipe: retarget human motion to robot kinematic references, then train policies via reinforcement learning (RL) to track them. But how does this recipe transfer to …
Time-Lag-Aware Deep Reinforcement Learning for Flexible Job-Shop Scheduling in PPVC Module Factories
Ziheng Zhang, Wei Zhang · arXiv · Jul 13, 2026
Prefabricated prefinished volumetric construction moves most building work into module factories, whose production floor operates as a flexible job shop. A major complication is decisive: long post-operation time-lags caused by concrete cur…
Active Offline-to-Online Reinforcement Learning
Alper Kamil Bozkurt, Shangtong Zhang, Yuichi Motai · arXiv · Jul 13, 2026
Background: Offline reinforcement learning (RL) enables effective policies to be trained from large, previously collected datasets and subsequently improved through limited online interaction. This offline-to-online RL (O2O-RL) paradigm is …
Diversified Multinomial Logit Contextual Bandits
Heesang Ann, Taehyun Hwang, Min-hwan Oh · arXiv · Jul 13, 2026
Existing contextual multinomial logit (MNL) bandits model relevance-driven choice but ignore the potential benefits of within-assortment diversity, while submodular/combinatorial bandits encode diversity in rewards but lack structured choic…
Semantic Pareto-DQN: A Multi-Objective Reinforcement Learning Framework for Financial Anomaly Detection
Cláudio Lúcio do Val Lopes, Lucca Machado da Silva · arXiv · Jul 10, 2026
Financial anomaly detection suffers from extreme class imbalance, causing traditional single-objective algorithms to exhibit ``fraud collapse'', defaulting to the majority class and failing to balance anomaly interdiction with customer fric…
Online behavior modification in Reinforcement Learning: Enforcing monotonicity between latent variables and behavioral axes
Fousseyni Sangare, Cédric Pradalier · ICRA26 Learning-HRI Workshop Poster · Jul 10, 2026
Online behavior modification in reinforcement learning (RL) allows robots to adjust behavioral style through latent variables without retraining. However, existing methods such as ACORD (Adjustable Control for RL Dynamics) often result in c…
MPFlow: Learning Budgeted Max-Flow Optimization on the Lightning Network with Deep Graph Reinforcement Learning
Harrison Rush, Vincent Davis, Simone Antonelli, Vikash Singh et al. · arXiv · Jul 9, 2026
We address liquidity placement in the Bitcoin Lightning Network (LN): given a fixed budget, which channels should a node open to maximize its routing capacity? We cast this as a budget-constrained combinatorial optimization problem on graph…
Multi-Modal, Multi-Environment Machine Teaching for Robust Reward Learning
Ali Larian, Qian Lin, Chang Zong Wu, Daniel S. Brown · arXiv · Jul 9, 2026
As autonomous agents are increasingly deployed across diverse operational contexts, aligning their behavior with human intent demands reward functions that remain robust to such changes rather than overfitting to any single environment. Inv…
Selective Timestep Weighting and Advantage-Based Replay for Sample-Efficient Diffusion RLHF
Eric Zhu, Abhinav Shrivastava, Soumik Mukhopadhyay · arXiv · Jul 8, 2026
Reinforcement learning from human feedback (RLHF) has emerged as a powerful paradigm for aligning generative models with human preferences. However, applying RLHF to diffusion models remains highly feedback inefficient, as existing approach…
Entanglement as a Structural Complexity Axis: A PAC-Bayesian View of Generalization in Quantum Policies and Value Functions
Jian Xu, Delu Zeng, John Paisley, Qibin Zhao · arXiv · Jul 7, 2026
Parameterized quantum circuits (PQCs) are increasingly used as policies and value functions in quantum reinforcement learning, yet it remains unclear when and why quantum policies generalize. We give a PAC-Bayesian account in which generali…
TriA Pipeline: A Large-Scale Automatic Audio Annotation Pipeline For Audio Classification In Specific Scenarios
Hong Lyu, Mingru Yang, Qianhua He, Yanxiong Li et al. · arXiv · Jul 7, 2026
There are some datasets of varying scales for audio classification (AC) applied to different tasks. However, annotated data is limited for most scenarios, such as domestic environments. To address this challenge, we propose an $\textbf{A}$u…
Digital Twin-Enabled Reinforcement Learning for Fault-Resilient Urban Traffic Signal Control
Marina Wiemers, Jones, Jaylen, Jordan, Mladen Čičić, Jonas Jostmann et al. · HAL (Le Centre pour la Comm... · Jul 7, 2026
International audience...
DecompRL: Solving Harder Problems by Learning Modular Code Generation
Juliette Decugis, Fabian Gloeckle, Francis Bach, Taco Cohen et al. · arXiv · Jul 2, 2026
How can Large Language Models (LLMs) solve problems they currently cannot? Repeated sampling scales test-time compute but GPU cost grows linearly with attempts, while reinforcement learning (RL) with verifiable rewards improves single-attem…
Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training
Zijian Zhang, Rizhen Hu, Athanasios Glentis, Dawei Li et al. · arXiv · Jul 1, 2026
Reinforcement learning (RL) has become a central component of post-training large language models (LLMs), yet little is understood about how RL adaptation is distributed across transformer layers. Existing approaches typically update all mo…

Track Reinforcement Learning on Distill AI — start free →

Latest Reinforcement Learning Research Papers

Recent papers

Related topics