Latest Chain-of-Thought Reasoning Research Papers
The newest Chain-of-Thought Reasoning papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Chain-of-Thought Reasoning so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Chain-of-Thought Reasoning papers in your inbox — free →Recent papers
- TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement LearningHeming Zou, Qi Wang, Yun Qu, Yuhang Jiang et al. · arXiv · Jun 9, 2026
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward cont…
- T1-Bench: Benchmarking Multi-Scenario Agents in Real-World DomainsGenta Indra Winata, Amartya Chakraborty, Yuzhen Lin, Swasthi P Rao et al. · arXiv · Jun 9, 2026
Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing benchmarks remain limited in task complexity, realism, and domain diversity, and…
- Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix ItXinyu Zhou, Boyu Zhu, Yi Xu, Zhiwei Li et al. · arXiv · Jun 9, 2026
Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including HypeNet an…
- Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning ModelsPrajakta Kini, Avinash Reddy, Souradip Chakraborty, Satya Sai Srinath Namburi GNVV et al. · arXiv · Jun 9, 2026
Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the alignment b…
- Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?Tengchao Lv, Dongdong Zhang, Jiayu Ding, Yilin Jia et al. · arXiv · Jun 9, 2026
The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal en…
- Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM ReasoningYiqing Lyu, Xianbing Zhao, Buzhou Tang, Ronghuan Jiang · arXiv · Jun 9, 2026
Automatic Depression Detection (ADD) from clinical interviews is a pivotal task in computational mental health, yet it remains challenging due to two critical obstacles: 1) difficulty in modeling complex but sparsely distributed depression …
- N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy OptimizationXukun Zhu, Hang Yu, Peng Di, Linchao Zhu · arXiv · Jun 9, 2026
The success of Large Language Models in mathematical reasoning relies heavily on the generation of diverse and valid solution paths during the rollout phase. However, current rollout techniques face a fundamental trade-off: token-level samp…
- When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning ModelsSai Kartheek Reddy Kasu, Nils Lukas, Samuele Poppi · arXiv · Jun 9, 2026
Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligne…
- REAL: A Reasoning-Enhanced Graph Framework for Long-Term Memory Management of LLMsKeer Lu, Liwei Chen, Guoqing Jiang, Zhiheng Qin et al. · arXiv · Jun 9, 2026
Large Language Models (LLMs) are increasingly expected to interact with users over long time horizons. However, due to their finite context window, LLMs cannot retain all past interactions, making long-term memory management essential for s…
- How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMsZhichen Dong, Yang Li, Yuhan Sun, Weixun Wang et al. · arXiv · Jun 9, 2026
Token-level credit assignment remains a key obstacle for reinforcement learning (RL) in large language models (LLMs), where RL recipes typically treat all tokens equally, failing to distinguish decisive reasoning steps from routine formatti…
- Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent ArgumentationJakub Masłowski, Jarosław A. Chudziak · arXiv · Jun 9, 2026
Multi-agent debate frameworks have been shown to improve large language model performance in convergent tasks, but they are currently optimized in a way that heavily favors final output accuracy rather than stability of the process. During …
- WebChallenger: A Reliable and Efficient Generalist Web AgentJayoo Hwang, Xiaowen Zhang, Vedant Padwal · arXiv · Jun 9, 2026
Autonomous web navigation remains challenging for LLM agents, and the strongest generalist systems rely on proprietary reasoning models whose inference cost is prohibitive for the repetitive tasks where such agents would be most useful. We …
- KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human DifficultySanghee Park, Geewook Kim, Kee-Eung Kim · arXiv · Jun 9, 2026
Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung) mathematics…
- IS-CoT: Breaking the Long-form Generation Collapse via Interleaved Structural ThinkingZechen Sun, Yuyang Sun, Zecheng Tang, Juntao Li et al. · arXiv · Jun 8, 2026
Generating coherent and controllable long-form content remains a persistent challenge for Large Language Models (LLMs). While reasoning-enhanced models have demonstrated success in logic-intensive domains, our evaluation reveals that they s…
- SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World TasksHongcheng Gao, Hailong Qu, Jingyi Tang, Jiahao Wang et al. · arXiv · Jun 8, 2026
Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simul…
- When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction FollowingSai Adith Senthil Kumar · arXiv · Jun 8, 2026
Large reasoning models (LRMs) often improve math and coding performance, but their effect on instruction following is unclear. We study IFEval with Qwen3 models (1.7B-32B), using same-weights Thinking ON/OFF controls; four Hunyuan models pr…
- Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous DrivingYimu Wang, Yee Man Choi, Barry Zhang, Mozhgan Nasr Azadani et al. · arXiv · Jun 8, 2026
Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer accuracy alone does not indicate whether a model relied on the correct visual evidence. This gap is particularly important in multi-v…
- Detecting Differences Is Not Understanding Structure: Large Language Models Fail at Graph IsomorphismKumar Thushalika, Sukumar Kishanthan, Asela Hevapathige · arXiv · Jun 8, 2026
Large language models (LLMs) have shown impressive performance on diverse reasoning tasks, yet their capacity for structural reasoning in graphs remains unclear. We investigate whether LLMs can genuinely understand graph isomorphism -a fund…
- Memory Beyond Recall: A Dual-Process Cognitive Memory System for Self-Evolving LLM AgentsTianxiang Fei, Mingyang Song, Mao Zheng, Xiang Yu · arXiv · Jun 8, 2026
Long-term memory for an LLM agent is more than retrieving the right passage at the right time. Current memory systems collapse belief revision, causal coupling, and cross-domain abstraction into a single retrieval surface tuned for surface …
- Reasoning without Gold Standards: A Proxy-Judge Theory of AutoformalizationLei Xu, Xin Quan, André Freitas · arXiv · Jun 8, 2026
Complex reasoning tasks increasingly require systems to produce outputs whose correctness cannot be judged by exact match against a single reference. Autoformalization (AF) is a representative example; it asks a model to translate informal …
- Capacity, Not Format: Rethinking Structured Reasoning FailuresHengxin Fan · arXiv · Jun 8, 2026
Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradien…
- Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall ShortHan Zhou, Adam X. Yang, Laurence Aitchison, Anna Korhonen et al. · arXiv · Jun 8, 2026
Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative …
- Is Text All You Need? Text as a Universal Information Bottleneck for Speech LLMsMing-Hao Hsu, Yuxuan Hu, Shujie Liu, Jinyu Li et al. · arXiv · Jun 8, 2026
Large language models (LLMs) provide a powerful reasoning backbone for speech understanding, but integrating continuous acoustic signals into a frozen LLM remains challenging. Existing speech-to-LLM interfaces typically operate at two extre…
- Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill MemoryHaoran Sun, Wenjie Li, Yujie Zhang, Zekai Lin et al. · arXiv · Jun 8, 2026
Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing m…
- PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit AssignmentYang Tian, Rui Wang, Xumeng Wen, Junjie Li et al. · arXiv · Jun 8, 2026
Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool …
- Multi-Hop Knowledge Composition is Bound by Pretraining ExposureYannis Karmim, Luis Marti, Djamé Seddah, Valentin Barrière · arXiv · Jun 8, 2026
Large Language Models fail at implicit multi-hop reasoning: a model answers "When was $X$ born?" and "Who is $Y$'s closest friend?" correctly but fails on "When was $Y$'s closest friend born?" in a single forward pass, even when both facts …
- One Model, Multiple Goals: Adaptive Multi-Objective Learning for E-commerce Dialogue SystemsMingzhe Li, Jing Xiang, Enguo Zhou, Lang Gao et al. · arXiv · Jun 8, 2026
Dialogue systems in e-commerce scenarios often need to satisfy multiple objectives: accurately reasoning over user profiles (e.g., eligibility, credit limit) to ensure correct decision-making and user state interpretation, while also genera…
- TruthSplit: Operationalizing Conditional Validity in Arguments Through Multi-Perspective ReasoningBenjamin Stieger, Maximilian Terberger, Thomas Huber, Christina Niklaus · arXiv · Jun 8, 2026
We present TruthSplit, an interactive system for multi-perspective argument analysis. Existing argumentation tools typically analyze properties of the argument itself, such as structure, quality, stance, or persuasiveness, while leaving per…
- Symbolic and Abstractive Reasoning with Complex Visual QueriesYichi Zhang, Jingdian Lu, Zhuo Chen, Lingbing Guo et al. · arXiv · Jun 8, 2026
Understanding and reasoning over abstract visual content remains a challenge for current multi-modal large language models (MLLMs). In this paper, we explore a novel abstract data type termed complex visual query (CVQ), designed to probe sy…
- SEF-CLGC at SemEval-2026 Task 11: Logical Notation Impact on Language Model PerformanceHanna Abi Akl, Fabien Gandon, Catherine Faron, Pierre Monnin · arXiv · Jun 8, 2026
This paper revisits our pipeline called Syllogistic Evaluation Framework-Common Logic Grammar Construction (SEF-CLGC). We combine formal logical notations with Small Language Models (SLMs) to evaluate reasoning performance on the SemEval-20…