Language & NLP

Latest Chain-of-Thought Reasoning Research Papers

The newest Chain-of-Thought Reasoning papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Chain-of-Thought Reasoning so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.

Get the latest Chain-of-Thought Reasoning papers in your inbox — free →

Recent papers

The Sociolinguistics of Machine Identity: LLM Personality and Ideology Propagation
Guangni Li · Open MIND · Dec 31, 2026
Do large language models (LLMs) possess a measurable "personality," and how do the linguistic properties of training corpora shape their cognitive style and downstream reasoning? This paper approaches these questions from a sociolinguistic …
OpenForgeRL: Train Harness-native Agents in Any Environment
Xiao Yu, Baolin Peng, Ruize Xu, Hao Zou et al. · arXiv · Jul 23, 2026
Modern AI agents rely on elaborate inference harnesses such as Claude Code, Codex, and OpenClaw to drive multi-turn reasoning, tool use, and access to external systems. While powerful, these complex harnesses also make agents hard to train …
RUMBA: Russian User Memory Benchmark
Elizaveta Shevtsova, Inna Glebkina, Mark Baushenko, Pavel Gulyaev et al. · arXiv · Jul 23, 2026
The ability to handle long-term memory in LLMs is becoming increasingly critical, yet existing benchmarks remain English-centric and rely on aggregate retrieval metrics, failing to capture interactions between long-range context, temporal i…
Token Budget Saturation and Mechanistic Early Detection of Reasoning Non-Convergence in Chain-of-Thought Models
Renuka Oladri, Niveda Jawahar, Abdirisak Mohamed · arXiv · Jul 23, 2026
Chain-of-thought reasoning models such as DeepSeek-R1-Distill-Qwen-7B exhibit a bimodal convergence pattern: generations either terminate within a token budget (converged) or exhaust it without reaching a conclusion (non-converged). We char…
Euclid-MCP: A Model Context Protocol Server for Deterministic Logical Reasoning via Prolog
Bartolomeo Bogliolo · arXiv · Jul 23, 2026
Large Language Models (LLMs) excel at natural language understanding and generation but remain unreliable for multi-step logical reasoning, especially in safety-critical or compliance-sensitive domains. Recent neuro-symbolic approaches addr…
Adaptive Depth Sparse Framework: Similarity-Driven Resource Allocation for Pre-Trained LLMs
Yidu Wu, Xiang Wang, Kejie Zhao, Zhangchi Wang et al. · arXiv · Jul 23, 2026
Large language models (LLMs) achieve strong generation and reasoning performance, but the Transformer architecture incurs high inference cost. Existing acceleration methods often rely on task-specific fine-tuning or training from scratch, i…
Training Large Language Models for Self-Explanation Faithfulness
Yeoktatt Cheah, María Pérez-Ortiz, Noah Y. Siegel, Oana-Maria Camburu · arXiv · Jul 23, 2026
We propose a Reinforcement Learning (RL) method to directly optimize the faithfulness of self-explanations - the extent to which a model's generated reasoning accurately reflects its internal decision-making process. While existing work foc…
The Weight of Silence: A Causal Case for Weights Over the Scratchpad in Latent Chess Reasoning
Ishan S. Kshirsagar · arXiv · Jul 23, 2026
Latent, or silent, reasoning lets language models carry out intermediate computation in continuous vector space instead of words, and is widely assumed to function as an internal scratchpad the model actively consults during inference. Whet…
Chemical Chain-of-Thought Functions as a Hallucination-Prone Molecular Scratchpad
Jiatong Li, Yuxuan Ren, Weida Wang, Xiaoyong Wei et al. · arXiv · Jul 23, 2026
Chemical reasoning language models are expected to derive molecular answers through faithful chain-of-thought (CoT). However, across four reasoning model families and twelve chemistry tasks, hallucination is widespread and largely decoupled…
REFACT: Adaptive Fact Restatement for Compact and Faithful Chain-of-Thought Reasoning
Zhensheng Jin, Xin Dai, Zhenghao Liu, Chaojun Xiao et al. · arXiv · Jul 23, 2026
Large language models increasingly rely on long-form reasoning for complex tasks, yet their reasoning traces may drift away from the supplied context when evidence is sparse, noisy, or in conflict with parametric knowledge. Existing groundi…
WaveformQA: Benchmarking LLM Temporal Reasoning on Digital Waveforms
Yichuan Liu, Daniel Cummings, Nick Vadlamudi · arXiv · Jul 22, 2026
Large Language Models (LLMs) have demonstrated strong capabilities in code generation and reasoning, yet their ability to perform temporal reasoning over digital waveform data remains largely unexplored. Although reasoning over digital wave…
PyroDash: Cost-Efficient Token-Level Small-Large Language Model Collaborative Inference
Niqi Lyu, Pengtao Shi, Wei Qiu, Jianlin Zhong et al. · arXiv · Jul 22, 2026
Large language models (LLMs) provide strong reasoning capabilities but are expensive to serve at scale, whereas small language models (SLMs) are cheaper but less reliable on difficult problems. We introduce PyroDash, a cost-aware framework …
PoTRE: Test-Time Reasoning inspired by Cognitive Heterogeneity
Anmol Kankariya, Sercan Ö. Arık · arXiv · Jul 22, 2026
While Large Language Models (LLMs) excel at many tasks, they frequently struggle with complex reasoning that requires long-horizon planning and iterative error correction. Furthermore, standard single-stream prompting proves brittle when mo…
A Multi-Dimensional Evaluation of Explainability in Media Bias Detection
Ting Chen, Raina Zhang, Benjamin M. Ampel, Sagar Samtani · arXiv · Jul 22, 2026
Detecting media bias automatically is difficult because biased framing is often subtle, yet in domains such as news analysis, accurate predictions alone are insufficient without explanations that reflect the model's underlying reasoning. We…
Overview of FinMMEval 2026 Task 1: Multilingual Financial Multiple-Choice Question Answering
Zhuohan Xie, Yuyang Dai, Rania Elbadry, Vanshikaa Jani et al. · arXiv · Jul 22, 2026
FinMMEval 2026 Task 1 evaluates multilingual financial multiple-choice question answering in English, Chinese, Arabic, and Hindi. The task tests whether systems can select the correct answer to finance questions involving domain terminology…
Auto-Fill: Learning to Predict Missing Values Accurately with Specialist Language Models
Yurong Liu, Yeye He, Haoyu Dong, Junjie Xing et al. · arXiv · Jul 22, 2026
Predicting missing cell values in tabular data is a fundamental problem in data cleaning. While state-of-the-art reasoning models show great promise in predicting missing values in tables, by reasoning holistically across rows and columns, …
Rewarding Better Thinking for LLM Preference Alignment
Xubo Liu, Wenya Guo, Ruxue Yan, Xinying Qian et al. · arXiv · Jul 22, 2026
LLM preference alignment aims to optimize models toward human preferences across diverse user instructions. Reinforcement learning has become a major post-training approach for this goal, but existing proxy rewards are often outcome-level, …
SLPO: Scaling Latent Reasoning via a Surrogate Policy
Runyang You, Zhiyuan Liu, Yongqi Li, Wenjie Li · arXiv · Jul 22, 2026
Reinforcement learning with verifiable rewards has become the predominant recipe for eliciting test-time scaling in explicit Chain-of-Thought reasoners. Yet this scaling path remains computationally costly, since every intermediate step mus…
Reference-Free Evaluation of Reasoning in Open-Ended Question Answering
Guneet Singh Kohli, Yuxiang Zhou, Michael Sejr Schlichtkrull, Gregory E Dean et al. · arXiv · Jul 22, 2026
AI-generated answers in high-stakes domains are often fluent but difficult to verify, especially when they contain multi-step reasoning rather than a single final answer. We propose a reasoning-based, reference-free framework for auditing L…
Twin Agent: Context Residual Compression for Privilege Separated Agents
Zhanhao Hu, Dennis Jacob, Xiao Huang, Zhaorun Chen et al. · arXiv · Jul 21, 2026
Large language model (LLM) agents are vulnerable to security risks, such as prompt injection attacks from untrusted context that manipulate downstream reasoning and tool use. Existing secure-by-design approaches mitigate this risk by separa…
When Reasoning Narrows the Move: Diversity Collapse in LLM Game Play
Junyi Sha, Renfei Tan, David Simchi-Levi · arXiv · Jul 21, 2026
Supervised fine-tuning (SFT) is widely used to adapt large language models to downstream tasks, but its effect on behavioral diversity in sequential decision-making remains under-explored. We study this question in a controlled suite of det…
Copy Less, Ground More: Overcoming Repetitive Copying in Long-Context Reasoning via Evidence-Aware Reinforcement Learning
Lizhe Fang, Weizhou Shen, Tianyi Tang, Yisen Wang · arXiv · Jul 21, 2026
Large language models that generate step-by-step reasoning traces have achieved strong performance on complex tasks, and extending them to long-context settings has emerged as an important frontier. However, we identify a critical failure m…
Agents in the Wild: Where Research Meets Deployment
Grace Hui Yang, Pranav N. Venkit, Hooman Sedghamiz, Enrico Santus et al. · arXiv · Jul 21, 2026
Agentic systems large language model (LLM) based architectures capable of reasoning, planning, acting, and coordinating with tools and other agents are rapidly transitioning from research prototypes to production scale deployments across do…
Selective State-Space Adaptation and Retrieval for Language Model Reasoning
Atahan Dokme, Larry Heck · arXiv · Jul 21, 2026
Low-rank adaptation introduces a static learned update applied identically to every input. The update provides task-level adaptation but does not explicitly represent token-level or instance-level state variation. A family of adapters is pr…
MeetingToM: Evaluating Multimodal LLMs on Theory-of-Mind Reasoning in Multi-Party Meetings
Ziyi Wang, Yuhang Wu, Dongxu Piao, Xingyu Liu et al. · arXiv · Jul 21, 2026
Theory of Mind (ToM), the ability to infer other's beliefs, intentions, and states of knowledge, is central to social interaction, yet remains challenging for current Multimodal Large Language Models (MLLMs), especially in multi-party meeti…
The Price of Reasoning: Cost-Quality Tradeoffs in Reinforcement Learning for Neural Machine Translation
Michael Jungo, Aixiu An · arXiv · Jul 21, 2026
Reinforcement learning with verifiable rewards (RLVR) has been established as a viable paradigm for the post-training of Large Language Models (LLMs), including downstream tasks, such as Neural Machine Translation (NMT). With the latest res…
MIRA-Ev:A Benchmark for Granular Evidence Detection and Relational Reasoning in Clinical Exams
Iker De la Iglesia, Johanna Ramirez-Romero, Jose Maria Villa-Gonzalez, Irune Urroz García et al. · arXiv · Jul 21, 2026
Clinical NLP evaluation remains dominated by multiple-choice question answering (MCQA), which scores only final-answer accuracy and cannot detect when a model reaches the correct diagnosis while grounding it in irrelevant, absent, or contra…
Reasoning Before Translation: Enhancing Legal Machine Translation with Structured Reasoning
Aixiu An, Michael Jungo, Eloi Eynard, Mark Drenhaus et al. · arXiv · Jul 21, 2026
Neural machine translation (NMT) in the legal domain is a linguistically and conceptually demanding task, primarily due to the complexity of legal language and the high level of precision it requires. The recent emergence of reasoning-capab…
Supra Cognitive Modes: A Routed Architecture for Agent Memory
Joshua Tobkin, David Yang · arXiv · Jul 21, 2026
Agent-memory workloads mix direct factual lookup, relation-chain and current-state reasoning, and broad synthesis over long histories. We describe Supra Cognitive Modes (SCM), an architecture that maps explicit or automatically selected per…
DAIS: Dependency-Aware Intermediate QA Supervision for Complex Reasoning
Yu Wang, Ming Fan, Xicheng Zhang, Zhiyong Li et al. · arXiv · Jul 21, 2026
Chain-of-thought (CoT) supervision exposes intermediate rationales, but flat rationale targets usually optimize a single reasoning sequence and provide limited supervision on how local conclusions should support later decisions. We introduc…

Track Chain-of-Thought Reasoning on Distill AI — start free →

Latest Chain-of-Thought Reasoning Research Papers

Recent papers

Related topics