Latest NLP Research Papers
The newest NLP papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks NLP so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest NLP papers in your inbox — free →Recent papers
- A Unifying Lens on Supervised Fine-Tuning Through Target Distribution DesignTong Xie, Yuanhao Ban, Yunqi Hong, Sohyun An et al. · arXiv · Jun 9, 2026
Supervised fine-tuning (SFT) typically maximizes the likelihood of every token in a demonstrated trajectory. However, an observed token can be non-unique, noisy, or misaligned with the model prior. Strictly fitting toward this one-hot targe…
- Data Journalist Agent: Transforming Data into Verifiable Multimodal StoriesKevin Qinghong Lin, Batu EI, Yuhong Shi, Pan Lu et al. · arXiv · Jun 9, 2026
Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an an…
- Multi-Faceted Interactivity Alignment in Full-Duplex Speech ModelsAtsumoto Ohashi, Neil Zeghidour, Alexandre Défossez, Eugene Kharitonov · arXiv · Jun 9, 2026
Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level likelihood maximi…
- Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data CurationSoham Bhattacharjee, Karun Sharma, Vinay Kumar Sankarapu, Pratinav Seth · arXiv · Jun 9, 2026
Synthetic post-training pipelines commonly filter generated samples with reward models or holistic LLM judges, yet two practices remain rarely examined together: whether the filtering signal is grounded in the source evidence that induced e…
- TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement LearningHeming Zou, Qi Wang, Yun Qu, Yuhang Jiang et al. · arXiv · Jun 9, 2026
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward cont…
- PhantomBench: Benchmarking the Non-existential Threat of Language ModelsHaeji Jung, Hila Gonen · arXiv · Jun 9, 2026
Hallucinations, where language models (LMs) generate factually ungrounded responses, pose serious risks, as users tend to blindly rely on them. This is particularly concerning in high-stakes domains, where consequences of such model behavio…
- The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language ModelsHakan Mehmetcik · arXiv · Jun 9, 2026
This study investigates cross-lingual distributional skew (the Shibboleth Effect) in frontier large language models (LLMs) subjected to sustained adversarial conditions. We develop a multi-agent geopolitical wargame, the Cerulean Sea Crisis…
- VISTA: A Versatile Interactive User Simulation Toolkit for Agent EvaluationYunan Lu, Ryan Shea, Yusen Zhang, Zhou Yu · arXiv · Jun 9, 2026
Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi-step nature of agentic behavior and struggle to expose meaning…
- A History-Aware Visually Grounded Critic for Computer Use AgentsJaewoo Lee, Zaid Khan, Archiki Prasad, Justin Chih-Yao Chen et al. · arXiv · Jun 9, 2026
Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. However, exi…
- Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language ModelsPeiqi Jia, Haonan Jia, Ziqi Miao, Linkang Du et al. · arXiv · Jun 9, 2026
With the widespread deployment of Multimodal Large Language Models (MLLMs) in social interaction, understanding and controlling their behavior under complex personality conditions is essential. This paper introduces explicit personality con…
- T1-Bench: Benchmarking Multi-Scenario Agents in Real-World DomainsGenta Indra Winata, Amartya Chakraborty, Yuzhen Lin, Swasthi P Rao et al. · arXiv · Jun 9, 2026
Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing benchmarks remain limited in task complexity, realism, and domain diversity, and…
- Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix ItXinyu Zhou, Boyu Zhu, Yi Xu, Zhiwei Li et al. · arXiv · Jun 9, 2026
Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including HypeNet an…
- Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning ModelsPrajakta Kini, Avinash Reddy, Souradip Chakraborty, Satya Sai Srinath Namburi GNVV et al. · arXiv · Jun 9, 2026
Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the alignment b…
- AuRA: Internalizing Audio Understanding into LLMs as LoRABo Cheng, Lei Shi, Zhanyu Ma, Yuan Wu et al. · arXiv · Jun 9, 2026
Recent efforts to extend large language models (LLMs) to speech inputs typically rely on cascaded ASR-LLM pipelines, end-to-end speech-language models, or bridge/distillation-based adaptation. While these routes respectively reuse strong pr…
- Generative Archetype-Grounded Item Representations for Sequential RecommendationYifan Li, Jiahong Liu, Xinni Zhang, Hao Chen et al. · arXiv · Jun 9, 2026
Sequential recommendation aims to predict users' next interaction with items by analyzing their historical behavior. However, the limited quality of item representations remains a critical bottleneck. While pre-trained large language models…
- Measuring Human Value Expression in Social Media Texts: Calibrated LLM Annotation and Encoder TransferMaria Milkova, Maksim Rudnev · arXiv · Jun 9, 2026
Measuring subjective constructs in naturally occurring social media text requires annotation procedures that are theoretically grounded, empirically validated, and transferable to an encoder model for scalable prediction. Using non-English …
- Who Brought Easter Eggs to Eid? Auditing Cultural Translation of Math Word Problems Across Diverse Languages and RegionsParisa Suchdev, Juniper Lovato · arXiv · Jun 9, 2026
Large language models are increasingly used to adapt math word problems for personalized learning at scale, but it remains an open question whether those adaptations are consistent across models, preserve cultural diversity at scale, and re…
- Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?Tengchao Lv, Dongdong Zhang, Jiayu Ding, Yilin Jia et al. · arXiv · Jun 9, 2026
The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal en…
- It Takes One to Bias Them All: Breaking Bad with One-Shot GRPONaihao Deng, Yilun Zhu, Naichen Shi, Clayton Scott et al. · arXiv · Jun 9, 2026
Warning: This paper contains several toxic and offensive statements. Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and reliable behavior. In this work, we investigate how easily s…
- Trace Only What You Need: Structure-Aware On-Demand Hypergraph Memory for Long-Document Question AnsweringXiangjun Zai, Xingyu Tan, Chen Chen, Xiaoyang Wang et al. · arXiv · Jun 9, 2026
Long-document question answering (QA) requires large language models (LLMs) to reason over evidence scattered across lengthy documents, where answers often depend on event order, section-level context, and cross-part evidence connections. A…
- Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and ActivationYupu Hao, Zhuoran Jin, Huanxuan Liao, Kang Liu et al. · arXiv · Jun 9, 2026
Large language models (LLMs) rely on tool use to act as autonomous agents, yet often fail in multi-step execution due to insufficient tool-related knowledge and ineffective knowledge activation. Therefore, we present a systematic study on h…
- Causally Evaluating the Learnability of Formal Language TasksVésteinn Snæbjarnarson, Anej Svete, Josef Valvoda, Reda Boumasmoud et al. · arXiv · Jun 8, 2026
Language models, as multi-task learners, acquire a wide range of abilities during training. A fundamental question is how much task-specific data is needed to learn a given task. Answering this for natural language is difficult: tasks are h…
- SIGA: Self-Evolving Coding-Agent Adapters for Scientific SimulationMatthew Ho, Brian Liu, Jixuan Chen, Audrey Wang et al. · arXiv · Jun 8, 2026
Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool int…
- Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' MayanAlexander Chulzhanov, Soeren Eberhardt, Arjun Mukherjee · arXiv · Jun 8, 2026
Neural machine translation for digitally low-resource Indigenous languages is often hindered by extreme data scarcity, prompting reliance on extractive web-scraping. To ensure data sovereignty, this study introduces a data synthesis methodo…
- iOSWorld: A Benchmark for Personally Intelligent Phone AgentsLawrence Keunho Jang, Mareks Woodside, Geronimo Carom, Andrew Keunwoo Jang et al. · arXiv · Jun 8, 2026
A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent be…
- Collaborative Human-Agent Protocol (CHAP)Arsalan Shahid, Gordon Suttie, Philip Black · arXiv · Jun 8, 2026
Foundation models are moving from response generation into operational roles. They plan across steps, call tools, request human input, coordinate with other agents, and increasingly carry responsibility for work that affects customers, clai…
- Multi-Turn Evaluation of Deep Research Agents Under Process-Level FeedbackRishabh Sabharwal, Hongru Wang, Amos Storkey, Jeff Z. Pan · arXiv · Jun 8, 2026
Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two …
- The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language ModelWendy K. Tam · arXiv · Jun 8, 2026
The ambition behind alignment training is to make large language models safe and useful. The primary mechanism, reinforcement learning from human feedback (RLHF), shapes the behavior of deployed language models by aligning them with ``human…
- IS-CoT: Breaking the Long-form Generation Collapse via Interleaved Structural ThinkingZechen Sun, Yuyang Sun, Zecheng Tang, Juntao Li et al. · arXiv · Jun 8, 2026
Generating coherent and controllable long-form content remains a persistent challenge for Large Language Models (LLMs). While reasoning-enhanced models have demonstrated success in logic-intensive domains, our evaluation reveals that they s…
- BrainSurgery: Reproducible and Reliable Declarative Weight Manipulations for Model Editing and UpcyclingGianluca Barmina, Annemette Broch Pirchert, Andrea Blasi Núñez, Lukas Galke Poech et al. · arXiv · Jun 8, 2026
As deep learning models scale, managing, inspecting, and modifying large checkpoints has become increasingly challenging. Researchers often need to alter model weights for layer restructuring, precision casting, low-rank factorization, and …