Latest Prompting & ICL Research Papers
The newest Prompting & ICL papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Prompting & ICL so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Prompting & ICL papers in your inbox — free →Recent papers
- TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement LearningHeming Zou, Qi Wang, Yun Qu, Yuhang Jiang et al. · arXiv · Jun 9, 2026
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward cont…
- Measuring Human Value Expression in Social Media Texts: Calibrated LLM Annotation and Encoder TransferMaria Milkova, Maksim Rudnev · arXiv · Jun 9, 2026
Measuring subjective constructs in naturally occurring social media text requires annotation procedures that are theoretically grounded, empirically validated, and transferable to an encoder model for scalable prediction. Using non-English …
- Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' MayanAlexander Chulzhanov, Soeren Eberhardt, Arjun Mukherjee · arXiv · Jun 8, 2026
Neural machine translation for digitally low-resource Indigenous languages is often hindered by extreme data scarcity, prompting reliance on extractive web-scraping. To ensure data sovereignty, this study introduces a data synthesis methodo…
- When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction FollowingSai Adith Senthil Kumar · arXiv · Jun 8, 2026
Large reasoning models (LRMs) often improve math and coding performance, but their effect on instruction following is unclear. We study IFEval with Qwen3 models (1.7B-32B), using same-weights Thinking ON/OFF controls; four Hunyuan models pr…
- End-to-End Context Compression at ScaleAng Li, Sean McLeish, Haozhe Chen, Nimit Kalra et al. · arXiv · Jun 8, 2026
Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time …
- How reliable are LLMs when it comes to playing dice?Luca Avena, Gianmarco Bet, Bernardo Busoni · arXiv · Jun 5, 2026
We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of co…
- Your UnEmbedding Matrix is Secretly a Feature Lens for Text EmbeddingsSonghao Wu, Zhongxin Chen, Yuxuan Liu, Heng Cui et al. · arXiv · Jun 5, 2026
Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding bench…
- Supervision versus Demonstration-Based In-Context Learning for Multiword Expression ClassificationSercan Karakaş, Yusuf Şimşek · arXiv · Jun 5, 2026
Turkish idiomatic light verb constructions (LVCs) are challenging for multiword expression processing because they often share the same surface form as fully literal verb-object combinations while functioning as a single, partially idiomati…
- When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt VariationsMahdi Alkaeed · arXiv · Jun 5, 2026
Large Language Models (LLMs) are increasingly used in healthcare for tasks such as clinical question answering, diagnosis support, and report summarization. Despite their promise, these models remain highly sensitive to subtle prompt pertur…
- Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation SkillMehmet Iscan · arXiv · Jun 4, 2026
Large language models increasingly write, review, and judge code, and a fast-growing practice equips them with prompt 'skills' that ask the model to reason like a scientist. A prominent example tells the model to act as a Popperian falsific…
- Reinforcement Learning Elicits Contextual Learning of Unseen Language TranslationHanxu Hu, Zdeněk Šnajdr, Pinzhen Chen, Jannis Vamvas et al. · arXiv · Jun 4, 2026
Prior work has shown that large language models (LLMs) can translate unseen or low-resource languages by undergoing continued training or even by encoding a grammar book in their context. However, both methods typically overfit specific lan…
- A Komi-Yazva--Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM TranslationPetr Parshakov · arXiv · Jun 4, 2026
We present the first Komi-Yazva--Russian parallel corpus together with an explicit evaluation protocol for studying LLM translation in an endangered, extremely low-resource setting. The dataset contains 457 aligned sentence pairs from 74 na…
- Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language ModelsZizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang et al. · arXiv · May 28, 2026
Large language models (LLMs) often solve a task when all instructions are given in a single prompt, but fail when the same information is revealed gradually across turns. When a clean FULL prompt and a RAW-SHARDED conversation contain the s…
- Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code RequestsRichard J. Young, Gregory D. Moody · arXiv · May 27, 2026
A general-purpose language model that answers a harmful question returns text; a coding model that complies with a malicious request can return a working weapon -- a keylogger, a ransomware stub, an exploit that runs as written. This asymme…
- FinHarness: An Inline Lifecycle Safety Harness for Finance LLM AgentsHaoxuan Jia, Yang Liu, Bin Chong, Yingguang Yang et al. · arXiv · May 26, 2026
Finance LLM agents must simultaneously block prompt-induced unauthorized actions and approve legitimate multi-step business workflows. However, boundary filters often miss irreversible mid-trajectory tool calls, while post-hoc LLM judges pe…
- When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM JudgesParth Darshan, Abhishek Divekar · arXiv · May 25, 2026
Customizing an LLM judge to a specific task or domain often involves optimizing its prompt across multiple evaluation criteria simultaneously. Textual gradient methods automate this for a single judge criterion, however they produce natural…
- SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLM Dialogue via RL-Driven Prompt OptimisationMichael Orme, Yanchao Yu, Zhiyuan Tan · arXiv · May 25, 2026
Ensuring safe and contextually appropriate behaviour in Large Language Models (LLMs) remains a critical challenge for real-world deployment. We present \textbf{SafeCtrl-RL}, an inference-time behavioural control framework that enables adapt…
- Anticipate and Learn: Unleashing Idle-Time Compute in Proactive AgentsHaoyi Hu, Qirong Lyu, Xianghan Kong, Weiwen Liu et al. · arXiv · May 25, 2026
While AI agents demonstrate remarkable capabilities in reasoning and tool use, they remain fundamentally reactive: they compute responses only after explicit user prompts. This paradigm ignores a critical opportunity: the idle time between …
- Boiling the Frog: A Multi-Turn Benchmark for Agentic SafetyPiercosma Bisconti, Matteo Prandi, Federico Pierucci, Federico Sartore et al. · arXiv · May 21, 2026
Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant object …
- Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign TasksYubin Qu, Ying Zhang, Yanjun Zhang, Gelei Deng et al. · arXiv · May 18, 2026
Coding agents now run autonomously with shell, file, and network privileges. When a user issues a benign request, the agent sometimes does more than asked: it deletes unrelated files, wipes a stale credentials backup, or rewrites configurat…
- Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning DynamicsMaciej Chrabąszcz, Aleksander Szymczyk, Marcin Sendera, Tomasz Trzciński et al. · arXiv · May 18, 2026
Large Reasoning Models (LRMs) introduce new opportunities for safety monitoring through their Chain of Thought (CoT) reasoning. However, CoT is not always faithful to the model's final output, undermining its reliability as a monitoring too…
- RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuningAndrea Morandi · arXiv · May 13, 2026
LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce …
- Fine-tuning with Hierarchical Prompting for Robust Propaganda Classification Across Annotation SchemasLukas Stähelin, Veronika Solopova, Max Upravitelev, David Kaplan et al. · arXiv · May 13, 2026
Propaganda detection in social media is challenging due to noisy, short texts and low annotation agreements. We introduce a new intent-focused taxonomy of propaganda techniques and compare it against an established, higher-agreement schema.…
- Locale-Conditioned Few-Shot Prompting Mitigates Demonstration Regurgitation in On-Device PII Substitution with Small Language ModelsAnuj Sadani, Deepak Kumar · arXiv · May 13, 2026
Personally Identifiable Information (PII) redaction usually replaces detected entities with placeholder tokens such as [PERSON], destroying the downstream utility of the redacted text for retrieval and Named Entity Recognition (NER) trainin…
- Task-Adaptive Embedding Refinement via Test-time LLM GuidanceAriel Gera, Shir Ashury-Tahan, Gal Bloch, Ohad Eytan et al. · arXiv · May 12, 2026
We explore the effectiveness of an LLM-guided query refinement paradigm for extending the usability of embedding models to challenging zero-shot search and classification tasks. Our approach refines the embedding representation of a user qu…
- Stories in Space: In-Context Learning Trajectories in Conceptual Belief SpaceEric Bigelow, Raphaël Sarfati, Daniel Wurgaft, Owen Lewis et al. · arXiv · May 12, 2026
Large Language Models (LLMs) update their behavior in context, which can be viewed as a form of Bayesian inference. However, the structure of the latent hypothesis space over which this inference operates remains unclear. In this work, we p…
- Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular ModelingEilam Shapira, Moshe Tennenholtz, Roi Reichart · arXiv · May 12, 2026
AI agents negotiate and transact in natural language with unfamiliar counterparts: a buyer bot facing an unknown seller, or a procurement assistant negotiating with a supplier. In such interactions, the counterpart's LLM, prompts, control l…
- Neural at ArchEHR-QA 2026: One Method Fits All: Unified Prompt Optimization for Clinical QA over EHRsAbrar Majeedi, Viswanatha Reddy Gajjala, Sai Prasanna Teja Reddy Bogireddy, Siddhant Rai · arXiv · May 11, 2026
Automated question answering (QA) over electronic health records (EHRs) demands precise evidence retrieval, faithful answer generation, and explicit grounding of answers in clinical notes. In this work, we present Neural1.5, our method for …
- MASPO: Joint Prompt Optimization for LLM-based Multi-Agent SystemsZhexuan Wang, Xuebo Liu, Li Wang, Zifei Shan et al. · arXiv · May 7, 2026
Large language model (LLM)-based Multi-agent systems (MAS) have shown promise in tackling complex collaborative tasks, where agents are typically orchestrated via role-specific prompts. While the quality of these prompts is pivotal, jointly…
- Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language ModelsQuintin Pope, Ajay Hayagreeve Balaji, Jacques Thibodeau, Xiaoli Fern · arXiv · May 6, 2026
We present an automated, contrastive evaluation pipeline for auditing the behavioral impact of interventions on large language models. Given a base model $M_1$ and an intervention model $M_2$, our method compares their free-form, multi-toke…