Agents & Foundation

Latest Prompting & ICL Research Papers

The newest Prompting & ICL papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Prompting & ICL so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.

Get the latest Prompting & ICL papers in your inbox — free →

Recent papers

Social Capital and Regeneration in Historical Areas: Case Study of Jianguo Men Market, Xi’an, China
Min Wang · Open MIND · Jan 1, 2029
In recent years, China has increasingly adopted more inclusive and participatory approaches to urban regeneration, prompted by constrained government funding and the limitations of traditional top-down models. Although existing literature l…
The Impact of Virtual Discharge Interventions Compared to Traditional Methods on Adult Patient Population Readmission Rates: An Integrative Review
Providence, Tiffany, Brooks, Carrie, Kayondo, Nannono, Lovell, Courteney et al. · DigitalCommons - Kennesaw S... · Jan 1, 2027
Hospital readmissions are a persistent quality and cost burden to the healthcare system, prompting evaluation of modern strategies to strengthen transitional care. This integrative review synthesized findings from thirteen studies published…
Moral Consistency Variance: A Pilot Benchmark for Decision Stability under Moral Prompt Perturbations in Large Language Models
Qiao Liang · Open MIND · Dec 31, 2026
Static moral question-answering benchmarks do not test whether a model's decision distribution remains stable when the same dilemma is rephrased without changing the underlying facts. This paper introduces Moral Consistency Variance (MCV), …
Test-Time Training for Modality Order Consistency in Vision-Language Models
Aditi Gupta, Yossi Gandelsman · arXiv · Jul 22, 2026
We find that vision-language models are sensitive to a specific semantically irrelevant change: the order in which the image and question are presented. Across three models and three benchmarks, image first prompting consistently outperform…
Sound Probabilistic Safety Bounds for Large Language Models
Mahdi Nazeri, Anne-Kathrin Schmuck, Sadegh Soudjani, Alessandro Abate · arXiv · Jul 22, 2026
We propose a novel framework for computing rigorous bounds on the probability that a large language model (LLM) generates harmful output to a given prompt. We study a new application of the Clopper-Pearson confidence intervals to obtain pro…
The Maskability Index: Predicting Task-Objective Alignment in Pretrained Language Models
Ahmad Pouramini, Mahsa Afsharzadeh · arXiv · Jul 22, 2026
Large-scale pretrained language models such as T5 and BERT have demonstrated strong capabilities for generating structured knowledge. However, their performance depends on how closely the prompting strategy matches the objectives used durin…
Understanding the Impact of Linguistic Realization Choices on LLM Stance with Causal Tracing
Langchen Huang, Sebastian Padó, Franziska Weeber · arXiv · Jul 22, 2026
Large language models (LLMs) are known to be sensitive to prompt and input formulations. However, existing studies have focused on lexical realization and largely ignored constructional choice. This paper studies whether linguistic construc…
CircuitKIT : Circuit Discovery, Evaluation, and Application Toolkit for Mechanistic Interpretability
Pratinav Seth, Hem Gosalia, Aditya Kasliwal, Vinay Kumar Sankarapu · arXiv · Jul 21, 2026
Circuit analysis can support not only model explanation but also downstream interventions such as pruning, editing, steering, and selective fine-tuning. However, conducting such analyses currently requires stitching together separate implem…
Prompt Design at Scale: How Format, Instruction Count, and Context Length Shape Instruction Adherence and Hallucination in Large Language Models
Netanel Eliav · arXiv · Jul 21, 2026
Practitioners make three prompt-design decisions with almost no controlled evidence behind them: how to format instructions and context (markdown, plain text, prose, or tabular), how many simultaneous instructions a system prompt can carry …
Inference-Time Steering for Cross-Lingual Factual Consistency in LLMs
Alexander Manev · arXiv · Jul 21, 2026
Although Large Language Models (LLMs) demonstrate remarkable multilingual fluency, their internal knowledge representations remain disproportionately biased toward high-resource languages. This leads to cross-lingual factual inconsistency, …
Beyond Score Prediction: LLM-Based Essay Scoring and Feedback Generation via Reinforcement Learning with Rubric Rewards
Xuefeng Jin, Jiashuo Zhang, Teng Cao, Bin Yang · arXiv · Jul 21, 2026
Large language models (LLMs) have been widely applied to automated essay scoring (AES) and automated feedback generation (AFG). However, existing studies rely primarily on prompt engineering or supervised fine-tuning, while systematic resea…
Partition, Prompt, Aggregate: Statistical Self-Consistency in Language Models
Patrik Wolf, Thomas Kleine Buening, Andreas Krause, Celestine Mendler-Dünner · arXiv · Jul 16, 2026
In-context learning is commonly interpreted as a form of conditional inference, in which the prompt specifies a context and the model's output is treated as an estimate of the corresponding conditional distribution. If this interpretation h…
The One-Word Census: Answer-Choice Conformity Across 44 Language Models
Tapan Parikh · arXiv · Jul 14, 2026
When a language model must pick one answer from a large space of equally valid options, which does it pick -- and how often is it the same answer every other model picks? Asked to "pick a word -- any word," 44 models chose "serendipity" 41%…
Tracing Agentic Failure from the Flow of Success
Samuel Yeh, Yiwen Zhu, Shaleen Deep, Sharon Li · arXiv · Jul 14, 2026
Failure attribution for LLM-based agentic systems, i.e., identifying which steps in a failure trajectory caused the task to fail, is critical for debugging and improving these systems. Existing approaches either rely on prompting-based pipe…
Epistemic Stance Flexibility Probing: Measuring Prompt-Conditioned Register Shift in Large Language Models
Binwen Liu, Yilin Ren · arXiv · Jul 14, 2026
A language model may be asked either what experts believe about a contested claim or what it believes about the claim itself. A trustworthy conversational agent should distinguish these two requests and respond in different epistemic regist…
Inside the Unfair Judge: A Mechanistic Interpretability Account of LLM-as-Judge Bias
Zixiang Xu, Sixian Li, Huaxing Liu, Xiang Wang et al. · arXiv · Jul 13, 2026
Existing studies of LLM-as-judge scoring bias work predominantly at the input-output level: they perturb inputs, measure score deltas, and propose prompt-level mitigations. We argue that the same biases admit a representation-level account …
Prompt Compression via Activation Aggregation
Thibaud Ardoin, Semira Einsele, Evis Bregu, Gerhard Wunder · arXiv · Jul 9, 2026
Large language models process prompts by propagating activations through dozens of layers before generating a response. We ask whether the task-relevant information contained in an instruction prompt can be compressed into a single activati…
From Application-Layer Simulation to Native Meta-Architecture: Structural Tension as an Endogenous Driver for Heterogeneous AI Evolution
Heting Mao · arXiv · Jul 7, 2026
Current large language models (LLMs) are fundamentally stateless: their behavior is fully determined by input at inference time, and any higher-order cognitive architecture must be simulated at the application layer through prompt engineeri…
What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates
Arman Ghaffarizadeh, Danyal Mohaddes, Aliakbar Izadkhah, Shahriar Noroozizadeh · arXiv · Jul 2, 2026
LLM agents will increasingly act in socially structured settings where role, audience, and relational context can shape what is advantageous or costly to say. We study whether such social structure, without any explicit objective in the pro…
Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent
Lei Bai, Zongsheng Cao, Yang Chen, Zhiyao Cui et al. · arXiv · Jun 29, 2026
We introduce Agents-A1, a 35B Mixture-of-Experts Agentic Model that reaches trillion-parameter-level performance by scaling the agent horizon. We investigate agent-horizon scaling from two perspectives: scaling long-horizon trajectories and…
Privacy-Preserving RAG via Multi-Agent Semantic Rewriting: Achieving Confidentiality Without Compromising Contextual Fidelity
Yuanhe Zhao, Tianyu Zhang, Huafei Xing, Derek F. Wong et al. · arXiv · Jun 23, 2026
Retrieval-Augmented Generation enhances large language models by incorporating external knowledge, but deploying it in sensitive scenarios risks privacy leakage via malicious prompts. To address this, we propose a multi-agent framework that…
AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability
Khanak Khandelwal · arXiv · Jun 23, 2026
Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline that mutat…
TriggerBench: Investigating Prospective Memory for Large Language Models
Tianhua Zhang, Xinjiang Wang, Qianxi Zhang, Qi Chen et al. · arXiv · Jun 22, 2026
While Large Language Models (LLMs) are increasingly deployed in long interactions, existing evaluations focus predominantly on retrospective memory (RM) via explicit queries. Prospective memory (PM), the critical ability to spontaneously re…
CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges
Helena Bonaldi, Genoveffa Martone, Marco Guerini · arXiv · Jun 18, 2026
Online hate speech and misinformation frequently overlap, yet NLP research has mainly treated them in isolation. While LLMs represent a scalable solution for assisting humans in the generation of counterspeech for both threats, zero-shot mo…
MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization
Aueaphum Aueawatthanaphisut · arXiv · Jun 18, 2026
Real-world clinical decision support requires reasoning over heterogeneous and longitudinal patient information rather than answering isolated medical questions. However, current medical large language models and retrieval-augmented generat…
NAMESAKES: Probing Identity Memorization in Text-to-Image Models
Morris Alper, Vasudha Varadarajan, Moran Yanuka, Angelina Wang et al. · arXiv · Jun 18, 2026
Text-to-image (T2I) models generate realistic likenesses of some individuals when prompted with their names, raising privacy concerns. However, distinguishing whether a generated face is memorized or fabricated currently requires ground-tru…
From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models
Jiaxu Zuo, Mu You, Kaixin Lan, Tao Fang et al. · arXiv · Jun 18, 2026
Recent advances in Large Language Models (LLMs) have substantially transformed Automated Essay Scoring (AES), yet the internal mechanisms underlying LLM-based scoring remain poorly understood. In this work, we systematically analyze the hid…
Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring
Po-Chin Chang, Nicholas Hogan, Aske Plaat, Michiel T. van der Meer · arXiv · Jun 18, 2026
LLMs can personalize education, although current static-prompt tutoring systems struggle to adapt to diverse academic disciplines. We develop and test a system with subject-aware prompting, based on 14 pedagogical features (e.g., tutor scaf…
Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis
Soheyl Bateni, Maryam Abdolali · arXiv · Jun 17, 2026
Large language models (LLMs) can make clinical decision support more accessible by interpreting free-text documentation, but their direct use as diagnostic engines is limited by sensitivity to prompts, information order, and plausible but i…
IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages
Sakshi Joshi, Dhruv Subhash Rathi, Sanskar Singh, Eldho Ittan George et al. · arXiv · Jun 17, 2026
AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during p…

Track Prompting & ICL on Distill AI — start free →

Latest Prompting & ICL Research Papers

Recent papers

Related topics