Generation

Latest Text-to-Image Research Papers

The newest Text-to-Image papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Text-to-Image so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.

Get the latest Text-to-Image papers in your inbox — free →

Recent papers

Appearance Pointers -- Multimodal Region Control of Diffusion Transformers
Rahul Sajnani, Yulia Gryaditskaya, Radomír Měch, Srinath Sridhar et al. · arXiv · Jul 21, 2026
Controllable image generation remains challenging for creative professionals, who often require precise regional control over materials, object identities, and spatial arrangements that cannot be reliably achieved through text prompting alo…
ExpertVerse: A General-Purpose Benchmark for Expert-Level Reasoning in Knowledge-Intensive Visual Synthesis
Yuan Wang, Yongchao Du, Mengting Chen, Jinsong Lan et al. · arXiv · Jul 21, 2026
Recent advances in multimodal generative models have enabled instruction-based image generation to move beyond semantic manipulation to knowledge-driven visual reasoning. However, these methods focus on explicit commonsense reasoning, shall…
Text Template Tokens Are Implicit Semantic Registers in Diffusion Transformers
Maohua Li, Qirui Li, Yanke Zhou, Yiduo Li et al. · arXiv · Jul 21, 2026
Text-to-image diffusion transformers (DiTs) jointly process text and image tokens, yet their internal computation during denoising remains poorly understood. We introduce a causal interpretability framework for modern large-scale DiTs that …
Read It Back: Pretrained MLLMs Are Zero-Shot Reward Models for Text-to-Image Generation
Runhui Huang, Qihui Zhang, Zhe Liu, Yu Gao et al. · arXiv · Jul 13, 2026
In this paper, we propose SpectraReward, a training-free reward function that turns pretrained MLLMs into off-the-shelf reward models for image-generation reinforcement learning. Instead of asking the MLLM to judge a generated image or answ…
Latent-Identity Tuning in Text-to-Image Personalization Models
Daniel Garibi, Ronen Kamenetsky, Hadar Averbuch-Elor, Daniel Cohen-Or et al. · arXiv · Jul 13, 2026
Generating and editing a person's face demands high precision, as even minor modifications can significantly alter a subject's perceived identity. Current personalization and editing methods built on general-purpose text-to-image models, ho…
Feature-Space Guided Diffusion for Realistic Ultrasound Image Synthesis
Marina Domínguez, Nélida Mirabet-Herranz, Valery Naranjo · arXiv · Jul 13, 2026
Conditional diffusion models can generate anatomically plausible medical ultrasound (US) images, but anatomical plausibility alone does not ensure realistic B-mode appearance. Most US pipelines adapt standard generative architectures and co…
Vision as Unified Multimodal Generation
Xiaoyang Han, Jianhua Li, Kewang Deng, Zukai Chen et al. · arXiv · Jul 7, 2026
We formulate computer vision as unified multimodal generation, where heterogeneous visual tasks are expressed in the native text and image generation spaces of a unified multimodal model, without task-specific architectures. Under this form…
From RGB Generation to Dense Field Readout: Pixel-Space Dense Prediction with Text-to-Image Models
Zanyi Wang, Xin Lin, Haodong Li, Dengyang Jiang et al. · arXiv · Jul 7, 2026
Large-scale text-to-image models are attractive backbones for dense prediction because RGB generation pretraining learns rich semantic, structural, and geometric priors. Existing generative and editing approaches reuse these priors by casti…
PIPBench: A Profile-Inclusive Framework for Personalized Image Generation Evaluation
Yuhang Wu, Shuxiang Zhang, Wee Hian Ching, Chi Zhang et al. · arXiv · Jul 7, 2026
Recent text-to-image models such as DALLE-3 excel at following diverse prompts yet remain blind to individual aesthetic preferences. We study personalized image generation, where models must align outputs with a user's implicit visual prefe…
EquiSteer: Cross-Attention Steering Towards a Fairer Text-Guided Image Generation
Tatiana Gaintseva, Akshit Achara, Gregory Slabaugh, Jiankang Deng et al. · arXiv · Jul 1, 2026
Text-to-image diffusion models power everyday creative tasks, but they still reproduce the demographic biases in their training data. On common prompts such as ``a photo of a nurse,'' ``a photo of a CEO'', they skew their outputs toward one…
DanceOPD: On-Policy Generative Field Distillation
Wei Zhou, Xiongwei Zhu, Zelin Xu, Bo Dong et al. · arXiv · Jun 25, 2026
Modern image generation demands a single model that unifies diverse capabilities, including text-to-image (T2I), local editing, and global editing. However, these capabilities are rarely naturally aligned and often conflict. For instance, e…
FunPiQ: A New Benchmark for Pixel-Level Quality Assessment in Fundus Images
Pengwei Wang, José Morano, Virginia Mares, Hrvoje Bogunović · arXiv · Jun 24, 2026
Color fundus photography (CFP) is the most common ophthalmic imaging modality for large-scale screening. However, it is highly susceptible to degradations, making robust fundus image quality assessment (FIQA) crucial. The criteria for what …
DiffusionBench: On Holistic Evaluation of Diffusion Transformers
Xingjian Leng, Jaskirat Singh, Zhanhao Liang, Ethan Smith et al. · arXiv · Jun 23, 2026
Diffusion transformer (DiT) research on image generation has converged to a single evaluation setup: class-conditional generation on ImageNet. While methods improve the FID and related metrics, it is increasingly unclear whether they reflec…
IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation
Zixuan Li, Haokun Lin, Yicheng Xiao, Zhiwei Li et al. · arXiv · Jun 23, 2026
Unified multi-modal large language models (MLLMs) have achieved strong text-to-image generation quality, but still struggle with structure-aware prompt following, where object counts, spatial relations, attribute bindings, and coarse layout…
Keep The Essentials: Efficient Reference Conditioned Generation via Token Dropping
Rishubh Parihar, Ayush Raina, R. Venkatesh Babu, Or Patashnik · arXiv · Jun 22, 2026
Reference-based diffusion models enable highly controllable image generation by leveraging elements from input images to guide prompt-driven synthesis. However, these models are computationally expensive in runtime, and their cost scales se…
Semantic Browsing: Controllable Diversity for Image Generation
Sara Dorfman, Maya Vishnevsky, Omer Dahary, Or Patashnik et al. · arXiv · Jun 22, 2026
Modern text-to-image models excel in visual fidelity and prompt adherence. However, this strict adherence comes at the cost of diversity: generated samples tend to collapse into a single visual interpretation. Existing methods to improve di…
GeoFidelity-Bench: Evaluating Segment-Level Geographic Fidelity in Text-to-Image Street-View Generation
Kaizhen Tan, Hanzhe Hong, Siru Tao · arXiv · Jun 22, 2026
Text-to-image models can generate visually plausible city streets, but whether their outputs correspond to a requested road segment rather than a generic city prior remains unclear. We introduce GeoFidelity-Bench, a reference-panel benchmar…
Hedgementation = Hedgerow Segmentation: A Remote Sensing Benchmark
Nathan Senyard, Salem Hamdani, Astrid Zhang, Derek Wang et al. · arXiv · Jun 22, 2026
We propose Hedgementation: a new benchmark to evaluate machine learning models for hedgerow mapping from remote sensing data at country scale and 10m$^2$ spatial resolution. We combine and harmonize multiple remote sensing data products and…
A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2
Yijin Wang, Shuyi Wang, Wenhan Zhang, Yuqi Ouyang · arXiv · Jun 17, 2026
Text-rich images often contain privacy-sensitive, transactional, or decision-relevant information. As recent multimodal image generation models become increasingly capable of synthesizing realistic textual content and structured visual desi…
RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space
Xichen Pan, Aashu Singh, Satya Narayan Shukla, Xiangjun Fan et al. · arXiv · Jun 12, 2026
Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative backbones. The emergence of representation autoencoders (RA…
HPSv3++: Scaling Reward Models Across the Full Spectrum of Diffusion Model Capabilities
Yijun Liu, Jie Huang, Zeyue Xue, Yuming Li et al. · arXiv · Jun 12, 2026
Reward models guide text-to-image (T2I) systems toward outputs aligned with human preferences. However, typical reward models such as HPSv3 are trained on pre-annotated data from earlier T2I models, without accounting for quality discrimina…
InterleaveThinker: Reinforcing Agentic Interleaved Generation
Dian Zheng, Harry Lee, Manyuan Zhang, Kaituo Feng et al. · arXiv · Jun 11, 2026
Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-i…
Modality Forcing for Scalable Spatial Generation
Bardienus Pieter Duisterhof, Deva Ramanan, Jeffrey Ichnowski, Justin Johnson et al. · arXiv · Jun 11, 2026
Text-to-image (T2I) models contain rich spatial priors. Synthesizing photorealistic, cluttered scenes requires an understanding of geometry, including perspective and relative scale. Prior works adapt T2I models to leverage this prior for d…
Echo-Memory: A Controlled Study of Memory in Action World Models
Wayne King, Zeyue Xue, Yuxuan Bian, Jie Huang et al. · arXiv · Jun 8, 2026
We present \textbf{Echo-Memory}, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure i…
LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models
Lu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu et al. · arXiv · Jun 1, 2026
Large-scale generative models have demonstrated remarkable capabilities across image generation and editing tasks. However, their performance in low-level vision tasks, which require pixel-wise control, remains insufficiently studied. To ad…
Drifting Preference Optimization for One-Step Generative Models
Zhou Jiang, Yandong Wen, Zhen Liu · arXiv · Jun 1, 2026
One-step text-to-image generators are attractive for deployment because they generate an image with a single forward pass, but preference finetuning them remains difficult: standard alignment methods often rely on policy likelihoods, denois…
Colored Noise Diffusion Sampling
Hadar Davidson, Noam Issachar, Sagie Benaim · arXiv · May 28, 2026
Diffusion models achieve state-of-the-art image synthesis, with their generative trajectories fundamentally exhibiting a spectral bias, resolving low-frequency global structures early and high-frequency fine details later. Conventional stoc…
Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization
Zhuohan Liu, Wujian Peng, Yitong Chen, Zuxuan Wu · arXiv · May 27, 2026
Despite the rapid progress of text-to-image (T2I) models, generating images that accurately reflect complex compositional prompts (covering attribute bindings, object relationships, counting) still remains challenging. To address this, we p…
Towards Controllable Image Generation through Representation-Conditioned Diffusion Models
Nithesh Chandher Karthikeyan, Jonas Unger, Gabriel Eilertsen · arXiv · May 26, 2026
Diffusion models have emerged as powerful tools for high-quality image generation and editing, but guiding these models to produce specific outputs remains a challenge. Conventional approaches rely on conditioning mechanisms, such as text p…
Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation
Shuhong Zheng, Aashish Kumar Misraa, Yu-Teng Li, Yu-Jhe Li et al. · arXiv · May 25, 2026
Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-mod…

Track Text-to-Image on Distill AI — start free →

Latest Text-to-Image Research Papers

Recent papers

Related topics