Latest Text-to-Image Research Papers
The newest Text-to-Image papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Text-to-Image so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Text-to-Image papers in your inbox — free →Recent papers
- Echo-Memory: A Controlled Study of Memory in Action World ModelsWayne King, Zeyue Xue, Yuxuan Bian, Jie Huang et al. · arXiv · Jun 8, 2026
We present \textbf{Echo-Memory}, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure i…
- LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative ModelsLu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu et al. · arXiv · Jun 1, 2026
Large-scale generative models have demonstrated remarkable capabilities across image generation and editing tasks. However, their performance in low-level vision tasks, which require pixel-wise control, remains insufficiently studied. To ad…
- Drifting Preference Optimization for One-Step Generative ModelsZhou Jiang, Yandong Wen, Zhen Liu · arXiv · Jun 1, 2026
One-step text-to-image generators are attractive for deployment because they generate an image with a single forward pass, but preference finetuning them remains difficult: standard alignment methods often rely on policy likelihoods, denois…
- Colored Noise Diffusion SamplingHadar Davidson, Noam Issachar, Sagie Benaim · arXiv · May 28, 2026
Diffusion models achieve state-of-the-art image synthesis, with their generative trajectories fundamentally exhibiting a spectral bias, resolving low-frequency global structures early and high-frequency fine details later. Conventional stoc…
- Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference OptimizationZhuohan Liu, Wujian Peng, Yitong Chen, Zuxuan Wu · arXiv · May 27, 2026
Despite the rapid progress of text-to-image (T2I) models, generating images that accurately reflect complex compositional prompts (covering attribute bindings, object relationships, counting) still remains challenging. To address this, we p…
- Towards Controllable Image Generation through Representation-Conditioned Diffusion ModelsNithesh Chandher Karthikeyan, Jonas Unger, Gabriel Eilertsen · arXiv · May 26, 2026
Diffusion models have emerged as powerful tools for high-quality image generation and editing, but guiding these models to produce specific outputs remains a challenge. Conventional approaches rely on conditioning mechanisms, such as text p…
- Squeezing Capacity from Multimodal Large Language Models for Subject-driven GenerationShuhong Zheng, Aashish Kumar Misraa, Yu-Teng Li, Yu-Jhe Li et al. · arXiv · May 25, 2026
Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-mod…
- Look Both Ways Before You Cross: Lifting Cross Fields From 2D Visual PriorsDale Decatur, Jacob Serfaty, Oded Stein, Amir Vaxman et al. · arXiv · May 25, 2026
We present CrossLift, a technique for computing cross fields on meshes guided by visual features in images. We leverage powerful text-to-image priors that are capable of synthesizing images of feature-aligned quad meshes in 2D. We extract t…
- Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-ResolutionZixin Jessie Chen, Zhuo Chen, Archer Wang, Jeff Gore et al. · arXiv · May 25, 2026
Creating images from noise is image generation; reconstructing fine details from coarse inputs is super-resolution. Despite their practical differences, both can be understood as reversing information loss across scales. We introduce $\text…
- SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion TransformersJavad Rajabi, Kimia Shaban, Koorosh Roohi, David B. Lindell et al. · arXiv · May 21, 2026
Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by m…
- PIXLRelight: Controllable Relighting via Intrinsic ConditioningMiguel Farinha, Ronald Clark · arXiv · May 18, 2026
We present PIXLRelight, a feed-forward approach for physically controllable single-image relighting. Existing methods either provide limited lighting control (e.g. through text or environment maps), accumulate errors when chaining inverse a…
- EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric VideosRuiping Liu, Junwei Zheng, Yufan Chen, Di Wen et al. · arXiv · May 18, 2026
Egocentric memory is widely used in embodied intelligence, but it may be insufficient for comprehensive spatial-temporal reasoning. Inspired by human recall from both field and observer perspectives, we introduce EgoExoMem, the first benchm…
- Does Synthetic Layered Design Data Benefit Layered Design Decomposition?Kam Man Wu, Haolin Yang, Qingyu Chen, Yihu Tang et al. · arXiv · May 14, 2026
Recent advances in image generation have made it easy to produce high-quality images. However, these outputs are inherently flattened, entangling foreground elements, background, and text within a fixed canvas. As a result, flexible post-ge…
- AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable RewardRunhui Huang, Jie Wu, Rui Yang, Zhe Liu et al. · arXiv · May 12, 2026
In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start st…
- Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage ShapingHaoyuan Sun, Jing Wang, Yuxin Song, Yu Lu et al. · arXiv · May 11, 2026
Recently, post-training methods based on reinforcement learning, with a particular focus on Group Relative Policy Optimization (GRPO), have emerged as the robust paradigm for further advancement of text-to-image (T2I) models. However, these…
- Flow-OPD: On-Policy Distillation for Flow Matching ModelsZhen Fang, Wenxuan Huang, Yu Zeng, Yiming Zhao et al. · arXiv · May 8, 2026
Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogene…
- SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image GenerationTianfei Ren, Zhipeng Yan, Yiming Zhao, Zhen Fang et al. · arXiv · May 8, 2026
While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to t…
- STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal GenerationYing Shen, Tianrong Chen, Yuan Gao, Yizhe Zhang et al. · arXiv · May 8, 2026
Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive lang…
- HEART: Hyperspherical Embedding Alignment via Kent-Representation Traversal in Diffusion ModelsArani Roy, Shristi Das Biswas, Kaushik Roy · arXiv · May 8, 2026
Text-to-image diffusion models can generate visually stunning images, yet, controlling what appears and how it appears, remains surprisingly difficult, especially when operating solely within the constraints of the text-conditioning space. …
- A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive WarpingMaxim V. Shugaev, Md Reshad Ul Hoque, Bridget Kennedy, Joseph T. Riley et al. · arXiv · May 6, 2026
Video sequence capturing through refractive dynamic media, such as a turbulent air or water surface, often suffer from severe geometric distortions and temporal instability. While recent advances address mild atmospheric turbulence, no exis…
- Large Language Models are Universal Reasoners for Visual GenerationSucheng Ren, Chen Chen, Zhenbang Wang, Liangchen Song et al. · arXiv · May 5, 2026
Text-to-image generation has advanced rapidly with diffusion models, progressing from CLIP and T5 conditioning to unified systems where a single LLM backbone handles both visual understanding and generation. Despite the architectural unific…
- Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data AugmentationChenyu Hui, Xiaodi Huang, Siyu Xu, Yunke Wang et al. · arXiv · May 4, 2026
Vision-language-action (VLA) models typically rely on large-scale real-world videos, whereas simulated data, despite being inexpensive and highly parallelizable to collect, often suffers from a substantial visual domain gap and limited envi…
- SEAL: Semantic-aware Single-image Sticker Personalization with a Large-scale Sticker-tag DatasetChanghyun Roh, Yonghyun Jeong, Jonghyun Lee, Chanho Eom et al. · arXiv · Apr 29, 2026
Synthesizing a target concept from a single reference image is challenging in diffusion-based personalized text-to-image generation, particularly for sticker personalization where prompts often require explicit attribute edits. With only on…
- Robust Deepfake Detection: Mitigating Spatial Attention Drift via Calibrated Complementary EnsemblesMinh-Khoa Le-Phan, Minh-Hoang Le, Trong-Le Do, Minh-Triet Tran · arXiv · Apr 28, 2026
Current deepfake detection models achieve state-of-the-art performance on pristine academic datasets but suffer severe spatial attention drift under real-world compound degradations, such as blurring and severe lossy compression. To address…
- Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal ModelsJiayi Guo, Linqing Wang, Jiangshan Wang, Yang Yue et al. · arXiv · Apr 28, 2026
Unified multimodal models (UMMs) integrate visual understanding and generation within a single framework. For text-to-image (T2I) tasks, this unified capability allows UMMs to refine outputs after their initial generation, potentially exten…
- Diffusion Model as a Generalist Segmentation LearnerHaoxiao Wang, Antao Xiang, Haiyang Sun, Peilin Sun et al. · arXiv · Apr 27, 2026
Diffusion models are primarily trained for image synthesis, yet their denoising trajectories encode rich, spatially aligned visual priors. In this paper, we demonstrate that these priors can be utilized for text-conditioned semantic and ope…
- CA-IDD: Cross-Attention Guided Identity-Conditional Diffusion for Identity-Consistent Face SwappingMd Shohel Rana, Tanoy Debnath · arXiv · Apr 27, 2026
Face swapping aims to optimize realistic facial image generation by leveraging the identity of a source face onto a target face while preserving pose, expression, and context. However, existing methods, especially GAN-based methods, often s…
- Adapting TrOCR for Printed Tigrinya Text Recognition: Word-Aware Loss Weighting for Cross-Script Transfer LearningYonatan Haile Medhanie, Yuanhua Ni · arXiv · Apr 22, 2026
Transformer-based OCR models have shown strong performance on Latin and CJK scripts, but their application to African syllabic writing systems remains limited. We present the first adaptation of TrOCR for printed Tigrinya using the Ge'ez sc…
- RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N RankingRoie Kazoom, Yotam Gigi, George Leifman, Tomer Shekel et al. · arXiv · Apr 22, 2026
Traditional change detection identifies where changes occur, but does not explain what changed in natural language. Existing remote sensing change captioning datasets typically describe overall image-level differences, leaving fine-grained …
- VLA Foundry: A Unified Framework for Training Vision-Language-Action ModelsJean Mercat, Sedrick Keh, Kushal Arora, Isabella Huang et al. · arXiv · Apr 21, 2026
We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines…