Latest Robot Manipulation & VLA Research Papers
The newest Robot Manipulation & VLA papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Robot Manipulation & VLA so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Robot Manipulation & VLA papers in your inbox — free →Recent papers
- TacForeSight: Force-Guided Tactile World Model for Contact-Rich ManipulationYujie Zang, Yuhang Zheng, Xian Nie, Yupeng Zheng et al. · arXiv · Jun 9, 2026
Contact-rich manipulation requires robots to continuously perceive and regulate evolving physical interactions under dynamic contact transitions or complex surface geometries. Recent imitation learning methods improve contact-aware control …
- JOIN: Anchor-Grasp-Conditioned Joining via Opposition, Inference, and Navigation for Bimanual Assistive ManipulationDrake Moore, Matt Cheng, Xiang Zhi Tan, Taşkın Padır · arXiv · Jun 9, 2026
Assistive mobility and manipulation platforms have received increasing attention as a means of restoring independence to individuals with disabilities. While effective for many basic activities of daily living (ADLs), a significant percenta…
- IMPACT: Learning Internal-Model Predictive Control for Forceful Robotic ManipulationJiawei Gao, Chaoqi Liu, Peilin Wu, Haonan Chen et al. · arXiv · Jun 9, 2026
Real-world robotic manipulation tasks often involve forceful interactions with the environment, such as using tools of varying weights, transporting objects with different masses, and performing contact-rich tasks like table wiping. Previou…
- MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action ModelsHao Shi, Weiye Li, Bin Xie, Yulin Wang et al. · arXiv · Jun 8, 2026
Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore strug…
- AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context RoutingJisong Cai, Long Ling, Shiwei Chu, Zhongshan Liu et al. · arXiv · Jun 8, 2026
World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world predictio…
- Difference-Aware Retrieval Policies for Imitation LearningQuinn Pfeifer, Ethan Pronovost, Paarth Shah, Khimya Khetarpal et al. · arXiv · Jun 8, 2026
Parametric imitation learning via behavior cloning can suffer from poor generalization to out-of-distribution states due to compounding errors during deployment. We show that reusing the training data during inference via a semi-parametric …
- Your Model Already Knows: Attention-Guided Safety Filter for Vision-Language-Action ModelsSeongbin Park, Fan Zhang, Baharan Mirzasoleiman, Shahriar Talebi et al. · arXiv · Jun 8, 2026
Vision-Language-Action (VLA) models have demonstrated impressive end-to-end performance across a variety of robotic manipulation tasks. However, these policies offer no guarantees against collisions with task-irrelevant objects in the scene…
- ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action ModelsFan Zhang, Seongbin Park, Baharan Mirzasoleiman, Shariar Talebi et al. · arXiv · Jun 8, 2026
Vision-Language-Action (VLA) models demonstrate strong perfor-1 mance on language-conditioned robotic manipulation within their training dis-2 tribution, yet their generalization capabilities remain fundamentally limited. They3 lack the rob…
- ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action PoliciesHaodi Hu, Chung-Ta Huang, Jing Liu, Ye Wang et al. · arXiv · Jun 8, 2026
Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states requiring targeted recovery. We propose ReCoVLA -- a failure-conditioned residual recovery framework…
- DexPIE: Stable Dexterous Policy Improvement from Real-World ExperienceRuizhe Liao, Wenrui Chen, Liangji Zeng, Haoran Lin et al. · arXiv · Jun 8, 2026
Dexterous manipulation presents substantial challenges for imitation learning due to its high-dimensional action space and complex contact-rich dynamics. Policies trained purely from demonstrations often suffer from compounding errors durin…
- CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor ControlJiacheng Li, Yize Guo, Jiabin Guo, Qingchen Liu et al. · arXiv · Jun 8, 2026
Vision-language-action models have shown strong promise for robot manipulation, yet raw language is primarily needed to specify task intent rather than to be repeatedly processed during high-frequency low-level execution. Motivated by this …
- Simulation-Driven Imitation Learning for Biosignals-Free Shared-Autonomy Prosthetic GraspingKaijie Shi, Wanglong Lu, Huiling Chen, Vinicius Prado da Fonseca et al. · arXiv · Jun 5, 2026
Biosignals-free shared-autonomy control of upper-limb prosthetic hands aims to enable natural and low-effort manipulation without relying on EMG or other physiological signals. Recent imitation-learning-based approaches have shown promising…
- Spline Policy: A Structured Representation for Robot PoliciesMengze Tian, Yiming Li, Sichao Liu, Auke Ijspeert et al. · arXiv · Jun 5, 2026
Modern imitation-learning policies for robot manipulation often represent actions as fixed-resolution action chunks, which are simple and effective but expose limited geometric and temporal structure before execution. This paper studies Spl…
- RhinoVLA Technical ReportHuixi Intelligence, :, Chen Zhang, Chenyang Zhou et al. · arXiv · Jun 5, 2026
Vision-Language-Action (VLA) models have shown strong potential for robotic manipulation, but real-time deployment on edge hardware remains challenging. In this work, we identify VLM visual and context tokens as a major source of deployment…
- Robotic Policy Adaptation via Weight-Space Meta-LearningChristian Bianchi, Siamak Yousefi, Alessio Sampieri, Andrea Roberti et al. · arXiv · Jun 5, 2026
Vision-Language-Action (VLA) models are emerging as a promising paradigm for robotic manipulation, enabling general-purpose policies trained from large corpora of demonstrations and action labels. However, adapting these models to new tasks…
- Coarse-to-Control: Action-Token Planning for Vision-Language-Action ModelsJinhao Wu, Shiduo Zhang, Yicheng Liu, Xiaopeng Yu et al. · arXiv · Jun 5, 2026
Most vision-language-action (VLA) models map observations directly to actions without explicit intermediate planning, which limits performance on long-horizon tasks where early mistakes compound. We propose Coarse-to-Control, a plan-execute…
- LARA: Latent Action Representation Alignment for Vision-Language-Action ModelsMengya Liu, Baoxiong Jia, Jiangyong Huang, Jingze Zhang et al. · arXiv · Jun 5, 2026
Visual-language action (VLA) models enable robots to predict actions directly from observations and language instructions, but their performance depends on large-scale, high-quality data and is limited by the scarcity of real-world robot ac…
- TempoVLA: Learning Speed-Controllable Vision-Language-Action PoliciesDong Jing, Jingchen Nie, Tianqi Zhang, Jiaqi Liu et al. · arXiv · Jun 4, 2026
Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed fr…
- Multi-Resolution Tactile Imitation Learning for Contact-Rich Robotic ManipulationRickmer Krohn, Erik Helmut, Niklas Funk, Jan Peters et al. · arXiv · Jun 4, 2026
Touch sensing is beneficial for solving a wide variety of manipulation tasks. While there exists a wide range of tactile sensors with different properties, exploiting the fusion of multiple heterogeneous tactile sensors to improve manipulat…
- MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-ActionBoyang Zhang, Lianlei Shan · arXiv · Jun 4, 2026
Vision-Language-Action (VLA) policies remain brittle in long-horizon and high-uncertainty control, where one-pass action decoding provides limited inference-time deliberation. Explicit chain-of-thought can increase reasoning depth, but intr…
- AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware UnderstandingQize Yu, Jiadi You, Yuran Wang, Jiaqi Liang et al. · arXiv · Jun 4, 2026
Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodie…
- AFUN: Towards an Affordance Foundation Model for Functionality UnderstandingZhaoning Wang, Yi Zhong, Jiawei Fu, Henrik I. Christensen et al. · arXiv · Jun 1, 2026
Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only …
- Intercepting the Future: Latent-Space Predictive World Model for Dynamic VLA ManipulationShahram Najam Syed, Arthur Jakobsson, Haoran Hao, Jeffrey Ichnowski · arXiv · Jun 1, 2026
Vision-Language-Action (VLA) models generalize across static manipulation but fail when objects move during task execution. They map the current observation to an action and assume the scene is stationary between observation and execution, …
- NDPP-Grasp: Non-Differentiable Physical Plausibility Constraint-Guided Task-Oriented Dexterous Grasp GenerationQiuchi Xiang, Haoxuan Qu, Hossein Rahmani, Jun Liu · arXiv · Jun 1, 2026
Task-oriented dexterous grasp generation aims to produce dexterous grasp poses that are both physically plausible and functionally suitable for specified manipulation tasks. Existing diffusion-based methods often address these two requireme…
- Towards Precise Intent-Aligned VLA Aerial Navigation via Expert-Guided GRPOTianyang Chen, Wenjun Li, Xin zhou, Yuze Wu et al. · arXiv · Jun 1, 2026
Vision-Language-Action (VLA) models offer a promising end-to-end paradigm for unmanned aerial vehicles (UAVs) to accomplish complex tasks specified by fine-grained instructions. However, standard supervised fine-tuning (SFT) suffers from da…
- FATE-VLA:Failue-aware test generation for vision-language-action modelsArusa Kanwal, Pablo Valle, Shaukat Ali, Aitor Arrieta · arXiv · Jun 1, 2026
Vision-Language-Action (VLA) models are increasingly used as generalist robot policies, yet their evaluation still relies largely on static benchmarks that randomly sample task scenes. In high-dimensional embodied spaces, failures are spars…
- RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA ModelsBin Yu, Yao Zhang, Haishan Liu, Shijie Lian et al. · arXiv · Jun 1, 2026
Vision-language-action (VLA) models are built on the premise that semantic understanding from pretrained language or vision-language backbones should guide robot action prediction. Yet robot fine-tuning is optimized as imitation over task-s…
- WALL-WM: Carving World Action Modeling at the Event JointsShalfun Li, Victor Yao, Charles Yang, Truth Qu et al. · arXiv · Jun 1, 2026
WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs …
- Co-training with Ego-centric Video and Demonstration for Robot Navigation TaskShoya Kuno, Yumo Ouchi, Kanata Suzuki · arXiv · Jun 1, 2026
Vision-language-action (VLA) models are promising for diverse robotic tasks, but their performance heavily depends on large-scale high-quality training data, whose collection on real robots is costly and time-consuming. While prior work has…
- DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided RepresentationJusuk Lee, Seungjae Lee, Jonghun Shin, Hoseong Jung et al. · arXiv · May 28, 2026
Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, lea…