Robotics

Latest Robot Manipulation & VLA Research Papers

The newest Robot Manipulation & VLA papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Robot Manipulation & VLA so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.

Get the latest Robot Manipulation & VLA papers in your inbox — free →

Recent papers

AXIS: A Growable Community-Driven Data Engine for Scalable Robot Manipulation
Mengfei Zhao, Dihong Huang, Yikai Tang, Peihao Li et al. · arXiv · Jul 23, 2026
Learning effective robot manipulation policies requires diverse, high-quality demonstrations, yet existing data pipelines are often difficult to scale because they rely on specialized hardware, centralized operators, or fixed task suites. W…
Grasp, Handover, Rotate: Bimanual Object Reorientation via Compositional Diffusion and Energy-Based Optimization
Wun Lam Yeung, Wenjun Liu, Yui Cheung Yu, Zhengyan Lambo Qin et al. · arXiv · Jul 23, 2026
Bimanual object reorientation - picking an object, handing it over between two arms, and placing it in a desired target pose - is valuable when direct placement from the initial grasp is infeasible due to collisions, kinematic constraints, …
GuidedAttention: Interpretable and Correctable Visual Attention for OOD-Robust Robot Manipulation via Imitation Learning
Masaki Murooka, Ryoichi Nakajo, Keisuke Shirai, Tomohiro Motoda et al. · arXiv · Jul 23, 2026
End-to-end visuomotor policies provide little opportunity for humans to understand or correct the policy's visual attention. We propose GuidedAttention, a visuomotor imitation learning framework that introduces interpretable and correctable…
Closing the Lab-to-Store Gap: A Data-Efficient Post-Training and Experience-Driven Learning VLA Framework for Retail Humanoids
Roger Sala Sisó, Tiago Silvério, Jakob Sand, Tran Nguyen Le · arXiv · Jul 22, 2026
Closing the gap between benchmark performance and reliable real-world operation remains a central challenge for Vision-Language-Action (VLA) humanoid robots, which must handle execution errors, distribution shifts, and environmental variabi…
SeededGrasp: Language-Guided Grasping in Complex Scenes with Multiple Embodiments
Yang Xu, Gurpreet Singh Mukker, Raymond Wang, Jasper Gerigk et al. · arXiv · Jul 22, 2026
Practical robotic grasping in complex scenes requires both 3D spatial reasoning and alignment with task-specific requirements. Vision-language models (VLMs) offer a natural way to specify these requirements using language, but existing appr…
ReferTrack: Referring Then Tracking for Embodied Visual Tracking
Hanjing Ye, Tianle Zeng, Jiazhao Zhang, Shaoan Wang et al. · arXiv · Jul 22, 2026
Embodied visual tracking (EVT) requires a mobile agent to continuously follow a specific target described in natural language using only onboard vision. While recent vision-language-action (VLA) policies unify target identification and traj…
V2F: Vision-Informed Grasp Force Prediction for Damage-Aware Robotic Handling of Date Fruits
Shahd Shami, Obadah Wali, Eric Feron, Shinkyu Park · arXiv · Jul 22, 2026
This paper presents a vision-informed grasp force prediction framework for robotic handling of date fruits. Addressing the dual challenge of high detachment forces and low bruise thresholds, we first conduct mechanical characterization on d…
Beyond imitation: Robots that learn to work in the real world
Johannes A. Stork · Science Robotics · Jul 22, 2026
Real-world learning can push robot manipulation beyond imitation alone....
Design and stability analysis of an underactuated hand with passively rotating fingers
Léonie Plancoulaine, Sylvain Guégan, Franck Plestan, Damien Chablat · arXiv · Jul 21, 2026
This paper presents an innovative design and stability analysis of an underactuated robotic finger with spatial mobility, designed to enhance gripping dexterity in robotic hands. The finger architecture incorporates a revolute joint at its …
From Frames to Pouring: The CRAM Cognitive Architecture for Everyday Robot Manipulation
Michael Beetz, Michaela Kümpel · KI - Künstliche Intelligenz · Jul 20, 2026
Abstract The Cognitive Robot Abstract Machine (CRAM) was developed to enable robots to perform everyday manipulation tasks in human environments by transforming underspecified task instructions into context-sensitive, executable actions. Ra…
Knowledge-augmented embodied exploration for robotic grasping in constrained environments
Jin Liu, Sun Kai, Leibing Xiao, Ce Wang · Robotica · Jul 20, 2026
Abstract It has always been expected that robots can actively manipulate complex environments to fulfill human requirements. This process typically necessitates that the robot be equipped with the ability for embodied exploration and manipu…
AHEAD: Anticipatory Hand-Driven Teleoperation via Human Intent Prediction
Seok Joon Kim, Junho Lee, Federica Spinola, Taein Kwon et al. · arXiv · Jul 16, 2026
Direct hand-driven teleoperation maps an operator's hand motion to robot end-effector commands at every frame, enabling precise control, but it requires constant monitoring and correction during approach, grasp, and placement, which can be …
CosFly-VLA: A Spatially Aware Vision-Language-Action Model for UAV Tracking
Ruilong Ren, Songsheng Cheng, Yunpeng Zhou, Hanxuan Chen et al. · arXiv · Jul 16, 2026
Dynamic target tracking is essential for Unmanned Aerial Vehicles (UAVs) operating in complex urban environments, where both the target and the camera viewpoint change continuously. Existing Vision-Language-Action (VLA) policies can track v…
AeroAct: Action-Centered World-Action Models for Language-Conditioned Quadrotor Flight
Xinhong Zhang, Qiyuan Zhu, Yubo Huang, Haolin Chen et al. · arXiv · Jul 16, 2026
Language-conditioned quadrotor flight requires a policy to ground semantic goals, anticipate the visual consequences of ego-motion, and output control references that remain smooth and dynamically executable under rapidly changing first-per…
Towards Human-like Physical Intelligence: LifelongVision-Language-Action Learning for Robotic Manipulation
Yao He, Gan Sun, Wenqi Liang, Fazeng Li et al. · arXiv · Jul 16, 2026
Similar to the natural capabilities of humans to sequentially learn new tasks, robots with Vision-Language-Action (VLA) models should possess lifelong learning ability to learn a new task when deployed in open-world environments. However, m…
Macroscopic EEG reveals discriminative low-frequency oscillations in plan-to-grasp visuomotor tasks
Anna Cetera, Sima Ghafoori, Ali Rabiee, Mohammad Hassan Farhadi et al. · Journal of Neural Engineering · Jul 16, 2026
OBJECTIVE: The vision-based grasping brain network integrates visual perception with cognitive and motor processes for visuomotor tasks. While invasive recordings have successfully decoded localized neural activity related to grasp type pla…
Vision-Language Model-Guided Transparent Object Perception and Task-Oriented Grasping for Robotic Manipulation
Kejian Ni, Xiepeng Yang, Tao Chen, Minglu Zhu · Robotics · Jul 16, 2026
Transparent objects such as glass containers, test tubes, and plastic bottles are common in robotic manipulation scenarios, but their refractive and reflective surfaces produce incomplete RGB-D geometry and make task-specific grasp selectio…
Depth-augmented diffusion policy with pseudo-depth for robust robotic manipulation
Lesia Mochurad, Yaroslav Hladun, Ivan Tsmots, Oleh Bisikalo · Scientific Reports · Jul 16, 2026
Diffusion Policy has emerged as a powerful approach for imitation learning in robotic manipulation. However, policies trained solely on RGB observations often fail to capture the spatial structure of a scene, which limits performance on tas…
Underwater Grippers for Dexterous Manipulation: A Review on Design and Enabling Technologies
Canjun Yang, Zilin Xing, Mingwei Lin, Ri Lin et al. · Journal of Field Robotics · Jul 15, 2026
ABSTRACT With the advancement of deep‐sea resource exploration and scientific research, underwater operations are evolving from rough manipulation to precise and intelligent tasks, imposing higher demands on the dexterity and environmental …
Object-centric diffusion policies for real-world robotic-arm imitation learning
Prashant Reddy Kasu, Dugan Um · Frontiers in Robotics and AI · Jul 15, 2026
Imitation learning in complex, unstructured environments remains challenging due to the difficulty of grounding perception in physically meaningful representations and the need to model multimodal action distributions. Existing approaches o…
DenseReward: Dense Reward Learning via Failure Synthesis for Robotic Manipulation
Yu Fang, Wanxi Dong, Jiaqi Liu, Yue Yang et al. · arXiv · Jul 14, 2026
Reinforcement learning holds great promise for improving robot policies beyond the limits of imitation learning. However, its practical adoption remains bottlenecked by the lack of reliable vision-language reward models that provide dense a…
ChunkFlow: Towards Continuity-Consistent Chunked Policy Learning
Zhao Yang, Yinan Shi, Mingyuan Yao, Wenyao Xue et al. · arXiv · Jul 14, 2026
Vision-language action (VLA) models increasingly adopt chunked action heads to satisfy real-time constraints; however, this introduces boundary jitter: overlapping regions between consecutive chunks often yield inconsistent predictions, deg…
ExToken: Structured Exploration for Efficient Vision-Language-Action Reinforcement Fine-tuning
Yilun Kong, Yunpeng Qing, Guozheng Ma, Haoyu Wang et al. · arXiv · Jul 14, 2026
Reinforcement Learning (RL) has demonstrated significant potential for improving Vision-Language-Action (VLA) models on complex manipulation tasks. However, its practical scalability remains severely limited by the substantial cost of envir…
Jetson-PI: Towards Onboard Real-Time Robot Control via Foresight-Aligned Asynchronous Inference
Zebin Yang, Qi Wang, Yunhe Wang, Xiurui Guo et al. · arXiv · Jul 14, 2026
Vision-Language-Action (VLA) models have achieved impressive performance on diverse embodied tasks. However, deploying VLA models on low-power onboard devices, such as the Jetson Orin, remains challenging due to their high computational com…
TrustVLA: Mechanism-Guided Inference-Time Defense Against Vision-Language-Action Backdoors
Pinhan Fu, Xianda Guo, Xuetao Li, Wenke Huang et al. · arXiv · Jul 14, 2026
Vision-Language-Action (VLA) models are deployed through pipelines that end users cannot audit, and a poisoned VLA can behave normally on clean observations while a small visual trigger redirects a long-horizon robot policy before any failu…
A Minimalist Retargeting-Guided Reinforcement Learning Recipe for Dexterous Manipulation
Yunhai Feng, Natalie Leung, Jiaxuan Wang, Lujie Yang et al. · arXiv · Jul 13, 2026
Recent work in humanoid whole-body control has found success with a simple recipe: retarget human motion to robot kinematic references, then train policies via reinforcement learning (RL) to track them. But how does this recipe transfer to …
From World Action Models to Embodied Brains: A Roadmap for Open-World Physical Intelligence
Yuanzhi Liang, Xufeng Zhan, Haibin Huang, Chi Zhang et al. · arXiv · Jul 13, 2026
Artificial general intelligence ultimately requires agents that can reason and act in the physical world. Action models, vision-language-action policies, and world models have advanced this goal, while World Action Models (WAMs) are particu…
Trajectory Planning and Certification for 3-DOF Robot Manipulators Using Real Quantifier Elimination Based on Comprehensive Gröbner Systems
Yu Nakai, Akira Terui, Masahiko Mikawa · arXiv · Jul 13, 2026
We propose an algorithm and its implementation for trajectory planning and certification for 3-DOF robot manipulators. The method uses Real Quantifier Elimination (QE) based on Comprehensive Gröbner Systems (CGS), also known as the CGS-QE m…
A world model as a judge: early, auditable verdicts on partial robot-manipulation rollouts.
Abdelhamid Bakhta · Zenodo (CERN European Organ... · Jul 13, 2026
We present LeWorldModel Judge, a locked benchmark and an auditable judging surface for scoring partial robot-manipulation rollouts before their episodes end. Sparse terminal reward answers no useful question about a trajectory prefix: wheth…
A world model as a judge: early, auditable verdicts on partial robot-manipulation rollouts.
Abdelhamid Bakhta · Zenodo (CERN European Organ... · Jul 13, 2026
We present LeWorldModel Judge, a locked benchmark and an auditable judging surface for scoring partial robot-manipulation rollouts before their episodes end. Sparse terminal reward answers no useful question about a trajectory prefix: wheth…

Track Robot Manipulation & VLA on Distill AI — start free →

Latest Robot Manipulation & VLA Research Papers

Recent papers

Related topics