Latest 3D Vision & NeRFs Research Papers
The newest 3D Vision & NeRFs papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks 3D Vision & NeRFs so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest 3D Vision & NeRFs papers in your inbox — free →Recent papers
- P3D-Bench: Benchmarking MLLMs for Parametric 3D Generation and Structural ReasoningYikang Yang, Zhanpeng Hu, Youtian Lin, Mengqi Zhou et al. · arXiv · Jun 9, 2026
Multimodal large language models can write code to produce complex programs as well as use programs to do 3D modeling, which opens up a new avenue for 3D generation powered by their priors, world knowledge and reasoning. Yet existing benchm…
- WorldOlympiad: Can Your World Model Survive a Triathlon?Yuke Zhao, Wangbo Zhao, Weijie Wang, Zeyu Zhang et al. · arXiv · Jun 9, 2026
We introduce WorldOlympiad, a benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or s…
- Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in FootballAndrew Kang, Priya Narasimhan · arXiv · Jun 9, 2026
We recast pass evaluation in football (soccer) as a Monte Carlo Tree Search (MCTS)-like evaluation problem whose components mostly exist in the literature under different names: a value model (possession value), a world model (multi-agent t…
- AnimaSpark: A Feed-Forward Method for Animating Arbitrary 3D ObjectsYiming Zhao, Haoyu Sun, Aoyu Wang · arXiv · Jun 9, 2026
While recent advancements in generative AI have substantially accelerated static 3D model creation workflows, the synthesis of category-agnostic 3D animations remains a significant bottleneck in 3D asset production. Current methods for cate…
- Latent Spatial Memory for Video World ModelsWeijie Wang, Haoyu Zhao, Yifan Yang, Feng Chen et al. · arXiv · Jun 8, 2026
Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE enco…
- Beyond Spherical Harmonics: Rethinking Appearance Models for Radiance ReconstructionEwa Miazga, Jorge Condor, Piotr Didyk · arXiv · Jun 8, 2026
View-dependent appearance modeling remains a challenging problem in novel-view synthesis and reconstruction. Accurately representing complex angular effects often requires substantial memory and computational resources. For new learning-bas…
- HDSL: A Hierarchical Domain-Specific Language for Structured 3D Indoor Scene Generation and Localized Editing with LLM AgentsLetian Li, Chao Shen, Shuzhao Xie, Chenghao Gu et al. · arXiv · Jun 8, 2026
Text-driven indoor scene generation and editing require an intermediate representation that language models can both produce and revise. Existing LLM-based systems often rely on scene graphs or global constraint lists, which are compact but…
- Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial ReasoningHaoyuan Li, Zhengdong Hu, Jun Wang, Hehe Fan et al. · arXiv · Jun 5, 2026
This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with …
- DisPOSE: Projected Polystochastic Diffusion for Self-Supervised Multi-View 3D Human Pose EstimationTony Danjun Wang, Tolga Birdal, Nassir Navab · arXiv · Jun 5, 2026
Recovering 3D human poses for multiple individuals from different camera views is a fundamental bottleneck for analyzing interacting behaviors. Existing self-supervised approaches leverage synthetic catalogues of 3D poses; however, this lea…
- PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene UnderstandingShaohui Dai, Yansong Qu, You Shen, Shengchuan Zhang et al. · arXiv · Jun 4, 2026
Recent advances in 3D multimodal large language models (3D-MLLMs) have enabled unified solutions for 3D scene understanding tasks, including visual question answering, captioning, and referring segmentation. However, existing 3D-MLLMs remai…
- HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home ScenesWenbo Li, Xiaoliang Ju, Zipeng Qin, Rongyao Fang et al. · arXiv · Jun 4, 2026
Indoor scene generation is crucial for robot simulation and modern interior design. However, complex layouts together with scarce 3D scene data make learning-based generation challenging. Existing methods often rely on hand-crafted rules or…
- RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow SchedulingChensheng Dai, Shengjun Zhang, Yifan Li, Zhang Zhang et al. · arXiv · Jun 4, 2026
Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attention. Exi…
- RadiusFPS: Efficient Farthest Point Sampling on CPUs and GPUs via Spherical Voxel PruningZiyang Yu, Xiang Li, Qiong Chang, Jun Miyazaki · arXiv · Jun 4, 2026
Point clouds are a primary sensory representation for robotic perception, underpinning LiDAR-based autonomous driving, simultaneous localization and mapping (SLAM), and navigation. Within these pipelines, Farthest Point Sampling (FPS) is th…
- LLM-Conditioned Synthesis of Pathological Gaits via Structured Gait-Language RepresentationsMritula Chandrasekaran, Sanket Kachole, Jarik Francik, Dimitrios Makris · arXiv · Jun 4, 2026
Pathological gait datasets remain scarce due to privacy, recruitment, cost, and movement variability. Our work presents a multimodal LLM-guided framework for pathology-aware 3D gait data synthesis from structured textual descriptions. The p…
- Texture-preserving implicit neural representation for Cone beam CT truncated reconstructionGenyuan Zhang, Junyao Wang, Haoran Lan, Chuandong Tan et al. · arXiv · Jun 4, 2026
Cone-beam computed tomography (CBCT) frequently suffers from data truncation, which introduces severe artifacts and limits the effective field of view (FOV). Existing deep learning methods for truncated cone-beam computed tomography (CBCT) …
- Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene GenerationMengshi Qi, Wei Deng, Xianlin Zhang, Huadong Ma · arXiv · Jun 4, 2026
Large Vision-Language Models have achieved significant reasoning performance in various tasks.However, there are few studies on text-to-3D indoor scene generation with LVLMs. The main challenge is that prevailing LVLM-based methods employ c…
- Deep Learning-based 3D Oral Cavity Reconstruction Using 2D Intraoral ImagesJihun Cho, Soo-Yeon Jeong, Eun-Jeong Bae, Sun-Young Ihm · arXiv · Jun 4, 2026
Oral 3D modelling is one of the most essential stages in dentistry, and many different approaches, such as impression taking and intraoral scanning, are commonly used for this phase, each with notable limitations. Impression taking, which i…
- T-FunS3D: Task-Driven Hierarchical Open-Vocabulary 3D Functionality SegmentationJingkun Feng, Reza Sabzevari · arXiv · Jun 4, 2026
Open-vocabulary 3D functionality segmentation enables robots to localize functional object components in 3D scenes. It is a challenging task that requires spatial understanding and task interpretation. Current open-vocabulary 3D segmentatio…
- Self-Learning Expression Deformations for Data-Efficient Gaussian AvatarsJiahao Yang, Xiaohang Yang, Qing Wang, Yilan Dong et al. · arXiv · Jun 4, 2026
Modeling dynamic facial expressions using 3D Gaussian representations remains challenging due to their unstructured nature. Conventional Gaussian avatar pipelines require extensive multiview and sequential expression data, limiting scalabil…
- Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language ModelsHaibo Wang, Lifu Huang · arXiv · Jun 4, 2026
Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of lar…
- LiAuto-GeoX: Efficient Grounded Driving TransformerJiawei Lian, Haoyi Sun, Yang Wu, Lifu Mu et al. · arXiv · Jun 4, 2026
Dense 3D reconstruction has demonstrated immense potential for spatial understanding, yet its viability as a real-time, onboard representation for autonomous driving remains an open challenge. Existing large-scale visual geometry models typ…
- GS-NFS: Bandwidth-adaptive Streaming of Dynamic Gaussian Splats and Point CloudsRajrup Ghosh, Haodong Wang, Haoran Hong, Eduardo Pavez et al. · arXiv · Jun 4, 2026
Dynamic 3D Gaussian Splatting (3DGS) holds great promise as a 3D video streaming technology since it can represent complex 3D scenes with high fidelity. In this approach, every frame in a 3D video represents the environment as a collection …
- KV-Control: Parameter-Efficient K/V Injection for Trajectory-Controlled Text-to-MotionTengjiao Sun, Pengcheng Fang, Xiaoyu Zhan, Yanwen Guo et al. · arXiv · Jun 4, 2026
Text-conditioned 3D human motion models now synthesize plausible motions from prompts, but practical animation and embodied-agent workflows rarely stop at text: a character may need to follow a sketched root path, hit an end-effector target…
- Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language ModelsGuangzhao He, Rundong Luo, Wei-Chiu Ma, Hadar Averbuch-Elor · arXiv · Jun 1, 2026
Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language mod…
- HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single ImageHezhen Hu, Wangbo Zhao, Lanqing Guo, Hanwen Jiang et al. · arXiv · Jun 1, 2026
In this paper, we present HumanNOVA, a photorealistic, universal, and rapid model for generating 3D human avatars from a single RGB image. Achieving both photorealism and generalization is challenging due to the scarcity of diverse, high-qu…
- Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth EstimationSiyuan Bian, Congrong Xu, Jun Gao · arXiv · Jun 1, 2026
Despite advances in depth estimation, flying points remain a persistent failure mode: near object boundaries, depth estimators often predict spurious 3D points in the empty space between foreground and background surfaces. We trace this art…
- GloResNet: A lightweight 3D CNN with global topological features for preterm brain injury predictionBoyu Yuan, Jiamiao Lu, Weichuan Zhang, Benqing Wu et al. · arXiv · Jun 1, 2026
This study introduces an automated deep learning framework for predicting brain injury (BI) in preterm infants from T2-weighted MRI (dHCP dataset). We propose GloResNet, a lightweight 3D CNN based on ResNet-10, pretrained on MedicalNet to a…
- MORPHOS: Autoregressive 4D Generation with Temporal Structured LatentsMinkyung Kwon, Jinhyeok Choi, Youngjin Shin, Jaeyeong Kim et al. · arXiv · Jun 1, 2026
We present MORPHOS, a novel autoregressive framework that generates dynamic 3D assets from videos across diverse representations, including meshes, 3D Gaussians, and radiance fields. Existing methods are typically limited to a single repres…
- Retrieve What's Missing: Coverage-Maximizing Retrieval for Consistent Long Video GenerationMinseok Joo, Dogyun Park, Taehoon Lee, Kyujin Lee et al. · arXiv · Jun 1, 2026
Maintaining long-term geometric consistency remains challenging for long-horizon autoregressive video generation. Memory-augmented generative models address this by retrieving historical frames, but their effectiveness depends on two key de…
- MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial IntelligenceHilton Raj, Vishnuram AV · arXiv · Jun 1, 2026
In 3D environments, Embodied Agents answer spatially relevant questions through reasoning from a mixture of modalities including natural language, RGB images, point clouds, depth maps and camera poses. Existing Vision-Language models (VLMs)…