Reconstructing physically valid 3D scenes from single-view observations is a prerequisite for bridging the gap between visual perception and robotic control. However, in scenarios requiring precise contact reasoning, such as robotic manipulation in highly cluttered environments, geometric fidelity alone is insufficient. Standard perception pipelines often neglect physical constraints, resulting in invalid states, e.g., floating objects or severe inter-penetration, rendering downstream simulation unreliable. To address these limitations, we propose a novel physics-constrained Real-to-Sim pipeline that reconstructs dynamically consistent 3D scenes from single-view RGB-D data. Central to our approach is a differentiable optimization pipeline that explicitly models spatial dependencies via a contact graph, jointly refining object poses and physical properties through differentiable rigid-body simulation. Extensive evaluations in both simulation and real-world settings demonstrate that our reconstructed scenes achieve high physical fidelity and faithfully replicate real-world contact dynamics, enabling stable and reliable contact-rich manipulation.
Our method can generate physically consistent digital scenes for highly cluttered scene.
We compare our result with two baselines: (1) SAM3D+ICP; (2) HoloScene.
The reconstructed scene is simulated under gravity using PyBullet.
Our method can adapt real-world environments and enable simulations with contact-rich manipulation.
Our method only needs a single RGB-D image to reconstruct complete and dynamically consistent scenes.
A same trajectory is executed in real-world and replayed in the reconstructed scenes.
Real-world Toy4K scene 1.
Replaying the robot trajectory in simulator.
Real-world Google Scanned Objects scene 1.
Replaying the robot trajectory in simulator.
Real-world Toy4K scene 2.
Replaying the robot trajectory in simulator.
Real-world Toy4K scene 3.
Replaying the robot trajectory in simulator.
Method Overview. Our physics-constrained Real2Sim pipeline consists of four stages. (a) Initial Reconstruction: Given a single RGB-D image \(I_t\) and instance masks \(M_t\), we obtain an initial estimation of objects geometry and appearance \(\theta\) using SAM3D and ICP pose refinement. (b) Contact Graph Construction: We construct a contact graph \(cg = (pt, E)\), where parse tree \(pt\) represents the supporting tree and edges \(E\) encode proximal relationships between objects. (c) Two-Stage Physics-Constrained Optimization: Guided by the contact graph, we optimize object properties in two stages. First, a geometry-aware optimization introduces SDF-based contact constraints and visual regularization to globally refine object poses, producing a penetration-free and contact-consistent initialization. Second, a hierarchical physics-constrained optimization, guided by the sequence of parse tree, uses differentiable simulation to jointly refine initial pose and physical parameters of each object for long-horizon physical stability. (d) Photometric Refinement: As a final post-process, object textures are refined using a differentiable renderer to achieve photometric consistency.