Real-to-Sim for Highly Cluttered Environments via Physics-Consistent Inter-Object Reasoning

Abstract

Reconstructing physically valid 3D scenes from single-view observations is a prerequisite for bridging the gap between visual perception and robotic control. However, in scenarios requiring precise contact reasoning, such as robotic manipulation in highly cluttered environments, geometric fidelity alone is insufficient. Standard perception pipelines often neglect physical constraints, resulting in invalid states, e.g., floating objects or severe inter-penetration, rendering downstream simulation unreliable. To address these limitations, we propose a novel physics-constrained Real-to-Sim pipeline that reconstructs dynamically consistent 3D scenes from single-view RGB-D data. Central to our approach is a differentiable optimization pipeline that explicitly models spatial dependencies via a contact graph, jointly refining object poses and physical properties through differentiable rigid-body simulation. Extensive evaluations in both simulation and real-world settings demonstrate that our reconstructed scenes achieve high physical fidelity and faithfully replicate real-world contact dynamics, enabling stable and reliable contact-rich manipulation.

Simulation Experiments

Our method can generate physically consistent digital scenes for highly cluttered scene.

We compare our result with two baselines: (1) SAM3D+ICP; (2) HoloScene.

The reconstructed scene is simulated under gravity using PyBullet.

Google Scanned Objects Scene 1.

Google Scanned Objects Scene 2.

Google Scanned Objects Scene 3.

YCB Scene 1.

YCB Scene 2.

YCB Scene 3.

Real-world Experiments

Our method can adapt real-world environments and enable simulations with contact-rich manipulation.

Our method only needs a single RGB-D image to reconstruct complete and dynamically consistent scenes.

A same trajectory is executed in real-world and replayed in the reconstructed scenes.

Simulation under Gravity

Real-world Toy4K scene 1.

Simulation with Pushing Trajectory

Replaying the robot trajectory in simulator.

Simulation under Gravity

Real-world Google Scanned Objects scene 1.

Simulation with Pushing Trajectory

Replaying the robot trajectory in simulator.

Simulation under Gravity

Real-world Toy4K scene 2.

Simulation with Pushing Trajectory

Replaying the robot trajectory in simulator.

Simulation under Gravity

Real-world Toy4K scene 3.

Simulation with Pushing Trajectory

Replaying the robot trajectory in simulator.

More Simulation Examples

Scenes using YCB Dataset

Scenes using Google Scanned Objects Dataset

Methodology

Method overview

Method Overview. Our physics-constrained Real2Sim pipeline consists of four stages. (a) Initial Reconstruction: Given a single RGB-D image \(I_t\) and instance masks \(M_t\), we obtain an initial estimation of objects geometry and appearance \(\theta\) using SAM3D and ICP pose refinement. (b) Contact Graph Construction: We construct a contact graph \(cg = (pt, E)\), where parse tree \(pt\) represents the supporting tree and edges \(E\) encode proximal relationships between objects. (c) Two-Stage Physics-Constrained Optimization: Guided by the contact graph, we optimize object properties in two stages. First, a geometry-aware optimization introduces SDF-based contact constraints and visual regularization to globally refine object poses, producing a penetration-free and contact-consistent initialization. Second, a hierarchical physics-constrained optimization, guided by the sequence of parse tree, uses differentiable simulation to jointly refine initial pose and physical parameters of each object for long-horizon physical stability. (d) Photometric Refinement: As a final post-process, object textures are refined using a differentiable renderer to achieve photometric consistency.