VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
| Entity Passport | |
| Registry ID | arxiv-paper--unknown--2601.05138 |
| License | ArXiv |
| Provider | hf |
Cite this paper
Academic & Research Attribution
@misc{arxiv_paper__unknown__2601.05138,
author = {Sixiao Zheng, Minghao Yin, Wenbo Hu},
title = {VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control Paper},
year = {2026},
howpublished = {\url{https://free2aitools.com/paper/arxiv-paper--unknown--2601.05138}},
note = {Accessed via Free2AITools Knowledge Fortress}
} 🔬Technical Deep Dive
Full Specifications [+]▾
⚖️ Nexus Index V2.0
💬 Index Insight
FNI V2.0 for VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:100), Quality (Q:45).
Verification Authority
📝 Executive Summary
❝ Cite Node
@article{Unknown2026VerseCrafter:,
title={VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control},
author={},
journal={arXiv preprint arXiv:arxiv-paper--unknown--2601.05138},
year={2026}
} Abstract & Analysis
VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
Title:
Content selection saved. Describe the issue below:
Description:
License: arXiv.org perpetual non-exclusive license
arXiv:2601.05138v2 [cs.CV] 30 Mar 2026
VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
Sixiao Zheng 1,2 , Minghao Yin 3 , Wenbo Hu 4 † , Xiaoyu Li 4 , Ying Shan 4 , Yanwei Fu 1,2 †
1 Fudan University 2 Shanghai Innovation Institute 3 HKU 4 ARC Lab, Tencent PCG Project Page: https://sixiaozheng.github.io/VerseCrafter_page/
Abstract
Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently capture dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter , a geometry-driven video world model that generates dynamic, realistic videos from a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state as a static background point cloud and per-object 3D Gaussian trajectories. This representation captures each object’s motion path and probabilistic 3D occupancy over time, providing a flexible, category-agnostic alternative to rigid bounding boxes and parametric models. We render 4D Geometric Control into 4D control maps for a pretrained video diffusion model, enabling high-fidelity, view-consistent video generation that faithfully follows the specified dynamics. To enable training at scale, we develop an automatic data engine and construct VerseControl4D , a real-world dataset of 35K training samples with automatically derived prompts and rendered 4D control maps. Extensive experiments show that VerseCrafter achieves superior visual quality and more accurate control over camera and multi-object motion than prior methods.
Figure 1 :
VerseCrafter enables precise control of camera motion and multi-object motion via a 4D Geometric Control representation built from a static background point cloud and per-object 3D Gaussian trajectories, producing videos that better follow the desired motion than Yume [ 63 ] and Uni3C [ 11 ] and more closely match the ground-truth video.
0 0 footnotetext: † Corresponding authors.
1
Introduction
Video world models learn to simulate world dynamics by generating future frame sequences conditioned on past observations and control signals, such as actions or camera trajectories [ 48 , 30 , 13 , 41 , 63 ] . They provide a unified interface for visual prediction [ 32 ] , navigation [ 7 ] , and manipulation [ 23 ] . However, the reliance on video introduces a fundamental challenge: while an ideal world model should simulate the full 4D spatiotemporal space to reflect our physical reality, videos inherently capture dynamics in the projected 2D image plane.
To bridge this gap, recent works introduce camera control into video generation through explicit 3D geometry [ 117 , 11 , 127 ] , implicit pose embeddings [ 52 ] , or learned movement embeddings [ 9 , 70 , 12 ] . However, these methods are often limited to static scenes or leave multi-object motion uncontrolled. Existing approaches typically rely on 2D cues such as point trajectories [ 98 ] , optical flow [ 58 ] , masks [ 125 ] , or bounding boxes [ 93 ] , which lack 3D awareness and often fail under large viewpoint changes. More advanced 3D-aware methods use depth maps [ 124 ] , sparse 3D trajectories [ 15 ] , 3D bounding boxes [ 94 ] , or parametric human models like SMPL-X [ 11 ] to align camera and object motion in 3D space. Nevertheless, these control representations remain inadequate for modeling multi-object dynamics as a unified, compact, and editable 4D geometric scene state in a shared world coordinate frame. For instance, sparse trajectories are often noisy and incomplete, 3D bounding boxes impose rigid constraints ill-suited to natural objects, and SMPL-X representations are category-limited. Furthermore, several existing works focus on synthetic game environments [ 41 , 110 , 116 ] , where precise annotations are available for training. However, controllable modeling of complex, realistic 4D scenes with multi-object motion remains underexplored.
Thus we propose VerseCrafter , a realistic, dynamic video world model that enables precise control of camera and multi-object motion within a unified 4D geometric world state, as shown in Fig. 1 . At the core of VerseCrafter is our 4D Geometric Control representation, which represents the scene state as a static background point cloud for scene geometry and per-object 3D Gaussian trajectories to capture object dynamics. Each 3D Gaussian trajectory models an object’s probabilistic 3D occupancy over time: its mean defines the motion path, while its covariance captures the object’s spatial extent and orientation. This probabilistic formulation provides a soft, flexible, and category-agnostic way to model diverse object shapes and motions, overcoming the limitations of rigid 3D bounding boxes or category-specific parametric models. Crucially, the background point cloud and per-object 3D Gaussian trajectories share a common world coordinate frame, enabling coherent and unified control over both camera and object motion.
By rendering 4D Geometric Control into multi-channel 4D control maps, we condition a frozen Wan2.1-14B video diffusion backbone [ 88 ] via a lightweight GeoAdapter, an adapter-style branch inspired by ControlNet [ 120 ] . This conditioning enables the generation of high-fidelity videos that faithfully reflect the explicit 4D geometric world state. Unlike 2D control signals, our 4D Geometric Control is inherently 3D-aware, making it naturally more view-consistent and robust to occlusions, and thus a more effective and reliable interface for video world modeling. Training VerseCrafter requires large-scale paired data of real-world videos and corresponding 4D geometric control. To this end, we construct VerseControl4D , a real-world video dataset with automatically derived prompts and rendered 4D control maps. This dataset supports large-scale training on diverse real-world videos.
Our contributions are threefold:
•
We introduce a novel 4D Geometric Control representation that unifies camera and multi-object motion in a shared world coordinate frame. By using 3D Gaussian trajectories, it provides a flexible and category-agnostic way to control object dynamics, overcoming the limitations of rigid, category-specific models.
•
We present VerseCrafter , a geometry-driven video world model that leverages 4D Geometric Control for precise control over camera and multi-object motion.
This enables the generation of high-fidelity, view-consistent videos that accurately follow complex 4D controls.
•
We construct VerseControl4D , a real-world dataset with automatically derived prompts and rendered 4D control maps, with 35K training samples. This addresses a key data bottleneck and supports large-scale training on diverse real-world videos.
2
Related Works
Video World Models . World models learn environment dynamics from observations by predicting future states for simulation, planning, and control [ 30 , 31 , 48 ] . Early visual world models adopt recurrent and latent-variable architectures [ 24 , 87 , 16 , 29 , 64 , 68 ] , while recent approaches use transformer and diffusion backbones to roll out realistic videos conditioned on actions, text, or camera trajectories [ 9 , 70 , 6 , 88 , 111 , 45 , 1 , 2 , 35 , 40 , 12 , 116 , 20 , 103 , 50 ] , and further extend temporal horizons with memories or long-sequence models [ 73 , 101 , 52 ] . Geometry-aware works such as DeepVerse [ 13 ] , Voyager [ 41 ] , and Yume [ 63 ] incorporate 3D geometry to support 4D video generation and world exploration, but are controlled via text, actions, or camera tokens and do not expose a compact, editable 4D geometric state for real-world multi-object dynamics. In contrast, VerseCrafter learns a geometry-driven mapping from 4D Geometric Control to dynamic, realistic videos, enabling disentangled control over camera and multi-object motion.
3D World Generation . Recent work leverages powerful 2D generative priors to synthesize explorable 3D environments from text, images, or videos [ 121 , 49 ] . Early methods mainly focus on object-level or single-scene generation [ 38 , 118 , 119 , 74 , 19 , 122 ] , distilling image diffusion models [ 79 ] into NeRFs [ 65 ] , implicit fields, meshes, or 3D Gaussian splats [ 44 ] , or optimizing scene geometry from multi-view or panoramic observations [ 80 , 115 , 18 , 114 , 113 ] . More recent approaches scale up to navigable 3D worlds [ 7 ] , combining depth estimation [ 108 ] , camera-guided video diffusion, iterative inpainting, and panoramic inputs to construct room- or city-scale Gaussian scenes for exploration [ 81 , 86 , 110 , 14 , 54 , 60 , 91 , 129 , 61 ] . However, these pipelines largely model static, synthetic-like scenes and provide limited explicit control over real-world multi-object dynamics. In contrast, VerseCrafter operates on real-world videos and represents the scene with a static background point cloud and per-object 3D Gaussian trajectories, forming an explicit 4D geometric scene state for geometry-consistent dynamic video generation.
Controllable Video Generation . Controllable video generation aims to steer camera and object motion via conditioning signals. Camera-controlled models [ 126 , 3 , 84 , 4 , 46 , 34 , 106 , 53 ] such as MotionCtrl [ 98 ] and CameraCtrl [ 33 ] inject camera extrinsics, Plücker-style encodings, or 3D priors [ 104 , 39 , 99 , 124 , 78 , 75 , 117 , 28 , 11 , 21 , 127 ] into video diffusion models for viewpoint control, but mostly assume static or weakly dynamic scenes. Object motion [ 85 , 96 , 107 , 55 , 66 , 67 , 82 , 76 , 62 , 51 , 128 , 123 , 90 , 89 , 36 , 69 , 97 , 25 , 57 , 83 , 10 , 109 , 112 , 100 ] is typically controlled using 2D cues (bounding boxes, masks, trajectories, strokes, optical flow) as in Boximator [ 93 ] , DragAnything [ 102 ] , and MotionCanvas [ 105 ] , or with more 3D-aware signals such as depth maps, sparse 3D trajectories, 3D boxes, or SMPL-X bodies in I2V3D [ 124 ] , Uni3C [ 11 ] , CineMaster [ 94 ] , Perception-as-Control [ 15 ] , and LongVie [ 26 ] . While these methods improve controllability, 2D controls remain view-dependent and fragile under large camera changes, and many 3D controls are category-specific, rigid, or tied to reconstruction-heavy pipelines. Recent approaches [ 109 , 27 , 58 , 105 , 94 , 22 , 98 , 127 , 15 ] begin to jointly control camera and object motion, but their control spaces are still fragmented rather than a unified, compact world state. VerseCrafter instead introduces 4D Geometric Control : a compact, category-agnostic 4D geometric scene state where a static background point cloud and per-object 3D Gaussian trajectories in a shared world coordinate frame jointly drive camera and multi-object motion.
Figure 2 : Framework of VerseCrafter. Given an input image and a text prompt, we estimate depth and obtain user-specified object masks to construct 4D Geometric Control consisting of a static background point cloud and per-object 3D Gaussian trajectories in a shared world coordinate frame. A camera trajectory is specified in the shared frame, and together with the 4D Geometric Control, rendered into per-frame background RGB/depth, 3D Gaussian trajectory RGB/depth, and a soft merged mask, forming multi-channel 4D control maps. The 4D control maps are encoded and fed into the proposed GeoAdapter, which conditions a frozen Wan2.1-14B backbone together with text embeddings from umT5, enabling geometry-consistent video generation with precise control over camera and multi-object motion.
3
Method
We propose VerseCrafter , a geometry-driven video world model that generates dynamic, realistic videos from an explicit 4D geometric scene state while enabling disentangled control over camera and multi-object motion. Our framework has two key components: (i) a unified 4D Geometric Control representation (Sec. 3.1 ), which represents the 4D geometric scene state in a shared world coordinate frame, and (ii) a lightweight GeoAdapter (Sec. 3.2 ), which injects encoded 4D control maps into a frozen Wan2.1-14B backbone while preserving its strong visual prior. Given an input image and a prompt, we construct 4D Geometric Control as a static background point cloud and per-object 3D Gaussian trajectories, specify a camera trajectory in the shared frame, render them into 4D control maps, and feed these maps into GeoAdapter to generate dynamic, realistic videos.
3.1
4D Geometric Control
We represent each scene as an explicit 4D geometric scene state, which we term 4D Geometric Control . This editable representation consists of a static background point cloud P bg P^{\text{bg}} and per-object 3D Gaussian trajectories { 𝒢 o t } {\mathcal{G}_{o}^{t}} , all defined in a shared world coordinate frame.
Background point cloud. As shown in Fig. 2 , we start from the input image, estimate monocular depth and camera intrinsics 𝐊 \mathbf{K} using MoGe-2 [ 95 ] , and obtain object masks { M o } {M_{o}} with Grounded SAM2 [ 77 ] , where the user selects one or more objects to control via text prompts or clicks. We take the input view as the reference world coordinate frame, so that the reference camera pose is given by 𝐑 1 = 𝐈 \mathbf{R}{1}=\mathbf{I} and 𝐭 1 = 𝟎 \mathbf{t}{1}=\mathbf{0} . Each pixel 𝐮 = ( u , v , 1 ) ⊤ \mathbf{u}=(u,v,1)^{\top} with depth D 1 ( 𝐮 ) D_{1}(\mathbf{u}) is then back-projected as
𝐩 ( 𝐮 ) = 𝐑 1 ⊤ ( D 1 ( 𝐮 ) 𝐊 − 1 𝐮 − 𝐭 1 ) . \mathbf{p}(\mathbf{u})=\mathbf{R}_{1}^{\top}\big(D_{1}(\mathbf{u})\mathbf{K}^{-1}\mathbf{u}-\mathbf{t}_{1}\big).
(1)
We use the object masks to partition the reconstructed point cloud into per-object point clouds
P o = { 𝐱 o , k | 𝐱 o , k = 𝐩 ( 𝐮 k ) , 𝐮 k ∈ M o } , P_{o}=\big\{\mathbf{x}_{o,k}\,\big|\,\mathbf{x}_{o,k}=\mathbf{p}(\mathbf{u}_{k}),\ \mathbf{u}_{k}\in M_{o}\big\},
(2)
and a static background point cloud
P bg = { 𝐩 ( 𝐮 ) | 𝐮 ∉ ⋃ o M o } = { 𝐩 i } i = 1 N bg . P^{\text{bg}}=\big\{\mathbf{p}(\mathbf{u})\,\big|\,\mathbf{u}\notin\bigcup_{o}M_{o}\big\}=\{\mathbf{p}_{i}\}_{i=1}^{N_{\text{bg}}}.
(3)
During generation, the background at frame t t is rendered from P bg P^{\text{bg}} under the camera pose, so viewpoint changes are realized as rigid camera motion in a fixed 3D world rather than by hallucinating a new background at every frame.
3D Gaussian trajectories. A single 3D Gaussian 𝒢 o ( 𝐱 ) = 𝒩 ( 𝐱 ∣ 𝝁 o , 𝚺 o ) \mathcal{G}{o}(\mathbf{x})=\mathcal{N}(\mathbf{x}\mid\boldsymbol{\mu}{o},\mathbf{\Sigma}{o}) in the world coordinate frame compactly encodes an object’s position (through 𝝁 o \boldsymbol{\mu}{o} ), approximate shape and size (through the eigenvalues of 𝚺 o \mathbf{\Sigma}{o} ), and orientation (through the eigenvectors of 𝚺 o \mathbf{\Sigma}{o} ). A 3D Gaussian trajectory for an object o o is then defined as a sequence of Gaussians
{ 𝒢 o t } t = 1 T , 𝒢 o t ( 𝐱 ) = 𝒩 ( 𝐱 ∣ 𝝁 o t , 𝚺 o t ) , \{\mathcal{G}_{o}^{t}\}_{t=1}^{T},\quad\mathcal{G}_{o}^{t}(\mathbf{x})=\mathcal{N}(\mathbf{x}\mid\boldsymbol{\mu}_{o}^{t},\mathbf{\Sigma}_{o}^{t}),
(4)
whose means { 𝝁 o t } {\boldsymbol{\mu}{o}^{t}} trace the motion path in 3D, while the covariances { 𝚺 o t } {\mathbf{\Sigma}{o}^{t}} capture how the object’s spatial extent and orientation evolve over time. This probabilistic formulation describes the object’s 3D occupancy in a soft, continuous manner and yields a compact control space that is more flexible than rigid 3D bounding boxes and more category-agnostic than parametric body models.
To initialize the trajectory for each controllable object o o , we fit a full-covariance Gaussian to its point cloud P o P_{o} obtained in the previous step:
𝝁 o = 1 N o ∑ k 𝐱 o , k , 𝚺 o = 1 N o ∑ k ( 𝐱 o , k − 𝝁 o ) ( 𝐱 o , k − 𝝁 o ) ⊤ , \boldsymbol{\mu}_{o}=\frac{1}{N_{o}}\sum_{k}\mathbf{x}_{o,k},\mathbf{\Sigma}_{o}=\frac{1}{N_{o}}\sum_{k}(\mathbf{x}_{o,k}-\boldsymbol{\mu}_{o})(\mathbf{x}_{o,k}-\boldsymbol{\mu}_{o})^{\top},
(5)
which gives an initial Gaussian 𝒢 o ( 𝐱 ) \mathcal{G}_{o}(\mathbf{x})
The low-dimensional parameters { 𝝁 o t , 𝚺 o t } {\boldsymbol{\mu}{o}^{t},\mathbf{\Sigma}{o}^{t}} naturally support flexible, user-driven editing. In practice, we convert each 𝒢 o t \mathcal{G}{o}^{t} into an ellipsoid mesh for visualization in a 3D editor such as Blender, and let the user specify or refine the trajectory by dragging and keyframing this ellipsoid in world coordinate space. The edited poses and shapes are mapped back to { 𝝁 o t , 𝚺 o t } {\boldsymbol{\mu}{o}^{t},\mathbf{\Sigma}_{o}^{t}} as control signals. The ellipsoids are only a user interface; all conditioning maps used by model are rendered directly from the underlying 3D Gaussians.
Rendering 4D control maps. Given our 4D Geometric Control, we render per-frame 4D control maps in the target camera views. For each frame t t , we generate three types of maps: (i) background RGB/depth, RGB t bg \text{RGB}^{\text{bg}}{t} and Depth t bg \text{Depth}^{\text{bg}}{t} , by projecting the static background point cloud P bg P^{\text{bg}} under the camera pose ( 𝐑 t , 𝐭 t ) (\mathbf{R}{t},\mathbf{t}{t}) ; (ii) 3D Gaussian trajectory RGB/depth, RGB t traj \text{RGB}^{\text{traj}}{t} and Depth t traj \text{Depth}^{\text{traj}}{t} , by projecting the per-object Gaussians { 𝒢 o t } {\mathcal{G}{o}^{t}} into soft elliptical footprints and taking depth from the corresponding ellipsoid surfaces; and (iii) a soft merged mask M t M{t} , used as a control mask, that indicates regions where the diffusion model should synthesize or overwrite content, obtained by inverting background visibility and merging it with the projected 3D Gaussian footprints, followed by Gaussian smoothing. For the first frame t = 1 t=1 , we replace RGB 1 bg \text{RGB}^{\text{bg}}{1} with input image and set M 1 = 0 M{1}=0 , so that the first frame is preserved and only future frames are modified. Background and 3D Gaussian maps share the same 4D geometric scene state but are rendered through decoupled channels, disentangling camera motion from object motion while preserving geometric consistency.
3.2
VerseCrafter Architecture
Backbone. We adopt Wan2.1-14B [ 88 ] as a frozen latent video diffusion backbone with a Wan Encoder, a Wan-DiT denoiser and a Wan Decoder. VerseCrafter treats Wan2.1 as a generic video prior: we do not change its architecture or weights, and instead attach a lightweight geometric adapter that conditions the backbone on rendered 4D control maps.
GeoAdapter. We take the rendered background maps and 3D Gaussian trajectory maps, RGB bg \text{RGB}^{\text{bg}} , Depth bg \text{Depth}^{\text{bg}} , RGB traj \text{RGB}^{\text{traj}} , Depth traj \text{Depth}^{\text{traj}} , together with the soft merged mask M M . The four RGB/depth maps are encoded by the same Wan Encoder, while M M is reshaped and interpolated to the latent resolution, following the practice in [ 42 , 88 ] . The encoded RGB/depth maps and the processed mask are concatenated channel-wise to form a spatio-temporal geometry tensor. GeoAdapter is a lightweight DiT-style branch that operates on this geometry tensor. It shares the same hidden dimensionality as the Wan-DiT blocks, but uses far fewer layers. We interleave GeoAdapter blocks with the frozen Wan-DiT: every k k -th DiT block in Wan-DiT is paired with a GeoAdapter block whose output is linearly projected and added to the corresponding DiT block as a residual modulation. Text prompts are encoded by umT5 [ 17 ] into text embeddings, which are injected into both Wan-DiT and GeoAdapter blocks through the same text-conditioning interfaces. This adapter-based conditioning injects 4D geometric information into Wan 2.1 with only a small number of extra parameters, while keeping all backbone weights fixed.
Inference. At inference time, VerseCrafter supports camera-only, object-only, and joint control within a unified framework. For camera-only control, we render the background control maps, while setting the 3D Gaussian trajectory RGB/depth maps to zero. The merged mask may still be nonzero due to viewpoint changes. For object-only control, we keep the camera pose fixed, render static background control maps from P bg P^{\text{bg}} , and render the 3D Gaussian trajectory maps to control object motion. For joint control, both background and trajectory control maps are rendered from the same 4D geometric scene state, enabling coordinated and geometry-consistent control over camera and multi-object motion.
4
VerseControl4D Dataset
To train and evaluate VerseCrafter on complex real-world scenes with 4D Geometric Control, we construct VerseControl4D , a real-world dataset with automatically derived prompts and rendered 4D control maps. As shown in Fig. 3 , VerseControl4D is built through four stages: data collection, clip extraction, quality filtering, and data annotation.
Data collection. VerseControl4D is built from two recent world-exploration datasets, Sekai-Real-HQ [ 56 ] and SpatialVID-HQ [ 92 ] , which provide long in-the-wild videos with diverse outdoor and urban scenes, camera poses, and captions, but lack object-motion labels. We use their high-resolution videos as the raw video pool for constructing our 4D geometric control annotations.
Clip extraction. We apply PySceneDetect to detect shots in the videos. For each shot longer than 81 frames, we uniformly sample an 81-frame sub-clip and discard shorter shots, matching the default temporal length used by the Wan2.1 backbone.
Figure 3 : Construction pipeline of VerseControl4D. Starting from Sekai-Real-HQ and SpatialVID-HQ, we extract 81-frame clips and apply quality filtering. For each retained clip, Qwen2.5-VL-72B, Grounded-SAM2, and MegaSAM provide captions, object masks, depth, and camera trajectory, which are lifted into background/object point clouds, from which 3D Gaussian trajectories are fitted, and then rendered into background/trajectory maps and a soft merged mask that constitute our 4D control maps.
Quality filtering. We apply an object-centric filtering pipeline to retain clips with clean geometry and controllable foreground. Using Grounded-SAM2 with prompts such as “person . human . car . animal” , we first obtain object masks on the first frame, and keep only clips whose controllable object count lies in [ 1 , 6 ] [1,6] . We then discard clips where any object mask covers more than 20 % 20% of the image area. For human instances, we further remove clips whose masks touch image borders or whose aspect ratios fall outside [ 2 , 4 ] [2,4] , as these typically correspond to severely truncated pedestrians. Finally, we apply visual-quality filtering based on aesthetic and luminance scores to exclude blurry or over-/under-exposed clips, yielding a set of visually clean, geometrically reliable videos.
Figure 4 : Qualitative comparison of joint camera and object motion control. Perception-as-Control and Uni3C exhibit noticeable human deformation, while Yume roughly follows the text-described motion but lacks precise camera control. Uni3C is also limited to single human. In contrast, VerseCrafter more faithfully follows both the camera trajectory and multi-object motion while maintaining sharp appearance and geometrically consistent backgrounds.
Figure 5 : Qualitative comparison of camera-only motion control on static scenes. ViewCrafter and Voyager exhibit distorted facades, drifting structures, or inaccurate camera motion, while FlashWorld tends to produce blurred scene boundaries and imprecise camera motion. In contrast, VerseCrafter better follows the target camera trajectory while preserving sharp details and globally consistent 3D geometry.
Table 1 : Joint camera and object motion control on VerseControl4D. We report VBench-I2V scores and 3D control metrics (RotErr, TransErr, ObjMC). VerseCrafter achieves the best overall video quality and the most accurate joint control of camera and object motion.
Overall
Score
↑ \uparrow
Imaging
Quality
↑ \uparrow
Aesthetic
Quality
↑ \uparrow
Dynamic
Degree
↑ \uparrow
Motion
Smoothness
↑ \uparrow
Background
Consistency
↑ \uparrow
Subject
Consistency
↑ \uparrow
I2V
Background
↑ \uparrow
I2V
Subject
↑ \uparrow
RotErr ↓ \downarrow
TransErr ↓ \downarrow
ObjMC ↓ \downarrow
Perception-as-Control [ 15 ]
83.66
66.81
53.34
73.91
96.89
93.19
94.02
96.35
94.78
5.006
8.767
6.556
Yume [ 63 ]
85.47
71.16
52.39
72.24
98.96
95.66
96.43
98.51
98.39
7.560
8.735
7.959
Uni3C [ 11 ]
83.55
68.06
53.16
66.09
98.94
93.74
94.19
97.19
97.05
1.361
7.731
5.883
Ours
88.10
72.70
57.49
86.26
98.79
95.69
96.48
98.76
98.65
0.890
3.103
2.507
Data annotation. For each retained clip, we automatically generate a text prompt and rendered 4D control maps. We first generate a descriptive caption using Qwen2.5-VL-72B [ 5 ] , which serves as text prompt. For geometry, we adopt MegaSAM as the base pipeline and replace its monocular and metric depth modules with MoGe-2 [ 95 ] and UniDepth V2 [ 72 ] , respectively, to obtain more accurate and temporally consistent depth. Given the video frames, the estimated depth maps, and the camera trajectory, we reconstruct a 3D point cloud for each frame. Applying Grounded-SAM2 object masks to the per-frame point clouds yields per-object point clouds and a static background point cloud P bg P^{\text{bg}} , as described in Sec. 3.1 . For each object, we then fit per-frame 3D Gaussians and form a 3D Gaussian trajectory { 𝒢 o t } {\mathcal{G}_{o}^{t}} . Finally, we render the 4D Geometric Control into model-ready 4D control maps. The static background point cloud is rendered under the camera trajectory to obtain background RGB/depth maps. The 3D Gaussian trajectories are rendered to obtain 3D Gaussian trajectories RGB/depth maps. We then invert the background mask and merge it with the 3D Gaussian trajectories mask to produce a soft merged mask that marks regions where the video diffusion model should synthesize content.
In total, VerseControl4D contains 35,000 training samples and 1,000 validation samples. In the training set, about 26% of samples are sourced from Sekai-Real-HQ and 74% from SpatialVID-HQ, while 20% of the samples depict static scenes, encouraging VerseCrafter to learn both camera-only world exploration and coupled camera–object dynamics. The validation set includes 250 static-scene samples to specifically assess camera-only control.
5
Experiments
Figure 6 : Ablation on 3D representations for object motion control. We compare object control using 3D point trajectory (top), 3D bounding box (middle), and 3D Gaussian trajectory (bottom). 3D point trajectory and 3D bounding box often cause scale drift and misaligned motion (red boxes), whereas 3D Gaussian trajectory better follows the intended object motion while preserving plausible shapes and background interactions.
Implementation Details. We build VerseCrafter upon the Wan2.1 T2V-14B model. The Wan backbone is kept frozen, and only GeoAdapter is updated. Each GeoAdapter block is initialized from the weights of its paired DiT block in Wan-DiT to stabilize training, and we set k = 5 k=5 so that every 5th DiT block in Wan-DiT is paired with a GeoAdapter block. We use the Adam optimizer with a learning rate of 2e-5, a constant learning rate schedule with 100 warmup steps. All experiments are conducted on 16 96-GB GPUs with a global batch size of 16. Training is performed in two stages: we first train for 2,500 iterations on 480 480 P clips, and then fine-tune the same model for another 2,500 iterations on 720 720 P clips. The total wall-clock training time is about 380 hours. We adopt classifier-free guidance during training by randomly dropping the text condition with probability 0.1. At inference time, we use 50 denoising steps and a classifier-free guidance scale of 5.0. Generating an 81-frame 720 720 P video clip on 8 96-GB GPU takes about 1152 seconds, with a peak per-GPU memory usage of about 90 GB.
Table 2 : Camera-only motion control on static scenes. On the static subset of VerseControl4D, we report VBench-I2V scores and camera control metrics (RotErr, TransErr). VerseCrafter achieves the best overall visual quality while substantially reducing camera pose errors.
Overall
Score
↑ \uparrow
Imaging
Quality
↑ \uparrow
Aesthetic
Quality
↑ \uparrow
Dynamic
Degree
↑ \uparrow
Motion
Smoothness
↑ \uparrow
Background
Consistency
↑ \uparrow
Subject
Consistency
↑ \uparrow
I2V
Background
↑ \uparrow
I2V
Subject
↑ \uparrow
RotErr ↓ \downarrow
TransErr ↓ \downarrow
ViewCrafter [ 117 ]
84.04
69.56
55.52
68.02
97.86
92.09
94.25
97.70
97.29
2.101
9.868
Voyager [ 41 ]
78.12
55.48
49.80
65.34
99.39
92.31
91.55
86.02
85.03
3.557
3.880
FlashWorld [ 54 ]
85.33
71.68
58.74
73.46
98.35
94.27
92.47
95.38
98.32
1.792
3.257
Ours
86.80
74.57
54.78
80.34
97.62
94.88
95.55
97.86
98.79
0.650
2.587
Table 3 : Ablation study on 3D representation, depth, and decoupled controls. We compare different variants of VerseCrafter using VBench-I2V and 3D control metrics (RotErr, TransErr, ObjMC). Our full model with 3D Gaussian trajectories, depth-aware rendering, and decoupled background/foreground controls achieves the best visual quality and the most accurate camera and object motion control.
Overall
Score
↑ \uparrow
Imaging
Quality
↑ \uparrow
Aesthetic
Quality
↑ \uparrow
Dynamic
Degree
↑ \uparrow
Motion
Smoothness
↑ \uparrow
Background
Consistency
↑ \uparrow
Subject
Consistency
↑ \uparrow
I2V
Background
↑ \uparrow
I2V
Subject
↑ \uparrow
RotErr ↓ \downarrow
TransErr ↓ \downarrow
ObjMC ↓ \downarrow
Ours (3D Bounding Box)
85.45
69.23
55.70
78.57
98.70
92.92
93.27
97.74
97.48
1.350
3.805
4.520
Ours (3D Point Trajectory)
85.57
70.29
55.27
78.23
98.63
94.00
92.75
97.85
97.55
1.298
3.281
6.896
Ours (w/o depth)
85.64
70.19
55.00
80.60
98.66
92.07
92.83
98.07
97.69
1.177
3.900
4.929
Ours (BG & FG Merged)
85.72
69.19
54.86
83.72
98.65
91.15
92.86
97.93
97.41
1.080
3.803
3.726
Ours
88.10
72.70
57.49
86.26
98.79
95.69
96.48
98.76
98.65
0.890
3.103
2.507
Evaluation Metrics. We evaluate overall video quality using VBench-I2V. For camera control, we follow CameraCtrl [ 33 ] and report rotation error (RotErr) and translation error (TransErr). For object-motion control, we adopt ObjMC proposed in MotionCtrl [ 98 ] . Given a generated video, we apply the same geometry annotation pipeline used for VerseControl4D to estimate its camera trajectory and 3D Gaussian trajectories, and compare them with the corresponding ground-truth trajectories from our dataset. ObjMC is computed as the average Euclidean distance between the estimated and ground-truth 3D Gaussian means over all controlled objects and frames.
5.1
Joint Camera and Object Motion Control
We first evaluate joint control of camera and object motion on VerseControl4D. As shown in Table 1 , VerseCrafter achieves the best VBench-I2V scores among all compared methods, with clear gains in Overall Score, Imaging Quality, Aesthetic Quality, and both subject and background consistency. On 3D control metrics, VerseCrafter substantially reduces rotation, translation, and object-motion errors compared with the best-performing baseline, reflecting much tighter alignment with the target 4D geometric control. Qualitative comparisons in Fig. 4 further highlight these differences: Perception-as-Control and Uni3C exhibit noticeable human deformation, while Yume roughly follows the text-described motion but lacks precise camera control. Uni3C, relying on SMPL-X, is limited to single-person motion and struggles with other categories such as vehicles. In contrast, VerseCrafter more faithfully follows both the camera trajectory and 3D Gaussian trajectories while maintaining sharp appearance and geometrically consistent backgrounds.
Figure 7 : Ablation on depth-aware control. We compare VerseCrafter without depth inputs ( Ours (w/o depth) , top) and with RGB+depth inputs (middle) under the same camera trajectory. Without depth, the model often produces incorrect foreground-background ordering, e.g., lampposts are pulled in front of distant buildings, and occlusion boundaries drift over time (red boxes). With RGB+depth, the model recovers consistent parallax and occlusion, producing geometry much closer to the ground truth.
5.2
Camera-Only Motion Control
We evaluate camera-only control on the static-scene subset of VerseControl4D, where objects remain stationary and only the camera moves. As shown in Table 2 , VerseCrafter achieves the best VBench-I2V performance among all compared methods, with consistent gains in Overall Score, Imaging Quality, and both subject and background consistency, while maintaining motion smoothness comparable to prior methods. On 3D camera metrics, VerseCrafter substantially reduces rotation and translation errors relative to the best-performing baseline, indicating that it follows the target camera trajectory much more faithfully in static scenes. Qualitative comparisons in Fig. 5 further confirm these trends: ViewCrafter and Voyager exhibit distorted facades, drifting structures, or inaccurate camera motion, while FlashWorld tends to produce blurred scene boundaries and imprecise camera motion. In contrast, VerseCrafter preserves straight structures, stable depth relationships, and an appearance closer to the ground-truth video, evidencing precise camera control in a static scene.
Figure 8 : Ablation on decoupled background and foreground controls. We compare a variant that merges background and foreground controls into a single map ( Ours (BG & FG Merged) , top) with our default decoupled design (middle). When the controls are merged, object motion control degrades significantly (red boxes), whereas the decoupled design better preserves the static background and produces more accurate and stable object motion.
5.3
Ablation Study
We conduct ablations to analyze three key design choices in VerseCrafter: (i) the 3D representations for object motion, (ii) the use of depth in control maps, and (iii) the decoupling of background and foreground controls. All variants share the same training data, backbone, and optimization settings; only the control design is changed.
3D representations for object motion. To isolate the effect of our motion representation, we derive two alternatives from per-object 3D Gaussian trajectories: (1) an oriented 3D bounding box whose axes follow the Gaussian’s principal directions and whose side lengths are scaled by its principal spreads; and (2) a 3D point trajectory that retains only the Gaussian centroid. The rest of the pipeline is unchanged: we rasterize cuboids (for boxes) or tiny disks/spheres (for points) instead of Gaussian ellipses. As reported in Table 3 , replacing Gaussians with boxes slightly hurts both visual quality and control accuracy, while point trajectories give the weakest object-motion consistency. Qualitatively (Fig. 6 ), points and boxes often cause scale drift and misaligned motion, whereas 3D Gaussian trajectories better follow the intended object trajectories and preserve plausible object shapes.
Depth-aware Control. To evaluate the effect of depth, we remove depth channel from the background and trajectory controls (“Ours (w/o depth)” in Table 3 ). This variant yields a lower Overall Score and significantly worse 3D control (higher RotErr and ObjMC values). As shown in Fig. 7 , without depth, the model produces incorrect foreground-background ordering: vertical structures like streetlights appear beside shelves in the foreground, while buildings that should be behind the character are positioned elsewhere, and occlusion boundaries drift over time. With RGB+depth, VerseCrafter recovers more consistent parallax and occlusion, producing geometry much closer to the ground truth.
Decoupled vs. merged controls. We further compare our decoupled design with a variant that merges the background and 3D Gaussian trajectory maps into a single control stream ( Ours (BG & FG Merged) in Table 3 ). Although this variant still benefits from the explicit 4D geometric scene state, it consistently underperforms the full model, with a particularly noticeable drop in object-motion accuracy. As shown in Fig. 8 , merging the controls leads to clear degradation in object motion control. In contrast, keeping decoupled design preserves static geometry while producing more precise and stable object motion, which is crucial for accurate and geometry-consistent control.
6
Conclusion
We present VerseCrafter , a geometry-driven video world model built upon an explicit 4D Geometric Control , represented by a static background point cloud and per-object 3D Gaussian trajectories in a shared world coordinate frame. Coupled with GeoAdapter, which conditions a frozen Wan2.1 backbone on rendered 4D control maps, this design enables high-fidelity video generation with precise, disentangled control over camera and multi-object motion. To support training and evaluation, we construct VerseControl4D , a real-world dataset with automatically derived prompts and rendered 4D control maps, comprising 35K training samples. Experiments and ablations show that VerseCrafter delivers superior visual quality and more accurate joint camera and object motion than existing controllable video generators and video world models, highlighting 4D Geometric Control as a promising interface for future work on dynamic world simulation and editing.
Acknowledgments.
The paper is supported by Shanghai Municipal Science Technology Major Project (2025SHZDZX025G02).
Figure 9 : Detailed architecture of VerseCrafter. Background RGB & depth maps and 3D Gaussian trajectory RGB & depth maps are first encoded by the frozen Wan Encoder. The soft merged mask is rearranged into latent-aligned channels, and all geometry latents are then concatenated along the channel dimension to form a unified spatio-temporal geometry feature. This feature is patchified into tokens and processed by the GeoAdapter branch. At selected Wan-DiT blocks, GeoAdapter outputs are passed through zero-initialized linear layers and added to the backbone tokens as residual modulations, enabling geometry-consistent control over camera motion and multi-object motion.
Appendix A
Preliminary: Video Diffusion Models
Modern video diffusion models operate in a compact latent space learned by a spatio-temporal VAE. Given a video x ∈ ℝ T × H × W × 3 x\in\mathbb{R}^{T\times H\times W\times 3} , the encoder E E maps it to latent features z 0 = E ( x ) ∈ ℝ T ′ × C × H ′ × W ′ z_{0}=E(x)\in\mathbb{R}^{T^{\prime}\times C\times H^{\prime}\times W^{\prime}} , on which the generative process is defined [ 79 , 8 ] . A standard forward diffusion process gradually perturbs z 0 z_{0} into noisy variables z t z_{t} via
q ( z t ∣ z 0 ) = α t z 0 + 1 − α t ϵ , ϵ ∼ 𝒩 ( 0 , I ) , q(z_{t}\mid z_{0})=\sqrt{\alpha_{t}}\,z_{0}+\sqrt{1-\alpha_{t}}\,\epsilon,\quad\epsilon\sim\mathcal{N}(0,I),
(6)
and a denoiser ϵ θ \epsilon_{\theta} is trained to predict the noise under time step t t and conditioning signal c c (e.g., text prompts or reference frames) as
ℒ diff ( θ ) = 𝔼 z 0 , t , ϵ [ ‖ ϵ θ ( z t , t , c ) − ϵ ‖ 2 2 ] , \mathcal{L}_{\text{diff}}(\theta)=\mathbb{E}_{z_{0},t,\epsilon}\big[\|\epsilon_{\theta}(z_{t},t,c)-\epsilon\|_{2}^{2}\big],
(7)
following the DDPM formulation [ 37 ] . Recent video generators further adopt continuous-time flow matching. Given clean latents z 0 z_{0} and a Gaussian samples z 1 z_{1} , one defines linear interpolants z τ = ( 1 − τ ) z 0 + τ z 1 z_{\tau}=(1-\tau)z_{0}+\tau z_{1} with τ ∈ [ 0 , 1 ] \tau\in[0,1] and learns a velocity field v θ v_{\theta} by
ℒ flow ( θ ) = 𝔼 z 0 , τ , ϵ [ ‖ v θ ( z τ , τ , c ) − ( z 1 − z 0 ) ‖ 2 2 ] , \mathcal{L}_{\text{flow}}(\theta)=\mathbb{E}_{z_{0},\tau,\epsilon}\big[\|v_{\theta}(z_{\tau},\tau,c)-(z_{1}-z_{0})\|_{2}^{2}\big],
(8)
as in flow-matching and related ODE-based generative formulations [ 59 , 43 ] . These objectives are typically implemented with Diffusion Transformers (DiT), which operate on spatio-temporal latent tokens and inject ( t , c ) (t,c) through attention [ 71 ] , forming the backbone of current foundation video generators.
Wan2.1 instantiates the above latent video diffusion / flow-matching paradigm with a 3D VAE and a DiT-based denoiser, together with rich multi-modal conditioning trained on large-scale, diverse video–text data [ 88 ] . In VerseCrafter, we adopt Wan2.1-14B as a frozen latent video diffusion backbone and treat it as a generic video prior. Specifically, we keep the Wan Encoder, Wan-DiT, and Wan Decoder unchanged, and attach a lightweight geometry-aware control interface, namely GeoAdapter, to selected Wan-DiT blocks. The detailed architecture of VerseCrafter is provided in the Sec. B .
Appendix B
Model Architecture Details
VerseCrafter is built on top of the Wan2.1 T2V-14B backbone [ 88 ] , a latent video diffusion / flow-matching model with a 3D VAE (Wan Encoder and Wan Decoder) and a DiT-based denoiser (Wan-DiT). As shown in Fig. 9 , we keep the Wan2.1 backbone frozen and introduce a geometry-aware conditioning pathway with a lightweight GeoAdapter that conditions selected Wan-DiT blocks on rendered 4D control maps. Table 4 summarizes the input resolution, number of Wan-DiT layers, hidden dimension, GeoAdapter injection pattern, and fine-tuning configuration of VerseCrafter.
Table 4 : Model configuration of VerseCrafter. Settings include the final output resolution, number of Wan-DiT layers, GeoAdapter injection blocks, pre-trained backbone, and training schedule.
VerseCrafter
Final resolution
720 720 P
Num. Wan-DiT layers
40
GeoAdapter injection blocks
[ 0 , 5 , 10 , 15 , 20 , 25 , 30 , 35 ] [0,5,10,15,20,25,30,35]
Pre-trained backbone
Wan2.1 T2V-14B
Hidden dimension
5120
Batch size
16
Training schedule
2,500 it. @480P + 2,500 it. @720P
Geometry encoding and tokenization. For each frame t t , we render background RGB/depth RGB t bg \mathrm{RGB}^{\mathrm{bg}}{t} , Depth t bg \mathrm{Depth}^{\mathrm{bg}}{t} , 3D Gaussian trajectory RGB/depth RGB t traj \mathrm{RGB}^{\mathrm{traj}}{t} , Depth t traj \mathrm{Depth}^{\mathrm{traj}}{t} , and a soft merged mask M t M_{t} that marks regions where the diffusion model should synthesize or overwrite content. For t = 1 t{=}1 , we replace RGB 1 bg \mathrm{RGB}^{\mathrm{bg}}{1} with the input image and set M 1 = 0 M{1}{=}0 . The four RGB/depth maps are encoded by the frozen Wan Encoder to obtain latent features at the VAE resolution, while the mask M ∈ ℝ 1 × T × H × W M\in\mathbb{R}^{1\times T\times H\times W} is rearranged to align with the latent grid of Wan Encoder (the “Rearrange” module in Fig. 9 ). Let s t s_{t} , s h s_{h} , and s w s_{w} denote the temporal and spatial strides of Wan Encoder (we use s t = 4 s_{t}{=}4 and s h = s w = 8 s_{h}{=}s_{w}{=}8 ). Following the practice in [ 42 , 88 ] , we drop the singleton channel dimension, split the spatial dimensions into s h × s w s_{h}\times s_{w} sub-cells, and fold these sub-cells into the channel dimension via a reshape–permute operation, yielding a tensor of shape C M × T × H ′ × W ′ C_{M}\times T\times H^{\prime}\times W^{\prime} with C M = s h s w C_{M}{=}s_{h}s_{w} , H ′ = H / s h H^{\prime}{=}H/s_{h} , and W ′ = W / s w W^{\prime}{=}W/s_{w} . We then downsample the temporal dimension using nearest-neighbor interpolation to match the latent depth T ′ = ( T + s t − 1 ) / s t T^{\prime}=(T+s_{t}-1)/s_{t} , producing M ^ ∈ ℝ C M × T ′ × H ′ × W ′ \hat{M}\in\mathbb{R}^{C_{M}\times T^{\prime}\times H^{\prime}\times W^{\prime}} . Finally, M ^ \hat{M} is concatenated channel-wise with the encoded background and 3D Gaussian trajectory latents to form a unified spatio-temporal geometry feature 𝒢 ∈ ℝ T ′ × H ′ × W ′ × C 𝒢 \mathcal{G}\in\mathbb{R}^{T^{\prime}\times H^{\prime}\times W^{\prime}\times C_{\mathcal{G}}} . Following Wan-DiT, we partition 𝒢 \mathcal{G} into non-overlapping 3D patches and linearly project each patch into a token embedding, yielding a sequence of geometry tokens 𝐠 ∈ ℝ L × D \mathbf{g}\in\mathbb{R}^{L\times D} , where L = T ′ H ′ W ′ L=T^{\prime}H^{\prime}W^{\prime} is the number of spatio-temporal patches and D D matches the hidden width of Wan-DiT. Because we use identical strides, positional encodings, and patch sizes, the geometry tokens are spatially and temporally aligned with the latent video tokens processed by Wan-DiT.
GeoAdapter integration. GeoAdapter is a lightweight DiT-style branch that operates on geometry tokens 𝐠 \mathbf{g} . It shares the same hidden dimensionality and positional encodings as Wan-DiT, but contains far fewer layers. Let { ℬ 1 , … , ℬ N } {\mathcal{B}{1},\dots,\mathcal{B}{N}} denote N N Wan-DiT blocks, and let { 𝒢 1 , … , 𝒢 M } {\mathcal{G}{1},\dots,\mathcal{G}{M}} denote M M GeoAdapter blocks. We attach GeoAdapter as a residual modulation branch to a subset of Wan-DiT blocks. Concretely, we choose a stride k k and inject GeoAdapter after every k k -th Wan-DiT block; see Table 4 for the exact injection pattern and configuration. For each Wan-DiT block ℬ n \mathcal{B}{n} whose index n n belongs to the injection set, with input tokens 𝐱 n ∈ ℝ L × D \mathbf{x}{n}\in\mathbb{R}^{L\times D} and geometry tokens 𝐠 \mathbf{g} , we add a geometry-conditioned residual of the form
𝐱 n + 1 = ℬ n ( 𝐱 n ) + 𝒢 m ( 𝐠 ) 𝐖 0 ( m ) , \mathbf{x}_{n+1}=\mathcal{B}_{n}(\mathbf{x}_{n})+\mathcal{G}_{m}(\mathbf{g})\,\mathbf{W}^{(m)}_{0},
(9)
where 𝒢 m \mathcal{G}{m} is the corresponding GeoAdapter block and 𝐖 0 ( m ) ∈ ℝ D × D \mathbf{W}^{(m)}{0}\in\mathbb{R}^{D\times D} is its output projection. Each GeoAdapter block is initialized from the weights of its paired Wan-DiT block for stable training, while 𝐖 0 ( m ) \mathbf{W}^{(m)}{0} is initialized to zero. As a result, VerseCrafter behaves identically to the original Wan2.1 backbone at the beginning of training. During fine-tuning, 𝐖 0 ( m ) \mathbf{W}^{(m)}{0} gradually learns to inject geometry information through residual modulation, in the spirit of zero-initialized adapter designs such as ControlNet [ 120 ] .
Table 5 : VerseControl4D data split and scene-type statistics. We report the number of samples from each source dataset and split. Dynamic scenes contain coupled camera motion and foreground object motion, while static scenes have negligible object motion and are used for camera-only evaluation.
Split
Sekai-Real-HQ
SpatialVID-HQ
Dynamic Scenes
Static Scenes
Dynamic Scenes
Train
9,071
7,000
18,929
Validation
468
250
282
Figure 10 : VerseControl4D dataset examples. For each clip, we visualize the input image and target camera trajectory (left), followed by several frames of ground-truth video and our rendered control signals (right): background RGB/depth, 3D Gaussian trajectory RGB/depth for controlled objects, and the final merged mask. These signals are automatically derived by our annotation pipeline in main paper.
Appendix C
VerseControl4D Dataset Details
We construct VerseControl4D , a large-scale real-world video dataset with automatically derived prompts and rendered 4D control maps. As described in the main paper, VerseControl4D is built through four stages: data collection, clip extraction, quality filtering, and data annotation. The rendered 4D control maps comprise background RGB/depth maps, 3D Gaussian trajectory RGB/depth maps, and a soft merged mask.
VerseControl4D contains 35,000 training samples and 1,000 validation samples. Table 5 summarizes the data distribution by source dataset and scene type. Overall, 26% of the samples come from Sekai-Real-HQ and 74% from SpatialVID-HQ, reflecting their complementary scene coverage. To support both camera-only world exploration and joint camera-object control, VerseControl4D includes dynamic scenes (clips with salient foreground object motion together with camera motion) and static scenes (clips with negligible object motion and only camera motion). About 20% of the training samples are static scenes , and the validation set includes 250 static-scene samples for dedicated camera-only evaluation. Representative samples and their rendered 4D control signals are shown in Fig. 10 .
Appendix D
Evaluation Metrics
D.1
VBench-I2V
We evaluate image-to-video generation quality using the VBench Image-to-Video (I2V) evaluation suite, denoted as VBench-I2V. For each generated clip, we follow the official VBench-I2V protocol: the conditioning image and its corresponding generated video are fed into the evaluation pipeline, which computes a set of learned, human-aligned metrics that jointly capture video-image consistency and perceptual video quality. In our experiments, we report the following eight VBench-I2V dimensions, and define the Overall Score as the simple arithmetic mean of these eight normalized scores, where higher values indicate better performance:
•
Imaging Quality. This metric measures low-level image fidelity, including sharpness and the absence of artifacts such as blur, noise, or overexposure. VBench uses an image quality predictor (e.g., MUSIQ) and averages its scores across frames to obtain a video-level imaging-quality score.
•
Aesthetic Quality.
This dimension assesses the artistic and aesthetic appeal of individual frames, including composition, color harmony, and realism.
VBench applies an aesthetic quality predictor (e.g., the LAION aesthetic model) to each frame and averages the predictions over the clip.
•
Dynamic Degree.
This metric quantifies how dynamic the generated video is.
Optical flow magnitudes (e.g., estimated by RAFT) are used to measure the amount of motion, and the score reflects whether the model produces sufficiently active (non-static) content.
•
Motion Smoothness.
This metric evaluates whether subject and camera motion evolve smoothly and follows reasonable physical dynamics.
VBench leverages a pre-trained video frame interpolation prior to assess how well intermediate motion can be interpolated, with smoother and more physically plausible motion receiving higher scores.
•
Background Consistency.
This dimension measures the temporal stability of background layout and texture.
Frame-level features (e.g., CLIP) are compared across time; large feature variations indicate flickering or unstable backgrounds and lead to lower scores.
•
Subject Consistency.
This dimension evaluates the temporal consistency of the foreground subject within the video, regardless of the input image.
VBench computes subject-region features across frames and measures their similarity over time to penalize identity drift or sudden appearance changes.
•
I2V Background (Video–Image Background Consistency).
This metric evaluates how well the global background in the video matches the background in the input image, especially for scene-centric inputs.
VBench uses background-sensitive features (e.g., DreamSim) and aggregates image–frame and inter-frame similarities into a single background consistency score.
•
I2V Subject (Video–Image Subject Consistency).
This metric measures how well the main subject in the generated video matches the subject in the input image.
VBench extracts high-level visual features (e.g., DINO) from the conditioning image and from each video frame, and combines image–frame similarities with inter-frame similarities into a weighted average subject consistency score.
Formally, given these eight per-dimension scores { s k } k = 1 8 {s_{k}}_{k=1}^{8} returned by VBench-I2V for a video, we define
Overall Score = 1 8 ∑ k = 1 8 s k , \text{Overall Score}=\frac{1}{8}\sum_{k=1}^{8}s_{k},
(10)
which is the value reported as “Overall Score” in the main paper.
D.2
Rotation Error (RotErr)
To measure how well the generated camera motion follows the ground-truth camera trajectory, we adopt the camera-alignment metric from CameraCtrl [ 33 ] . For each generated video, we estimate its camera trajectory using the same geometry-annotation pipeline used for VerseControl4D, yielding rotation matrices { 𝐑 gen j } j = 1 n {\mathbf{R}^{j}{\mathrm{gen}}}{j=1}^{n} and translation vectors { 𝐓 gen j } j = 1 n {\mathbf{T}^{j}{\mathrm{gen}}}{j=1}^{n} , where n n is the number of frames. Let { 𝐑 gt j } j = 1 n {\mathbf{R}^{j}{\mathrm{gt}}}{j=1}^{n} denote the corresponding ground-truth rotation matrices. The rotation error is computed by comparing the ground-truth and generated rotation matrices at each frame:
RotErr = ∑ j = 1 n arccos ( tr ( 𝐑 gen j 𝐑 gt j ⊤ ) − 1 2 ) , \mathrm{RotErr}=\sum_{j=1}^{n}\arccos\!\left(\frac{\operatorname{tr}\!\big(\mathbf{R}^{j}_{\mathrm{gen}}{\mathbf{R}^{j}_{\mathrm{gt}}}^{\top}\big)-1}{2}\right),
(11)
where tr ( ⋅ ) \operatorname{tr}(\cdot) denotes the matrix trace. Lower RotErr indicates better alignment between the generated and ground-truth camera orientations.
D.3
Translation Error (TransErr)
We also evaluate the accuracy of the generated camera positions. Let { 𝐓 gt j } j = 1 n {\mathbf{T}^{j}{\mathrm{gt}}}{j=1}^{n} and { 𝐓 gen j } j = 1 n {\mathbf{T}^{j}{\mathrm{gen}}}{j=1}^{n} be the ground-truth and generated camera translation vectors for a video with n n frames. Following CameraCtrl [ 33 ] , the translation error is defined as the sum of per-frame Euclidean distances between the translation vectors:
TransErr = ∑ j = 1 n ‖ 𝐓 gt j − 𝐓 gen j ‖ 2 , \mathrm{TransErr}=\sum_{j=1}^{n}\bigl\|\mathbf{T}^{j}_{\mathrm{gt}}-\mathbf{T}^{j}_{\mathrm{gen}}\bigr\|_{2},
(12)
Lower TransErr indicates that the generated camera positions more closely match the ground-truth camera positions.
D.4
Object Motion Control (ObjMC)
For object-motion control, we follow the ObjMC metric proposed in MotionCtrl [ 98 ] and extend it to the multi-object setting under our 3D Gaussian trajectory representation. Given a generated video, we run the same geometry annotation pipeline used for VerseControl4D to estimate per-object 3D Gaussian trajectories, and compare them with the corresponding ground-truth trajectories from our dataset.
Let N gt N_{\text{gt}} and N pred N_{\text{pred}} denote the numbers of ground-truth and predicted controlled objects in a sample, and let T T be the number of frames. For each ground-truth object o ∈ { 1 , … , N gt } o\in{1,\dots,N_{\text{gt}}} and frame t ∈ { 1 , … , T } t\in{1,\dots,T} , we denote the ground-truth 3D Gaussian mean by 𝝁 o ( t ) ∈ ℝ 3 \boldsymbol{\mu}^{(t)}{o}\in\mathbb{R}^{3} and the estimated mean from the generated video by 𝝁 ^ k ( t ) ∈ ℝ 3 \hat{\boldsymbol{\mu}}^{(t)}{k}\in\mathbb{R}^{3} for a predicted object k k .
Multi-object matching.
Since N gt N_{\text{gt}} and N pred N_{\text{pred}} may differ, we first define the trajectory distance between a ground-truth object o o and a predicted object k k as the average Euclidean distance between their 3D Gaussian means over time:
d ( o , k ) = 1 T ∑ t = 1 T ‖ 𝝁 ^ k ( t ) − 𝝁 o ( t ) ‖ 2 . d(o,k)=\frac{1}{T}\sum_{t=1}^{T}\bigl\|\hat{\boldsymbol{\mu}}^{(t)}_{k}-\boldsymbol{\mu}^{(t)}_{o}\bigr\|_{2}.
(13)
We then build a cost matrix 𝐂 ∈ ℝ N gt × N pred \mathbf{C}\in\mathbb{R}^{N_{\text{gt}}\times N_{\text{pred}}} with entries C o k = d ( o , k ) C_{ok}=d(o,k) . To handle unmatched objects, we pad this matrix with dummy rows and columns and fill them with a constant penalty λ \lambda (set to 10.0 10.0 m in our experiments). Finally, we apply the Hungarian algorithm [ 47 ] to obtain an optimal one-to-one matching between ground-truth and predicted trajectories. This step assigns each ground-truth object either to a predicted trajectory or to a dummy entry when no suitable match exists.
ObjMC score.
Given the optimal matching, we define the per-object trajectory error for a ground-truth object o o as
d o = { d ( o , k ) if o is matched to a predicted object k , λ if o is unmatched , d_{o}=\begin{cases}d(o,k)&\text{if }o\text{ is matched to a predicted object }k,\\
\lambda&\text{if }o\text{ is unmatched},\end{cases}
(14)
and compute the final ObjMC score as the average over all ground-truth controlled objects:
ObjMC = 1 N gt ∑ o = 1 N gt d o . \text{ObjMC}=\frac{1}{N_{\text{gt}}}\sum_{o=1}^{N_{\text{gt}}}d_{o}.
(15)
Lower ObjMC indicates more accurate multi-object 3D motion control, and the unmatched penalty λ \lambda penalizes missed objects under the one-to-one matching formulation.
Appendix E
Additional Qualitative Results
We provide additional qualitative comparisons on VerseControl4D, following the same evaluation settings and baselines as in the main paper. Fig. 11 and Fig. 12 present dynamic scenes with joint camera and multi-object motion control. Perception-as-Control and Uni3C often exhibit noticeable object deformation, while Yume roughly follows the text-described motion but lacks precise camera control. Uni3C is also limited to a single human and does not generalize well to diverse multi-object scenarios. In contrast, VerseCrafter more faithfully follows both the camera trajectory and multi-object motion while maintaining sharp appearance and geometrically consistent backgrounds.
Fig. 13 and Fig. 14 present static scenes for camera-only motion control. ViewCrafter, Voyager and FlashWorld often exhibit distorted facades, drifting structures, or inaccurate camera motion. In contrast, VerseCrafter better follows the target camera trajectory while preserving sharp details and globally consistent 3D geometry. These additional examples further demonstrate VerseCrafter’s robustness under real-world 4D Geometric Control in both dynamic and static settings.
Figure 11 : Additional qualitative comparison of joint camera and object motion control. Perception-as-Control and Uni3C exhibit noticeable object deformation, while Yume roughly follows the text-described motion but lacks precise camera control. Uni3C is also limited to a single human. In contrast, VerseCrafter more faithfully follows both the camera trajectory and multi-object motion while maintaining sharp appearance and geometrically consistent backgrounds.
Figure 12 : Additional qualitative comparison of joint camera and object motion control. Perception-as-Control and Uni3C exhibit noticeable object deformation, while Yume roughly follows the text-described motion but lacks precise camera control. Uni3C is also limited to a single human. In contrast, VerseCrafter more faithfully follows both the camera trajectory and multi-object motion while maintaining sharp appearance and geometrically consistent backgrounds.
Figure 13 : Additional qualitative comparison of camera-only motion control on static scenes. ViewCrafter, Voyager and FlashWorld exhibit distorted facades, drifting structures, or inaccurate camera motion. In contrast, VerseCrafter better follows the target camera trajectory while preserving sharp details and globally consistent 3D geometry.
Figure 14 : Additional qualitative comparison of camera-only motion control on static scenes. ViewCrafter, Voyager and FlashWorld exhibit distorted facades, drifting structures, or inaccurate camera motion. In contrast, VerseCrafter better follows the target camera trajectory while preserving sharp details and globally consistent 3D geometry.
Appendix F
Additional Analysis of Control Fidelity and Robustness
We further provide targeted qualitative analyses to clarify the fidelity, scope, and robustness of our 4D Geometric Control. Specifically, we analyze orientation controllability, dynamic background modeling, articulated and non-rigid object controllability, the effect of multi-view input, and robustness to monocular-depth errors.
F.1
Control Fidelity and Boundary Cases
Figure 15 : Additional analysis of control fidelity and boundary cases. (a) Orientation controllability. Two success cases on rigid anisotropic objects and one failure case on a human-like subject. (b) Dynamic background modeling. Two success cases on moderate background dynamics and one failure case on highly non-rigid background motion. (c) Articulated and non-rigid object controllability. Two examples showing effective coarse object-level control on articulated and non-rigid objects.
Orientation controllability. Our representation provides ellipsoid-level orientation control through the principal axes of each 3D Gaussian, rather than fine-grained 6D pose control. As shown in Fig. 15 (a), this control is reliable for strongly anisotropic rigid objects such as cars and buses, where rotation induces clear changes in rendered footprint and depth, leading to stable orientation cues after 3D → \rightarrow 2D rendering. However, it can fail for human-like subjects approximated by a single ellipsoid. In such cases, heading changes mainly correspond to rotation around the ellipsoid’s major principal axis, and when the other two axes are similar, the projected footprint/depth variation becomes subtle and ambiguous. As a result, the geometric cue may be too weak to fully determine facing direction, and the diffusion prior may dominate, occasionally causing heading mismatches.
Dynamic background modeling. Our background point cloud is reconstructed from the first frame and therefore serves as a mostly static geometric scaffold. It anchors scene geometry under viewpoint changes, but does not explicitly model per-frame non-rigid background deformation. Fig. 15 (b) shows that this design still works well for moderate dynamic-background effects such as wind-swaying grass and a flowing river, where the diffusion prior can synthesize plausible temporal variation while the 4D controls maintain camera and object consistency. In contrast, the waterfall example exhibits weaker motion. This failure is expected because fine, texture-dominant, highly non-rigid dynamics are only weakly constrained by a static 3D scaffold after rendering to 2D control maps. Thus, VerseCrafter currently handles dynamic backgrounds mainly through the interaction between static geometry anchoring and the video prior, rather than through explicit background dynamics modeling.
Articulated and non-rigid object controllability. VerseCrafter uses a single 3D Gaussian per controlled object and guides its motion coarsely by changing its position, scale, and orientation over time. This representation is not designed to explicitly encode part-level articulation. Nevertheless, Fig. 15 (c) shows that it remains effective for object-level motion control in both articulated and non-rigid scenarios, including a robotic arm extension and wind-blown clothes. Although the guidance is coarse, the generated videos follow the intended object-level motion while remaining visually coherent. These examples suggest that even a simple object-level 3D Gaussian can provide a useful control signal for a broad range of dynamic objects, though finer articulation control remains an important direction for future work.
F.2
Geometry Coverage and Robustness
Figure 16 : Additional analysis of geometry coverage and robustness. (a) Single-view vs. multi-view input. Multi-view reconstruction improves geometry coverage and novel-view faithfulness. (b) Robustness to monocular-depth errors. Even with noisier depth and distorted point clouds, the generated videos remain visually similar and preserve the main scene structure.
Single-view vs. multi-view input. Multi-view input improves geometry coverage and therefore improves novel-view faithfulness. As shown in Fig. 16 (a), using multiple views to reconstruct the scene expands the point cloud to cover regions that are weakly observed or fully invisible from a single reference image, such as the tram side and rear structure. Consequently, the generated video is more faithful under larger viewpoint changes; for example, the rear door is recovered only in the multi-view case. By comparison, single-view reconstruction still produces plausible videos because the diffusion prior can fill in under-constrained regions, but the missing geometry leads to less faithful novel-view synthesis. This result supports our claim that more complete 3D reconstruction directly benefits controllable video generation when the target camera motion departs significantly from the reference view.
Table 6 : Memory–time trade-off under different inference settings. We report peak per-GPU memory and diffusion inference time for the 50-step setting. FSDP reduces peak memory from 90 GB to 70 GB with negligible runtime change, and FSDP + CPU offload further reduces it to 57 GB (36.7% reduction) with only a small increase in diffusion inference time.
Inference setting
Peak GPU memory (GB)
Memory reduction (%)
Diffusion inference time (s)
Baseline
90
0.0
866
Baseline + FSDP
70
22.2
870
Baseline + FSDP + CPU offload
57
36.7
880
Table 7 : Stage-wise end-to-end inference latency and cacheability. We report the runtime breakdown for generating an 81-frame 720P video on 8 × \times 96GB GPUs. 4D geometric scene state is reusable across repeated edits of the same scene, and model loading is a one-time startup cost, whereas 4D control rendering and diffusion sampling must be rerun when the edited controls change.
Stage
Time (s) ↓ \downarrow
Cacheable?
4D Geometric State Construction
∼ \sim 23
✓
4D Control Maps Rendering
∼ \sim 60
✗
Diffusion Sampling
Diffusion Model Loading
∼ \sim 203
✓
Diffusion Inference
∼ \sim 866 (50 steps)
✗
Diffusion Inference
∼ \sim 715 (30 steps)
✗
Robustness to monocular-depth errors. We also test robustness to imperfect monocular depth estimation in challenging conditions with heavy occlusion and strong illumination variation. In Fig. 16 (b), replacing MoGe-2 with MiDaS v2.1 produces visibly noisier depth and distorted point clouds in difficult regions such as the pillars highlighted by the red boxes. Despite these geometry errors, the generated videos remain visually similar and preserve the main building structure. This robustness is expected because the reconstructed point cloud acts as a coarse geometric scaffold rather than a per-pixel hard constraint. After 3D → \rightarrow 2D rendering, the diffusion prior can compensate for moderate depth noise and still generate structurally plausible results. Therefore, while better monocular geometry generally improves controllability, VerseCrafter does not critically depend on perfectly accurate depth estimates.
Overall, these analyses suggest that VerseCrafter is most effective when the underlying 4D geometric cues are sufficiently informative, while failures mainly arise in under-constrained cases such as subtle human orientation changes or highly non-rigid background dynamics.
Appendix G
Inference Efficiency and Memory Usage
We further analyze the inference cost of VerseCrafter. For generating an 81-frame 720P video on 8 × \times 96GB GPUs, Table 7 shows that diffusion inference is the dominant bottleneck, while 4D geometric state construction is cacheable across repeated edits of the same scene and diffusion model loading is a one-time startup cost. Accordingly, the per-edit latency is substantially reduced for subsequent edits, and can be further lowered by using fewer denoising steps.
Table 6 summarizes the memory and runtime trade-off under different inference settings. FSDP substantially reduces peak per-GPU memory with negligible runtime overhead, and FSDP + CPU offload further lowers memory at only a small additional cost. These results suggest that the current practical bottleneck is diffusion inference rather than 4D geometric state construction.
Appendix H
Limitations and Future Work
Despite the encouraging results, VerseCrafter still has several limitations that suggest promising directions for future work.
First, our current object representation provides only ellipsoid-level control through a single 3D Gaussian per object, which limits fine-grained pose and part-level articulation, especially for human-like or near-symmetric objects. More expressive object representations, such as multiple Gaussians per object or articulated 3D structures, may improve fine-grained orientation and pose control.
Second, our background point cloud is reconstructed from the first frame and serves as a mostly static geometric scaffold, which limits controllability for highly non-rigid and texture-dominant scene dynamics such as waterfalls. Incorporating explicit dynamic background representations or temporally evolving scene geometry may improve controllability in such cases.
Third, although VerseCrafter enforces 4D geometric consistency through explicit camera control and 3D Gaussian trajectory control, it does not impose explicit physical constraints during generation. Integrating stronger physics priors, such as collision-aware losses, contact constraints, ground constraints, or differentiable physics guidance, could improve physical realism and controllability in complex interactions.
Finally, VerseCrafter remains computationally expensive at high resolution and long temporal horizons because it conditions a large frozen video diffusion backbone and renders multi-channel 4D controls for all frames. Future work may explore more efficient backbones, distilled sampling, cached control encoding, and streaming or long-video synthesis to enable faster and longer world rollouts.
References
[1] N. Agarwal, A. Ali, M. Bala, Y. Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y. Chen, Y. Cui, Y. Ding, et al. (2025)
Cosmos world foundation model platform for physical ai .
arXiv preprint arXiv:2501.03575 .
Cited by: §2 .
[2] H. A. Alhaija, J. Alvarez, M. Bala, T. Cai, T. Cao, L. Cha, J. Chen, M. Chen, F. Ferroni, S. Fidler, et al. (2025)
Cosmos-transfer1: conditional world generation with adaptive multimodal control .
arXiv preprint arXiv:2503.14492 .
Cited by: §2 .
[3] S. Bahmani, I. Skorokhodov, G. Qian, A. Siarohin, W. Menapace, A. Tagliasacchi, D. B. Lindell, and S. Tulyakov (2025)
Ac3d: analyzing and improving 3d camera control in video diffusion transformers .
In Proceedings of the Computer Vision and Pattern Recognition Conference ,
pp. 22875–22889 .
Cited by: §2 .
[4] S. Bahmani, I. Skorokhodov, A. Siarohin, W. Menapace, G. Qian, M. Vasilkovsky, H. Lee, C. Wang, J. Zou, A. Tagliasacchi, et al. (2024)
Vd3d: taming large video diffusion transformers for 3d camera control .
arXiv preprint arXiv:2407.12781 .
Cited by: §2 .
[5] S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. (2025)
Qwen2. 5-vl technical report .
arXiv preprint arXiv:2502.13923 .
Cited by: §4 .
[6] P. J. Ball, J. Bauer, F. Belletti, B. Brownfield, A. Ephrat, S. Fruchter, A. Gupta, K. Holsheimer, A. Holynski, J. Hron, C. Kaplanis, M. Limont, M. McGill, Y. Oliveira, J. Parker-Holder, F. Perbet, G. Scully, J. Shar, S. Spencer, O. Tov, R. Villegas, E. Wang, J. Yung, C. Baetu, J. Berbel, D. Bridson, J. Bruce, G. Buttimore, S. Chakera, B. Chandra, P. Collins, A. Cullum, B. Damoc, V. Dasagi, M. Gazeau, C. Gbadamosi, W. Han, E. Hirst, A. Kachra, L. Kerley, K. Kjems, E. Knoepfel, V. Koriakin, J. Lo, C. Lu, Z. Mehring, A. Moufarek, H. Nandwani, V. Oliveira, F. Pardo, J. Park, A. Pierson, B. Poole, H. Ran, T. Salimans, M. Sanchez, I. Saprykin, A. Shen, S. Sidhwani, D. Smith, J. Stanton, H. Tomlinson, D. Vijaykumar, L. Wang, P. Wingfield, N. Wong, K. Xu, C. Yew, N. Young, V. Zubov, D. Eck, D. Erhan, K. Kavukcuoglu, D. Hassabis, Z. Gharamani, R. Hadsell, A. van den Oord, I. Mosseri, A. Bolton, S. Singh, and T. Rocktäschel (2025)
Genie 3: a new frontier for world models .
External Links: Link
Cited by: §2 .
[7] A. Bar, G. Zhou, D. Tran, T. Darrell, and Y. LeCun (2025)
Navigation world models .
In Proceedings of the Computer Vision and Pattern Recognition Conference ,
pp. 15791–15801 .
Cited by: §1 , §2 .
[8] A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis (2023)
Align your latents: high-resolution video synthesis with latent diffusion models .
In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition ,
pp. 22563–22575 .
Cited by: Appendix A .
[9] J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al. (2024)
Genie: generative interactive environments .
In Forty-first International Conference on Machine Learning ,
Cited by: §1 , §2 .
[10] R. Burgert, Y. Xu, W. Xian, O. Pilarski, P. Clausen, M. He, L. Ma, Y. Deng, L. Li, M. Mousavi, et al. (2025)
Go-with-the-flow: motion-controllable video diffusion models using real-time warped noise .
In Proceedings of the Computer Vision and Pattern Recognition Conference ,
pp. 13–23 .
Cited by: §2 .
[11] C. Cao, J. Zhou, S. Li, J. Liang, C. Yu, F. Wang, X. Xue, and Y. Fu (2025)
Uni3C: unifying precisely 3d-enhanced camera and human motion controls for video generation .
arXiv preprint arXiv:2504.14899 .
Cited by: Figure 1 , Figure 1 , §1 , §2 , Table 1 .
[12] H. Che, X. He, Q. Liu, C. Jin, and H. Chen (2024)
Gamegen-x: interactive open-world game video generation .
arXiv preprint arXiv:2411.00769 .
Cited by: §1 , §2 .
[13] J. Chen, H. Zhu, X. He, Y. Wang, J. Zhou, W. Chang, Y. Zhou, Z. Li, Z. Fu, J. Pang, et al. (2025)
DeepVerse: 4d autoregressive video generation as a world model .
arXiv preprint arXiv:2506.01103 .
Cited by: §1 , §2 .
[14] L. Chen, Z. Zhou, M. Zhao, Y. Wang, G. Zhang, W. Huang, H. Sun, J. Wen, and C. Li (2025)
FlexWorld: progressively expanding 3d scenes for flexiable-view synthesis .
arXiv preprint arXiv:2503.13265 .
Cited by: §2 .
[15] Y. Chen, Y. Men, Y. Yao, M. Cui, and L. Bo (2025)
Perception-as-control: fine-grained controllable image animation with 3d-aware motion representation .
arXiv preprint arXiv:2501.05020 .
Cited by: §1 , §2 , Table 1 .
[16] S. Chiappa, S. Racaniere, D. Wierstra, and S. Mohamed (2017)
Recurrent environment simulators .
arXiv preprint arXiv:1704.02254 .
Cited by: §2 .
[17] H. W. Chung, N. Constant, X. Garcia, A. Roberts, Y. Tay, S. Narang, and O. Firat (2023)
Unimax: fairer and more effective language sampling for large-scale multilingual pretraining .
arXiv preprint arXiv:2304.09151 .
Cited by: §3.2 .
[18] J. Chung, S. Lee, H. Nam, J. Lee, and K. M. Lee (2023)
Luciddreamer: domain-free generation of 3d gaussian splatting scenes .
arXiv preprint arXiv:2311.13384 .
Cited by: §2 .
[19] D. Cohen-Bar, E. Richardson, G. Metzer, R. Giryes, and D. Cohen-Or (2023)
Set-the-scene: global-local training for generating controllable nerf scenes .
In Proceedings of the IEEE/CVF International Conference on Computer Vision ,
pp. 2920–2929 .
Cited by: §2 .
[20] E. Decart, Q. McIntyre, S. Campbell, X. Chen, and R. Wachen (2024)
Oasis: a universe in a transformer .
URL: https://oasis-model. github. io .
Cited by: §2 .
[21] W. Feng, J. Liu, P. Tu, T. Qi, M. Sun, T. Ma, S. Zhao, S. Zhou, and Q. He (2024)
I2VControl-camera: precise video camera control with adjustable motion strength .
arXiv preprint arXiv:2411.06525 .
Cited by: §2 .
[22] W. Feng, T. Qi, J. Liu, M. Sun, P. Tu, T. Ma, F. Dai, S. Zhao, S. Zhou, and Q. He (2024)
I2VControl: disentangled and unified video motion synthesis control .
arXiv preprint arXiv:2411.17765 .
Cited by: §2 .
[23] S. Ferraro, P. Mazzaglia, T. Verbelen, and B. Dhoedt (2025)
FOCUS: object-centric world models for robotic manipulation .
Frontiers in Neurorobotics 19 , pp. 1585386 .
Cited by: §1 .
[24] C. Finn, I. Goodfellow, and S. Levine (2016)
Unsupervised learning for physical interaction through video prediction .
Advances in neural information processing systems 29 .
Cited by: §2 .
[25] X. Fu, X. Liu, X. Wang, S. Peng, M. Xia, X. Shi, Z. Yuan, P. Wan, D. Zhang, and D. Lin (2024)
3dtrajmaster: mastering 3d trajectory for multi-entity motion in video generation .
arXiv preprint arXiv:2412.07759 .
Cited by: §2 .
[26] J. Gao, Z. Chen, X. Liu, J. Feng, C. Si, Y. Fu, Y. Qiao, and Z. Liu (2025)
Longvie: multimodal-guided controllable ultra-long video generation .
arXiv preprint arXiv:2508.03694 .
Cited by: §2 .
[27] D. Geng, C. Herrmann, J. Hur, F. Cole, S. Zhang, T. Pfaff, T. Lopez-Guevara, C. Doersch, Y. Aytar, M. Rubinstein, et al. (2024)
Motion prompting: controlling video generation with motion trajectories .
arXiv preprint arXiv:2412.02700 .
Cited by: §2 .
[28] Z. Gu, R. Yan, J. Lu, P. Li, Z. Dou, C. Si, Z. Dong, Q. Liu, C. Lin, Z. Liu, et al. (2025)
Diffusion as shader: 3d-aware video diffusion for versatile video generation control .
In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers ,
pp. 1–12 .
Cited by: §2 .
[29] D. Ha and J. Schmidhuber (2018)
Recurrent world models facilitate policy evolution .
Advances in neural information processing systems 31 .
Cited by: §2 .
[30] D. Ha and J. Schmidhuber (2018)
World models .
arXiv preprint arXiv:1803.10122 2 ( 3 ).
Cited by: §1 , §2 .
[31] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson (2019)
Learning latent dynamics for planning from pixels .
In International conference on machine learning ,
pp. 2555–2565 .
Cited by: §2 .
[32] D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap (2023)
Mastering diverse domains through world models .
arXiv preprint arXiv:2301.04104 .
Cited by: §1 .
[33] H. He, Y. Xu, Y. Guo, G. Wetzstein, B. Dai, H. Li, and C. Yang (2024)
Cameractrl: enabling camera control for text-to-video generation .
arXiv preprint arXiv:2404.02101 .
Cited by: §D.2 , §D.3 , §2 , §5 .
[34] H. He, C. Yang, S. Lin, Y. Xu, M. Wei, L. Gui, Q. Zhao, G. Wetzstein, L. Jiang, and H. Li (2025)
Cameractrl ii: dynamic scene exploration via camera-controlled video diffusion models .
arXiv preprint arXiv:2503.10592 .
Cited by: §2 .
[35] X. He, C. Peng, Z. Liu, B. Wang, Y. Zhang, Q. Cui, F. Kang, B. Jiang, M. An, Y. Ren, et al. (2025)
Matrix-game 2.0: an open-source, real-time, and streaming interactive world model .
arXiv preprint arXiv:2508.13009 .
Cited by: §2 .
[36] X. He, S. Wang, J. Yang, X. Wu, Y. Wang, K. Wang, Z. Zhan, O. Ruwase, Y. Shen, and X. E. Wang (2024)
Mojito: motion trajectory and intensity control for video generation .
arXiv preprint arXiv:2412.08948 .
Cited by: §2 .
[37] J. Ho, A. Jain, and P. Abbeel (2020)
Denoising diffusion probabilistic models .
Advances in neural information processing systems 33 , pp. 6840–6851 .
Cited by: Appendix A .
[38] Y. Hong, K. Zhang, J. Gu, S. Bi, Y. Zhou, D. Liu, F. Liu, K. Sunkavalli, T. Bui, and H. Tan (2023)
Lrm: large reconstruction model for single image to 3d .
arXiv preprint arXiv:2311.04400 .
Cited by: §2 .
[39] C. Hou, G. Wei, Y. Zeng, and Z. Chen (2024)
Training-free camera control for video generation .
arXiv preprint arXiv:2406.10126 .
Cited by: §2 .
[40] A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado (2023)
Gaia-1: a generative world model for autonomous driving .
arXiv preprint arXiv:2309.17080 .
Cited by: §2 .
[41] T. Huang, W. Zheng, T. Wang, Y. Liu, Z. Wang, J. Wu, J. Jiang, H. Li, R. W. Lau, W. Zuo, et al. (2025)
Voyager: long-range and world-consistent video diffusion for explorable 3d scene generation .
arXiv preprint arXiv:2506.04225 .
Cited by: §1 , §1 , §2 , Table 2 .
[42] Z. Jiang, Z. Han, C. Mao, J. Zhang, Y. Pan, and Y. Liu (2025)
Vace: all-in-one video creation and editing .
arXiv preprint arXiv:2503.07598 .
Cited by: Appendix B , §3.2 .
[43] T. Karras, M. Aittala, T. Aila, and S. Laine (2022)
Elucidating the design space of diffusion-based generative models .
Advances in neural information processing systems 35 , pp. 26565–26577 .
Cited by: Appendix A .
[44] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis (2023)
3D gaussian splatting for real-time radiance field rendering. .
ACM Trans. Graph. 42 ( 4 ), pp. 139–1 .
Cited by: §2 .
[45] W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. (2024)
Hunyuanvideo: a systematic framework for large video generative models .
arXiv preprint arXiv:2412.03603 .
Cited by: §2 .
[46] Z. Kuang, S. Cai, H. He, Y. Xu, H. Li, L. J. Guibas, and G. Wetzstein (2024)
Collaborative video diffusion: consistent multi-video generation with camera control .
Advances in Neural Information Processing Systems 37 , pp. 16240–16271 .
Cited by: §2 .
[47] H. W. Kuhn (1955)
The hungarian method for the assignment problem .
Naval research logistics quarterly 2 ( 1-2 ), pp. 83–97 .
Cited by: §D.4 .
[48] Y. LeCun (2022)
A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27 .
Open Review 62 ( 1 ), pp. 1–62 .
Cited by: §1 , §2 .
[49] H. Li, H. Shi, W. Zhang, W. Wu, Y. Liao, L. Wang, L. Lee, and P. Y. Zhou (2024)
Dreamscene: 3d gaussian-based text-to-3d scene generation via formation pattern sampling .
In European Conference on Computer Vision ,
pp. 214–230 .
Cited by: §2 .
[50] J. Li, J. Tang, Z. Xu, L. Wu, Y. Zhou, S. Shao, T. Yu, Z. Cao, and Q. Lu (2025)
Hunyuan-gamecraft: high-dynamic interactive game video generation with hybrid history condition .
arXiv preprint arXiv:2506.17201 .
Cited by: §2 .
[51] Q. Li, Z. Xing, R. Wang, H. Zhang, Q. Dai, and Z. Wu (2025)
Magicmotion: controllable video generation with dense-to-sparse trajectory guidance .
arXiv preprint arXiv:2503.16421 .
Cited by: §2 .
[52] R. Li, P. Torr, A. Vedaldi, and T. Jakab (2025)
VMem: consistent interactive video scene generation with surfel-indexed view memory .
arXiv preprint arXiv:2506.18903 .
Cited by: §1 , §2 .
[53] T. Li, G. Zheng, R. Jiang, T. Wu, Y. Lu, Y. Lin, X. Li, et al. (2025)
Realcam-i2v: real-world image-to-video generation with interactive complex camera control .
arXiv preprint arXiv:2502.10059 .
Cited by: §2 .
[54] X. Li, T. Wang, Z. Gu, S. Zhang, C. Guo, and L. Cao (2025)
FlashWorld: high-quality 3d scene generation within seconds .
arXiv preprint arXiv:2510.13678 .
Cited by: §2 , Table 2 .
[55] Y. Li, X. Wang, Z. Zhang, Z. Wang, Z. Yuan, L. Xie, Y. Zou, and Y. Shan (2024)
Image conductor: precision control for interactive video synthesis .
arXiv preprint arXiv:2406.15339 .
Cited by: §2 .
[56] Z. Li, C. Li, X. Mao, S. Lin, M. Li, S. Zhao, Z. Xu, X. Li, Y. Feng, J. Sun, et al. (2025)
Sekai: a video dataset towards world exploration .
arXiv preprint arXiv:2506.15675 .
Cited by: §4 .
[57] J. Liang, J. Zhou, S. Li, C. Cao, L. Sun, Y. Qian, W. Chen, and F. Wang (2025)
Realismotion: decomposed human motion control and video generation in the world space .
arXiv preprint arXiv:2508.08588 .
Cited by: §2 .
[58] X. Liao, X. Zeng, L. Wang, G. Yu, G. Lin, and C. Zhang (2025)
Motionagent: fine-grained controllable video generation via motion field agent .
arXiv preprint arXiv:2502.03207 .
Cited by: §1 , §2 .
[59] Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le (2022)
Flow matching for generative modeling .
arXiv preprint arXiv:2210.02747 .
Cited by: Appendix A .
[60] Y. Liu, Z. Min, Z. Wang, J. Wu, T. Wang, Y. Yuan, Y. Luo, and C. Guo (2025)
WorldMirror: universal 3d world reconstruction with any-prior prompting .
arXiv preprint arXiv:2510.10726 .
Cited by: §2 .
[61] T. Lu, T. Shu, J. Xiao, L. Ye, J. Wang, C. Peng, C. Wei, D. Khashabi, R. Chellappa, A. Yuille, et al. (2024)
Genex: generating an explorable world .
arXiv preprint arXiv:2412.09624 .
Cited by: §2 .
[62] W. K. Ma, J. P. Lewis, and W. B. Kleijn (2024)
Trailblazer: trajectory control for diffusion-based video generation .
In SIGGRAPH Asia 2024 Conference Papers ,
pp. 1–11 .
Cited by: §2 .
[63] X. Mao, S. Lin, Z. Li, C. Li, W. Peng, T. He, J. Pang, M. Chi, Y. Qiao, and K. Zhang (2025)
Yume: an interactive world generation model .
arXiv preprint arXiv:2507.17744 .
Cited by: Figure 1 , Figure 1 , §1 , §2 , Table 1 .
[64] W. Menapace, S. Lathuiliere, S. Tulyakov, A. Siarohin, and E. Ricci (2021)
Playable video generation .
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ,
pp. 10061–10070 .
Cited by: §2 .
[65] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2021)
Nerf: representing scenes as neural radiance fields for view synthesis .
Communications of the ACM 65 ( 1 ), pp. 99–106 .
Cited by: §2 .
[66] C. Mou, M. Cao, X. Wang, Z. Zhang, Y. Shan, and J. Zhang (2024)
ReVideo: remake a video with motion and content control .
arXiv preprint arXiv:2405.13865 .
Cited by: §2 .
[67] M. Niu, X. Cun, X. Wang, Y. Zhang, Y. Shan, and Y. Zheng (2025)
Mofa-video: controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model .
In European Conference on Computer Vision ,
pp. 111–128 .
Cited by: §2 .
[68] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh (2015)
Action-conditional video prediction using deep networks in atari games .
Advances in neural information processing systems 28 .
Cited by: §2 .
[69] K. Pandey, M. Gadelha, Y. Hold-Geoffroy, K. Singh, N. J. Mitra, and P. Guerrero (2024)
Motion modes: what could happen next? .
arXiv preprint arXiv:2412.00148 .
Cited by: §2 .
[70] J. Parker-Holder, P. Ball, J. Bruce, V. Dasagi, K. Holsheimer, C. Kaplanis, A. Moufarek, G. Scully, J. Shar, J. Shi, S. Spencer, J. Yung, M. Dennis, S. Kenjeyev, S. Long, V. Mnih, H. Chan, M. Gazeau, B. Li, F. Pardo, L. Wang, L. Zhang, F. Besse, T. Harley, A. Mitenkova, J. Wang, J. Clune, D. Hassabis, R. Hadsell, A. Bolton, S. Singh, and T. Rocktäschel (2024)
Genie 2: a large-scale foundation world model .
External Links: Link
Cited by: §1 , §2 .
[71] W. Peebles and S. Xie (2023)
Scalable diffusion models with transformers .
In Proceedings of the IEEE/CVF international conference on computer vision ,
pp. 4195–4205 .
Cited by: Appendix A .
[72] L. Piccinelli, C. Sakaridis, Y. Yang, M. Segu, S. Li, W. Abbeloos, and L. Van Gool (2025)
Unidepthv2: universal monocular metric depth estimation made simpler .
arXiv preprint arXiv:2502.20110 .
Cited by: §4 .
[73] R. Po, Y. Nitzan, R. Zhang, B. Chen, T. Dao, E. Shechtman, G. Wetzstein, and X. Huang (2025)
Long-context state-space video world models .
arXiv preprint arXiv:2505.20171 .
Cited by: §2 .
[74] B. Poole, A. Jain, J. T. Barron, and B. Mildenhall (2022)
Dreamfusion: text-to-3d using 2d diffusion .
arXiv preprint arXiv:2209.14988 .
Cited by: §2 .
[75] S. Popov, A. Raj, M. Krainin, Y. Li, W. T. Freeman, and M. Rubinstein (2025)
Camctrl3d: single-image scene exploration with precise 3d camera control .
arXiv preprint arXiv:2501.06006 .
Cited by: §2 .
[76] H. Qiu, Z. Chen, Z. Wang, Y. He, M. Xia, and Z. Liu (2024)
Freetraj: tuning-free trajectory control in video diffusion models .
arXiv preprint arXiv:2406.16863 .
Cited by: §2 .
[77] T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y. Chen, F. Yan, et al. (2024)
Grounded sam: assembling open-world models for diverse visual tasks .
arXiv preprint arXiv:2401.14159 .
Cited by: §3.1 .
[78] X. Ren, T. Shen, J. Huang, H. Ling, Y. Lu, M. Nimier-David, T. Müller, A. Keller, S. Fidler, and J. Gao (2025)
Gen3c: 3d-informed world-consistent video generation with precise camera control .
In Proceedings of the Computer Vision and Pattern Recognition Conference ,
pp. 6121–6132 .
Cited by: §2 .
[79] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer (2022)
High-resolution image synthesis with latent diffusion models .
In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition ,
pp. 10684–10695 .
Cited by: Appendix A , §2 .
[80] K. Sargent, Z. Li, T. Shah, C. Herrmann, H. Yu, Y. Zhang, E. R. Chan, D. Lagun, L. Fei-Fei, D. Sun, et al. (2024)
Zeronvs: zero-shot 360-degree view synthesis from a single image .
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ,
pp. 9420–9429 .
Cited by: §2 .
[81] M. Schneider, L. Höllein, and M. Nießner (2025)
WorldExplorer: towards generating fully navigable 3d scenes .
arXiv preprint arXiv:2506.01799 .
Cited by: §2 .
[82] X. Shi, Z. Huang, F. Wang, W. Bian, D. Li, Y. Zhang, M. Zhang, K. C. Cheung, S. See, H. Qin, et al. (2024)
Motion-i2v: consistent and controllable image-to-video generation with explicit motion modeling .
In ACM SIGGRAPH 2024 Conference Papers ,
pp. 1–11 .
Cited by: §2 .
[83] X. Shuai, H. Ding, Z. Qin, H. Luo, X. Ma, and D. Tao (2025)
Free-form motion control: a synthetic video generation dataset with controllable camera and object motions .
arXiv preprint arXiv:2501.01425 .
Cited by: §2 .
[84] W. Sun, S. Chen, F. Liu, Z. Chen, Y. Duan, J. Zhang, and Y. Wang (2024)
Dimensionx: create any 3d and 4d scenes from a single image with controllable video diffusion .
arXiv preprint arXiv:2411.04928 .
Cited by: §2 .
[85] M. Tanveer, Y. Zhou, S. Niklaus, A. M. Amiri, H. Zhang, K. K. Singh, and N. Zhao (2024)
MotionBridge: dynamic video inbetweening with flexible controls .
arXiv preprint arXiv:2412.13190 .
Cited by: §2 .
[86] H. Team, Z. Wang, Y. Liu, J. Wu, Z. Gu, H. Wang, X. Zuo, T. Huang, W. Li, S. Zhang, et al. (2025)
Hunyuanworld 1.0: generating immersive, explorable, and interactive 3d worlds from words or pixels .
arXiv preprint arXiv:2507.21809 .
Cited by: §2 .
[87] R. Villegas, A. Pathak, H. Kannan, D. Erhan, Q. V. Le, and H. Lee (2019)
High fidelity video prediction with large stochastic recurrent neural networks .
Advances in Neural Information Processing Systems 32 .
Cited by: §2 .
[88] T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. (2025)
Wan: open and advanced large-scale video generative models .
arXiv preprint arXiv:2503.20314 .
Cited by: Appendix A , Appendix B , Appendix B , §1 , §2 , §3.2 , §3.2 .
[89] Z. Wan, S. Tang, J. Wei, R. Zhang, and J. Cao (2024)
Dragentity: trajectory guided video generation using entity and positional relationships .
In Proceedings of the 32nd ACM International Conference on Multimedia ,
pp. 108–116 .
Cited by: §2 .
[90] H. Wang, H. Ouyang, Q. Wang, W. Wang, K. L. Cheng, Q. Chen, Y. Shen, and L. Wang (2024)
LeviTor: 3d trajectory oriented image-to-video synthesis .
arXiv preprint arXiv:2412.15214 .
Cited by: §2 .
[91] J. Wang, L. Ye, T. Lu, J. Xiao, J. Zhang, Y. Guo, X. Liu, R. Chellappa, C. Peng, A. Yuille, et al. (2025)
EvoWorld: evolving panoramic world generation with explicit 3d memory .
arXiv preprint arXiv:2510.01183 .
Cited by: §2 .
[92] J. Wang, Y. Yuan, R. Zheng, Y. Lin, J. Gao, L. Chen, Y. Bao, Y. Zhang, C. Zeng, Y. Zhou, et al. (2025)
Spatialvid: a large-scale video dataset with spatial annotations .
arXiv preprint arXiv:2509.09676 .
Cited by: §4 .
[93] J. Wang, Y. Zhang, J. Zou, Y. Zeng, G. Wei, L. Yuan, and H. Li (2024)
Boximator: generating rich and controllable motions for video synthesis .
arXiv preprint arXiv:2402.01566 .
Cited by: §1 , §2 .
[94] Q. Wang, Y. Luo, X. Shi, X. Jia, H. Lu, T. Xue, X. Wang, P. Wan, D. Zhang, and K. Gai (2025)
Cinemaster: a 3d-aware and controllable framework for cinematic text-to-video generation .
In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers ,
pp. 1–10 .
Cited by: §1 , §2 .
[95] R. Wang, S. Xu, Y. Dong, Y. Deng, J. Xiang, Z. Lv, G. Sun, X. Tong, and J. Yang (2025)
MoGe-2: accurate monocular geometry with metric scale and sharp details .
arXiv preprint arXiv:2507.02546 .
Cited by: §3.1 , §4 .
[96] X. Wang, H. Yuan, S. Zhang, D. Chen, J. Wang, Y. Zhang, Y. Shen, D. Zhao, and J. Zhou (2023)
Videocomposer: compositional video synthesis with motion controllability .
Advances in Neural Information Processing Systems 36 , pp. 7594–7611 .
Cited by: §2 .
[97] Z. Wang, Y. Lan, S. Zhou, and C. C. Loy (2024)
Objctrl-2.5 d: training-free object control with camera poses .
arXiv preprint arXiv:2412.07721 .
Cited by: §2 .
[98] Z. Wang, Z. Yuan, X. Wang, Y. Li, T. Chen, M. Xia, P. Luo, and Y. Shan (2024)
Motionctrl: a unified and flexible motion controller for video generation .
In ACM SIGGRAPH 2024 Conference Papers ,
pp. 1–11 .
Cited by: §D.4 , §1 , §2 , §5 .
[99] Z. Wang, J. Cho, J. Li, H. Lin, J. Yoon, Y. Zhang, and M. Bansal (2025)
EPiC: efficient video camera control learning with precise anchor-video guidance .
arXiv preprint arXiv:2505.21876 .
Cited by: §2 .
[100] J. Wu, X. Li, Y. Zeng, J. Zhang, Q. Zhou, Y. Li, Y. Tong, and K. Chen (2024)
Motionbooth: motion-aware customized text-to-video generation .
Advances in Neural Information Processing Systems 37 , pp. 34322–34348 .
Cited by: §2 .
[101] T. Wu, S. Yang, R. Po, Y. Xu, Z. Liu, D. Lin, and G. Wetzstein (2025)
Video world models with long-term spatial memory .
arXiv preprint arXiv:2506.05284 .
Cited by: §2 .
[102] W. Wu, Z. Li, Y. Gu, R. Zhao, Y. He, D. J. Zhang, M. Z. Shou, Y. Li, T. Gao, and D. Zhang (2025)
Draganything: motion control for anything using entity representation .
In European Conference on Computer Vision ,
pp. 331–348 .
Cited by: §2 .
[103] J. Xiang, G. Liu, Y. Gu, Q. Gao, Y. Ning, Y. Zha, Z. Feng, T. Tao, S. Hao, Y. Shi, et al. (2024)
Pandora: towards general world model with natural language actions and video states .
arXiv preprint arXiv:2406.09455 .
Cited by: §2 .
[104] Z. Xiao, W. Ouyang, Y. Zhou, S. Yang, L. Yang, J. Si, and X. Pan (2024)
Trajectory attention for fine-grained video motion control .
arXiv preprint arXiv:2411.19324 .
Cited by: §2 .
[105] J. Xing, L. Mai, C. Ham, J. Huang, A. Mahapatra, C. Fu, T. Wong, and F. Liu (2025)
Motioncanvas: cinematic shot design with controllable image-to-video generation .
In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers ,
pp. 1–11 .
Cited by: §2 .
[106] D. Xu, W. Nie, C. Liu, S. Liu, J. Kautz, Z. Wang, and A. Vahdat (2024)
CamCo: camera-controllable 3d-consistent image-to-video generation .
arXiv preprint arXiv:2406.02509 .
Cited by: §2 .
[107] T. Xu, Z. Chen, L. Wu, H. Lu, Y. Chen, L. Jiang, B. Liu, and Y. Chen (2024)
Motion dreamer: realizing physically coherent video generation through scene-aware motion reasoning .
arXiv preprint arXiv:2412.00547 .
Cited by: §2 .
[108] L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao (2024)
Depth anything v2 .
Advances in Neural Information Processing Systems 37 , pp. 21875–21911 .
Cited by: §2 .
[109] S. Yang, L. Hou, H. Huang, C. Ma, P. Wan, D. Zhang, X. Chen, and J. Liao (2024)
Direct-a-video: customized video generation with user-directed camera movement and object motion .
In ACM SIGGRAPH 2024 Conference Papers ,
pp. 1–12 .
Cited by: §2 .
[110] Z. Yang, W. Ge, Y. Li, J. Chen, H. Li, M. An, F. Kang, H. Xue, B. Xu, Y. Yin, et al. (2025)
Matrix-3d: omnidirectional explorable 3d world generation .
arXiv preprint arXiv:2508.08086 .
Cited by: §1 , §2 .
[111] Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Feng, et al. (2024)
Cogvideox: text-to-video diffusion models with an expert transformer .
arXiv preprint arXiv:2408.06072 .
Cited by: §2 .
[112] S. Yin, C. Wu, J. Liang, J. Shi, H. Li, G. Ming, and N. Duan (2023)
Dragnuwa: fine-grained control in video generation by integrating text, image, and trajectory .
arXiv preprint arXiv:2308.08089 .
Cited by: §2 .
[113] H. Yu, H. Duan, C. Herrmann, W. T. Freeman, and J. Wu (2025)
Wonderworld: interactive 3d scene generation from a single image .
In Proceedings of the Computer Vision and Pattern Recognition Conference ,
pp. 5916–5926 .
Cited by: §2 .
[114] H. Yu, H. Duan, J. Hur, K. Sargent, M. Rubinstein, W. T. Freeman, F. Cole, D. Sun, N. Snavely, J. Wu, et al. (2024)
Wonderjourney: going from anywhere to everywhere .
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ,
pp. 6658–6667 .
Cited by: §2 .
[115] J. J. Yu, F. Forghani, K. G. Derpanis, and M. A. Brubaker (2023)
Long-term photometric consistent novel view synthesis with diffusion models .
In Proceedings of the IEEE/CVF International Conference on Computer Vision ,
pp. 7094–7104 .
Cited by: §2 .
[116] J. Yu, Y. Qin, X. Wang, P. Wan, D. Zhang, and X. Liu (2025)
Gamefactory: creating new games with generative interactive videos .
arXiv preprint arXiv:2501.08325 .
Cited by: §1 , §2 .
[117] W. Yu, J. Xing, L. Yuan, W. Hu, X. Li, Z. Huang, X. Gao, T. Wong, Y. Shan, and Y. Tian (2024)
Viewcrafter: taming video diffusion models for high-fidelity novel view synthesis .
arXiv preprint arXiv:2409.02048 .
Cited by: §1 , §2 , Table 2 .
[118] K. Zhang, S. Bi, H. Tan, Y. Xiangli, N. Zhao, K. Sunkavalli, and Z. Xu (2024)
Gs-lrm: large reconstruction model for 3d gaussian splatting .
In European Conference on Computer Vision ,
pp. 1–19 .
Cited by: §2 .
[119] L. Zhang, Z. Wang, Q. Zhang, Q. Qiu, A. Pang, H. Jiang, W. Yang, L. Xu, and J. Yu (2024)
Clay: a controllable large-scale generative model for creating high-quality 3d assets .
ACM Transactions on Graphics (TOG) 43 ( 4 ), pp. 1–20 .
Cited by: §2 .
[120] L. Zhang, A. Rao, and M. Agrawala (2023)
Adding conditional control to text-to-image diffusion models .
In Proceedings of the IEEE/CVF international conference on computer vision ,
pp. 3836–3847 .
Cited by: Appendix B , §1 .
[121] Q. Zhang, C. Wang, A. Siarohin, P. Zhuang, Y. Xu, C. Yang, D. Lin, B. Zhou, S. Tulyakov, and H. Lee (2024)
Towards text-guided 3d scene composition .
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ,
pp. 6829–6838 .
Cited by: §2 .
[122] S. Zhang, J. Li, X. Fei, H. Liu, and Y. Duan (2025)
Scene splatter: momentum 3d scene generation from single image with video diffusion model .
In Proceedings of the Computer Vision and Pattern Recognition Conference ,
pp. 6089–6098 .
Cited by: §2 .
[123] Z. Zhang, J. Liao, M. Li, Z. Dai, B. Qiu, S. Zhu, L. Qin, and W. Wang (2024)
Tora: trajectory-oriented diffusion transformer for video generation .
arXiv preprint arXiv:2407.21705 .
Cited by: §2 .
[124] Z. Zhang, D. Chen, and J. Liao (2025)
I2V3D: controllable image-to-video generation with 3d guidance .
arXiv preprint arXiv:2503.09733 .
Cited by: §1 , §2 .
[125] Z. Zhang, F. Long, Z. Qiu, Y. Pan, W. Liu, T. Yao, and T. Mei (2025)
MotionPro: a precise motion controller for image-to-video generation .
In Proceedings of the Computer Vision and Pattern Recognition Conference ,
pp. 27957–27967 .
Cited by: §1 .
[126] G. Zheng, T. Li, R. Jiang, Y. Lu, T. Wu, and X. Li (2024)
Cami2v: camera-controlled image-to-video diffusion model .
arXiv preprint arXiv:2410.15957 .
Cited by: §2 .
[127] S. Zheng, Z. Peng, Y. Zhou, Y. Zhu, H. Xu, X. Huang, and Y. Fu (2025)
Vidcraft3: camera, object, and lighting control for image-to-video generation .
arXiv preprint arXiv:2502.07531 .
Cited by: §1 , §2 .
[128] H. Zhou, C. Wang, R. Nie, J. Liu, D. Yu, Q. Yu, and C. Wang (2024)
Trackgo: a flexible and efficient method for controllable video generation .
arXiv preprint arXiv:2408.11475 .
Cited by: §2 .
[129] H. Zhu, Y. Wang, J. Zhou, W. Chang, Y. Zhou, Z. Li, J. Chen, C. Shen, J. Pang, and T. He (2025)
Aether: geometric-aware unified world modeling .
In Proceedings of the IEEE/CVF International Conference on Computer Vision ,
pp. 8535–8546 .
Cited by: §2 .
BETA
AI Summary: Based on hf metadata. Not a recommendation.
🛡️ Paper Transparency Report
Technical metadata sourced from upstream repositories.
🆔 Identity & Source
- id
- arxiv-paper--unknown--2601.05138
- slug
- unknown--2601.05138
- source
- hf
- author
- Sixiao Zheng, Minghao Yin, Wenbo Hu
- license
- ArXiv
- tags
- paper, research
⚙️ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
📊 Engagement & Metrics
- downloads
- 0
- stars
- 0
- forks
- 0
Data indexed from public sources. Updated daily.