📊

Dataset

OmniWorld

by InternRobotics ID: hf-dataset--internrobotics--omniworld

FNI Rank 23

Percentile Top 2%

Activity

→ 0.0%

OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling

View Source Code →

Data Integrity 23 FNI Score

- Size

- Rows

Parquet Format

- Tokens

Dataset Information Summary
Entity Passport
Registry ID	hf-dataset--internrobotics--omniworld
Provider	huggingface

📜

Cite this dataset

Academic & Research Attribution

BibTeX

@misc{hf_dataset__internrobotics__omniworld,
  author = {InternRobotics},
  title = {OmniWorld Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/InternRobotics/OmniWorld}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

InternRobotics. (2026). OmniWorld [Dataset]. Free2AITools. https://huggingface.co/datasets/InternRobotics/OmniWorld

🔬Technical Deep Dive

Full Specifications [+]

⚖️ Free2AI Nexus Index

Methodology → 📘 What is FNI?

23.0

Top 2% Overall Impact

🔥 Popularity (P) 0

🚀 Velocity (V) 0

🛡️ Credibility (C) 0

🔧 Utility (U) 0

Nexus Verified Data

💬 Why this score?

The Nexus Index for OmniWorld aggregates Popularity (P:0), Velocity (V:0), and Credibility (C:0). The Utility score (U:0) represents deployment readiness, context efficiency, and structural reliability within the Nexus ecosystem.

🔗 Source Links (Click to verify)

📊 P: HuggingFace Stats 📈 V: 7-Day Delta 📄 C: Papers With Code 🔧 U: Deploy Score

Data Verified 🕐 Last Updated: Not calculated

Free2AI Nexus Index | Fair · Transparent · Explainable | Full Methodology

⬇️

Downloads

27,605

❤️

Likes

👁️ Data Preview

📊

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

🔗 Explore Full Dataset ↗

🧬 Field Logic

🧬

Schema not yet indexed for this dataset.

Dataset Specification

license: cc-by-nc-sa-4.0
size_categories:

n>1T
task_categories:
text-to-video
image-to-video
image-to-3d
robotics
other
language:
en
pretty_name: OmniWorld

OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling

🎉NEWS

[2026.1.7] Update and release OmniWorld-Game, RH20T, RH20T-Human, Ego-Exo4D, EgoDex.
[2025.11.11] The OmniWorld is now live on 🤖 ModelScope!
[2025.10.15] The OmniWorld-Game Benchmark is now live on Hugging Face!
[2025.10.8] The OmniWorld-HOI4D and OmniWorld-DROID dataset is now live on Hugging Face!
[2025.9.28] The OmniWorld-CityWalk dataset is now live on Hugging Face!
[2025.9.21] 🔥 The OmniWorld-Game dataset now includes 5k splits in total on Hugging Face!
[2025.9.17] 🎉 Our dataset was ranked #1 Paper of the Day on 🤗 Hugging Face Daily Papers!
[2025.9.16] 🔥 The first 1.2k splits of OmniWorld-Game is now live on Hugging Face! We will continue to update, more data is coming soon, Stay tuned!

🧭 Dataset Overview and Navigation

OmniWorld is a multi-domain and multi-modal dataset comprising several distinct sub-datasets. 🙂 indicates the modality is newly (re-)annotated by us, ✅ denotes ground-truth data that already exists in the original dataset, ❌ marks missing modalities.

Dataset	Domain	# Seq.	FPS	Resolution	# Frames	Depth	Camera	Text	Opt. flow	Fg. masks	Detailed Guide
OmniWorld-Game	Simulator	96K	24	1280 × 720	18,515K	🙂	🙂	🙂	🙂	🙂	→ See guide
AgiBot	Robot	20K	30	640 × 480	39,247K	🙂	✅	✅	❌	🙂	[TBD]
DROID	Robot	35K	60	1280 × 720	26,643K	🙂	✅	🙂	🙂	🙂	→ See guide
RH20T	Robot	109K	10	640 × 360	53,453K	❌	✅	🙂	🙂	🙂	→ See guide
RH20T-Human	Human	73K	10	640 × 360	8,875K	❌	✅	🙂	❌	❌	→ See guide
HOI4D	Human	2K	15	1920 × 1080	891K	🙂	🙂	🙂	🙂	✅	→ See guide
Epic-Kitchens	Human	15K	30	1280 × 720	3,635K	❌	🙂	🙂	❌	❌	[TBD]
Ego-Exo4D	Human	4K	30	1024 × 1024	9,190K	❌	✅	🙂	🙂	❌	→ See guide
HoloAssist	Human	1K	30	896 × 504	13,037K	❌	🙂	🙂	🙂	❌	[TBD]
Assembly101	Human	4K	60	1920 × 1080	110,831K	❌	✅	🙂	🙂	🙂	[TBD]
EgoDex	Human	242K	30	1920 × 1080	76,631K	❌	✅	🙂	❌	❌	→ See guide]
CityWalk	Internet	7K	30	1280 × 720	13,096K	❌	🙂	✅	❌	❌	→ See guide
Game-Benchmark	Simulator	-	24	1280 × 720	-	🙂	🙂	🙂	🙂	🙂	→ See guide

Directory Structure

This structure outlines the organization across all OmniWorld sub-datasets. Each sub-dataset (e.g., OmniWorld-Game, OmniWorld-CityWalk) maintains its unique scene folders within the shared annotations/, metadata/, and videos/ top-level directories.

DATA_PATH/
├─ annotations/
│  ├─ OmniWorld-Game/
│  │  ├─ b04f88d1f85a/
│  │  ├─ 52e80f590716/
│  │  └─ …                   # one folder per scene
│  ├─ OmniWorld-CityWalk/
│  └─ …
├─ metadata/
│  ├─ OmniWorld-Game_metadata.csv
│  ├─ OmniWorld-CityWalk_metadata.csv
│  └─ …
├─ videos/
│  ├─ OmniWorld-Game/
│  │  ├─ b04f88d1f85a/
│  │  ├─ 52e80f590716/
│  │  └─ …
│  ├─ OmniWorld-CityWalk/
│  └─ …
└─ README.md                # this guide

Dataset Download

You can download the entire OmniWorld dataset using the following command:

# 1. Install (if you haven't yet) pip install --upgrade "huggingface_hub[cli]" 2. Full download

hf download InternRobotics/OmniWorld --repo-type dataset --local-dir /path/to/DATA_PATH

For downloading specific files (eg., instead of the full OmniWorld-Game dataset), please refer to the download_specific.py provided in our GitHub repository.

OmniWorld-Game Detailed Guide

This section provides detailed organization, metadata, and usage instructions specific to the OmniWorld-Game dataset.

OmniWorld-Game Organisation and File Structure

To keep the download manageable, each scene is split into multiple .tar.gz files:

RGB / Depth / Flow : ≤ 2 000 images per .tar.gz. The naming convention follows the format: …/<scene_id>_<modality>_<part_idx>.tar.gz
Other Annotations: Additional data such as camera poses, masks, and text annotations are grouped together in a single file per scene: …/<scene_id>_others.tar.gz

Metadata Explained (omniworld_game_metadata.csv)

Field Name	Description
`UID`	Scene ID (folder name).
`Video Path`	Relative path to the RGB frames.
`Annotation Path`	Relative path to all multimodal annotations.
`Split Img Num`	Frame count across all splits of the scene.
`Split Num`	Number of splits the scene was cut into.
`Total Img Num`	Raw frame count before splitting.
`Test Split Index`	Zero-based indices of splits used for the test set (comma-separated). Blank = no test split. Example: "0,5" marks the `split_0`, `split_5` as test data.
`FPS`	Frames per second.
`Resolution`	`width×height` in pixels.

OmniWorld-Game Usage Guide

1. Quick-Start: Extracting One Scene

Below we extract RGB frames and all annotations for scene <scene_id> to a local folder of the same name.

scene_id=b04f88d1f85a
root=/path/to/DATA_PATH        # where you store OmniWorld

mkdir -p ${scene_id}
--- RGB (may span several parts) ------------------------------------------
for rgb_tar in ${root}/videos/OmniWorld-Game/${scene_id}/${scene_id}rgb*.tar.gz
do
    echo "Extracting $(basename $rgb_tar)…"
    tar -xzf "$rgb_tar" -C ${scene_id}
done
--- Depth -----------------------------------------------------------------
for d_tar in ${root}/annotations/OmniWorld-Game/${scene_id}/${scene_id}depth*.tar.gz
do
    echo "Extracting $(basename $d_tar)…"
    tar -xzf "$d_tar" -C ${scene_id}
done
--- Flow ------------------------------------------------------------------
for f_tar in ${root}/annotations/OmniWorld-Game/${scene_id}/${scene_id}flow*.tar.gz
do
    echo "Extracting $(basename $f_tar)…"
    tar -xzf "$f_tar" -C ${scene_id}
done
--- All other annotations --------------------------------------tar -xzf ${root}/annotations/OmniWorld-Game/${scene_id}/${scene_id}_others.tar.gz -C ${scene_id}

Resulting Scene Folder:

b04f88d1f85a/
├─ color/              # RGB frames (.png)
├─ depth/              # 16-bit depth maps
├─ flow/               # flow_u_16.png / flow_v_16.png / flow_vis.png
├─ camera/             # split_*.json (intrinsics + extrinsics)
├─ subject_masks/      # foreground masks (per split)
├─ gdino_mask/         # dynamic-object masks (per frame)
├─ text/               # structured captions (81-frame segments)
├─ droidclib/          # coarse camera poses (if you need them)
├─ fps.txt             # source video framerate
└─ split_info.json     # how frames are grouped into splits

2. Modality Details

2.1. Split Information (`split_info.json`)

Each scene is divided into several high-quality "splits". split_info.json tells you how the original video indices are grouped.

{
  "scene_name": "b04f88d1f85a",
  "split_num": 6,
  "split": [
    [0, 1, 2, ...],          // split_0
    [316, 317, ...],         // split_1
    ...
  ]
}

Meaning:

split_num – total number of splits in this scene.
split[i] – an array with the original frame indices belonging to split i.

2.2. Camera Poses (`camera/split_<idx>.json`)

For every split you will find a file

<scene_name>/camera/split_<idx>.json   (e.g. split_0.json)

containing:

focals – focal length in pixels (same for x and y).
cx, cy – principal point.
quats – per-frame rotation as quaternions (w, x, y, z).
trans – per-frame translation (x, y, z).

Minimal Reader

import json
from pathlib import Path

import numpy as np
from scipy.spatial.transform import Rotation as R
def load_split_info(scene_dir: Path):
    """Return the split json dict."""
    with open(scene_dir / "split_info.json", "r", encoding="utf-8") as f:
        return json.load(f)
def load_camera_poses(scene_dir: Path, split_idx: int):
    """
    Returns
    -------
    intrinsics : (S, 3, 3) array, pixel-space K matrices
    extrinsics : (S, 4, 4) array, OpenCV world-to-camera matrices
    """
    # ----- read metadata -----------------------------------------------------
    split_info = load_split_info(scene_dir)
    frame_count = len(split_info["split"][split_idx])

    
      
      cam_file = scene_dir / "camera" / f"split_{split_idx}.json"
with open(cam_file, "r", encoding="utf-8") as f:
    cam = json.load(f)

# ----- intrinsics --------------------------------------------------------
intrinsics = np.repeat(np.eye(3)[None, ...], frame_count, axis=0)
intrinsics[:, 0, 0] = cam["focals"]          # fx
intrinsics[:, 1, 1] = cam["focals"]          # fy
intrinsics[:, 0, 2] = cam["cx"]              # cx
intrinsics[:, 1, 2] = cam["cy"]              # cy

# ----- extrinsics --------------------------------------------------------
extrinsics = np.repeat(np.eye(4)[None, ...], frame_count, axis=0)

# SciPy expects quaternions as (x, y, z, w) → convert
quat_wxyz = np.array(cam["quats"])           # (S, 4)  (w,x,y,z)
quat_xyzw = np.concatenate([quat_wxyz[:, 1:], quat_wxyz[:, :1]], axis=1)

rotations = R.from_quat(quat_xyzw).as_matrix()      # (S, 3, 3)
translations = np.array(cam["trans"])               # (S, 3)

extrinsics[:, :3, :3] = rotations
extrinsics[:, :3, 3] = translations

return intrinsics.astype(np.float32), extrinsics.astype(np.float32)
    
  --------------------------- example usage -----------------------------------if name == "main":
    scene = Path("b04f88d1f85a")   # adjust to your path
    K, w2c = load_camera_poses(scene, split_idx=0)      # world-to-camera transform in OpenCV format
    print("Intrinsics shape:", K.shape)
    print("Extrinsics shape:", w2c.shape)

2.3. Depth (`depth/<frame_idx>.png`)

16-bit PNG, one file per RGB frame.
Values are stored as unsigned integers in [0, 65535].

0 … 100 ≈ invalid / too close

65500 … 65535 ≈ sky / too far

Minimal Reader

import imageio.v2
import numpy as np
from pathlib import Path


def load_depth(depthpath):
    """
    Returns
    -------
    depthmap : (H, W) float32
    valid   : (H, W) bool      True for reliable pixels
    """

    
      
      depthmap = imageio.v2.imread(depthpath).astype(np.float32) / 65535.0
near_mask = depthmap < 0.0015   # 1. too close
far_mask = depthmap > (65500.0 / 65535.0) # 2. filter sky
# far_mask = depthmap > np.percentile(depthmap[~far_mask], 95) # 3. filter far area (optional)
near, far = 1., 1000.
depthmap = depthmap / (far - depthmap * (far - near)) / 0.004

valid = ~(near_mask | far_mask)
depthmap[~valid] = -1

return depthmap, valid
    
  ---------------------------- example ---------------------------------------if name == "main":
    d, mask = load_depth("b04f88d1f85a/depth/000000.png")
    print("Depth shape:", d.shape, "valid pixels:", mask.mean() * 100, "%")

Feel free to tighten the far_mask with np.percentile(depthmap[~far_mask], 95) if you need a stricter “too-far” criterion.

We provide a script to generate a fused point cloud from camera poses and depth maps. Instructions can be found in the Point Cloud Visualization section from our github repository.

2.4. Structured Caption (`text/<start_idx>_<end_idx>.json`)

From every split we sample 81 frames and attach rich, structured captions.

The general naming format of the text file is <start_idx>_<end_idx>.json, which means that the text is the description of the start_idx frame to the end_idx frame of the global video.

Each text file contains the following description information

Short_Caption: A brief summary (1–2 sentences).
PC_Caption: Actions and status of the player-character.
Background_Caption: Fine-grained spatial description of the scene.
Camera_Caption: How the camera moves, such as zooms, rotates.
Video_Caption: ≈200-word dense paragraph combining all above..
Key_Tags: string of tags that combines key features.

2.5. Foreground Masks (`subject_masks/split_<idx>.json`)

Binary masks (white = subject, black = background) for every frame in a split. Main masked object includes:

Human/Robotics scenes: the active arm / robot.
Game scenes: the playable character or vehicle.

Minimal Reader

import json
from pathlib import Path
from pycocotools import mask as mask_utils
import numpy as np

def load_subject_masks(scene_dir: Path, split_idx: int):
    """
    Returns
    -------
    masks : list[np.ndarray]  (H, W) bool
    """
    seg_mask_list = []
    segmask_path = scene_dir / "subject_masks" / f"split_{split_idx}.json"
    with open(segmask_path, "r", encoding="utf-8") as f:
        seg_masks = json.load(f)
    for key in seg_masks.keys():
        seg_mask = seg_masks[key]
        seg_mask = mask_utils.decode(seg_mask["mask_rle"])
        seg_mask_list.append(seg_mask)

    
      
      seg_mask_list
    
  ---------------------------- example ---------------------------------------if name == "main":
    masks = load_subject_masks(Path("b04f88d1f85a"), split_idx=0)
    print("Loaded", len(masks), "masks of shape", masks[0].shape)

We also release per-frame Dynamic Masks (gdino_mask/<frame_idx>.png). Each RGB image in the original video is labeled with dynamic objects (such as cars, people, and animals). White represents dynamic objects, and black represents static backgrounds. This can be used in conjunction with Foreground Masks as needed.

2.6. Optical Flow (`flow/<frame_idx>/...`)

For every RGB frame t we provide dense forward optical flow that points to frame t + 1.

Directory layout (example for frame 0 of scene b04f88d1f85a)

b04f88d1f85a/
└─ flow/
   └─ 000000/
      ├─ flow_u_16.png   # horizontal component  (u, Δx)
      ├─ flow_v_16.png   # vertical component    (v, Δy)
      └─ flow_vis.png    # ready-made RGB visualisation (for inspection only)

Minimal Reader

import numpy as np
import imageio.v2 as iio
from pathlib import Path

FLOW_MIN, FLOW_MAX = -300.0, 300.0           # change if you override the range
def flow_decompress(u, v, flow_min=-FLOW_MIN, flow_max=FLOW_MAX):
    """
    Read uint16 image and convert back to optical flow data

    
      
      Args:
    u: np.array (np.uint16) - Optical flow horizontal component
    v: np.array (np.uint16) - Optical flow vertical component
    flow_min: float - Assumed minimum value of optical flow
    flow_max: float - Assumed maximum value of optical flow

Returns:
    np.array (np.float32) - Optical flow data with shape (H,W,2)
"""
u = u.astype(np.float32) / 65535.0
v = v.astype(np.float32) / 65535.0

u = u * (flow_max - flow_min) + flow_min
v = v * (flow_max - flow_min) + flow_min

res = np.stack((u, v), axis=-1)

return res.astype(np.float32)
    
  def load_flow(flowpath):
    of_u_path = os.path.join(flowpath, "flow_u_16.png")
    of_v_path = os.path.join(flowpath, "flow_v_16.png")

    
      
      u = cv2.imread(str(of_u_path), cv2.IMREAD_UNCHANGED)
v = cv2.imread(str(of_v_path), cv2.IMREAD_UNCHANGED)
flow = flow_decompress(u, v)

return flow
    
  ---------------------------- example ---------------------------------------if name == "main":
    flow = load_flow("b04f88d1f85a/flow/000000")
    print("Flow shape: ", flow.shape)

OmniWorld-Game Benchmark Detailed Guide

The OmniWorld-Game Benchmark is a curated subset of test splits, specifically selected from the OmniWorld-Game dataset to serve as a challenging evaluation platform, as detailed in our paper.

Task	Sequence Length	Duration	Key Modalities
Geometric Prediction	384 frames	16 seconds	RGB, Depth, Camera Poses
Video Generation	81 frames	3.4 seconds	RGB, Depth, Camera Poses, Text

Each sequence in the benchmark is challenging, featuring rich dynamics that accurately reflect real-world complexity. They are accompanied by high-fidelity ground-truth annotations for camera poses and depth.

Data Access and Organization

The benchmark annotation data is packaged into .tar.gz files located under the OmniWorld/benchmark directory. Each archive is named in the format <UID>_<split_index>.tar.gz.

Extracted Directory Structure

<UID>_<split_index>/
├─ depth/
│  ├─ 000000.npy       # (H, W) Depth map. Already processed and stored using the OmniWorld-Game Depth reading method.
│  ├─ 000001.npy
│  └─ ...
├─ image/              # High-resolution RGB frames (720×1280 pixels)
│  ├─ 000000.png
│  ├─ 000001.png
│  └─ ...
├─ camera_poses.npy    # (num_frames, 4, 4) Camera-to-World (C2W) transformation matrices.
├─ intrinsics.npy      # (num_frames, 3, 3) Intrinsic camera matrices in pixel space.
├─ text_caption.json   # The structured text caption associated with the sequence.
└─ video.mp4           # MP4 video file corresponding to the PNG frames in the 'image/' directory.

The depth maps are already processed and stored using the OmniWorld-Game Depth reading method.

OmniWorld-CityWalk Detailed Guide

This section provides detailed organization, metadata, and usage instructions specific to the OmniWorld-CityWalk dataset.

OmniWorld-CityWalk Organisation and File Structure

The OmniWorld-CityWalk dataset is a collection of re-annotated data derived from a subset of the Sekai-Real-Walking-HQ dataset. You need downloading original videos and extracting video clips.

Important Note: In this repository, we only provide the annotated data (e.g., camera poses, dynamic masks), and do not include the raw RGB image files due to licensing and size constraints. Please refer to the original project for instructions on downloading and splitting the raw video data. Our annotations are designed to align with the original video frames.

Annotation Files

The camera annotation data is packaged in .tar.gz files located under OmniWorld/annotations/OmniWorld-CityWalk/.

Naming Convention: omniworld_citywalk_<start_scene_index>_<end_scene_index>.tar.gz, where the indices correspond to the scene index range within the metadata file.

Scene and Split Specifications

Video Length: Each source video scene is 60 seconds long.
Frame Rate: 30 FPS.
Total Frames: 1800 frames per scene.
Split Strategy: Each scene is divided into 6 splits of 300 frames each for detailed annotation.

Metadata Explained (omniworld_citywalk_metadata.csv)

Field Name	Description
`index`	The sequential index number of the scene.
`videoFile`	The video file name, formatted as `<scene_id>_<start_frame>_<end_frame>`. The corresponding source video on YouTube can be accessed via `https://www.youtube.com/watch?v=<scene_id>`.
`cameraFile`	The directory name for the camera annotation data, which is named after the video file.
`caption`	The dense text description/caption for the video segment.
`location`	The geographical location where the video was filmed.
`crowdDensity`	An assessment of the crowd/people density within the video.
`weather`	The general weather condition (e.g., sunny, overcast).
`timeOfDay`	The time of day when the video was recorded (e.g., morning, afternoon).

OmniWorld-CityWalk Usage Guide

1. Quick-Start: Extracting One Scene

To access the annotations for a scene, you first need to extract the corresponding .tar.gz archive. After extracting one omniworld_citywalk_<start_scene_index>_<end_scene_index>.tar.gz file, the resulting folder structure for each individual scene within the archive is as follows:

xpPEhccDNak_0023550_0025350/  # Example Scene name (videoFile)
├─ gdino_mask/          # Per-frame dynamic-object masks (.png)
├─ recon/               # Camera and 3D reconstruction data per split
│  ├─ split_0/
│  │  ├─ extrinsics.npz # Per-frame camera extrinsics: (frame_num, 3, 4) in OpenCV world-to-camera format
│  │  ├─ intrinsics.npz # Per-frame camera intrinsics: (frame_num, 3, 3) in pixel units
│  │  └─ points3D_ba.ply # Sparse and accurate point cloud data after Bundle Adjustment (BA) for this split
│  ├─ split_1/
│  │  └─ ...
|  └─ ...
├─ image_list.json      # Defines the frame naming convention (e.g., 000000.png to 001799.png)
└─ split_info.json      # Records how frames are grouped into 300-frame splits

2. Modality Details

2.1. Split Information (`split_info.json`)

Scene frames are segmented into 300-frame splits for annotation. The mapping and division information is stored in split_info.json.

2.2. Camera Poses (`recon/split_<idx>/...`)

Camera poses are provided as NumPy compressed files (.npz) containing the extrinsics (world-to-camera rotation and translation) and intrinsics (focal length and principal point).

Minimal Reader

import numpy as np

Load Extrinsics (World-to-Camera Transform in OpenCV format)
extrinsics = np.load("recon/split_0/extrinsics.npz")['extrinsics']  # Shape: (frame_num, 3, 4)
Load Intrinsics (in Pixel Units)
intrinsics = np.load("recon/split_0/intrinsics.npz")['intrinsics']  # Shape: (frame_num, 3, 3)
print("Extrinsics shape:", extrinsics.shape)
print("Intrinsics shape:", intrinsics.shape)

OmniWorld-HOI4D Detailed Guide

This section provides detailed organization, metadata, and usage instructions specific to the OmniWorld-HOI4D dataset.

OmniWorld-HOI4D Organisation and File Structure

The OmniWorld-HOI4D dataset is a collection of re-annotated data derived from the HOI4D dataset. You need downloading original videos.

Important Note: In this repository, we only provide the annotated data (e.g., camera poses, flow, depth, text), and do not include the raw RGB image files due to licensing and size constraints. Please refer to the original project for instructions on downloading the raw video data. Our annotations are designed to align with the original video frames.

Annotation Files

The annotation data is packaged in .tar.gz files located under OmniWorld/annotations/OmniWorld-HOI4D/.

Naming Convention: omniworld_hoi4d_<start_scene_index>_<end_scene_index>.tar.gz, where the indices correspond to the scene index range within the metadata file.

Scene and Split Specifications

Total Frames: 300 frames per scene.
Split Strategy: Each scene is divided into 1 splits of 300 frames each for detailed annotation.

Metadata Explained (omniworld_hoi4d_metadata.csv)

Field Name	Description
`Index`	The sequential index number of the scene.
`Video Path`	The relative path of the scene in the original HOI4D dataset. Use this path to locate the corresponding source RGB video that you have downloaded. Example: `ZY20210800001/H1/C1/N19/S100/s02/T1`
`Annotation Path`	The directory name for this scene's annotations inside the extracted `.tar.gz` archive. This is generated by replacing all `/` in the Video Path with `_`. Example: `ZY20210800001_H1_C1_N19_S100_s02_T1`

OmniWorld-HOI4D Usage Guide

1. Quick-Start: Extracting One Scene

To access the annotations for a scene, you first need to extract the corresponding .tar.gz archive. After extracting one omniworld_hoi4d_<start_scene_index>_<end_scene_index>.tar.gz file, the resulting folder structure for each individual scene within the archive is as follows:

<Annotation Path>
# e.g., ZY20210800001_H1_C1_N19_S100_s02_T1
|
├── camera/
│   ├── recon/
│   │   └── split_0/
│   │       └── info.json        # Camera intrinsics and extrinsics for all 300 frames.
│   ├── image_list.json          # Ordered list of corresponding image filenames.
│   └── split_info.json          # Defines the frame segmentation (HOI4D is one 300-frame split).
|
├── flow/                        # Just like OmniWorld-Game.
│   ├── 00000/
│   │   ├── flow_u_16.png        # Optical flow (horizontal component). 
│   │   ├── flow_v_16.png        # Optical flow (vertical component).
│   │   └── flow_vis.png         # Visualization of the optical flow.
│   ├── 00001/
│   ... (up to frame 299)
|
├── prior_depth/
│   ├── 00000.png               # Monocular depth map for frame 0.
│   ├── 00001.png               # Monocular depth map for frame 1.
│   ... (up to frame 299)
|
└── text/                        # Just like OmniWorld-Game.
    ├── 0_80.txt                 # Text description for frames 0-80.
    ├── 120_200.txt              # Text description for frames 120-200.
    ...

2. Modality Details

2.1. Split Information (`split_info.json`)

Scene frames are segmented into 300-frame splits for annotation. The mapping and division information is stored in split_info.json. Each HOI4D scene consists of a single 300-frame split.

2.2 Camera Poses (`info.json`)

Minimal Reader

import json
import torch

def load_camera_info(info_json_path: str):
    """
    Parses an info.json file to extract camera intrinsics and extrinsics.
    """
    with open(info_json_path, 'r') as f:
        info_data = json.load(f)

    
      
      # Extrinsics are provided as a list of 4x4 world-to-camera matrices (OpenCV convention)
extrinsics = torch.tensor(info_data['extrinsics'])  # Shape: (num_frames, 4, 4)

num_frames = extrinsics.shape[0]

fx, fy, cx, cy = info_data['crop_intrinsic'].values()
intrinsic = torch.eye(3)
intrinsic[0, 0] = fx
intrinsic[0, 2] = cx
intrinsic[1, 1] = fy
intrinsic[1, 2] = cy

# Repeat the intrinsic matrix for each frame
intrinsics = intrinsic.unsqueeze(0).repeat(num_frames, 1, 1)  # Shape: (num_frames, 3, 3)

return intrinsics, extrinsics
    
  Example usage:
annotation_path = "ZY20210800001_H1_C1_N19_S100_s02_T1"
info_path = f"{annotation_path}/camera/recon/split_0/info.json"
intrinsics, extrinsics = load_camera_info(info_path)
print("Intrinsics shape:", intrinsics.shape)
print("Extrinsics shape:", extrinsics.shape)

OmniWorld-DROID Detailed Guide

This section provides detailed organization, metadata, and usage instructions specific to the OmniWorld-DROID dataset.

OmniWorld-DROID Organisation and File Structure

The OmniWorld-DROID dataset is a collection of re-annotated data derived from the DROID dataset. You need downloading original videos.

Important Note: In this repository, we only provide the annotated data (e.g., flow, depth, text, mask), and do not include the raw RGB image files due to licensing and size constraints. Please refer to the original project for instructions on downloading the raw video data. Our annotations are designed to align with the original video frames.

Annotation Files

The annotation data is packaged in .tar.gz files located under OmniWorld/annotations/OmniWorld-DROID/.

Naming Convention: omniworld_droid_<start_scene_index>_<end_scene_index>.tar.gz, where the indices correspond to the scene index range within the metadata file.

Metadata Explained (omniworld_droid_metadata.csv)

Field Name	Description
`Index`	The sequential index number of the scene.
`Video Path`	The relative path of the scene in the original DROID dataset. Use this path to locate the corresponding source RGB video that you have downloaded. Example: `droid_raw/1.0.1/TRI/success/2023-10-17/Tue_Oct_17_17:20:55_2023/`
`Annotation Path`	The directory name for this scene's annotations inside the extracted `.tar.gz` archive. Example: `droid_processed/1.0.1/TRI/success/2023-10-17/Tue_Oct_17_17:20:55_2023/`
`Img Num`	The total number of image frames from one camera perspective in the scene.

OmniWorld-DROID Usage Guide

1. Quick-Start: Extracting One Scene

To access the annotations for a scene, you first need to extract the corresponding .tar.gz archive. After extracting one omniworld_droid_<start_scene_index>_<end_scene_index>.tar.gz file, the resulting folder structure for each individual scene within the archive is as follows:

<Annotation Path>/
# e.g., droid_processed/1.0.1/TRI/success/2023-10-17/Tue_Oct_17_17:20:55_2023/
|
├── flow/                        # Just like OmniWorld-Game
│   └── <camera_serial_id>/      # e.g., 18026681, 22008760, etc.
│       ├── 0/
│       │   ├── flow_u_16.png    # Optical flow (horizontal component) for frame 0
│       │   ├── flow_v_16.png    # Optical flow (vertical component) for frame 0
│       │   └── flow_vis.png     # Visualization of the optical flow for frame 0
│       ├── 1/
│       ... (up to Img Num - 1)
|
├── foundation_stereo/
│   └── <camera_serial_id>/
│       ├── 0.png                # Monocular depth map for frame 0
│       ├── 1.png                # Monocular depth map for frame 1
│       ... (up to Img Num - 1)
|
├── robot_masks/                 # Just like OmniWorld
│   └── <camera_serial_id>/
│       ├── mask_prompt.json
│       └── tracked_masks_coco.json
|
├── text/
│   └── <camera_name>/           # e.g., ext1_cam_serial, wrist_cam_serial
│       ├── 0-161.txt            # Short caption for frames 0-161
│       └── 40-201.txt           # Short caption for frames 40-201
|
├── recordings/
│   └── camera_info_dict.npy         # Camera intrinsics
|
├── <camera_name>_totalcaption.txt   # Long-form, summary caption for the entire scene from one camera's perspective
├── meta_info.json                   # General metadata for the scene
...

This section provides detailed organization, metadata, and usage instructions specific to the OmniWorld-DROID dataset.

2. Modality Details

2.1. Depth

Minimal Reader

import imageio.v2
import numpy as np

_MAX_DEPTH = 10.0
def load_depth(depthpath):
    """
    Returns
    -------
    depthmap : (H, W) float32
    valid   : (H, W) bool      True for reliable pixels
    """

    
      
      depthmap = imageio.v2.imread(depthpath).astype(np.float32) / 65535.0 * _MAX_DEPTH

valid = ((depthmap > 0) & (depthmap < _MAX_DEPTH)).astype(float)

return depthmap, valid
    
  ---------------------------- example ---------------------------------------if name == "main":
    d, valid = load_depth("droid/droid_processed/1.0.1/REAL/success/2023-05-27/Sat_May_27_11:22:57_2023/foundation_stereo/23960472/160.png")
    print("Depth shape:", d.shape, "valid pixels:", valid.mean() * 100, "%")

2.2 Camera Pose

To streamline the data loading process, we have pre-extracted camera intrinsics from the official DROID metadata and consolidated them into camera_info_dict.npy. Alternatively, you may parse these parameters directly from the raw DROID metadata files.

Note on Camera Extrinsics: In the DROID dataset, the wrist camera pose data is often inaccurate. Consequently, we do not provide extrinsic loading for wrist-mounted views. For fixed-view cameras, the extrinsic matrix can be initialized as an identity matrix.

import numpy as np

camera_info_dict_path = "droid/droid_processed/1.0.1/REAL/success/2023-05-27/Sat_May_27_11:22:57_2023/camera_info_dict.npy"
camera_info = np.load(camera_info_dict_path, allow_pickle=True).item()
Example: Accessing intrinsics for specific camera serials
camera_serial_ids = ["18026681", "22008760", "24400334"]
for cam_id in camera_serial_ids:
    intrinsics = camera_info[cam_id]["cam_matrix"]
    print(f"Camera {cam_id} Intrinsics Shape: {intrinsics.shape}")  # Output: (3, 3)

OmniWorld-RH20TRobot Detailed Guide

This section provides detailed organization, metadata, and usage instructions specific to the OmniWorld-RH20TRobot dataset.

OmniWorld-RH20TRobot Organisation and File Structure

The OmniWorld-RH20TRobot dataset is a collection of re-annotated data derived from the RH20T dataset. You need downloading original videos.

Annotation Files

The annotation data is packaged in .tar.gz files located under OmniWorld/annotations/OmniWorld-RH20TRobot/.

Naming Convention: rh20t_<start_scene_index>_<end_scene_index>.tar.gz, where the indices correspond to the scene index range within the metadata file.

Metadata Explained (omniworld_rh20t_robot_metadata.csv)

Field Name	Description
`Index`	The sequential index number of the scene.
`Video Path`	The relative path of the scene in the original rh20t dataset. Use this path to locate the corresponding source RGB video that you have downloaded. Example: `RH20T/RH20T_cfg1/task_0030_user_0010_scene_0004_cfg_0001/cam_035622060973/color/`
`Annotation Path`	The directory name for this scene's annotations inside the extracted `.tar.gz` archive. Example: `RH20T/RH20T_cfg1/task_0030_user_0010_scene_0004_cfg_0001/cam_035622060973/`

OmniWorld-RH20TRobot Usage Guide

1. Quick-Start: Extracting One Scene

To access the annotations for a scene, you first need to extract the corresponding .tar.gz archive. After extracting one rh20t_<start_scene_index>_<end_scene_index>.tar.gz file, the resulting folder structure for each individual scene within the archive is as follows:

<Annotation Path>/
# e.g., RH20T_cfg1/task_0030_user_0010_scene_0004_cfg_0001/cam_035622060973/
|
├── robot_masks/                 # Read like OmniWorld
│   ├── mask_prompt.json
|   ├── tracked_masks_coco_v2.json
│   └── tracked_masks_coco.json
|
├── text/
│   ├── 0-161.txt            # caption for frames 0-161
│   └── 40-201.txt           # caption for frames 40-201
|
...

OmniWorld-RH20THuman Detailed Guide

This section provides detailed organization, metadata, and usage instructions specific to the OmniWorld-RH20TTHuman dataset.

OmniWorld-RH20THuman Organisation and File Structure

The OmniWorld-RH20TTHuman dataset is a collection of re-annotated data derived from the RH20T dataset. You need downloading original videos.

Annotation Files

The annotation data is packaged in .tar.gz files located under OmniWorld/annotations/OmniWorld-RH20TTHuman/.

Naming Convention: rh20t_human_<start_scene_index>_<end_scene_index>.tar.gz, where the indices correspond to the scene index range within the metadata file.

Metadata Explained (omniworld_rh20t_human_metadata.csv)

Field Name	Description
`Index`	The sequential index number of the scene.
`Video Path`	The relative path of the scene in the original rh20t dataset. Use this path to locate the corresponding source RGB video that you have downloaded. Example: `RH20T/RH20T_cfg1/task_0062_user_0001_scene_0010_cfg_0001_human/cam_035622060973/color/`
`Annotation Path`	The directory name for this scene's annotations inside the extracted `.tar.gz` archive. Example: `RH20T/RH20T_cfg1/task_0062_user_0001_scene_0010_cfg_0001_human/cam_035622060973/`

OmniWorld-RH20THuman Usage Guide

1. Quick-Start: Extracting One Scene

To access the annotations for a scene, you first need to extract the corresponding .tar.gz archive. After extracting one rh20t_human_<start_scene_index>_<end_scene_index>.tar.gz file, the resulting folder structure for each individual scene within the archive is as follows:

<Annotation Path>/
# e.g., RH20T_cfg1/task_0062_user_0001_scene_0010_cfg_0001_human/cam_035622060973/
|
├── text/
│   ├── 0-161.txt            # caption for frames 0-161
│   └── 40-201.txt           # caption for frames 40-201
|
...

OmniWorld-EgoExo4D Detailed Guide

This section provides detailed organization, metadata, and usage instructions specific to the OmniWorld-EgoExo4D dataset.

OmniWorld-EgoExo4D Organisation and File Structure

The OmniWorld-EgoExo4D dataset is a collection of re-annotated data derived from the Ego-Exo4D dataset. You need downloading original videos.

Annotation Files

The annotation data is packaged in .tar.gz files located under OmniWorld/annotations/OmniWorld-EgoExo4D/.

Naming Convention: omniword_egoexo4d_<start_scene_index>_<end_scene_index>.tar.gz, where the indices correspond to the scene index range within the metadata file.

Metadata Explained (omniworld_egoexo4d_metadata.csv)

Field Name	Description
`Index`	The sequential index number of the scene.
`Video Path`	The relative path of the scene in the original Ego-Exo4D dataset. Use this path to locate the corresponding source RGB video that you have downloaded. Example: `egoexo4d-processed/takes/cmu_bike01_2/frame_aligned_videos/aria01_214-1-undistorted/`
`Annotation Path`	The directory name for this scene's annotations inside the extracted `.tar.gz` archive. Example: `egoexo4d-processed/takes/cmu_bike01_2/`

OmniWorld-EgoExo4D Usage Guide

1. Quick-Start: Extracting One Scene

To access the annotations for a scene, you first need to extract the corresponding .tar.gz archive. After extracting one omniworld_egoexo4d_<start_scene_index>_<end_scene_index>.tar.gz file, the resulting folder structure for each individual scene within the archive is as follows:

<Annotation Path>/
# e.g., egoexo4d-processed/takes/cmu_bike01_2/
|
├── text/
│   ├── 0-161.txt            # caption for frames 0-161
│   └── 40-201.txt           # caption for frames 40-201
|
...

OmniWorld-EgoDex Detailed Guide

This section provides detailed organization, metadata, and usage instructions specific to the OmniWorld-EgoDex dataset.

OmniWorld-EgoDex Organisation and File Structure

The OmniWorld-EgoDex dataset is a collection of re-annotated data derived from the EgoDex dataset. You need downloading original videos.

Annotation Files

The annotation data is packaged in .tar.gz files located under OmniWorld/annotations/OmniWorld-EgoDex/.

Naming Convention: omniword_egodex_<start_scene_index>_<end_scene_index>.tar.gz, where the indices correspond to the scene index range within the metadata file.

Metadata Explained (omniworld_egodex_metadata.csv)

Field Name	Description
`Index`	The sequential index number of the scene.
`Video Path`	The relative path of the scene in the original EgoDex dataset. Use this path to locate the corresponding source RGB video that you have downloaded. Example: `egodex/part1/assemble_disassemble_legos/2338/`
`Annotation Path`	The directory name for this scene's annotations inside the extracted `.tar.gz` archive. Example: `egodex/part1/assemble_disassemble_legos/2338/`

OmniWorld-EgoDex Usage Guide

1. Quick-Start: Extracting One Scene

To access the annotations for a scene, you first need to extract the corresponding .tar.gz archive. After extracting one omniworld_egodex_<start_scene_index>_<end_scene_index>.tar.gz file, the resulting folder structure for each individual scene within the archive is as follows:

<Annotation Path>/
# e.g., egodex/part1/assemble_disassemble_legos/2338/
|
├── text/
│   ├── 0-80.txt            # caption for frames 0-80
│   └── 40-120.txt           # caption for frames 40-120
|
...

License

The OmniWorld dataset is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). By accessing or using this dataset, you agree to be bound by the terms and conditions outlined in this license, as well as the specific provisions detailed below.

Special Note on Third-Party Content:
A portion of this dataset is derived from third-party game content. All intellectual property rights pertaining to these original game assets (including, but not limited to, RGB and depth images) remain with their respective original game developers and publishers.
Permitted Uses:
You are hereby granted permission, free of charge, to use, reproduce, and share the OmniWorld dataset and any adaptations thereof, solely for non-commercial research and educational purposes. This includes, but is not limited to: academic publications, algorithm benchmarking, reproduction of scientific results.

Under this license, you are expressly forbidden from:

Using the dataset, in whole or in part, for any commercial purpose, including but not limited to its incorporation into commercial products, services, or monetized applications.
Redistributing the original third-party game assets contained within the dataset outside the scope of legitimate research sharing.
Removing or altering any copyright, license, or attribution notices.

The authors of the OmniWorld dataset provide this dataset "as is" and make no representations or warranties regarding the legality of the underlying data for any specific purpose. Users are solely responsible for ensuring that their use of the dataset complies with all applicable laws and the terms of service or license agreements of the original game publishers (sources of third-party content).

For the full legal text of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, please visit: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.

Citation

If you found this dataset useful, please cite our paper

@article{zhou2025omniworld,
      title={OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling}, 
      author={Yang Zhou and Yifan Wang and Jianjun Zhou and Wenzheng Chang and Haoyu Guo and Zizun Li and Kaijing Ma and Xinyue Li and Yating Wang and Haoyi Zhu and Mingyu Liu and Dingning Liu and Jiange Yang and Zhoujie Fu and Junyi Chen and Chunhua Shen and Jiangmiao Pang and Kaipeng Zhang and Tong He},
      journal={arXiv preprint arXiv:2509.12201},
      year={2025}
}

Top Tier

Social Proof

HuggingFace Hub

81Likes

27.6KDownloads

Hub Discussions

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Dataset Transparency Report

Verified data manifest for traceability and transparency.

100% Data Disclosure Active

🆔 Identity & Source

id: hf-dataset--internrobotics--omniworld
source: huggingface
author: InternRobotics
tags: task_categories:text-to-videotask_categories:image-to-videotask_categories:image-to-3dtask_categories:roboticstask_categories:otherlanguage:enlicense:cc-by-nc-sa-4.0size_categories:1bformat:webdatasetmodality:imagemodality:textlibrary:datasetslibrary:webdatasetlibrary:mlcroissantarxiv:2509.12201region:us

⚙️ Technical Specs

architecture: null
params billions: null
context length: null

📊 Engagement & Metrics

likes: 81
downloads: 27,605

Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!

Cite this dataset

🔬Technical Deep Dive

⚖️ Free2AI Nexus Index

💬 Why this score?

🔗 Source Links (Click to verify)

👁️ Data Preview

🧬 Field Logic

Dataset Specification

OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling

🎉NEWS

🧭 Dataset Overview and Navigation

Directory Structure

Dataset Download

2. Full download

OmniWorld-Game Detailed Guide

OmniWorld-Game Organisation and File Structure

OmniWorld-Game Usage Guide

1. Quick-Start: Extracting One Scene

--- RGB (may span several parts) ------------------------------------------

--- Depth -----------------------------------------------------------------

--- Flow ------------------------------------------------------------------

--- All other annotations --------------------------------------

2. Modality Details

2.1. Split Information (split_info.json)

2.2. Camera Poses (camera/split_<idx>.json)

--------------------------- example usage -----------------------------------

2.3. Depth (depth/<frame_idx>.png)

---------------------------- example ---------------------------------------

2.4. Structured Caption (text/<start_idx>_<end_idx>.json)

2.5. Foreground Masks (subject_masks/split_<idx>.json)

---------------------------- example ---------------------------------------

2.6. Optical Flow (flow/<frame_idx>/...)

---------------------------- example ---------------------------------------

OmniWorld-Game Benchmark Detailed Guide

Data Access and Organization

Extracted Directory Structure

OmniWorld-CityWalk Detailed Guide

OmniWorld-CityWalk Organisation and File Structure

Annotation Files

Scene and Split Specifications

OmniWorld-CityWalk Usage Guide

1. Quick-Start: Extracting One Scene

2. Modality Details

2.1. Split Information (split_info.json)

2.2. Camera Poses (recon/split_<idx>/...)

Load Extrinsics (World-to-Camera Transform in OpenCV format)

Load Intrinsics (in Pixel Units)

OmniWorld-HOI4D Detailed Guide

OmniWorld-HOI4D Organisation and File Structure

Annotation Files

Scene and Split Specifications

OmniWorld-HOI4D Usage Guide

1. Quick-Start: Extracting One Scene

2. Modality Details

2.1. Split Information (split_info.json)

2.2 Camera Poses (info.json)

Example usage:

OmniWorld-DROID Detailed Guide

OmniWorld-DROID Organisation and File Structure

Annotation Files

OmniWorld-DROID Usage Guide

1. Quick-Start: Extracting One Scene

2. Modality Details

2.1. Depth

---------------------------- example ---------------------------------------

2.2 Camera Pose

Example: Accessing intrinsics for specific camera serials

OmniWorld-RH20TRobot Detailed Guide

OmniWorld-RH20TRobot Organisation and File Structure

Annotation Files

OmniWorld-RH20TRobot Usage Guide

1. Quick-Start: Extracting One Scene

OmniWorld-RH20THuman Detailed Guide

OmniWorld-RH20THuman Organisation and File Structure

Annotation Files

OmniWorld-RH20THuman Usage Guide

2.1. Split Information (`split_info.json`)

2.2. Camera Poses (`camera/split_<idx>.json`)

2.3. Depth (`depth/<frame_idx>.png`)

2.4. Structured Caption (`text/<start_idx>_<end_idx>.json`)

2.5. Foreground Masks (`subject_masks/split_<idx>.json`)

2.6. Optical Flow (`flow/<frame_idx>/...`)

2.1. Split Information (`split_info.json`)

2.2. Camera Poses (`recon/split_<idx>/...`)

2.1. Split Information (`split_info.json`)

2.2 Camera Poses (`info.json`)