Toolkit for ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes

Chandan Yeshwanth*, Yueh-Cheng Liu*, Matthias Nießner and Angela Dai

ICCV 2023

This is the official undistortion script that generates the undistortion ground-truth in the benchmark. This will generate the undistorted images, mask, and the respective transforms.json file for NeRF training. This is particularly useful if your method support only images with pinhole camera model (e.g., Gaussian Splatting).

Insert data_root in dslr/configs/undistort.yml and run:

python -m dslr.undistort dslr/configs/undistort.yml

Additionally, the user can specify the input and output path in the config files.

Downscale the DSLR images

If you need to downscale the DSLR images to reduce the memory overhead during NeRF training, you can run the following script. The configuration is similar to the undistortion script.

python -m dslr.downscale dslr/configs/downscale.yml

(Deprecated) Undistortion: convert fisheye images to pinhole with COLMAP

User could also use COLMAP to undistort DSLR images (and masks) based on COLMAP so that the output images are pinhole camera models. However, the result here is different from the ones generated by OpenCV.

You will need COLMAP installed to run this script.

Insert data_root and output_dir in dslr/configs/undistort_colmap.yml and run:

python -m dslr.undistort_colmap dslr/configs/undistort_colmap.yml

The output will be saved in output_dir with the following structure:

output_dir/SCENE_ID
├── colmap
│   ├── cameras.txt
│   ├── images.txt
│   └── points3D.txt
├── images
├── masks
└── nerfstudio/transforms.json

Render Depth for DSLR and iPhone

Install the python package from https://github.com/liu115/renderpy in addtion to the requirements.

python -m common.render common/configs/render.yml

The output will be saved in output_dir with the following structure:

output_dir/SCENE_ID/[dslr, iphone]
├── render_rgb
└── render_depth

The rendered depth maps are single-channel uint16 png, where the unit is mm and 0 means invalid depth.

iPhone

Extract RGB frames, masks and depth frames

python -m iphone.prepare_iphone_data iphone/configs/prepare_iphone_data.yml

Semantics

Prepare 3D Semantics Training Data

The meshes may not have a uniform distribution of mesh vertices and voxelizing these could lead to holes in the data. Hence, the vertices must not be treated as a point cloud.

Instead, please sample points on the surface of the mesh and use these as inputs for voxelization, etc.

An example of how to do this is given. This script samples points on the mesh and maps 1.5k+ raw labels to the benchmark classes. The mapping file is at metadata/semantic_benchmark/map_benchmark.csv

Configure the paths in semantic/configs/prepare_training_data.yml

Then run

python -m semantic.prep.prepare_training_data semantic/configs/prepare_training_data.yml

The resulting PTH files have these fields:

scene_id - str, scene ID
sampled_coords - (n_samples, 3), coordinates of points sampled on the mesh
sampled_colors - (n_samples, 3), RGB colors of points in range [0, 1] (Open3D format)
sampled_labels - (n_samples,), semantic IDs 0-N according to the specified labels file
sampled_instance_labels - (n_samples,), instance IDs
sampled_instance_anno_id: (n_samples,), instance ids corresponding to segments_anno.json['segGroups']['id']

Split PTH files into chunks for training

Split the PTH files into smaller chunks of fixed size. For training, use overlapping chunks and for validation, set overlap to 0.

python -m semantic.prep.split_pth_data semantic/configs/split_pth_data_train.yml

Visualize training data

Configure the PTH data dir, scene list and required outputs in semantic/configs/viz_pth_data.yml

python -m semantic.viz.viz_pth_data semantic/configs/viz_pth_data.yml

Prepare Semantic/Instance Ground Truth Files for Evaluation

Prepare PTH files similar to the training data step, but without point sampling. Then configure the PTH data dir, scene list and required outputs in semantic/configs/prepare_semantic_gt.yml and run

python -m semantic.prep.prepare_semantic_gt semantic/configs/prepare_semantic_gt.yml

3D Semantic Segmentation Evaluation

For this you need to prepare the semantic ground truth and predictions in the following format

one file per scene named <scene_id>.txt, where each line contains the label(s) for the corresponding vertex in the mesh. You can specify either a single label or multiple comma-separate labels in each line. Each line should have the same number of labels, i.e each file should be an N x 1 or N x 3 array for 1 and 3 predictions respectively.

Configure the paths to GT, predictions, label list and downloaded data in semantic/configs/eval_semantic.yml

Then run

python -m semantic.eval.eval_semantic semantic/configs/eval_semantic.yml

3D Instance Segmentation Evaluation

See semantic/eval/eval_instance.py for details on the input formats.

Configure the paths to GT, predictions, label list and downloaded data in semantic/configs/eval_instance.yml

Then run

python -m semantic.eval.eval_instance semantic/configs/eval_instance.yml

Rasterize 3D Meshes onto 2D Images

(Requires Pytorch3D and a GPU)

Use this to rasterize the mesh onto DSLR or iPhone images and save the 2D-3D mappings (pixel-to-face) to file. This can later be used to get the 3D semantic and instance annotations on the 2D images.

Useful params to configure -

image_type: dslr or iphone
image_downsample_factor: rasterize onto downsampled images, since the 2D images have a very high resolution
subsample_factor: rasterize every Nth image
batch_size: for rasterization

python -m semantic.prep.rasterize

(Note: This script uses a Hydra config, no need to specify the config path)

2D Semantics

Use the rasterization from the previous step to get semantic and instance annotations on the 2D images, either iPhone or DSLR. Then visualize the object IDs on the image, crop individual objects from the images, etc.

The rasterization data also contains the zbuf depth which can be used for backprojection, depth estimation tasks or filtering by distance from the camera.

Configure the image_type and subsample_factor as before. Use undistort_dslr to get semantics on the undistorted images

python -m semantic.prep.semantics_2d

Visualized object IDs should look like this

Select Images with Best Coverage

We provide 2 useful functions to select 2D images and save these to a cache file

scannetpp.common.utils.anno.get_best_views_from_cache: Order the subsampled images by the next best view that increases the coverage of the scene.

from scannetpp.common.utils.anno.get_visiblity_from_cache: Find the visibility of each object in each subsampled image, to filter by the desired visibility

Both functions save the visibility/image list to a cache file so that they don't have to be recomputed each time.

Novel View Synthesis

Novel View Synthesis Evaluation (DSLR)

The evaluation script here is the same that runs on the benchmark server. Therefore, it's highly encouraged to run the evaluation script before submitting the results (on the val set) to the benchmark server.

python -m eval.nvs --data_root DATA_ROOT --split SPLIT_FILE --pred_dir PRED_DIR

The PRED_DIR should have the following structure:

SCENE_ID0/
├── DSC00001.JPG
├── DSC00002.JPG
├── ...
SCENE_ID1/
├── ...

NOTE: The evaluation script here is the same that runs on the benchmark server. Therefore, it's highly encouraged to run the evaluation script before submitting the results (on the val set) to the benchmark server.

Benchmarks

Semantic Segmentation

This table presents the Top-1 IoU and Top-3 IoU results for different models on validation and test sets.

Method	Top-1 IoU (Val)	Top-3 IoU (Val)	Top-1 IoU (Test)	Top-3 IoU (Test)	Checkpoint	Logs
PTV3	0.488	0.733	0.488	0.725	TBA	Wandb
CAC	0.484	0.740	0.483	0.717	TBA	Wandb
OACNN	0.476	0.762	0.470	0.726	TBA	Wandb
Octformer	0.477	0.737	0.460	0.691	TBA	Wandb
SpUNet	0.478	0.723	0.456	0.683	TBA	Wandb
PTV2	0.466	0.741	0.445	0.688	TBA	Wandb

Notes:

All Model Checkpoints will be released soon.
Implementation code can be found on Pointcept.
Configuration files can be found on Pointcept PR 412 (Now merged on the main branch!).
A compiled report for all methods can be found on Wandb.

Instance Segmentation

This table presents the AP50 results for different models on validation and test sets.

Method	AP50 (Val)	AP50 (Test)	Checkpoint	Logs
SGIFormer	0.411	0.457	TBA	Wandb
SPFormer	0.421	0.435	TBA	Wandb
OneFormer3D	0.411	0.433	TBA	Wandb
SPFormer-Pretrained Scannet	0.419	0.432	TBA	Wandb
PointGroup	0.147	0.152	TBA	Wandb

Notes:

All Model Checkpoints will be released soon.
Implementation code and configuration code will be released soon (Now in open PRs on pointcept Pointcept.
A compiled report for all methods can be found on Wandb.
Logs containing the title freq mean that the metric is average over 500 training steps.

Contributing

Please open a PR and we'll be happy to review it!

Citation

If you find our code, dataset or paper useful, please consider citing

@inproceedings{yeshwanth2023scannet++,
  title={Scannet++: A high-fidelity dataset of 3d indoor scenes},
  author={Yeshwanth, Chandan and Liu, Yueh-Cheng and Nie{\ss}ner, Matthias and Dai, Angela},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={12--22},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 194 Commits
assets		assets
common		common
dslr		dslr
eval		eval
img		img
iphone		iphone
semantic		semantic
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toolkit for ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes

Contents

Dataset Documentation

Requirements

DSLR

Undistortion: convert fisheye images to pinhole with OpenCV

Downscale the DSLR images

(Deprecated) Undistortion: convert fisheye images to pinhole with COLMAP

Render Depth for DSLR and iPhone

iPhone

Extract RGB frames, masks and depth frames

Semantics

Prepare 3D Semantics Training Data

Split PTH files into chunks for training

Visualize training data

Prepare Semantic/Instance Ground Truth Files for Evaluation

3D Semantic Segmentation Evaluation

3D Instance Segmentation Evaluation

Rasterize 3D Meshes onto 2D Images

2D Semantics

Select Images with Best Coverage

Novel View Synthesis

Novel View Synthesis Evaluation (DSLR)

Benchmarks

Semantic Segmentation

Instance Segmentation

Contributing

Citation

About

Releases

Packages

Contributors 4

Languages

scannetpp/scannetpp

Folders and files

Latest commit

History

Repository files navigation

Toolkit for ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes

Contents

Dataset Documentation

Requirements

DSLR

Undistortion: convert fisheye images to pinhole with OpenCV

Downscale the DSLR images

(Deprecated) Undistortion: convert fisheye images to pinhole with COLMAP

Render Depth for DSLR and iPhone

iPhone

Extract RGB frames, masks and depth frames

Semantics

Prepare 3D Semantics Training Data

Split PTH files into chunks for training

Visualize training data

Prepare Semantic/Instance Ground Truth Files for Evaluation

3D Semantic Segmentation Evaluation

3D Instance Segmentation Evaluation

Rasterize 3D Meshes onto 2D Images

2D Semantics

Select Images with Best Coverage

Novel View Synthesis

Novel View Synthesis Evaluation (DSLR)

Benchmarks

Semantic Segmentation

Instance Segmentation

Contributing

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages