[INLG2023] The High-Level (HL) dataset is a Vision and Language (V&L) resource aligning object-centric descriptions from COCO with high-level descriptions crowdsourced along 3 axes: scene, action, rationale.
dataset image-captioning image2text vision-and-language multimodal-data huggingface-datasets multimodal-grounding
-
Updated
Nov 13, 2023