Utilities to convert datasets to JSON and access RESTful data on NeuroJSON.io
- Maintainer: Qianqian Fang (q.fang at neu.edu)
- License: BSD 3-Clause License
- Version: 0.5
- Website: https://neurojson.org
- Docker image: https://hub.docker.com/r/openjdata/neuroj
Table of content
- Terminology
- Introduction
- Data conversion strategies and recommendations
- Utility overview
- neuroj
- njprep
- Dependencies
- "database": a collection of datasets, such as OpenNeuro, Dandi, BrainLife etc
- "dataset" : a group of data files that are acquired for a specific study/purpose, may have multiple subjects folders
- "subject folder" : a folder storing data obtained from a study subject (human or animal)
- "data file": a single data file that is part of a dataset
- "BIDS dataset": a dataset that is organized following the BIDS standard https://bids.neuroimaging.io/
The NeuroJSON project (https://neurojson.org), funded by the US National Institute of Health (NIH), is aimed at promoting and curating scalable, searchable, and reusable neuroimaging datasets among the communities. The NeuroJSON project adopts JSON and binary JSON formats as the primary underlying data formats to reinforce searchbility and scalability. JSON is an internationally standardized format that is universally supported across wide range of programming environments. JSON has a vast toolchain ecosystem that can be readily applied for processing neuroimaging data once converted. Specifically, modern document-store and NoSQL databases, such as Radis, CouchDB, MongoDB, or JSON type support in MySQL or Sqlite, provides rapid and extensive search capability of large datasets that can easily handle millions of datasets at scale.
This toolbox provides a set of lightweight, easy-to-use shell-based utilities to convert neuroimaging datasets from native modality-specific data formats to JSON, and subsequently allow users to upload their JSON-encoded data to NeuroJSON.io, our primary document-store database built upon an open-source CouchDB server, for sharing and data publication. The utilities also provide convenient functions to query all free/public datasets provided on NeuroJSON.io, search specific datasets and data records for reusing those in secondary data analysis or testing. The CouchDB server exposes all provided dataset information using intuitive RESTful APIs for neuroimaging end-users and tool developers to query, combine and download JSON-encoded data that are relevant to their project.
The migration from zip-file based neuroimaging data sharing to modern NoSQL database based data dissemination not only greatly enhances the scalability of neuroimaging datasets, but also make datasets findable, searchable, and easy to integrate with diverse data analysis tools. This also prepares the community towards building complex data analysis pipelines that requires interoperable data-exchange between complex software tools running on the cloud or web-based applications.
We map neuroimaging datasets and dataset collections to CouchDB/NoSQL database object hierarchies. The below table shows the conceptual mapping of the data logical structures to the CouchDB object hierarchies.
Data logical structure | CouchDB object | Examples |
---|---|---|
a dataset collection | a CouchDB database | openneuro, dandi, openfnirs,... |
a dataset | a CouchDB document | ds000001, ds000002, ... |
files and folders related to a subject | JSON keys inside a document | sub-01, sub-1/anat/scan.tsv,... |
human-readable binary content (small) | an attachment to a document | .png, .jpg, .pdf, ... |
non-searchable binary content (large) | _DataLink_ JSON key |
"_DataLink_":"http://url/to/ds/filehash.jbd" |
A CouchDB (similarly other NoSQL database engines) can hold and process enormous numbers of databases (i.e. collections) and documents (i.e. datasets) in each database, however, for high-performance search capabilities, our CouchDB server follows the best practice recommendations and set the maximum document size to 8 MB. That means the searchable JSON-encoded content of a dataset should be limited to ~10-15MB if stored in raw JSON text files (after parsing, the data will be reduced). In practices, the searchable content in most existing datasets can fit in this limit (for example, over 90% of OpenNeuro datasets have less than 10 MB raw JSON size after conversion). After capping the sizes of large .tsv files, nearly all OpenNeuro datasets can be stored in a CouchDB document.
It is highly recommended to only encode human-readable and searchable data in the JSON encoded datasets and offload the non-searchable binary data in externally linked files. This way, the JSON document can be small and easy to query, download and manipulate. CouchDB can perform complex searches of a database containing millions of small documents (kB) in a fraction of a second (MongoDB can be even faster; Radis DB offers the fastest speed if the entire database can be fit in the memory of the server).
Primary tools:
neuroj
: NeuroJSON client - the primary utility that calls other tools to convert and query NeuroJSON.ionjprep
: a bash script to convert databases, single dataset or single data file
Helper functions (called by neuroj
and njprep
)
bids2json
: a utility to merge converted dataset JSON files to a singledatasetname.json
file for uploadlink2json
: a utility to create a JSON file for symbolic linkslistdatalink
: a utility to list/extract all URL/externally linked data files (_DataLink_
) for batch downloadmergejson
: a bash/jq script to merge all converted files under a subject-folder to a singlesubject.jbids
filetsv2json
: a Perl script to convert tsv/csv to JSON
In the below diagram, we show the data conversion input/output folder structrues.
[input folder] [output folder]
/orig/data/collection/root => /coverted/json/root
| |----------------------------- **database1.json** -> push to NeuroJSON.io CouchDB
|-- dataset1/ |-- dataset1/ ^
| |-- dataset_description.json => (copy) | |-- dataset_description.json | merge by `bids2json`
| |-- README => (convert) | |-- README.jbids |
| | | |---------------------- **subj-01.jbids**
| |-- sub-01/ | |-- sub-01/ \ ^
| | |-- sub-01_scans.tsv => (convert) | | |-- sub-01_scans.tsv.json | |
| | |-- anat/ | | |-- anat/ |--| merge by `mergejson`
| | |-- sub-01_T1w.nii.gz => (convert) | | |-- sub-01_T1w.nii.gz.json |
| | |-- sub-01_events.tsv | | |-- sub-01_events.tsv.json
| | |-- sub-01-file <symlink> -> git/annex/...| |-- sub-01-file.json: ["_DataLink_":"symlink:git/annex/..."]
| | | |---------------------- **subj-02.jbids**
| |-- sub-02/ | |-- sub-02/
| | |-- ... => | | |-- ...
| |----------------------------- **database2.json** -> push to NeuroJSON.io CouchDB
|-- dataset2/ |-- dataset2/
| |-- ... => | |-- ...
|... |...
|.att/ # attachment data files -> upload to your preferred or NeuroJSON server
|-- dataset1/
| |-- md5_pathhash_file1-zlib.jdb
| |-- md5_pathhash_file2-zlib.jdb
| |-- ...
|-- dataset2/
| |-- md5_pathhash_file1-zlib.jdb
| |-- md5_pathhash_file2-zlib.jdb
| |-- ...
...
neuroj
is the NeuroJSON client script that provides most of the functionalities. It calls njprep
to
perform batched and parallel dataset/datafile conversion to JSON, as well as listing, searching, downloading,
databases and datasets from NeuroJSON.io, our open data dissemination portal. NeuroJSON.io
shares open datasets publically and permits anonymous access.
For neuroimaging dataset creators, uploaders and collection administrators, you can also use neuroj
to perform administrative tasks such as uploading new JSON-encoded datasets to an existing database, updating
dataset JSON document with new revision, deleting old versions and other maintenance commands supported
by the CouchDB REST API.
Command format: neuroj -flag1 <param1> -flag2 <param2> ...
Suported flags include
-i/--input folderpath path to the top folder of a data collection (such as OpenNeuro)
-o/--output folderpath path to the output folder storing the converted JSON files
-db/--database dbname database name (such as openneuro, openfnirs, dandi etc)
-ds/--dataset dataset dataset name (a single dataset in a collection, such as ds000001)
-v/--rev revision dataset revision key hash
-r/--convert convert database (-db) or dataset (-db .. -ds ..) (in parallel) to JSON
-t/--threads num set the thread number for parallel conversion (4 by default)
-l/--list list all database if -db is given; or the dataset if both -db/-ds are given
-q/--info query database info if -db is given; or dataset info if both -db/-ds are given
-f/--find '{" selector ":,...}' use the CouchDB _find API to search dataset
retrieve/upload NeuroJSON.io (except -g, other functions requires -U/-P or NEUROJSON_IO environment variable)
-g/--get/--pull retrieve and display JSON encoded dataset, or complete database (slow)
-p/--put/--push dataset.json upload JSON data to a database (-db) and dataset (if -ds is missing, use file name), (admin only)
-c/--create create a specified database (-db), (admin only)
-d/--delete delete specified database (-ds) from a database (-db), (admin only)
-u/--url https://... CouchDB REST API root url, use https://neurojson.io:7777 (default) or use NEUROJSON_IO env variable
-n read from \$HOME/.netrc (Linux/MacOS) or \%HOME\%/_netrc for username/password for admin tasks
--netrc-file /path/netrcfile same as -n, specify netrc file path (see https://everything.curl.dev/usingcurl/netrc)
-U/--user username set username for admin tasks (unless use -n or -c or NEUROJSON_IO URL has user info)
-P/--pass password set password for admin tasks (unless use -n or -c or NEUROJSON_IO URL has password info)
neuroj accepts 3 ways to set username/password if you are running admin tasks (create/upload/update/delete datasets). Using curl with -n/--netrc-file is the recommended approach as it does not leave passwords in the commands or system logs.
If one can not install curl, neuroj attempts to use Perl module LWP::UserAgent to communicate with the server. In this case, user may set an environment variable NEUROJSON_IO in the form of https://user:pass\@example.com:port. If user/pass contains special characters, they must be URL-encoded. This way, the neuroj command will not show any password in the log. If you are on a secure computer, using -U/-P will also allow LWP::UserAgent to authenticate.
# print help information and flags
neuroj
# preview commands (dry-run, does not execute) to convert a database (including all included datasets) to JSON
neuroj -i /path/to/database/rootfolder -o /path/to/output/json/folder -db openneuro
# convert a database (a collection of datasets) to JSON in batch with 12 parallel threads (1 thread per dataset)
neuroj -i /path/to/database/rootfolder -o /path/to/output/json/folder -db openneuro --convert -t 12
# convert a single dataset (including all files under sub-folders) to JSON in parallel (1 thread per file)
neuroj -i /path/to/database/rootfolder -o /path/to/output/json/folder -db openneuro -ds ds000001 --convert
# convert a database to JSON, assuming the last segment of the input path as database name
neuroj -i /path/to/database/databasename -o /path/to/output/json/folder
# convert a single dataset ds000001 to JSON (dry-run only, add --convert to run)
neuroj -i /path/to/database/databasename -o /path/to/output/json/folder -ds ds000001
# list all databases currently hosted on NeuroJSON.io
neuroj --list
# list all databases currently hosted on NeuroJSON.io, format the output with jq
neuroj --list | jq '.'
# list all datasets stored under the 'openneuro' database
neuroj --list -db openneuro
# list all datasets strored under the 'openneuro' database and print all datasets using jq filters
neuroj --list -db openneuro | jq '.rows[] | .id'
# query server-level information of the database openneuro (return dataset count, file sizes etc)
neuroj --info -db openneuro
# search the IDs of the first 10 (limit:10) datasets in the openneuro database starting from the 3rd (skip:2) datasets
neuroj -db openneuro --find '{"selector":{},"fields":["_id"],"limit":10,"skip":2}'
# query the dataset name "_id" and "dataset_description.json" records of the first two datasets in "openneuro" and format with jq
neuroj -db openneuro --find '{"selector":{},"fields":["_id","bids_dataset_info.dataset_description\\\\.json"],"limit":2}' | jq '.'
njprep
is a neuroimaging-data file to JSON converter following the general
principles of the NeuroJSON project - that is to separate a dataset
into human-readable/searchable part and a non-searchable/binary data part.
The human-readable part is stored in the JSON format and can be readily uploaded to modern document-store databases to allow data analyses to scale to large datasets, making the data searchable, findable and universally accessible and parsable. The human-readability of the data format also ensures future reusability.
The non-searchable data are stored in binary JSON, or their original formats and can be stored externally while still being associated with the searchable JSON data using links, URLs or stored as "attachments" to the JSON document. They can be "re-united" with the searchable data on-demand to restore the full dataset for data analysis.
For conversion of human-readable data files, njprep
currently supports
.json
, .tsv
, .csv
, and various text files (.txt/.md/.rst
); for a limit set
of neuroimaging data files, such as .nii.gz
, .snirf
, njprep
parse the file
header into JSON while storing the rest into binary files. njprep
also converts
symbolic links to a special JSON element to maintain the linkage. Other human-readable
documentation files, such as .png
, .jpg
, .pdf
are stored as attachments
# convert all datasets in a database to JSON
njprep /database/root/ /output/json/root/ database_name
# convert a specific dataset "dataset_name" in a specific database to JSON
njprep /database/root/ /output/json/root/ database_name dataset_name
# convert a single file or subject-folder in a given
njprep /database/root/ /output/json/root/ database_name dataset_name /path/to/a/file
For Linux and Mac OS:
- jq
- curl
- GNU Octave
- jbids https://github.com/NeuroJSON/jbids - including 4 submodules under tools)
- libparallel-forkmanager-perl (for Parallel::ForkManager)
- libwww-perl (for LWP::UserAgent)
- libjson-xs-perl (for JSON::XS)
For Windows: please first install cygwin64 (http://cygwin.com/) or MSYS2 (http://msys2.org/) and also install the above packages in the corresponding cygwin64/msys2 installers.
When converting datasets with neuroj
or njprep
, conversion for some of the data files,
such as .snirf
or .nii/.nii.gz
requires octave and the jbids toolbox (including its submodules).
Other functionalities does not require octave.