Streamlined Searching in GA4GH-Standard Phenotypic and Clinical Data Repositories and Beyond
Documentation: https://mrueda.github.io/pheno-search
Docker Hub Image: https://hub.docker.com/r/manuelrueda/pheno-search/tags
ElasticSearch LICENSE.
To pull the Docker image, use the following command:
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.10.0
To run the image, execute:
docker run -d --name elasticsearch -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.10.0
To install jq, run:
sudo apt-get install jq
Suppose you have a file named data/individuals.json
containing 100 entries. First, you'll need to process it to make it compatible with the Elasticsearch API:
jq -c '.[] | {"index": {"_index": "dataset1"}}, .' data/individuals.json > dataset1.json
Now perform the data ingestion:
curl -H "Content-Type: application/json" -XPOST "http://localhost:9200/index_name/_bulk?pretty" --data-binary "@dataset1.json"
This command flattens the data, potentially losing its nested structure. If maintaining nestedness is crucial, you'll need to use a data/mapping.json
file to inform Elasticsearch of the data's structure.
First, delete the old index:
curl -X DELETE "http://localhost:9200/dataset1"
Then, create the index with the correct structure:
curl -X PUT "http://localhost:9200/dataset1" -H 'Content-Type: application/json' -d'@data/mapping.json'
Now perform the data ingestion:
curl -H "Content-Type: application/json" -XPOST "http://localhost:9200/index_name/_bulk?pretty" --data-binary "@dataset1.json"
To query for "Alzheimer disease, susceptibility to", use curl
:
curl -X GET "http://localhost:9200/dataset1/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"nested": {
"path": "diseases",
"query": {
"bool": {
"must": [
{ "match": { "diseases.diseaseCode.label": "Alzheimer disease, susceptibility to" }}
]
}
}
}
}
}
'
To install the required modules, run:
pip install -r requirements.txt
To execute the code, run:
python3 pheno-search.py