Automatically split downloads in chunks for queries with >4000 records #29
Open
Description
Just a small possible enhancement, but would it be possible to have the download
function automatically split the queries in chunks for searches when the length of list_of_accession_ids
is >4000.
Now I do this myself, e.g. to fetch the most recently uploaded records, using
df= query(
credentials = credentials,
from_subm = as.character(GISAID_max_submdate),
to_subm = as.character(today),
fast = TRUE
)
dim(df) # 103356 1
# function to split vector in chunks of max size chunk_length
chunk = function(x, chunk_length=4000) split(x, ceiling(seq_along(x)/chunk_length))
chunks = chunk(df$accession_id)
downloads = do.call(rbind, lapply(1:length(chunks),
function (chunk) {
message(paste0("Downloading batch ", chunk, " out of ", length(chunks)))
Sys.sleep(3)
return(download(credentials = credentials,
list_of_accession_ids = chunks[[chunk]])) } ))
dim(downloads) # 103356 29
names(downloads)
# [1] "strain" "virus" "accession_id"
# [4] "genbank_accession" "date" "region"
# [7] "country" "division" "location"
# [10] "region_exposure" "country_exposure" "division_exposure"
# [13] "segment" "length" "host"
# [16] "age" "sex" "Nextstrain_clade"
# [19] "pangolin_lineage" "GISAID_clade" "originating_lab"
# [22] "submitting_lab" "authors" "url"
# [25] "title" "paper_url" "date_submitted"
# [28] "purpose_of_sequencing" "sequence"
Even better could be to also have this parallelized (if GISAID would allow that), as the above is still relative slow - it now takes about 1.5 hours to download these 103K records from the last 5 days. If I tried with a chunk size of 5000 I received a Server error, so reduced it to 4000 and that seemed to work...