Skip to content

Automatically split downloads in chunks for queries with >4000 records #29

Open
@tomwenseleers

Description

Just a small possible enhancement, but would it be possible to have the download function automatically split the queries in chunks for searches when the length of list_of_accession_ids is >4000.

Now I do this myself, e.g. to fetch the most recently uploaded records, using

df= query(
  credentials = credentials, 
  from_subm = as.character(GISAID_max_submdate), 
  to_subm = as.character(today),
  fast = TRUE
)
dim(df) # 103356      1
# function to split vector in chunks of max size chunk_length
chunk = function(x, chunk_length=4000) split(x, ceiling(seq_along(x)/chunk_length))

chunks = chunk(df$accession_id)
downloads = do.call(rbind, lapply(1:length(chunks),
                   function (chunk) {
                     message(paste0("Downloading batch ", chunk, " out of ", length(chunks)))
                     Sys.sleep(3)
                     return(download(credentials = credentials, 
                              list_of_accession_ids = chunks[[chunk]])) } ))
dim(downloads) # 103356     29
names(downloads)
# [1] "strain"                "virus"                 "accession_id"         
# [4] "genbank_accession"     "date"                  "region"               
# [7] "country"               "division"              "location"             
# [10] "region_exposure"       "country_exposure"      "division_exposure"    
# [13] "segment"               "length"                "host"                 
# [16] "age"                   "sex"                   "Nextstrain_clade"     
# [19] "pangolin_lineage"      "GISAID_clade"          "originating_lab"      
# [22] "submitting_lab"        "authors"               "url"                  
# [25] "title"                 "paper_url"             "date_submitted"       
# [28] "purpose_of_sequencing" "sequence"  

Even better could be to also have this parallelized (if GISAID would allow that), as the above is still relative slow - it now takes about 1.5 hours to download these 103K records from the last 5 days. If I tried with a chunk size of 5000 I received a Server error, so reduced it to 4000 and that seemed to work...

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions