Description
The idea of the RcloneListDirIter
implementation on the agent/rclone side is that it should stream dir entries on the fly.
This approach has a few benefits:
- easier timeout handling
- better performance
- less memory pressure on the agent with limited resources (no need to store all dir entries in the memory - as both golang objects and json encodings)
The problem is that currently the "stream dir entries on the fly" part is done with the whole dir granularit.
Take a look at the rclone listing code:
entries, err := listDir(ctx, f, includeAll, job.remote) // <- lists the whole dir
var jobs []listJob
if err == nil && job.depth != 0 {
entries.ForDir(func(dir fs.Directory) {
// Recurse for the directory
jobs = append(jobs, listJob{
remote: dir.Remote(),
depth: job.depth - 1,
})
})
}
mu.Lock()
err = fn(job.remote, entries, err) // <- callback function responsible for streaming dir entries to SM
mu.Unlock()
It is called only after listing all entries in the dir, so when listing just a single large dir (which is the most common way of using RcloneListDirIter
during backup), it totally nullifies the "stream dir entries on the fly".
On the other hand, most backend's APIs naturally support chunked listing. E.g., by default we list up to 1000 objects with a single API call to s3. Take a look at the rclone s3 listing code:
for {
// FIXME need to implement ALL loop
req := s3.ListObjectsInput{
Bucket: &bucket,
Delimiter: &delimiter,
Prefix: &directory,
MaxKeys: &f.opt.ListChunk, // <- 1000 by default
Marker: marker,
}
...
if !aws.BoolValue(resp.IsTruncated) { // <- Keep on making API calls until all objects have been listed
break
}
So in order to make RcloneListDirIter
useful, the agent/rclone implementation should call the callback after each chunked call to s3, and not only after listing the whole dir.
Ref: https://github.com/scylladb/scylla-enterprise/issues/4861