Skip to content

RcloneListDirIter: call callback before listing the whole dir #4132

Closed as not planned
@Michal-Leszczynski

Description

@Michal-Leszczynski

The idea of the RcloneListDirIter implementation on the agent/rclone side is that it should stream dir entries on the fly.
This approach has a few benefits:

  • easier timeout handling
  • better performance
  • less memory pressure on the agent with limited resources (no need to store all dir entries in the memory - as both golang objects and json encodings)

The problem is that currently the "stream dir entries on the fly" part is done with the whole dir granularit.
Take a look at the rclone listing code:

entries, err := listDir(ctx, f, includeAll, job.remote) // <- lists the whole dir
var jobs []listJob
if err == nil && job.depth != 0 {
	entries.ForDir(func(dir fs.Directory) {
		// Recurse for the directory
		jobs = append(jobs, listJob{
			remote: dir.Remote(),
			depth:  job.depth - 1,
		})
	})
}
mu.Lock()
err = fn(job.remote, entries, err) // <- callback function responsible for streaming dir entries to SM
mu.Unlock()

It is called only after listing all entries in the dir, so when listing just a single large dir (which is the most common way of using RcloneListDirIter during backup), it totally nullifies the "stream dir entries on the fly".

On the other hand, most backend's APIs naturally support chunked listing. E.g., by default we list up to 1000 objects with a single API call to s3. Take a look at the rclone s3 listing code:

for {
	// FIXME need to implement ALL loop
	req := s3.ListObjectsInput{
		Bucket:    &bucket,
		Delimiter: &delimiter,
		Prefix:    &directory,
		MaxKeys:   &f.opt.ListChunk, // <- 1000 by default
		Marker:    marker,
	}
...
	if !aws.BoolValue(resp.IsTruncated) { // <- Keep on making API calls until all objects have been listed
		break
	}

So in order to make RcloneListDirIter useful, the agent/rclone implementation should call the callback after each chunked call to s3, and not only after listing the whole dir.

Ref: https://github.com/scylladb/scylla-enterprise/issues/4861

Metadata

Metadata

Labels

backupbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions