RcloneListDirIter: call callback before listing the whole dir

The idea of the `RcloneListDirIter` implementation on the agent/rclone side is that it should stream dir entries on the fly.
This approach has a few benefits:
- easier timeout handling
- better performance
- less memory pressure on the agent with limited resources (no need to store all dir entries in the memory - as both golang objects and json encodings)

The problem is that currently the "stream dir entries on the fly" part is done with the whole dir granularit. 
Take a look at the [rclone listing code](https://github.com/scylladb/rclone/blob/afe1fd2aa65ea4806541047752bb39a21ba72bcf/fs/walk/walk.go#L395-L408):
```go
entries, err := listDir(ctx, f, includeAll, job.remote) // <- lists the whole dir
var jobs []listJob
if err == nil && job.depth != 0 {
	entries.ForDir(func(dir fs.Directory) {
		// Recurse for the directory
		jobs = append(jobs, listJob{
			remote: dir.Remote(),
			depth:  job.depth - 1,
		})
	})
}
mu.Lock()
err = fn(job.remote, entries, err) // <- callback function responsible for streaming dir entries to SM
mu.Unlock()
```
It is called only after listing all entries in the dir, so when listing just a single large dir (which is the most common way of using `RcloneListDirIter` during backup), it totally nullifies the "stream dir entries on the fly".

On the other hand, most backend's APIs naturally support chunked listing. E.g., by default we list up to 1000 objects with a single API call to s3. Take a look at the [rclone s3 listing code](https://github.com/scylladb/rclone/blob/afe1fd2aa65ea4806541047752bb39a21ba72bcf/fs/walk/walk.go#L395-L408):
```go
for {
	// FIXME need to implement ALL loop
	req := s3.ListObjectsInput{
		Bucket:    &bucket,
		Delimiter: &delimiter,
		Prefix:    &directory,
		MaxKeys:   &f.opt.ListChunk, // <- 1000 by default
		Marker:    marker,
	}
...
	if !aws.BoolValue(resp.IsTruncated) { // <- Keep on making API calls until all objects have been listed
		break
	}
```

So in order to make `RcloneListDirIter` useful, the agent/rclone implementation should call the callback after each chunked call to s3, and not only after listing the whole dir.

Ref: https://github.com/scylladb/scylla-enterprise/issues/4861

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RcloneListDirIter: call callback before listing the whole dir #4132

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RcloneListDirIter: call callback before listing the whole dir #4132

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions