Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: Ec optimisations main collaboration branch #60782

Open
wants to merge 53 commits into
base: main
Choose a base branch
from

Conversation

aainscow
Copy link

We have multiple people working EC optimisations. This pull request is here to check that the basic acceptance tests are passing. Anyone is welcome to review it!

I will add more detail to the PR as and when it is expected to no longer be a draft.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

aainscow and others added 30 commits November 19, 2024 11:32
The align function does not need to determine whether an interval overlaps an existing interval,
because the insert code has been changed recently to allow multiple inserts of overlapping
regions.

Also worth noting that the original code did not quite do the right thing and may have led to
overlapping inserts anyway.

Also, added a unit test and fixed a template issue with align.

Signed-off-by: Alexander H Ainscow <[email protected]>

interval_set: Allow interval_sets to be aligned
The current implementation of interval set only allows inserts to be made when the new interval does not overlap the old.  The "union_insert" method works around this by creating a new interval map, then performing a union of the two sets - an inefficient process.

This commit adapts the insert method to have the flexibility of union_insert, but with the efficiency of the old insert method.

We should add a similar erase method in the future, although there is no need for this PR.

Signed-off-by: Alexander H Ainscow <[email protected]>
Before this commit, the erase function would assert that intervals to be erased entirely
exist within an existing interval.  This leads to a number of issues:

1. Easy to create hard-to-test edge conditions for client code.
2. Often calling code would do inefficient thing, such a execute intersect, then subtract.
3. It is a particular surprise that this issue also applies to subtract.

This new version of the erase method allows for any sized interval to be used in erase. This
allows for both non or partially overlapping erases, but also can recurse over multiple
contained regions, making it much more convenient to erase.

The new code should be as efficient for existing erases supported by the code.  For a single
non-overlapping erase, the complexity is O(log n) and the constant penalty should be
roughly the same.

For multiple erases, the complexity is somewhere between O(log n) and O(n), scaling with
the number of entries which must be erased.

The complexity of subtract is the same as before O(n log m), but should have a lower constant
than previously held.

I have also removed various workarounds used in interval_set required by the limitations
of the old erase and insert functions.

The added unit tests should be extensive enough to give confidence that the new code has a
functional superset of the old code.

Signed-off-by: Alexander H Ainscow <[email protected]>
* More fixes for multi-region reads
* Enable partial stripe reads that span stripes

Signed-off-by: Bill Scales <[email protected]>
As preperation for later commits, here we move the decision about which
extents are read to an earlier point in the read operation.

Later commits will then extend this to perform fewer reads.

Signed-off-by: Alexander H Ainscow <[email protected]>
In a previous commit, there was some inefficient code, which:
1. Creates a pair
2. Passes pair to function
3. Function does some trivial maths, then populates a new pair
4. Pair is returned and immediately deconstructed.

This is inefficient, but possibly more importantly generated messy code.  This code is a minor refactor to remove some of the pairs.

Signed-off-by: Alexander H Ainscow <[email protected]>
The restrictions here assert that IOs are rounded to chunk size, which is no longer a requirement.

Signed-off-by: Alexander H Ainscow <[email protected]>
These are utilities for consumption by future commits. These
equality operators will largely be used by test harnesses and
are intented to work intuitively.

Signed-off-by: Alexander H Ainscow <[email protected]>
Add a harness to mock up ECCommon::ReadPipeline, thus allowing more testing
of functions with read pipeline.

This is a very minimal stub and will be extended when required for tests.

Signed-off-by: Alexander H Ainscow <[email protected]>
Adapt the functions which determine the extents of each shard that
need to be read, such that they pick the minimal reads possible.

This code was tested with subsequent test harness commits only and is
not expected to work outside such an environment, as the IO complete
path cannot yet cope with minimal reads.

There is quite a bit of fiddly integer maths going on here and this
will be neatened up in a later commit once further testing has taken
place and this maths is no longer undergoing frequent changes.

Signed-off-by: Alexander H Ainscow <[email protected]>
Further code-neatness enhancements to remove temporary pair objects.

It is noted that the compiler likely optimises out such pairs,
however, where they reduce readability, they should be removed.

Signed-off-by: Alexander H Ainscow <[email protected]>
…tests.

Fix some warnings from clang-tidy. Also fix a unit test which was broken
due to a rebase.

Signed-off-by: Alexander H Ainscow <[email protected]>
This test was added to debug a problem found in an more complex test.

Signed-off-by: Alexander H Ainscow <[email protected]>
…equests.

FIXME: Need more detail

Signed-off-by: Alexander H Ainscow <[email protected]>
This adds a new setting called osd_ec_partial_reads_experimental.

When enabled, the EC code will issue much more optimal reads,
however there are known scenarios in which reads will not be
recovered correctly. This most likely will lead to OSD processes
crashing and data becoming inaccessible.

When disabled, most of the new code is still being exercised, but
much of the efficiency is lost, particularly for large chunk sizes.

Signed-off-by: Alexander H Ainscow <[email protected]>
…function.

There is already a FIXME comment in this function, but asserting here is a
good prompt to make sure this is fixed in the future.

Signed-off-by: Alexander H Ainscow <[email protected]>
…shards

Following a review comment, observing that read_request is never null and
as such can be passed by reference, rather than by pointer.

A further review request suggested that this function should return
read_request in a tl::expected wrapper. However, this ends up with passing
this function more parameters to enable it to construct the read request,
thus making the code less efficient, so I did not implement this.

Signed-off-by: Alexander H Ainscow <[email protected]>
This method returns true if the interval entirely contains the specified range, without gaps.

Signed-off-by: Alexander H Ainscow <[email protected]>
Signed-off-by: Alexander H Ainscow <[email protected]>
Major changes
1. Move the bulk of the logic to ECUtil, for re-use in the write path
2. Remove references to reads in the method.
3. Move from using vectors to maps for want to read shards.

The vectors initially were thought to be more efficient, but with multiple
loops over the vector, this probably wasnt the case (although I never
measured any performance).

Signed-off-by: Alexander H Ainscow <[email protected]>
When performing optimised reads, prior to this commit, if the experimental
flag was enabled, then recovery would be impossible if it involved re-reading
an existing chunk.

For example. A client performs a read which covers end of chunk A and the start of chunk B. If chunk B fails, then the start of chunk A is required for recovery.

Signed-off-by: Alexander H Ainscow <[email protected]>
This utility tracks an extent map per shard for erasure coding.  It provides a number of capabilities tailored to the requirements of the EC write path.

This commit is starting a move to a new convention with erasure coding referring to which address space offsets/lengths refer to. The convention will be:

"Rados object offset/size/start/end/length" refers to offsets within the object itself. This will be shortened to a prefix of "ro_" for the most part.

"Shard offset/size/start/end/length" refers to offsets within the individual shards. Typically both "shard" and "offset" are required. No shortening has been established as yet.

Examples:

insert_in_shard(shard, offset, bufferlist)
  Insert a buffer into a particular shard.

insert_ro_extent_map?(ro_extent_map)
  Convert ro_extent_map to shard addressing, then insert (merge) it into the sharded extent maps.

Additionally, the (generalised) get_min_shards function has been renamed to reflect its primary role of converting between ro and shard address spaces.

Signed-off-by: Alexander H Ainscow <[email protected]>
The sinfo structure is more efficient to access and now stashes:
* k
* m
* shard mapping (aka chunk mapping) - map of shard indexing used in plugin to the k-then-m indexing assumed by many utilities.

Signed-off-by: Alexander H Ainscow <[email protected]>
This commit should disappear on a reabase.

Signed-off-by: Alexander H Ainscow <[email protected]>
Reorganisers the main function to be more modular, moving functionality into a TestRunner object and anonymous namespaces.
Adds a new "interactive" test mode that allows you to direct the order of operations rather than using the preset sequences. This allows complete precise control of the tool when you want to test specific IOs.

Signed-off-by: Jon Bailey <[email protected]>
Signed-off-by: Alexander H Ainscow <[email protected]>
1. Add interval_set.erase_after, to avoid fiddling around with interval_set.erase(X, large_value).
2. Optimise "subtract" so that it ignores non-overlapping ranges efficiently.
3. Add interval_map interface to get the first extent, without the last.
1. Do not send a dummy write if there are any IOs or commit
in flight at all to that PG.
2. Change the request interface to a two phase approach, to allow the caller to schedule multuple cache requests.
3. Remove some unnecessary debug code (check)buffers_pinned call)
1. Remove some dead code
2. Tweak some variable names
3. Move new methods.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants