Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB Backup / Restore Options #4

Open
gitspeaks opened this issue Apr 28, 2020 · 31 comments
Open

DB Backup / Restore Options #4

gitspeaks opened this issue Apr 28, 2020 · 31 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@gitspeaks
Copy link

Please provide instructions on how to do point in time online backup of a KVDB and how to do a complete KVDB restore.

@hse-root hse-root added the question Further information is requested label Apr 29, 2020
@smoyerx
Copy link

smoyerx commented Apr 29, 2020

HSE works with any SSD-backed volume (e.g., LVM, SAN array, cloud volume), so you can take a snapshot of the volume you configured for your capacity media class following the method appropriate for your volume manager. Best practice is to halt the KVDB application to flush all data prior to taking the snapshot, otherwise you will have a crash-consistent snapshot.

If you configure both a capacity and staging media class, you need to snapshot the volume associated with each. In this case, you need to halt the KVDB application to ensure the two volumes are in sync when you take the snapshots.

@gitspeaks
Copy link
Author

Thanks. This method is effectively an “offline backup”. Please consider enhancing the engine to support backing up the database while the engine is running to a file that can be simply moved off the machine after the backup completes.

@hse-root hse-root added the enhancement New feature or request label Apr 29, 2020
@smoyerx
Copy link

smoyerx commented Apr 29, 2020

We anticipate that HSE will most often be used as part of an application that has its own backup method. For example, we integrated HSE with MongoDB as a proof point, and users of MongoDB with HSE would then likely use one of the several backup methods that MongoDB provides (e.g., mongodump).

That said, there is certainly utility in a native dump/restore method, which we'll consider for future inclusion.

@gitspeaks
Copy link
Author

We anticipate that HSE will most often be used as part of an application that has its own backup method.

As far as HSE, what would be the correct way (Api) to enumerate all KV’s in all db’s in order to read a consistent snapshot of the entire dataset ?

@gitspeaks
Copy link
Author

Do you include KVDB/KVS snapshot version in each individual KVDB/KVS/KV object ?

@smoyerx
Copy link

smoyerx commented May 1, 2020

There is an experimental API (inlcude/hse/hse_experimental.h) to dump and restore a KVDB that likely does what you want. Keep in mind that we use it primarily as an internal utility, so it hasn't been through rigorous testing.

@gitspeaks
Copy link
Author

Thanks. I’ll run some tests. Is this export api designed to run concurrently along other reader / writer threads or does it lockout writer threads while it writes out the KVDB to the file ?

@davidboles
Copy link

As alluded to above, the import/export API's are experimental. As something designed to be embedded within something else, where the boundaries of a backup interface should be isn't entirely clear. Input from the community would be greatly appreciated.

As to what is currently implemented, there are two experimental API entry points: hse_kvdb_export_exp() and hse_kvdb_import_exp() (these are declared in hse_experimental.h). These are effectively only usable for off-line backup. The first takes an open KVDB handle, a params, and a target path for the output. The second takes an mpool name and a source path.

For hse_kvdb_export_exp(), all of the KVSs in the KVDB are exported and none of them may already be open as of the current code. In turn, each KVS is opened, a cursor over it created, and a work job started to export that KVS. The call to export then waits until all the exports complete and it then closes the KVSs and returns.

To your earlier question: "Do you include KVDB/KVS snapshot version in each individual KVDB/KVS/KV object ?", the answer is "yes". The hse_kvdb_export_exp() function can be enhanced, either a little or a lot ... community feedback is key in charting that path.

One possible enhancement would be to allow it to be called on a KVDB with open KVS's and take an ephemeral snapshot at the beginning of the call so that each KVS would be dumped at the same view. That call would have to return after the export work has started and there would have to be a status interface for the client to check on export progress (including cancelling it, etc.). The enclosing application would then only have to quiesce itself across the call to hse_kvdb_export_exp().

@gitspeaks
Copy link
Author

@davidboles Thanks for clarifying!

One possible enhancement would be to allow it to be called on a KVDB with open KVS's and take an ephemeral snapshot at the beginning of the call so that each KVS would be dumped at the same view. That call would have to return after the export work has started and there would have to be a status interface for the client to check on export progress (including cancelling it, etc.).

Yes, these would be key requirements defining the backup part of the solution but I have yet to understand how you manage snapshot version per object in way that supports creating such an ephemeral snapshot.

Alternatively, going back to the initial suggestion - taking a snapshot of the disk volume - Assuming I can tolerate write downtime for the duration of a disk snapshot (which should be several seconds at worst) and ensure in my application code that no write thread interacts with the engine api but reader threads, what should I do to ensure ALL memory buffers affecting the integrity of the stored data are flushed to disk before I invoke the snapshot operation?
Can you relax hse_kvdb_export_exp() to work with a pre opened KVS ?

@smoyerx
Copy link

smoyerx commented May 2, 2020

You can call hse_kvdb_flush() to ensure the data from all committed transactions, and completed standalone operations, is on stable media.

@gitspeaks
Copy link
Author

@smoyergh Thanks, regarding my last query about relaxing hse_kvdb_export_exp() to work with a pre opened KVS, I should have added that this may allow an additional backup option if the dump processes turns out to be quick resulting in low write downtime. The plus here is of-course avoiding the additional complexity of dealing with volume snapshot programs and volume copying for off-machine storage.

@smoyerx
Copy link

smoyerx commented May 5, 2020

A method to dump a consistent snapshot of a live (online) KVDB is to create a transaction, which creates an ephemeral snapshot of all KVS in the KVDB, and then use transaction snapshot cursors to iterate over all KV pairs in each KVS.

Long-term we can consider stable APIs for both live dump and pause (allowing time to take volume snapshots).

@gitspeaks
Copy link
Author

A method to dump a consistent snapshot of a live (online) KVDB is to create a transaction, which creates an ephemeral snapshot of all KVS in the KVDB, and then use transaction snapshot cursors to iterate over all KV pairs in each KVS.

Having zero knowledge about how things work internally I can only speculate that executing such “Backup transactions” on a live Db may delay reclamation of the storage associated with objects that are modified while traversing and writing out the dump. If that’s true, how can I monitor that impact in terms of increased KVDB/KVS size / IO ?

@davidboles
Copy link

Having zero knowledge about how things work internally I can only speculate that executing such “Backup transactions” on a live Db may delay reclamation of the storage associated with objects that are modified while traversing and writing out the dump. If that’s true, how can I monitor that impact in terms of increased KVDB/KVS size / IO ?

The impact would be in terms of space amplification, not I/O. Newer data is found first so the older, as yet un-reclaimed data generally wouldn't be accessed by query activity. We do not currently expose space amplification data. The engine does keep estimates of garbage levels using HyperLogLog-based mechanisms, but we don't publish that info.

Taking a few steps back - if you have a database that is substantially overwritten in the time it takes to perform an export then there will be a space-amp penalty that you will have to account for. Such a database is unlikely to be very large - if it was, you wouldn't be able to overwrite most of it in the export interval.

@gitspeaks
Copy link
Author

About reclaiming the garbage, I assume this would be done during compact (e.g hse_kvdb_compact api). Is this the only way available to force GC “now” ? Does it block access to KVDB’s ? Otherwise, how can I know when garbage is actually reclaimed ?

@smoyerx
Copy link

smoyerx commented May 6, 2020

Compaction is done in the background, as needed, both for GC and to optimize storage layout. And there are multiple flavors of compaction tied to how we organize storage. The hse_kvdb_compact() API exists to proactively put the system into a known state of "compactedness". We use it internally primarily for benchmarking to get consistent results.

@gitspeaks
Copy link
Author

Aside from having things neat for benchmarking I'm not clear on when one would use this api.

hse.h, hse_kvdb_compact description:

In managing the data within an HSE KVDB, there are maintenance activities that occur
as background processing. The application may be aware that it is advantageous to do
enough maintenance now for the database to be as compact as it ever would be in
normal operation.

What info does the engine publish that can be used by the application to determine if it's advantageous to do enough maintenance "now" ?

@smoyerx
Copy link

smoyerx commented May 6, 2020

Some applications may have a natural period of time when the load is low, and could choose to call hse_kvdb_compact() as a proactive action to take care of maintenance at that time. The API only compacts down to certain thresholds related to garbage and elements of the storage organization that influence performance, so it won't do any more work than necessary to achieve these thresholds. Hence a metric of when it is advantageous wouldn't be all that beneficial.

But again, we implemented it to support consistency in benchmarks, not because we really need the application to call it for regular operation.

@gitspeaks
Copy link
Author

I think it would be beneficial if you expose some performance counters for this process including what you consider as “performance thresholds” to allow users to observe the engine dynamics during a particular workload.

@davidboles
Copy link

I think it would be beneficial if you expose some performance counters for this process including what you consider as “performance thresholds” to allow users to observe the engine dynamics during a particular workload.

If you call hse_kvdb_compact() on a KVDB, access to that KVDB is in no way restricted. Further, if you initiate such a compaction operation on an approximately idle database, wait until that completes, and then invoke it again, the call is effectively a no-op. Looking in hse/include/hse/hse.h we find this structure:

struct hse_kvdb_compact_status {
    unsigned int kvcs_samp_lwm;  /**< space amp low water mark (%) */
    unsigned int kvcs_samp_hwm;  /**< space amp high water mark (%) */
    unsigned int kvcs_samp_curr; /**< current space amp (%) */
    unsigned int kvcs_active;    /**< is an externally requested compaction underway */
};

That struct can be retrieved via hse_kvdb_compact_status().

There are in fact many performance counters in the system that are exposed via a REST API against a UNIX socket. We will be documenting that interface in the future. See hse/include/hse/kvdb_perfc.h for those. Not all are enabled by default. You can see usage of the REST interface in hse/cli/cli_util.c.

@gitspeaks
Copy link
Author

@davidboles Thanks! this is all reassuring. Hopefully more documentation will arrive soon.
BTW, are you aware of any gotchas building the project on Ubuntu?

@smoyerx
Copy link

smoyerx commented May 7, 2020

HSE currently supports RHEL 7.7 and 8.1. We have not tested other platforms.

Porting HSE will likely require changes to the mpool kernel module. We are working with the Linux community to get mpool upstream, but that will take some time.

In advance of that, we are looking at adding support for Ubuntu 18.04.4. However, we cannot commit to a specific time frame.

@gitspeaks
Copy link
Author

we are looking at adding support for Ubuntu 18.04.4

That would be great!! My environment is based on Ubuntu so I'll continue to experiment with project once you provide the building instructions.

@victorstewart
Copy link

@smoyergh I use Clear Linux, so I'll report any issues if I face them. I assume running inside a CentOS/RHEL docker container would not eliminate any potential issues since such issues would arise from inside the kernel?

@victorstewart
Copy link

@davidboles

To add my perspective here... Currently building a Redis Enterprise-like clone based around HSE. The fact that distributed databases rely on active replication, and thus you get "backups for free", my only whole database backup needs orbit around seeding new instances/clusters (most likely in new datacenters over the network).

So I've been researching how to best accomplish this today after being pointed in the right direction by @smoyergh.

My intuition is that no serialization scheme such as hse_kvdb_export_exp could compare to the performance (and 0 downtime for free) of 1) creating an LVM snapshot, 2) dd-ing the volume to a file, and then 3) concurrently compressing it to make it sparse?

@smoyerx
Copy link

smoyerx commented Aug 16, 2020

A container uses the host kernel, and we have never tested mpool with the Clear Linux kernel. We are very close to posting our next release, which simplifies mpool considerably and makes it run with a wider range of kernels. So if the current version of mpool doesn't work with Clear Linux, the soon-to-be-released version might.

That said, the upcoming release has only been tested with RHEL 7, RHEL 8, and Ubuntu 18.04.

@smoyerx
Copy link

smoyerx commented Aug 16, 2020

Regarding backup performance, I agree that backing-up an mpool volume snapshot via whatever mechanism is native to the environment (whether LVM, AWS EBS, a SAN array volume, etc.) is going to be faster than the HSE serialization APIs.

@victorstewart
Copy link

https://github.com/datto/dattobd

Dattobd seems like a better option than creating a snapshot volume, just create a snapshot file.

@victorstewart
Copy link

another variable here is that if you're trying to send your backup over the network (or just replicate your entire dataset to another machine), serialization might actually be faster when the occupancy of your volume is low.

maybe you only have 1GB of data in a 500GB volume. so if you snapshot that, then dd + compress + ssh it over the network, you still have to run through those 500GB.

also if you want to orchestrate this in applications (with requester + sender applications on opposing machines), it's way simpler to just hse_kvdb_export_exp() , pass it over the network and hse_kvdb_import_exp() it.... than to on-demand snapshot with dattobd, transfer the block data, then create the volumes and mount etc.

@victorstewart
Copy link

some new thoughts on backups. i'm going to be running my database shards over replicated Portworx volumes, so implicit duplication at the storage layer. but one could also run over BTRFS, and take snapshots of the data directory while the application in running and unaffected (read: no downtime).

@alexttx
Copy link
Contributor

alexttx commented Jan 3, 2023

Filesystem snapshots taken while a KVDB is open may result in a corrupt snapshot. HSE would need a "pause" method as described in an earlier comment by @smoyergh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants