-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DB Backup / Restore Options #4
Comments
HSE works with any SSD-backed volume (e.g., LVM, SAN array, cloud volume), so you can take a snapshot of the volume you configured for your capacity media class following the method appropriate for your volume manager. Best practice is to halt the KVDB application to flush all data prior to taking the snapshot, otherwise you will have a crash-consistent snapshot. If you configure both a capacity and staging media class, you need to snapshot the volume associated with each. In this case, you need to halt the KVDB application to ensure the two volumes are in sync when you take the snapshots. |
Thanks. This method is effectively an “offline backup”. Please consider enhancing the engine to support backing up the database while the engine is running to a file that can be simply moved off the machine after the backup completes. |
We anticipate that HSE will most often be used as part of an application that has its own backup method. For example, we integrated HSE with MongoDB as a proof point, and users of MongoDB with HSE would then likely use one of the several backup methods that MongoDB provides (e.g., mongodump). That said, there is certainly utility in a native dump/restore method, which we'll consider for future inclusion. |
As far as HSE, what would be the correct way (Api) to enumerate all KV’s in all db’s in order to read a consistent snapshot of the entire dataset ? |
Do you include KVDB/KVS snapshot version in each individual KVDB/KVS/KV object ? |
There is an experimental API (inlcude/hse/hse_experimental.h) to dump and restore a KVDB that likely does what you want. Keep in mind that we use it primarily as an internal utility, so it hasn't been through rigorous testing. |
Thanks. I’ll run some tests. Is this export api designed to run concurrently along other reader / writer threads or does it lockout writer threads while it writes out the KVDB to the file ? |
As alluded to above, the import/export API's are experimental. As something designed to be embedded within something else, where the boundaries of a backup interface should be isn't entirely clear. Input from the community would be greatly appreciated. As to what is currently implemented, there are two experimental API entry points: hse_kvdb_export_exp() and hse_kvdb_import_exp() (these are declared in hse_experimental.h). These are effectively only usable for off-line backup. The first takes an open KVDB handle, a params, and a target path for the output. The second takes an mpool name and a source path. For hse_kvdb_export_exp(), all of the KVSs in the KVDB are exported and none of them may already be open as of the current code. In turn, each KVS is opened, a cursor over it created, and a work job started to export that KVS. The call to export then waits until all the exports complete and it then closes the KVSs and returns. To your earlier question: "Do you include KVDB/KVS snapshot version in each individual KVDB/KVS/KV object ?", the answer is "yes". The hse_kvdb_export_exp() function can be enhanced, either a little or a lot ... community feedback is key in charting that path. One possible enhancement would be to allow it to be called on a KVDB with open KVS's and take an ephemeral snapshot at the beginning of the call so that each KVS would be dumped at the same view. That call would have to return after the export work has started and there would have to be a status interface for the client to check on export progress (including cancelling it, etc.). The enclosing application would then only have to quiesce itself across the call to hse_kvdb_export_exp(). |
@davidboles Thanks for clarifying!
Yes, these would be key requirements defining the backup part of the solution but I have yet to understand how you manage snapshot version per object in way that supports creating such an ephemeral snapshot. Alternatively, going back to the initial suggestion - taking a snapshot of the disk volume - Assuming I can tolerate write downtime for the duration of a disk snapshot (which should be several seconds at worst) and ensure in my application code that no write thread interacts with the engine api but reader threads, what should I do to ensure ALL memory buffers affecting the integrity of the stored data are flushed to disk before I invoke the snapshot operation? |
You can call hse_kvdb_flush() to ensure the data from all committed transactions, and completed standalone operations, is on stable media. |
@smoyergh Thanks, regarding my last query about relaxing hse_kvdb_export_exp() to work with a pre opened KVS, I should have added that this may allow an additional backup option if the dump processes turns out to be quick resulting in low write downtime. The plus here is of-course avoiding the additional complexity of dealing with volume snapshot programs and volume copying for off-machine storage. |
A method to dump a consistent snapshot of a live (online) KVDB is to create a transaction, which creates an ephemeral snapshot of all KVS in the KVDB, and then use transaction snapshot cursors to iterate over all KV pairs in each KVS. Long-term we can consider stable APIs for both live dump and pause (allowing time to take volume snapshots). |
Having zero knowledge about how things work internally I can only speculate that executing such “Backup transactions” on a live Db may delay reclamation of the storage associated with objects that are modified while traversing and writing out the dump. If that’s true, how can I monitor that impact in terms of increased KVDB/KVS size / IO ? |
The impact would be in terms of space amplification, not I/O. Newer data is found first so the older, as yet un-reclaimed data generally wouldn't be accessed by query activity. We do not currently expose space amplification data. The engine does keep estimates of garbage levels using HyperLogLog-based mechanisms, but we don't publish that info. Taking a few steps back - if you have a database that is substantially overwritten in the time it takes to perform an export then there will be a space-amp penalty that you will have to account for. Such a database is unlikely to be very large - if it was, you wouldn't be able to overwrite most of it in the export interval. |
About reclaiming the garbage, I assume this would be done during compact (e.g hse_kvdb_compact api). Is this the only way available to force GC “now” ? Does it block access to KVDB’s ? Otherwise, how can I know when garbage is actually reclaimed ? |
Compaction is done in the background, as needed, both for GC and to optimize storage layout. And there are multiple flavors of compaction tied to how we organize storage. The hse_kvdb_compact() API exists to proactively put the system into a known state of "compactedness". We use it internally primarily for benchmarking to get consistent results. |
Aside from having things neat for benchmarking I'm not clear on when one would use this api. hse.h, hse_kvdb_compact description:
What info does the engine publish that can be used by the application to determine if it's advantageous to do enough maintenance "now" ? |
Some applications may have a natural period of time when the load is low, and could choose to call hse_kvdb_compact() as a proactive action to take care of maintenance at that time. The API only compacts down to certain thresholds related to garbage and elements of the storage organization that influence performance, so it won't do any more work than necessary to achieve these thresholds. Hence a metric of when it is advantageous wouldn't be all that beneficial. But again, we implemented it to support consistency in benchmarks, not because we really need the application to call it for regular operation. |
I think it would be beneficial if you expose some performance counters for this process including what you consider as “performance thresholds” to allow users to observe the engine dynamics during a particular workload. |
If you call hse_kvdb_compact() on a KVDB, access to that KVDB is in no way restricted. Further, if you initiate such a compaction operation on an approximately idle database, wait until that completes, and then invoke it again, the call is effectively a no-op. Looking in hse/include/hse/hse.h we find this structure:
That struct can be retrieved via hse_kvdb_compact_status(). There are in fact many performance counters in the system that are exposed via a REST API against a UNIX socket. We will be documenting that interface in the future. See hse/include/hse/kvdb_perfc.h for those. Not all are enabled by default. You can see usage of the REST interface in hse/cli/cli_util.c. |
@davidboles Thanks! this is all reassuring. Hopefully more documentation will arrive soon. |
HSE currently supports RHEL 7.7 and 8.1. We have not tested other platforms. Porting HSE will likely require changes to the mpool kernel module. We are working with the Linux community to get mpool upstream, but that will take some time. In advance of that, we are looking at adding support for Ubuntu 18.04.4. However, we cannot commit to a specific time frame. |
That would be great!! My environment is based on Ubuntu so I'll continue to experiment with project once you provide the building instructions. |
@smoyergh I use Clear Linux, so I'll report any issues if I face them. I assume running inside a CentOS/RHEL docker container would not eliminate any potential issues since such issues would arise from inside the kernel? |
To add my perspective here... Currently building a Redis Enterprise-like clone based around HSE. The fact that distributed databases rely on active replication, and thus you get "backups for free", my only whole database backup needs orbit around seeding new instances/clusters (most likely in new datacenters over the network). So I've been researching how to best accomplish this today after being pointed in the right direction by @smoyergh. My intuition is that no serialization scheme such as |
A container uses the host kernel, and we have never tested mpool with the Clear Linux kernel. We are very close to posting our next release, which simplifies mpool considerably and makes it run with a wider range of kernels. So if the current version of mpool doesn't work with Clear Linux, the soon-to-be-released version might. That said, the upcoming release has only been tested with RHEL 7, RHEL 8, and Ubuntu 18.04. |
Regarding backup performance, I agree that backing-up an mpool volume snapshot via whatever mechanism is native to the environment (whether LVM, AWS EBS, a SAN array volume, etc.) is going to be faster than the HSE serialization APIs. |
https://github.com/datto/dattobd Dattobd seems like a better option than creating a snapshot volume, just create a snapshot file. |
another variable here is that if you're trying to send your backup over the network (or just replicate your entire dataset to another machine), serialization might actually be faster when the occupancy of your volume is low. maybe you only have 1GB of data in a 500GB volume. so if you snapshot that, then also if you want to orchestrate this in applications (with requester + sender applications on opposing machines), it's way simpler to just |
some new thoughts on backups. i'm going to be running my database shards over replicated Portworx volumes, so implicit duplication at the storage layer. but one could also run over BTRFS, and take snapshots of the data directory while the application in running and unaffected (read: no downtime). |
Filesystem snapshots taken while a KVDB is open may result in a corrupt snapshot. HSE would need a "pause" method as described in an earlier comment by @smoyergh. |
Please provide instructions on how to do point in time online backup of a KVDB and how to do a complete KVDB restore.
The text was updated successfully, but these errors were encountered: