Using ArchiveBox with git-annex #1592
-
|
I have a collection about 18 TiB in size, mostly of YouTube videos. It's managed by a pile of homegrown scripts that have become too messy to handle, and I just came across ArchiveBox, so I'm thinking of migrating to it. I'll worry about the reorganization myself (most likely I'll just extract the URLs from the current collection, redownload everything that still exists publicly, and move anything that doesn't into a separate manually-managed section), but I also use and like git-annex, and I'm wondering if that will pose any problems for ArchiveBox. If you aren't aware, git-annex is a tool that finds large files in a git repository, hashes their contents, moves them into a subfolder of The main issues that I can imagine popping up are:
But I don't know ArchiveBox's internals well enough to say for sure if these are the only issues. Is anyone else doing this? Has anyone tried and failed? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
|
I would not recommend migrating that collection to ArchiveBox for a couple reasons:
I think this may be more error-prone than you're expecting. there isn't a great way to tell if something has redownloaded "properly", as often requests will respond with 200 OK but return a page that has a modal/popup/ad/CAPTCHA/login window hiding the content. Don't be so quick to throw out your original copies just because it looks like a new tool has saved the URLs at first glance. I have built an LLM-powered QA system to detect issues like that in captures (for one of our paying clients), but it's not yet publically available. (WebRecorder.net also provides good QA tooling to assess capture success rate.) If you do decide to go for it, definitely try it at a small scale first with ~1,000 videos and see if it works well for your needs first. |
Beta Was this translation helpful? Give feedback.
Ok, makes sense, just wanted to make sure you're aware of the limitations for this use case.
As for your specific concerns:
Yeah I would do this out-of-band with a periodic cronjob or inotify watcher.
ArchiveBox exclusively uses
os.path.isfile(...)for file existence, which returnsFalsefor broken symlinks. However, ArchiveBox wont try to re-download files once a success is recorded in the DB, even if i…