Skip to content

Latest commit

 

History

History
936 lines (683 loc) · 82.4 KB

help.org

File metadata and controls

936 lines (683 loc) · 82.4 KB

How to read this document

It is highly recommended you view this page by clicking the Help button in the extension’s own UI. Doing that will make this page interactive: the settings popup will be displayed on the right on this page and hovering over or clicking on any links pointing to popup.html will highlight those elements in the popup.

See screenshots if you want to see how it will look.

You can still read this page outside of the extension’s UI, but be prepared for all links pointing to popup.html to be useless. Also, the version hosted on the author’s web site is superior to what GitHub’s web UI renders (this pages is written in org-mode markup language, converting it to GitHub Markdown will make things much harder, since it uses a lot of advanced markup features of org-mode to simplify things, and GitHub does not render org-mode files very well at the moment).

What?

Hoardy-Web is a browser extension (add-on) that passively captures and collects dumps of HTTP requests and responses as you browse the web, and then archives them using one or more of the following methods:

To view your archived data, see the accompanying hoardy-web CLI tool (also there).

Conventions

When you open this document by clicking the Help button in extension’s UI, this page has two parts: this help text, and an iframe with a completely unrolled popup UI in it.

The whole page will switch between single- and two-column layouts depending on available viewport width (which depends on device width and zoom level). In single-column layout the popup UI is placed after the end of the help text. In two-column layout they are placed side-by-side.

In both layouts:

  • links that look like this are references to elements of the popup UI,
    • in single-column layout, clicking such a link will scroll the whole page to the corresponding element in the popup UI iframe and then highlight it;
    • in two-column layout, clicking or hovering over such a link will scroll only the popup UI iframe around to put the corresponding referenced element into view and then highlight it;
    • the highlighted element will stop being highlighted if you click anywhere else on the page;
  • links that look like this are references to other parts of this page; clicking these links will scroll this page to the place they point to and then highlight the relevant part;
  • links that look like this are references to other internal pages of Hoardy-web; clicking them will navigate this tab there;
  • finally, links like this are references to external URLs.

In cases when clicking on a link scrolls this page around or navigates to another page, pressing the “Back” button of your browser will get you back to the exact link you clicked and then highlight it, making it easy to get back to reading from the exact place you left off.

**Go forth and try it by clicking one or more of the above links.**

The above rules are also apply on all other internal pages of Hoardy-Web, e.g. the Changelog page.

General operation

Glossary

  • A reqres (REQuest + RESponse) is an internal object containing captured information about an HTTP request and its response, including their headers and data, and some meta-information (whether it originates from an extension, tabId it originates from, its state, etc).

State Diagram

Reqres change their internal states according to the following state diagram (which is explained below):

(start) -> (request sent) -> (nIO) -> (headers received) -> (nIO) --> (body recived)
   |                           |                              |             |
   |                           v                              v             v
   |                     (no_response)                   (incomplete)   (complete)
   |                           |                              |             |
   |                           \                              |             |
   |\---> (canceled) ----\      \                             |             |
   |                      \      \                            \             |
   |\-> (incomplete_fc) ---\      \                            \            v
   |                        >------>---------------------------->-----> (finished)
   |\--> (complete_fc) ----/                                             /  |
   |                      /                                             /   |
   \----> (snapshot) ----/       /- (collected) <--------- (picked) <--/    |
                                /        ^                     |            |
               (stashIO?) <----/         |                     v            v
                   |                     \-- (in_limbo) <- (stashIO?) <- (dropped)
                   v                              |                         |
                (queued) <--------------------\   |                         |
                / |  ^ \                       \  \-----> (discarded) <-----/
  (exported) <-/  |  |  \-------------------\   \              ^
      |           |  |                       \   \             |
      |       /---/  \-----------------\      \   \            |
      |       |                        |       \   \           |
      |       v                        |        \   \          |
      |\-> (srvIO) -> (stashIO?) -> (unarchived) |   \         |
      |       |                        ^        /    |         |
      |       |                        |    /--/     |         |
      |       v                        |    v        |         |
      |   (sumbitted) --------------> (saveIO) --> (saved)     | {{!saving}}
      |       \                                                |
      \-------->-----------------------------------------------/

Step 1: Tracking

Hoardy-Web attaches to your browser’s runtime and tracks progress of HTTP requests and their responses, capturing both their request and response headers and data at appropriate times in the browser’s request and response processing pipeline.

Whether Hoardy-Web will track a given request depends on the Track new requests toggles in the settings popup, e.g:

  • this toggle allows you to disable tracking of newly spawned HTTP requests globally, thus essentially disabling Hoardy-Web,
  • this one controls whether Hoardy-Web will track new requests originating from the currently active tab,
  • this one controls whether it will track new requests originating from new tabs opened from the currently active tab (aka “children tabs”, e.g. via middle mouse click, context menu, etc),
  • while this one controls whether it will track new requests originating from new tabs opened via browser’s “New Tab” browser action (i.e. the plus sign in the tab bar, Control+T, menu item, etc),
  • and so forth for the others (press ? symbols to see a tooltip explaining what each of them does).

Disabling any of these toggles does not stop tracking of already initiated requests, it only stops new requests controlled by that toggle from being tracked.

The networking states of the State Diagram

As shown on the above diagram, a new reqres proceeds through the following networking states:

  • start: the starting state;
  • request sent, (response) headers received, (response) body recived: these are the normal HTTP request stages (stages of =webRequest= sub-API of WebExtensions API);
  • nIO: normal network IO performed by the browser in between HTTP request stages;
  • canceled: request was canceled before it was sent
    • you, the user, canceled it manually, via the Stop button;
    • by the browser when redirecting an http:// URL to an https:// URL in HTTPS-only mode;
    • by an ad-blocking extension like uBlock Origin;
    • etc;

    unsent would have probably been a better name for this, but all browsers call it canceled internally, so Hoardy-Web follows that convention;

  • no_response: request was sent, but no response was received:
    • you canceled it manually via the Stop button before it got a response;
    • connection to the server was rejected;
    • the server decided to ignore the request completely;
    • network timeout was reached;
    • etc;
  • incomplete: request was sent, response headers were received, but then the loading was interrupted before all of the response body was received;
  • incomplete_fc: only on Firefox-based browsers: the browser loaded the response data of this reqres directly from its cache, but did not give it to Hoardy-Web;

    this is just how Firefox handles things sometimes; usually, this only happens for images;

    this is a separate state, because usually this means this URL was successfully archived before; if it was not, reload the page with Control+F5;

  • complete: request was completed successfully;
  • complete_fc: request was completed successfully from browser’s cache;
  • snapshot: this reqres was produced by taking a DOM (Document Object Model) snapshot (using one of the appropriate-buttons in the popup), i.e. it was produced by capturing a raw HTML or XML of the current state of the tab/frame, not by capturing a network request;
  • finished: the terminal state of this step, no new events for this reqres will come from the browser.

The states after the finished state

In principle, at reaching finished state the reqres can be serialized and saved to disk, but Hoardy-Web provides more states and UI for convenience and to workaround limitations of various browser APIs (a WebExtensions API function call that writes a data chunk into a file on a local file system while reporting out-of-disk-space errors does not exists).

Glossary

  • An /in-flight reqres/ (current tab) is a reqres that did not reach the finished state yet, in history-log such reqres will be shown to be in in_flight state.

    These two stats are represented as sums of two numbers:

    • the number of reqres that are still being tracked via webRequest or debugger API; and
    • the number of reqres that have finished being tracked and are now waiting for all their events to finish processing.

    On Firefox, nothing should ever get stuck, if something seems to be stuck in in_flight state, it’s probably still loading (or it is a bug in the browser, which does happen, very rarely).

    On Chromium, limitations of the Chromium’s debugging interface mean a request can get stuck among the reqres represended by the first number above. If the first number is zero, however, then the second should also rapidly become zero, at most after two times this many seconds.

    If some reqres got stuck in one of the in_flight states, you can forcefully move them out of that state using this and/or that popup buttons.

  • A finished reqres is a reqres that reached the finished state.
  • Final networking state is the last state a reqres had before it finished: i.e. complete, incomplete, canceled, etc.

Step 2: Classification

On reaching the finished state, Hoardy-Web performs reqres classification controlled by =Mark reqres as ‘problematic’ when they finish= and =Pick reqres for archival when they finish= settings. The former set decides if the reqres in question should be marked as problematic. The latter set decides whether the reqres in question should be picked or dropped, which influences the actions Hoardy-Web will perform in the next step.

Problematic reqres

The problematic reqres status is a flag (NOT a state) that does not influence archival or any actions discussed in the latter steps. It exists because browsers provide no indication when some parts of the page failed to load properly — they expect you to actually look at the page with your eyes to notice something looking broken (and reload it manually) instead — which is counterproductive when you want to be sure that the whole page with all its resources was archived.

After all, parts of a dynamically loaded page might simply silently fail to be rendered by associated JavaScript because some of the HTTP requests that JavaScript did in background failed, or, on a static web page, layout and `CSS` might have made some of the incompletely loaded parts of the page invisible (by design or by accident).

So, to provide an indicator for such cases, Hoardy-Web keeps the log of problematic reqres and displays the number of elements in the log in its toolbar button’s badge.

By default, HTTP requests that failed to get a response, those that have incomplete response bodies, and those for which the browser reported potentially problematic errors but then Hoardy-Web picked them anyway, will be marked as problematic.

Problematic errors are errors like

  • “this request failed because of a networking issue”,
  • “this request was aborted because the JavaScript function making it decided to cancel it when you moved your mouse cursor away from a video thumbnail it was needed for”,
  • and similar things that probably imply some part of the page was left unfetched,

but NOT errors like

  • “fetching of this request was aborted because the server redirected it to a URL blocked by uBlock Origin”,
  • “the browser decided against rendering of this data”,
  • “the browser failed to render this data because this image file is broken”,
  • and similar errors where the data was properly fetched.

(In principle, Hoardy-Web could have been designed to never record the errors of the latter category in the first place, thus simplifying the above bit, but Hoardy-Web is designed to follow the philosophy or “collect everything as browser gives it, as raw as possible, do all the post-processing logic separately, allow for no logic at all, if the user asks for it”.)

The raw error strings reported by the browser for each reqres can be seen in the history-log.

If this option is enabled Hoardy-Web will generate a notification each time a new problematic reqres get produced. If you don’t care about the problematic flag and it annoys you, you should disable that option, not options under =Mark reqres as ‘problematic’ when they finish= settings. This way you could then still see the number of problematic reqres in extension’s toolbar button’s badge.

Glossary

Step 3: Collection, Discarding, and Limbo

On exit from the finished state each reqres gets split into

  • a loggable, which is a hollow reqres structure without any request or response data, i.e. it only keeps the metadata used by history-log, and
  • a dump, which is a serialized CBOR-formatted dump of the original reqres structure.

Since those tuples can be reconstructed back into the original reqres structures, the following will continue to refer to them as if nothing changed when the fact they are now being internally represented by those tuples is not relevant.

Normally, picked reqres proceed to the collected state and get queued for archival while dropped reqres proceed to being discarded from memory.

When =Archive ‘collected’ reqres= toggle is enabled, those queued reqres proceed directly to the next step.

“Limbo” mode

However, sometimes you might want to actually look at a web page before deciding if you want to archive it or not. The naive way to do it would be to load a page with capture disabled first, look at it, and then, if you want to save it, enable it, and reload the page again with browser’s cache disabled via Control+F5 (and it has to be Control+F5, not just F5, because otherwise some URLs, on Firefox, might produce reqres in incomplete_fc state, on Chromium, their fetching could be silently skipped).

Obviously, this is both annoying and will force you to fetch everything twice.

Which is why Hoardy-Web implements “limbo mode”. With one of the limbo mode options enabled, Hoardy-Web will instead capture everything as normal, but then, instead of sending the reqres in question to collected or discarded states immediately, it will put them into in_limbo state where they would linger until you collect it or discard them manually by pressing the appropriate-buttons, or until =Automatic actions for recently closed tabs= options make a decision semi-automatically for you.

A picked reqres will be put into in_limbo when =Pick into limbo= setting is enabled in the currently active tab or when one-of-the-other settings is enabled for other reqres sources.

Similarly, a dropped reqres will be put into in_limbo when =Drop into limbo= setting is enabled in the currently active tab or when one-of-the-other settings is enabled for other reqres sources. (This latter option mainly exists for debugging.)

If this option is enabled and there are more than this number reqres in_limbo or the total size of all dumps in_limbo is more than this size (in MiB), Hoardy-Web will complain to remind you to collect or discard some of them so that your browser does not waste too much memory (and so that you won’t loose too much data if something crashes while =Stash ‘collected’ reqres into local storage= option discussed below is disabled).

Glossary

Step 3.5: Stashing

The stashed reqres status is, essentially, a flag that says this reqres was temporarily backed up to browser’s local storage. In other words, stashing exists to prevent loss of successfully captured but yet unarchived data in situations where

  • you quit or restart your browser, or
  • Hoardy-Web crashes or gets reloaded unexpectedly,
  • your computer unexpectedly looses power,

before you collected or discarded everything from in_limbo or Hoardy-Web has successfully archived everything from its archiving queue.

In particular:

Moreover, the following section will discuss how Hoardy-Web will try stashing unarchived reqres into browser’s local storage too.

Note however, that even with stashing enabled Hoardy-Web will skip disk IO whenever possible: e.g., if both =Archive ‘collected’ reqres= and [[./popup.html#div-config.archiveSubmitHTTP][=Submit dumps via ‘HTTP’]] options discussed below are enabled, =Hoardy-Web will first try to archive each new collected reqres straight from memory to the archiving server and only if that process fails will it attempt stashing them to local storage instead.

Meaning that

  • stashing of non-=in_limbo= reqres is usually completely free and so you should probably keep that option always enabled;
  • stashing of in_limbo reqres via-one-of-the-those options is not free, so if you almost never archive from limbo then keeping those options enabled will waste disk IO, so you might want to disable at least some of them in that case.

The above also implies that, technically, stashing is not a silver bullet against data loss. To try and make it such would mean unconditional immediate stashing of all captured data, which would waste a lot of disk IO on most Hoardy-Web configurations.

When both =Archive ‘collected’ reqres= option and =Stash ‘collected’ reqres into local storage= option are disabled, then, after a new reqres gets queued, Hoardy-Web will generate a new notification complaining about it, unless that option is disabled too.

You can also forcefully stash all currently queued, in_limbo, and unarchived reqres by pressing this button. It stashes everything immediately and unconditionally, ignoring all other stashing settings. When reloading the extension via the Reload button or via =Auto-reload on updates= option, this action will be run automatically.

Glossary

  • A stuck queued reqres is a queued reqres that got stuck in the archival queue, e.g. because it got queued while =Archive ‘collected’ reqres= option was disabled.
  • A /stashed reqres/ is a reqres that was temporarily stashed (backed-up) into browser’s local storage while it is still being kept in Hoardy-Web’s memory. I.e., the stash is a persistent on-disk backup for in-memory reqres.
  • A /failed to stash reqres/ is a reqres that is currently unstashed, i.e. a reqres that failed to be stashed into browser’s local storage. Note that reqres for which stashing was not even attempted are not included in this set. It is also a part of the sum of the “Failed” part of the Queued/Failed line.

    You can retry stashing these by pressing this button.

Step 3.75: Logging

On entering collected or discarded state, loggable metadata of each reqres is copied into the recent reqres history-log and is kept there until the size of the log reaches this many elements, at which point the older elements of the log start being elided automatically.

You can also ask Hoardy-Web to forget all history manually by pressing this button, or to forget history of reqres generated by the currently active tab by pressing that button instead, or do the same by using similar buttons in the-log. Using the-log will also allow the use of reqres filtering options for doing this, allowing you to selectively forget parts of history.

Note, however, that problematic reqres will not get automatically elided from the log, nor forgotten by using the above buttons. To forget about them, you will have to unset the problematic flag on the respective reqres via this button, or that button, or use similar buttons in the-log.

Step 4: Archival

When =Archive ‘collected’ reqres= toggle is enabled, Hoardy-Web will pop queued reqres from the archival queue one by one and then perform one or more of the following (in order they are listed):

  • if =Export dumps via ‘saveAs’= option is enabled, Hoardy-Web will
    • append the dump, as a byte string, to a (per-=bucket=, see below) bundle,
    • and then
      • if the bundle gets larger than this or
      • after a delay controlled by that and this options

      export the resulting bundle via browser’s saveAs mechanism (i.e. generate a fake-Download);

  • if =Submit dumps via ‘HTTP’= option is enabled, Hoardy-Web will submit the dump to the archiving server at =Server URL= setting by making an HTTP POST request with the dump as request body (which is denoted by srvIO states on the diagram above);
  • if any of the above fails Hoardy-Web will
    • move the reqres into the unarchived state,
    • if =Stash ‘collected’ reqres into local storage= option is enabled, it will try stashing the (loggable, dump) tuple into browser’s local storage (which is denoted by stashIO states on the diagram above) and record but ignore any errors produced while doing that, and
    • stop processing this reqres;
  • otherwise, if =Save reqres into local storage= option is enabled, Hoardy-Web will
    • try to save the (loggable, dump) tuple into browser’s local storage (which is denoted by saveIO states on the diagram above),
    • if saving fails, it will move the reqres into the unarchived state instead, and stop processing this reqres;
  • finally, if =Save reqres into local storage= option is disabled or if saving to local storage succeeds, Hoardy-Web will discard the reqres from memory.

You can enable more than one archival method at the same time. For a given loggable, Hoardy-Web will remember and skip previously successful archival methods if the loggable ever returns to the archival queue again (e.g., when one of the archival methods fails and you later ask Hoardy-Web to retry the archival, or when you re-queue a reqres from local storage from the Saved in Local Storage page).

Note the difference between stashed and saved reqres:

  • stashed reqres are kept in memory until they get successfully archived by all configured archival methods (or until you manually discard them, in case they were stashed in_limbo);
  • saved reqres get dumped into browser’s local storage and, if that succeeds, discarded from memory (until you manually load them back from there).

Buckets

Sometimes you might want to split your archivals into separate buckets to simplify future hoarding and sharing of collected archives. E.g., say, by default you might want to put everything into the “default” bucket, but then you might want to put reqres produced by a select tab where you just logged in into you personal account into the “private” bucket instead.

To implement this, for each reqres in the archival queue, Hoardy-Web computes a bucket parameter from the appropriate “Bucket” setting, e.g.

  • this one will be used for requests originating from the currently active tab,
  • this one will be used for requests originating from new child tabs opened from the currently active tab (e.g. via middle mouse click, context menu, etc),
  • while this one will be used for new tabs opened via browser’s “New Tab” browser action (i.e. the plus sign in the tab bar, Control+T, menu item, etc),
  • and so forth for the others (press ? symbols to see a tooltip explaining what each of them does).

Evaluation of the bucket parameter is done just before each archival attempt, so if the queue is not yet empty, and you disable =Archive ‘collected’ reqres=, edit some of the “Bucket” settings, and enable it again, Hoardy-Web will start using the new setting immediately.

When exporting via saveAs, bucket value will be used in the file name of the generated fake-Download .wrrb file and the dumps will be split into separate fake-Download files by said bucket. I.e., internally, the bundle discussed above is actually a set of per-=bucket= bundle’s.

When submitting to an HTTP server, Hoardy-Web will specify bucket as a query parameter (named “profile”, for historical reasons) to each HTTP POST request.

When stashing or saving to local storage, Hoardy-Web will record the value of bucket into each loggable before saving data to disk. If you restart your browser, thus starting a new Hoardy-Web session, Hoardy-Web will use the old stashed/saved bucket values for all new attempted archivals of old reqres generated by previous sessions.

Glossary

  • An /exported reqres/ is a reqres that was successfully exported by generating a fake-Download containing its dump.
  • A /submitted reqres/ is a reqres that was successfully submitted to the archiving server and thus was discarded from memory.
  • A /saved reqres/ is a reqres that was successfully saved by being archived into browser’s local storage.
  • An archived reqres is either exported, submitted, or saved reqres.

Handling of failures

As noted above, if any of the archival methods fail, the reqres in question will be moved into the unarchived state.

Submissions of reqres that unarchived because of networking issues will be retried automatically every 60 seconds. Archivals of reqres rejected by the archiving server or those that failed to be saved to browser’s local storage will not be retried automatically as those usually happen when there is no space left on the device you are archiving to.

You can retry all archiving failures by pressing one of this or that buttons. You can also use them to nudge the archiving sub-process awake if some things got stuck in the queue by accident. E.g., after the extension got reloaded with a non-empty queue, or if you previously quit your browser before everything was archived.

If this option is enabled and a new reqres recently moved to the unarchived state, a new notification will be generated. If this option is enabled, a new notification will be generated when the archival queue gets empty the very first time or after any failures.

Glossary

Re-archival

If you archived some data by saving it into local storage and you now want to re-archive the same data using another method, do the following:

  • enable the option for your desired archival method (e.g., =Export dumps via ‘saveAs’=),
  • but keep the =Save reqres into local storage= option enabled,
  • if you are re-archiving by =exporting them via ‘saveAs’= option, you should probably temporarily set this timeout to 0 to prevent idle waiting,
  • press the the Show button on Saved in LS line in the popup to open the Saved in Local Storage page;
  • set the filter of you desired archival method there to false (red) to make it only display reqres that were not yet archived using that archival method (e.g., set Exported via 'saveAs' to false);
  • then re-queue the data saved in local storage:
    • press the Re-queue button;
    • wait for Hoardy-Web to finish archival of newly re-queued reqres;
    • (if you are re-archiving by =exporting them via ‘saveAs’= option while running on a truly ancient hardware and the above process is slow, you can disable the GZip outputs option; though, the resulting WRR bundles will take a lot of disk space in this case);
    • (also, if you are re-archiving by =exporting them via ‘saveAs’= option, then after each Re-queue you should wait for the browser to save the resulting generated WRR bundles to disk and then confirm that each generated fake-Download did not fail; why? because if you re-archive a lot of data, thus generating many WRR bundles at once, and you run out of disk space in the process, the browser might fail a random subset of the generated fake-Downloads without telling Hoardy-Web anything about it; this is not an issue when archiving by =… submitting them via ‘HTTP’=, because archiving servers report their errors properly);
    • Re-queue more data, repeat until everything is re-archived.
  • set this timeout to its previous value, if you changed it.

If after you confirming everything was properly re-archived you now want to wipe that re-archived data from local storage, do the following:

  • press the the Show button on Saved in LS line in the popup to open the Saved in Local Storage page;
  • set the filter of you desired archival method there to true (green) to make it only display reqres that were already archived using that archived method;
  • press the Delete button there repeatedly, until everything is deleted.

“Work offline” mode

Sometimes, you might want to block a select tab from performing new HTTP requests.

Say, for instance, you opened a URL in a new tab, then you forgot about that tab for a while, but then you returned to it again, and you now want to read it, but the font size is too small for you, so you want to change that tab’s zoom level. Changing zoom level will change tab’s viewport size, which, if the page uses responsive CSS, will likely force your browser to generate new HTTP requests to fetch data used by previously inactive parts of the page. If you don’t want your browser to notify the page’s origin server you are interacting with the page now, you will want to block these requests. I.e., you will want to block that tab from sending new HTTP requests to the Internet.

Desktop versions of Firefox-based browsers have File > Work Offline option that can do this, but it disables all new requests browser-wise, which is quite inconvenient and error-prone if you want to keep some of your tabs offline while not restricting others. Chromium-based browsers do not appear to have such a feature at all.

Also, by default, pages generated by =hoardy-web export mirror= (also there) from your archived data will have all their URLs remapped to refer to local files, not the Internet. Though, if you want, you can make hoardy-web export mirror keep some of the links and references to page requisites on exported pages pointing to their original URLs, by running it with non-default command line arguments. But even when you ask hoardy-web to remap everything to local files, it can still fail to remap some of the URLs because it does not support all possible ways those URLs can be encoded in HTML pages yet. Also, unexpected interactions between an exported page and one of your browser extensions rewriting its DOM on-the-fly can also accidentally and unexpectedly make the page refer to Internet. Also, hoardy-web can have bugs in its remapping code.

Meanwhile, sometimes, you might want to ensure your browser does not try to access the Internet when you open one of those exported pages. And in some of those cases you might even want to prevent your browser from opening non-remapped jump-links (a href) even when you click them.

To solve all of the above issues — and to add an equivalent of File > Work Offline to Chromium-based browsers — Hoardy-Web implements its own Work offline mode controlled via the following toggles:

  • the global toggle is pretty much equivalent to the Firefox’s own option and enables canceling of all new requests browser-wise;
  • this toggle enables “Work offline” mode in the currently active tab, thus also preventing you from navigating to any Internet URLs by clicking any links that open in the same tab;
  • this toggle enables it for the currently active tab’s new children, thus also preventing you from opening any Internet URLs by spawning new tabs from it;
  • there is also a toggle for controlling the default value of the above two options in newly spawned root tabs,
  • as well as toggles controlling “Work offline” mode for background requests and requests generated by extensions.

Unlike the File > Work Offline option of Firefox, enabling any of these toggles:

  • does not break any requests that are already in-flight;
  • does not prevent generation of new canceled reqres when a corresponding Track new requests toggle is also enabled, and they can be seen in the history-log.

In the latter case, those newly generated canceled reqres will also be marked as problematic if that option is enabled. So, for convenience, there is also a toggle that controls whether toggling Work offline options (from the popup or with keyboard shortcuts) should also automatically set the corresponding Track new requests option to the opposite value.

Finally, there is also a bunch options that automatically enable “Work offline” mode in tabs with various classes of URLs. By default, “Work offline” mode is enabled for file: URLs to stop any pages generated by hoardy-web export mirror to accessing the Internet.

Shortcuts

Hoardy-Web provides a bunch of keyboard and context menu shortcuts to allow using it in more efficient ways.

  • On Firefox-based browsers, you can see and edit all keyboard shortcuts via Add-ons and themes (about:addons) -> the gear icon -> Manage Extension Shortcuts.
  • On Chromium-based browsers, you can see and edit all keyboard shortcuts via the menu -> Extensions -> Manage Extensions (chrome://extensions/) -> Keyboard shortcuts (on the left).

Keyboard shortcuts

Hoardy-Web provides shortcuts to:

Context menu actions

Hoardy-Web provides context menu actions to:

  • open a given link in a new tab with currently active tab’s tracking in children tabs setting negated. I.e.,
    • right-mouse clicking while pointing at a link and
    • selecting Hoardy-Web > Open Link in New Tracked/Untracked Tab menu item,

    is equivalent to

    • toggling this,
    • middle-mouse clicking a link,
    • toggling this again.
  • do the same thing, but opening it in a new window.

Error messages and codes

Error messages, as seen in generated notifications

  • Failed to archive <N> items because `Hoardy-Web` can't establish a connection to the archiving server at <URL>

    Are you running the the archiving server script?

  • Failed to archive <N> items because requests to the archiving server failed with: <STATUS> <REASON>: <RESPONSE>

    Your archiving sever is returning HTTP errors when Hoardy-Web is trying to archive data to it. See your archiving server’s console for more information.

    Some common reasons it could be failing:

    • No space left on the device you are archiving to.
    • It’s a bug, {{{reportit()}}}.
  • Failed to stash <N> items becase <reason> or Failed to archive <N> items becase <reason>

    Stashing or archiving failed for some other reason.

    Some common reasons it could be failing:

    • No space left on the device your browser saves its local storage to.
    • It’s a bug, {{{reportit()}}}.
  • Failed to open/create a database via `IndexedDB` API, all data persistence will be done via `storage.local` API instead. This is not ideal, but not particularly bad. However, the critical issue is that it appears Hoardy-Web previously used `IndexedDB` for archiving and/or stashing reqres.

    So, it worked before, but why doesn’t it work now? The most likely reason is: you are running Hoardy-Web under a browser based on an older version of Firefox and you have recently enabled Always use private browsing mode setting in your browser’s config. Older versions of Firefox forbid the use of IndexedDB API when that setting is set.

    To make archives currently saved in IndexedDB accessible to Hoardy-Web under Always use private browsing mode you need to:

    All old data should be available from the Saved in Local Storage page now.

  • Failed to process <N> items becase <reason>

    It’s a bug, {{{reportit()}}}.

  • Other error notifications should be completely self descriptive. If they are not, {{{reportit()}}}.

Errors recorded in reqres, as seen in the-log

Most error codes are produced by attaching one of the following prefixes to the raw error code given by the browser:

  • webRequest:: prefix is prepended to errors produced by the code working with webRequest API;
  • debugger:: prefix is prepended to errors produced by the code working with Chromium’s Debugger API;
  • filterResponseData:: prefix is prepended to errors produced by webRequest.filterResponseData API (these can usually be ignored, since Firefox generates normal webRequest:: codes for those reqres too, when it was an actual error; but Hoardy-Web still collects them, adhering to “collect everything as browser gives it, when possible” philosophy).

In particular, webRequest::NS_ prefix on Firefox, and webRequest::net:: and debugger::net:: prefixes on Chromium signify various issues produced by the networking stacks of those browsers. For instance:

  • webRequest::NS_ERROR_ABORT on Firefox and webRequest::net::ERR_ABORTED on Chromium signify that this request was aborted before it finished, e.g. because the originator tab was closed before it was fully loaded; Firefox also uses this code to mean what Chromium signifies with various BLOCKED codes;
  • webRequest::net::ERR_BLOCKED_BY_CLIENT on Chromium signifies that an extension blocked it;
  • debugger::net::ERR_BLOCKED:: is a prefix for other errors when the request was blocked, e.g. by CSP;
  • webRequest::NS_ERROR_NET prefix on Firefox and webRequest::net::ERR_FAILED error on Chromium signify various networking issues.

The exception to the above rule of keeping everything as raw as possible are webRequest::capture:: and debugger::capture:: prefixes which signify various errors produced by Hoardy-Web itself in its webRequest- or debugger-handling code, respectively. In particular:

  • webRequest::capture::EMIT_FORCED::BY_USER and debugger::capture::EMIT_FORCED::BY_USER are produced when you forcefully advance a reqres from in-flight state by pressing this or that button;
  • debugger::capture::EMIT_FORCED::BY_DETACHED_DEBUGGER is produced when Chromium debugger gets detached from its tab while a reqres inside that tab is still in flight;
  • debugger::capture::EMIT_FORCED::BY_CLOSED_TAB is produced when a tab gets closed while a reqres inside of it is still in flight;
  • debugger::capture::NO_RESPONSE_BODY:: is a prefix for errors produced when getting request’s response body from Chromium’s debugger fails for various reasons;
  • webRequest::capture::CANCELED::NO_DEBUGGER is produced when a non-main-frame request is canceled by Hoardy-Web because no debugger is available to capture it; in the case of a main frame request, Hoardy-Web will cancel the request and reload the tab, as discussed there, so this error will not be produced; but it can happen if a page tries to load a sub-frame (like iframe) while the debugger for the tab (and, thus, the main frame) did not attach yet (which only happens for pages where Chromium disallows debugging, or when Hoardy-Web gets enabled after the page in question already started loading, e.g. the very first page after the browser starts); also, this can happen when the debugger gets detached after the main frame was captured but its resources are still loading.
  • webRequest::capture::CANCELED::BY_WORK_OFFLINE is produced when the reqres was canceled by one of “Work offline” options, i.e. as a result of one or more of this-this-this-this-or-that options being set.
  • webRequest::capture::RESPONSE::BROKEN is produced when some response metadata is unavailable.

    At the moment, this only appears to happen on Firefox when a request gets fulfilled by a service or shared worker after Firefox had already sent it to the server. Firefox then interrupts the networking code and generates NS_ERROR_NET_ON_* error about the event failing to supply the response metadata generated by the service/shared worker.

Quirks and Bugs

If you are reading this page outside of the extension’s UI be sure to read the very top of this page first.

Known Hoardy-Web’s own issues

  • Hoardy-Web does not implement collection of WebSockets data on any of the supported browsers.

    (Firefox does not support it. Chromium does support it, in theory, but I have not tried using that API, so I have no idea how well it works.)

    This is low-priority issue since you can simply take a DOM snapshot instead of capturing and later replaying WebSocket messages to in-page JavaScript. Also, capturing and archiving a DOM snapshot will free you from needing to run any JavaScript at all when you decide to return to view the archived page later, which is nice.

  • On Chromium, response data of background requests and requests made by other extensions does not get collected, since there’s no tab to attach a debugger to, and I have not figured out how to attach debugger to other things yet.
  • On Firefox, fetches that spawn new downloads will be marked as problematic by default, since Firefox’s implementation of webRequest.filterResponseData API does not provide their contents to the extension and I have not figured out how to distinguish them from other fetches yet.

Known issues that are consequences of issues of all supported browsers

  • When Hoardy-Web is reloaded without using the Reload button or =Auto-reload on updates= option, i.e. when Hoardy-Web is reloaded by clicking the “Reload” button in browser’s extension list, then all per-tab setting of all tabs will be reset to the values used by the newly spawned root tabs.

    This issue is not applicable in the case when the reload happens because the extension was updated, in that case the browser will notify Hoardy-Web about it and Hoardy-Web will handle it properly, see the help string of the Reload button for more info.

    But in the case of Reload buttons, the browser does not ask the extension nicely, so all unsaved internal state will be lost.

  • If an HTTP server supplies the same header multiple times — which happens sometimes, most commonly with Set-Cookie headers — then the archived response headers will usually become weird, with multiple headers squished into a single value, separated by newline symbols.

    This is just the way both Firefox (usually) and Chromium (always) supply those headers to extensions and Hoardy-Web does not try to undo it.

Known issues that are consequences of issues of Firefox-based desktop browsers: Firefox, Tor Browser, LibreWolf, etc

  • On Firefox-based browsers, without the patch (also there), the browser only supplies formData to webRequest.onBeforeRequest handlers, thus making impossible to recover the actual request body for a POST request.

    Hoardy-Web will mark such requests as having a “partial request body” and try its best to recover the data from formData structure, but if a POST request was uploading files, they won’t be recoverable from formData (in fact, it is not even possible to tell if there were any files attached there), and so your archived request data will be incomplete even after Hoardy-Web did its best.

    Disabling this toggle will disable archiving of such broken requests. This is not recommended, however, as archiving some data is usually better than archiving none.

    With the above patch applied, small POST requests will be archived completely and correctly. POST requests that upload large files and only those will be marked as having a “partial request body”.

  • If-Modified-Since and If-None-Match headers never get archived, because the browser never supplies them to the extensions. Thus, you can get 304 Not Modified reqres response to a seemingly normal GET request.
  • Reqres of already cached media files (images, audio, video, except for svg and favicons) will end in incomplete_fc state because webRequest.filterResponseData API does not provide response bodies for such requests. This toggle controls if such reqres should be picked.

    By default, Hoardy-Web will drop them. Usually this is not a problem since such media will be archived on first (non-cached) access. But if you want to force everything on the page to be archived, you can reload the page without the cache with Control+F5.

  • Firefox fails to run onstop method for webRequest.filterResponseData filter for the very first HTTP/2 request the browser makes after you start it, thus making the reqres of that request incomplete. If this option is enabled, Hoardy-Web will transparently work around this bug by redirecting the very first navigation request to about:blank and then reloading the tab with its original URL.
  • Firefox-based browsers provide no API for archiving WebSockets data at the moment, unfortunately.

Known issues that are consequences of issues of Firefox-based mobile browsers: Fenix aka Firefox for Android, Fennec, Mull, etc

All of the above apply, moreover:

Known issues that are consequences of issues of Chromium-based desktop browsers: Chromium, Chrome, etc

On Chromium-based browsers, there is no way to get HTTP response data without attaching Chromium’s debugger to a tab from which a request originates from. This makes things a bit tricky, for instance:

  • With this and this option enabled, new tabs will be reset to this value (about:blank by default) because the default of chrome://newtab/ does not allow attaching debugger to the tabs with chrome: URLs.
  • Requests made before the debugger is attached will get canceled by Hoardy-Web. So, for instance, when you middle-click a link, Chromium will open a new tab, but Hoardy-Web will block the requests from there until the debugger gets attached and then automatically reload the tab after. As side-effect of this, Chromium will show Request blocked page until the debugger is attached and the page is reloaded, meaning it will get visually stuck on Request blocked page if fetching the request ended up spawning a download instead of showing a page. The download will proceed as normal, though.
  • You will get an annoying notification bar constantly displayed in the browser while =Hoardy-Web= is enabled. Closing that notification will detach the debugger. Hoardy-Web will reattach it immediately because it assumes you don’t want to lose data and closing that notification on accident is, unfortunately, quite easy.

    However, closing the notification will make all in-flight requests lose their response data.

    All alternatives to Hoardy-Web that work with Chromium suffer from the same issue.

    If you disable this option the debuggers will get detached only after all requests finish. But even if there are no requests in-flight the notification will not disappear immediately. Chromium takes its time updating the UI after the debugger is detached.

Moreover, Chromium has the following long-standing issues/bugs making things difficult:

  • Chromium will automatically detach a debugger from a tab if it tries to save too much data into its debugger state. Which means that a tab that loads too much data too fast will get its debugger detached. Chromium does this to try and save memory, but this, among other issues, means that large images will fail to be properly archived, and any page that loads such files is likely to fail to be archived too.

    This is a design limitation of Chromium debugging interface, there appears to be no work-around for this at the moment.

    Meanwhile, on Firefox, Hoardy-Web uses webRequest.filterResponseData API (not available no Chromium, because it greatly enhances browser’s ad-blocking capabilities) which does not suffer from this problem.

  • Chromium will occasionally detach debuggers from some tabs at random. It just happens. Fortunately, Hoardy-Web will mark the resulting broken reqres as problematic by default as they match the conditions of at least one of this, this, or that options.
  • Chromium handling of media files (audio and video) within its debugging interface is very strange. When Chromium encounters a media file, it immediately loads a first few frames of it, then cancels the rest of the download, generates a networking error debugging event, but forgets to give the already loaded data to it, and then, when the user clicks the play button, continues the download by requesting the rest of the file as normal. Thus, on Chromium, for media files Hoardy-Web will only ever get 206 Partial Content HTTP responses with the first few kilobytes of file data missing. This bug has no good workaround, all alternatives to Hoardy-Web that work with Chromium work it around by silently re-downloading the file the second time in background.
  • Similarly to unpatched Firefox, Chromium-based browsers do not supply contents of files in POST request data. They do, however, provide a way to see if files were present in the request, so Hoardy-Web will mark such and only such requests as having a “partial request body”. There is no patch for Chromium to fix this, nor do I plan to make one (feel free to contribute one, though).

    As with Firefox, disabling this toggle will disable archiving of such broken requests. This is not recommended, however, as archiving some data is usually better than archiving none.

  • Chromium fails to provide openerTabId to tabs created with chrome.tabs.create API so in the unlikely case of opening two or more new tabs/windows in rapid succession via Hoardy-Web context menu actions and not giving them time to initialize Hoardy-Web could end up mixing up settings between the newly created tabs/windows. This bug is impossible to trigger unless your system is very slow or you are clicking things with automation tools like AutoHotKey or xnee.
  • To properly collect all the data about a reqres, Hoardy-Web has to use both the data generated by webRequest API and Chromium’s own debugging API events, using only one of those is usually insufficient. But Chromium generates different request IDs for events generated by these two different APIs and also generates those events in arbitrary order. Therefore, Hoardy-Web tracks reqres generated by both sets of APIs separately and then matches those two lists against each other heuristically, merging matching reqres together. Which is ugly enough. But then Chromium sometimes generates debugging API events and forgets to produce the corresponding webRequest API events, or vice versa, thus leaving some of those reqres unmatched.

    To work around that, Hoardy-Web waits this many seconds for new events to arrive, and if none do, forcefully finishes all unmatched but network-complete in_flight reqres. Yes, this means that some minor metadata fields (like document_url) of those reqres might be missing, but waiting more time usually won’t fix it, so Hoardy-Web can’t do anything else there.

  • However, sometimes Chromium forgets to generate both loading-complete and loading-failed debugging events. This usually happens when a request gets started and then canceled by a page’s JavaScript, or when you navigate between pages too fast.

    In that case, Hoardy-Web can’t tell if a reqres is just slow at being loaded or if Chromium forgot about it, so those reqres will get stuck in the in_flight state indefinitely, at least until their originator tab gets closed, or until you press one of this or that buttons.

    Hoardy-Web might get another workaround for this bug later.

Frequently Asked Questions

If you are reading this page outside of the extension’s UI be sure to read the very top of this page first.

General

Does Hoardy-Web send any of my captured web browsing data anywhere?

Hoardy-Web only ever sends your data to the archiving Server URL=]] you specify when [[./popup.html#div-config.archiveSubmitHTTP][the =Submit dumps via 'HTTP' option is enabled.

Nowhere else. Never else.

Does Hoardy-Web collect and send any telemetry anywhere?

For your convenience, Hoardy-Web saves some global stats across restarts (e.g., the Collected, Discarded, Picked, and Dropped lines).

However, none of those are ever sent anywhere and you can reset them at any time.

Will the answers to the above two questions ever change in a future version of Hoardy-Web?

No. I (the author) hate non-consensual data collection.

In fact, as you might have noticed, Hoardy-Web, unlike most other browser extensions, is almost trivial to reproducible-build from source on a POSIX-compliant system with a Nix package manager installed, and it has a privately operated source code mirror.

This is by design, I expect a chunk of Hoardy-Web users to be paranoid enough to only ever build it from source and install the results manually into their LibreWolf or some such, leaving zero telemetry fingerprints anywhere.

Hoardy-Web asks for a lot of permissions, what does it use all those permissions for?

Capture

Can I use Hoardy-Web to capture web pages while my browser runs with JavaScript disabled?

Yes.

Can I use Hoardy-Web to capture web pages that use a lot of JavaScript?

This is why =DOM=-snapshot buttons exist, see the following question.

In principle, Hoardy-Web will capture everything your browser fetches from the network as you browse the web, except for, at the moment, WebSockets data. So, web pages using only simple UI-related JavaScript code will work fine when you start replaying them “from scratch” via =hoardy-web export mirror= (also there) or some such.

However, in the most general case, “from scratch” replay of pages dynamically generated via JavaScript is not guaranteed. For example, consider a web page with a JavaScript code that generates a random number, then queries a remote server with that number, and then renders the result somehow. Obviously, such a web page can not be replayed “from scratch” since it will generate a new random number and your archive probably won’t have the corresponding server’s response for it.

Can I use Hoardy-Web to capture a web page as it currently is, after all JavaScript was run, not as it was when it was last fetched from the network?

Yes, you can capture DOM (Document Object Model) snapshots of all frames of the currently active tab by pressing this button in the popup.

Doing that will generate and capture snapshots of raw HTML’s or XML’s for each frame contained in the currently active tab. (Reqres-wise they will be 200 OK responses, but with protocol set to SNAPSHOT and method set to DOM.)

You can also do that for all open tabs for which this setting is enabled all at once by pressing that button.

How can I make Hoardy-Web capture a web page completely, especially when parts of it are loaded lazily?

In the most general case, you will have to scroll the page around and click random buttons and media elements.

Hoardy-Web has no “autopilot” for doing this, nor will it ever get one, at least as part of Hoardy-Web extension, since “autopiloting” is very website-specific. So, at the moment, the most general semi-automated solution is to run a website-specific UserScript via Tampermonkey or some such, wait until everything finishes loading, and then take a snapshot. (Hoardy-Web will get an integration for automating that, eventually.)

On the other hand, if you

  • run Hoardy-Web under Firefox,
  • just want to load all lazily-loaded images the page already has (NOT load more stuff), and
  • the page in question uses modern HTML5 lazy loading attributes instead of using JavaScript to do the same,

then you can simply go to about:config and toggle dom.image-lazy-loading.enabled to false. All images will start being loaded eagerly after that.

Can I use Hoardy-Web to capture a web page without archiving it, look at it, decide if I want to save it, and archive it only if I do, all without reloading the page a second time?

Yes. This is why =Pick into limbo= setting exists. See above for more info.

In combination with =Automatic actions for recently closed tabs= options you can implement any of the following workflows:

  • archive everything by default, but allow to exclude some things by manually discarding them from limbo;
  • only archive things that are explicitly manually collected, discard everything else by default.

Why do pages under https://addons.mozilla.org/ and https://chromewebstore.google.com/ can not be captured by Hoardy-Web?

Browsers prevent extensions from running on extension store pages to prevent them from manipulating ratings, reviews, and etc such things. However, you can archive https://addons.mozilla.org/ pages by running Hoardy-Web under Chromium and https://chromewebstore.google.com/ pages by running Hoardy-Web under Firefox.

When running Hoardy-Web under Chromium, a lot of my captures fail with debugger::capture::EMIT_FORCED::BY_DETACHED_DEBUGGER, debugger::capture::NO_RESPONSE_BODY::DETACHED_DEBUGGER, webRequest::capture::CANCELED::NO_DEBUGGER, and similar errors. What do I do?

You are either

  • pressing the Cancel or Close (cross) buttons in the Chromium’s popup-toolbar telling you about the debugger being enabled, and so Chromium detaches it, breaking everything (see there);
  • pressing Space or Escape keyboard keys when doing things in Chromium’s UI, but nothing at that particular moment reacts to the key you pressed, except there is that popup-toolbar… and so Chromium decides it must mean you want to press Cancel button there … and detaches the debugger, breaking everything (again);

    yes, this is really annoying, and this is a common problem for me, since I usually page-down using Space and press Escape a lot (usually to cancel selection, but sometimes also as a trauma of a long-time Vim user);

    the only solution to this I know of is to just not touch the keyboard at all, at least while things are still loading; i.e. just click on stuff using the mouse/track-point/touch-pad/touchscreen/etc, wait for the T (“Tracking”) to vanish from the extension’s badge, and only then let your (grabby and impatient for exercise via keyboard shortcuts) fingers to touch the keyboard;

    even then, Chromium will detach debuggers from time to time seemingly at random, but at least it will be rare enough that you won’t need to reload much;

  • trying to capture large or media files; as discussed there, this has no workaround, run Hoardy-Web under Firefox instead.

Also, Chromium will occasionally detach its debugger at random, it just happens.

When running Hoardy-Web under Firefox, some of my captures fail with webRequest::capture::RESPONSE::BROKEN. What do I do?

This is a rare error caused by a race condition between webpage’s service/shared worker and browser’s networking code.

Usually, you can ignore this error, since loading another related page is likely to fulfill the same URL.

However, if this happens a lot to you, or if it annoys you, you can go to about:config, toggle dom.serviceWorkers.enabled to false, and restart the browser. Alternatively, you can use NoScript or some such extension to disable JavaScript, and thus the offending service/shared workers, on the page in question.

Why does a (specific) URL or some part of it fails to be properly captured by Hoardy-Web?

Did you read the notes on the bugs of the browser you are using?

Most notably:

  • both Firefox- and Chromium-based browsers in their default builds fail to properly supply POST request data to their extensions; for Firefox-based browsers there exists a patch that fixes it, mostly; Chromium users are out of luck at the moment;
  • on a Chromium-based browser, because of limitations of the Chromium’s debugging interface, it is impossible to properly capture media files (both audio and video) and large files in general; this issue has no good work-around and, AFAIK, all alternatives to Hoardy-Web running on Chromium-based browser suffer from it (and work around it by silently re-downloading said files the second time in background); try using Hoardy-Web under a Firefox-based browser instead.

Archival

The documentation claims that all Hoardy-Web archival methods except for submission via =HTTP= are unsafe. Why?

Archival by exporting using =saveAs= (generation of fake-Downloads) can fail and **lose a bit of your collected data at a time** if you press a wrong button in you browser’s UI, mis-reconfigure your browser a bit, or your disk gets out of space unexpectedly.

Archival to browser’s local storage (which is what Hoardy-Web is doing by default) can **loose all your collected data at the same time** if you uninstall the extension by accident.

Meanwhile, archival by submission via =HTTP= has none of these problems:

  • Hoardy-Web will keep each reqres in memory until the archiving server responds with 200 OK for that reqres;
  • the archiving will only respond with 200 OK response to Hoardy-Web after the dump is written and fsync-ed to disk;
  • the archiving server never deletes any of your archived data; by using an archiving server, you can only loose your archived data if you go to its directory and delete some of it yourself, or if your disk dies, or if your file system gets corrupted; all of those problems are solved by regular backups.

Archival to browser’s local storage was added because it was very easy to implement after the-stash was added. It is the default because it usually works fine, it properly reports errors, has the most consistent behaviour across all browsers, and does not require the user to install any Python code, which helps with on-boarding.

In the ideal world, browsers would provide a better saveAs API which would have a less annoying UI for the user and would return out-of-disk-space errors to the extension, in which case exporting via =saveAs= would be the default.

As it is now, the only way to be absolutely sure you data is properly forever-saved to disk when the extension reports it archived is to use submission via =HTTP=.

When running Hoardy-Web under Firefox, enabling export via =saveAs= makes the browser’s UI quite annoying. Can it be fixed?

Yes, go to about:config and toggle browser.download.alwaysOpenPanel to false.

This page does not answer my question. What do I do?

If the whole content of this page (not just this section, did you try searching for stuff with Control+F? there’s a lot of info here) does not explain your problem, {{{reportit()}}}.