Skip to content

Commit

Permalink
docs: Reorganize user guide (webrecorder#2050)
Browse files Browse the repository at this point in the history
Reorganizes user guide to be more solutions based

---------

Co-authored-by: Henry Wilkinson <[email protected]>
Co-authored-by: Emma Segal-Grossman <[email protected]>
Co-authored-by: Tessa Walsh <[email protected]>
  • Loading branch information
4 people authored Aug 28, 2024
1 parent ea252e8 commit ecac4f6
Show file tree
Hide file tree
Showing 22 changed files with 299 additions and 174 deletions.
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

&nbsp;

Browsertrix is an open-source cloud-native high-fidelity browser-based crawling service designed
Browsertrix is a cloud-native, high-fidelity, browser-based crawling service designed
to make web archiving easier and more accessible for everyone.

The service provides an API and UI for scheduling crawls and viewing results, and managing all aspects of crawling process. This system provides the orchestration and management around crawling, while the actual crawling is performed using [Browsertrix Crawler](https://github.com/webrecorder/browsertrix-crawler) containers, which are launched for each crawl.
Expand All @@ -17,6 +17,8 @@ See [browsertrix.com](https://browsertrix.com) for a feature overview and inform

The full docs for using, deploying, and developing Browsertrix are available at: [https://docs.browsertrix.com](https://docs.browsertrix.com)

Our docs are created with [Material for MKDocs](https://squidfunk.github.io/mkdocs-material/).

## Deployment

The latest deployment documentation is available at: [https://docs.browsertrix.com/deploy](https://docs.browsertrix.com/deploy)
Expand Down
6 changes: 5 additions & 1 deletion docs/deploy/customization.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,10 @@ Text

The `~~~` is used to separate the sections. If only two sections are provided, the email template is treated as plain text, if three, an HTML email with plain text fallback is sent.

## Signing WACZ files
## Signing WACZ Files

Browsertrix has the ability to cryptographically sign WACZ files with [Authsign](https://github.com/webrecorder/authsign). The ``signer`` setting can be used to enable this feature and configure Authsign.

## Enable Open Registration

You can enable sign-ups by setting `registration_enabled` to `"1"`. Once enabled, your users can register by visiting `/sign-up`.
38 changes: 27 additions & 11 deletions docs/stylesheets/extra.css
Original file line number Diff line number Diff line change
Expand Up @@ -25,34 +25,50 @@
src: url('../assets/fonts/Inter-Italic.var.woff2') format('woff2');
font-feature-settings: "ss03";
}

@font-face {
font-family: 'Konsole';
font-weight: 100 900;
font-display: swap;
font-style: normal;
src: url('https://wr-static.sfo3.cdn.digitaloceanspaces.com/fonts/konsole/Konsolev1.1-VF.woff2') format('woff2');
}

:root {
--md-display-font: "Konsole", "Helvetica", sans-serif;
--md-code-font: "Recursive", monospace;
--md-text-font: "Inter", "Helvetica", "Arial", sans-serif;
--wr-blue-primary: #0891B2;
--wr-orange-primary: #C96509;
--wr-blue-primary: #088eaf;
--wr-orange-primary: #bb4a00;
}

[data-md-color-scheme="webrecorder"] {
--md-primary-fg-color: #4D7C0F;
--md-primary-fg-color--light: #0782A1;
--md-primary-fg-color--dark: #066B84;
--md-primary-fg-color--light: #057894;
--md-primary-fg-color--dark: #035b71;
--md-typeset-color: black;
--md-accent-fg-color: #0782A1;
--md-typeset-a-color: #066B84;
--md-accent-fg-color: #057894;
--md-typeset-a-color: #035b71;
--md-code-bg-color: #F9FAFB;
}

/* Nav changes */

.md-header__title {
font-family: var(--md-code-font);
font-variation-settings: "MONO" 0.51;
.md-header__title, .md-nav__title {
font-family: var(--md-display-font);
text-transform: uppercase;
font-variation-settings: "wght" 750, "wdth" 87;
margin-left: 0 !important;
}

.md-header__title--active {
font-family: var(--md-text-font);
font-weight: 600;
font-family: var(--md-display-font);
text-transform: none;
font-variation-settings: "wght" 550, "wdth" 90;
}

.md-header__button {
margin-right: 0 !important;
}

/* Custom menu item hover */
Expand Down
22 changes: 13 additions & 9 deletions docs/user-guide/archived-items.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Archived Items
# Intro to Archived Items

Archived Items consist of one or more WACZ files created by a crawl workflow, or uploaded to Browsertrix. They can be individually replayed, or combined with other archived items in a [collection](collections.md). The Archived Items page lists all items in the organization.
Archived items consist of one or more WACZ files created by a crawl workflow or uploaded to Browsertrix. They can be individually replayed, or combined with other archived items in a [collection](collections.md). The **Archived Items** page lists all items in the organization.

## Uploading Web Archives

Expand All @@ -26,13 +26,13 @@ The archived item details page is composed of the following sections, though som

### Overview

The Overview tab displays the item's metadata and statistics associated with its creation process.
View metadata and statistics associated with how the archived item was created.

Metadata can be edited by pressing the pencil icon at the top right of the metadata section to edit the item's description, tags, and collections it is associated with.

### Quality Assurance

The Quality Assurance tab displays crawl quality information collected from analysis runs and user assessment of pages. This is where you can start new analysis runs, view quality metrics from older runs, and delete previous analysis runs. This tab is not available for uploaded archived items and not accessible for users with [viewer permissions](org-settings.md#permission-levels).
View crawl quality information collected from analysis runs, review crawled pages, and start new analysis runs. QA is only available for crawls and org members with [crawler permissions](org-members.md).

The pages list provides a record of all pages within the archived item, as well as any ratings or notes given to the page during review. If analysis has been run, clicking on a page in the pages list will go to that page in the review interface.

Expand All @@ -50,22 +50,26 @@ Like running a crawl workflow, running crawl analysis also uses execution time.

### Replay

The Replay tab displays the web content contained within the archived item.
View a high-fidelity replay of the website at the time it was archived.

For more details on navigating web archives within ReplayWeb.page, see the [ReplayWeb.page user documentation.](https://replayweb.page/docs/user-guide/exploring/)

### Exporting Files

While crawling, Browsertrix will output one or more WACZ files — the crawler aims to output files in consistently sized chunks, and each [crawler instance](workflow-setup.md#crawler-instances) will output separate WACZ files.

The Files tab lists the individually downloadable WACZ files that make up the archived item as well as their file sizes and backup status. To combine one or more archived items and download them all as a single WACZ file, add them to a collection and [download the collection](collections.md#downloading-collections).
The files tab lists the individually downloadable WACZ files that make up the archived item as well as their file sizes and backup status.

To download an entire archived item as a single WACZ file, click the _Download Item_ button at the top of the files tab or the _Download Item_ entry in the crawl's _Actions_ menu.

To combine multiple archived items and download them all as a single WACZ file, add them to a collection and [download the collection](collections.md#downloading-collections).

### Error Logs

The Error Logs tab displays a list of errors encountered during crawling. Clicking an errors in the list will reveal additional information.
View a list of errors that may have been encountered during crawling. Clicking an error in the list will reveal additional information.

All log entries with that were recorded in the creation of the Archived Item can be downloaded in JSONL format by pressing the _Download Logs_ button.
All log entries with that were recorded in the creation of the archived item can be downloaded in JSONL format by pressing the _Download Logs_ button.

### Crawl Settings

The Crawl Settings tab displays the crawl workflow configuration options that were used to generate the resulting archived item. Many of these settings also apply when running crawl analysis.
View the crawl workflow configuration options that were used to generate the resulting archived item. Many of these settings also apply when running crawl analysis.
4 changes: 2 additions & 2 deletions docs/user-guide/browser-profiles.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Browser Profiles
# Intro to Browser Profiles

Browser profiles are saved instances of a web browsing session that can be reused to crawl websites as they were configured, with any cookies, saved login sessions, or browser settings. Using a pre-configured profile also means that content that can only be viewed by logged in users can be archived, without archiving the actual login credentials.

Expand All @@ -18,7 +18,7 @@ Browser profiles are saved instances of a web browsing session that can be reuse

## Creating New Browser Profiles

New browser profiles can be created on the Browser Profiles page by pressing the _New Browser Profile_ button and providing a starting URL.
New browser profiles can be created on the **Browser Profiles** page by pressing the _New Browser Profile_ button and providing a starting URL.

Press the _Finish Browsing_ button to save the browser profile with a _Name_ and _Description_ of what is logged in or otherwise notable about this browser session.

Expand Down
14 changes: 14 additions & 0 deletions docs/user-guide/collection.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Intro to Collections

## Create a Collection

You can create a collection from the Collections page, or the _Create New ..._ shortcut from the org overview.

## Sharing Collections

Collections are private by default, but can be made public by marking them as sharable in the Metadata step of collection creation, or by toggling the _Collection is Shareable_ switch in the share collection dialogue.

After a collection has been made public, it can be shared with others using the public URL available in the share collection dialogue. The collection can also be embedded into other websites using the provided embed code. Un-sharing the collection will break any previously shared links.

For further resources on embedding archived web content into your own website, see the [ReplayWeb.page docs page on embedding](https://replayweb.page/docs/embedding).

11 changes: 2 additions & 9 deletions docs/user-guide/collections.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Collections
# Add to Collection

Collections are the primary way of organizing and combining archived items into groups for presentation.

Expand All @@ -7,19 +7,12 @@ Collections are the primary way of organizing and combining archived items into

After adding the crawl and the upload to a collection, the content from both will become available in the replay viewer.

## Adding Content to Collections
## Adding Archived Items to Collections

Crawls and uploads can be added to a collection after creation by selecting _Select Archived Items_ from the collection's actions menu.

A crawl workflow can also be set to [automatically add any completed archived items to a collection](workflow-setup.md#collection-auto-add) in the workflow's settings.

## Sharing Collections

Collections are private by default, but can be made public by marking them as sharable in the Metadata step of collection creation, or by toggling the _Collection is Shareable_ switch in the share collection dialogue.

After a collection has been made public, it can be shared with others using the public URL available in the share collection dialogue. The collection can also be embedded into other websites using the provided embed code. Un-sharing the collection will break any previously shared links.

For further resources on embedding archived web content into your own website, see the [ReplayWeb.page docs page on embedding](https://replayweb.page/docs/embedding).

## Downloading Collections

Expand Down
9 changes: 9 additions & 0 deletions docs/user-guide/contribute.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Contribute

We hope our user guide is a useful tool for you. Like Browsertrix itself, our user guide is open source. We greatly appreciate any feedback and open source contributions to our docs [on GitHub](https://github.com/webrecorder/browsertrix).

Other ways to contribute:

1. Answer questions from the web archiving community on the [community help forum](https://forum.webrecorder.net/c/help/5).
2. [Let us know](mailto:[email protected]) how we can improve our documentation.
3. If you encounter any bugs while using Browsertrix, please open a [GitHub issue](https://github.com/webrecorder/browsertrix/issues/new/choose) or [contact support](mailto:[email protected]).
55 changes: 21 additions & 34 deletions docs/user-guide/crawl-workflows.md
Original file line number Diff line number Diff line change
@@ -1,54 +1,41 @@
# Crawl Workflows
# Intro to Crawl Workflows

Crawl Workflows consist of a list of configuration options that instruct the crawler what it should capture.
Crawl workflows are the bread and butter of automated browser-based crawling. A crawl workflow enables you to specify how and what the crawler should capture on a website.

## Creating and Editing Crawl Workflows
A finished crawl results in an [archived item](./archived-items.md) that can be downloaded and shared. To easily identify and find archived items within your org, you can automatically name and tag archived items through custom workflow metadata.

New Crawl Workflows can be created from the Crawling page. A detailed breakdown of available settings can be found [here](workflow-setup.md).
You can create, view, search for, and run crawl workflows from the **Crawling** page.

## Status

Crawl Workflows inherit the [status of the last item they created](archived-items.md#status). When a workflow has been instructed to run it can have have five possible states:

| Status | Description |
| ---- | ---- |
| <span class="status-waiting">:bootstrap-hourglass-split: Waiting</span> | The workflow can't start running yet but it is queued to run when resources are available. |
| <span class="status-waiting">:btrix-status-dot: Starting</span> | New resources are starting up. Crawling should begin shortly.|
| <span class="status-success">:btrix-status-dot: Running</span> | The crawler is finding and capturing pages! |
| <span class="status-waiting">:btrix-status-dot: Stopping</span> | A user has instructed this workflow to stop. Finishing capture of the current pages.|
| <span class="status-waiting">:btrix-status-dot: Finishing Crawl</span> | The workflow has finished crawling and data is being packaged into WACZ files.|
| <span class="status-waiting">:btrix-status-dot: Uploading WACZ</span> | WACZ files have been created and are being transferred to storage.|
## Create a Crawl Workflow

## Running Crawl Workflows
Create new crawl workflows from the **Crawling** page, or the _Create New ..._ shortcut from **Overview**.

Crawl workflows can be run from the actions menu of the workflow in the crawl workflow list, or by clicking the _Run Crawl_ button on the workflow's details page.
### Choose what to crawl

While crawling, the Watch Crawl page displays a list of queued URLs that will be visited, and streams the current state of the browser windows as they visit pages from the queue.
The first step in creating a new crawl workflow is to choose what you'd like to crawl. This determines whether the crawl type will be **URL List** or **Seeded Crawl**. Crawl types can't be changed after the workflow is created—you'll need to create a new crawl workflow.

Running a crawl workflow that has successfully run previously can be useful to capture content as it changes over time, or to run with an updated [Crawl Scope](workflow-setup.md#scope).
#### Known URLs `URL List`{ .badge-blue }

### Live Exclusion Editing
Choose this option if you already know the URL of every page you'd like to crawl. The crawler will visit every URL specified in a list, and optionally every URL linked on those pages.

While [exclusions](workflow-setup.md#exclusions) can be set before running a crawl workflow, sometimes while crawling the crawler may find new parts of the site that weren't previously known about and shouldn't be crawled, or get stuck browsing parts of a website that automatically generate URLs known as ["crawler traps"](https://en.wikipedia.org/wiki/Spider_trap).
A URL list is simpler to configure, since you don't need to worry about configuring the workflow to exclude parts of the website that you may not want to archive.

If the crawl queue is filled with URLs that should not be crawled, use the _Edit Exclusions_ button on the Watch Crawl page to instruct the crawler what pages should be excluded from the queue.
#### Automated Discovery `Seeded Crawl`{ .badge-orange }

Exclusions added while crawling are applied to the same exclusion table saved in the workflow's settings and will be used the next time the crawl workflow is run unless they are manually removed.
Let the crawler automatically discover pages based on a domain or start page that you specify.

### Changing the Amount of Crawler Instances
Seeded crawls are great for advanced use cases where you don't need (or want) to know every single URL of the website that you're archiving.

Like exclusions, the [crawler instance](workflow-setup.md#crawler-instances) scale can also be adjusted while crawling. On the Watch Crawl page, press the _Edit Crawler Instances_ button, and set the desired value.
After deciding what type of crawl you'd like to run, you can begin to set up your workflow. A detailed breakdown of available settings can be found in the [workflow settings guide](workflow-setup.md).

Unlike exclusions, this change will not be applied to future workflow runs.
## Run Crawl

## Ending a Crawl
Run a crawl workflow by clicking _Run Crawl_ in the actions menu of the workflow in the crawl workflow list, or by clicking the _Run Crawl_ button on the workflow's details page.

If a crawl workflow is not crawling websites as intended it may be preferable to end crawling operations and update the crawl workflow's settings before trying again. There are two operations to end crawls, available both on the workflow's details page, or as part of the actions menu in the workflow list.
While crawling, the **Watch Crawl** section displays a list of queued URLs that will be visited, and streams the current state of the browser windows as they visit pages from the queue. You can [modify the crawl live](./running-crawl.md) by adding URL exclusions or changing the number of crawling instances.

### Stopping
Re-running a crawl workflow can be useful to capture a website as it changes over time, or to run with an updated [crawl scope](workflow-setup.md#scope).

Stopping a crawl will throw away the crawl queue but otherwise gracefully end the process and save anything that has been collected. Stopped crawls show up in the list of Archived Items and can be used like any other item in the app.

### Canceling
## Status

Canceling a crawl will throw away all data collected and immediately end the process. Canceled crawls do not show up in the list of Archived Items, though a record of the runtime and workflow settings can be found in the crawl workflow's list of crawls.
Finished crawl workflows inherit the [status of the last archived item they created](archived-items.md#status). Crawl workflows that are in progress maintain their [own statuses](./running-crawl.md#crawl-workflow-status).
Loading

0 comments on commit ecac4f6

Please sign in to comment.