-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Provide a screenshot and describe the bug
With the current default settings, Chrome will hang trying to save https://www.rover.com/. This is triggered by the --virtual-time-budget=15000 argument. Removing this argument allows chrome to archive the DOM as desired.
However, there's a pathological case here! When we hang forever trying to save Rover.com, it causes the persona to get locked. ArchiveBox times out the action and moves on to the next one which runs into the singleton file and fails. As mentioned in #1599 (comment), this pathological case would go away if we used a temp personal directory each time.
I think this was added to work around full page screenshot issues. I just wanted to share this example in case this is a more common problem and maybe the default would be reconsidered. I'm not arguing one way or another, but it's probably worth tracking how many websites hit this issue.
Steps to reproduce
1. `docker run -it -v $PWD/data:/data archivebox/archivebox:dev init --setup`
2. `docker run -it -v $PWD/data:/data archivebox/archivebox:dev manage createsuperuser`
3. `docker run -it -p 8000:8000 -v $PWD/data:/data archivebox/archivebox:dev server`
4. Archive https://www.rover.com/Logs or errors
[+] [2024-11-24 22:41:56] Adding 1 links to index (crawl depth=0)...
> Saved verbatim input to sources/1732488116-import.txt
> Parsed 1 URLs from input (Generic TXT)
> Found 1 new URLs not already in index
[*] [2024-11-24 22:41:57] Writing 1 links to main index...
√ ./index.sqlite3
[*] [2024-11-24 22:41:57] Archiving 1/2 URLs from added set...
[▶] [2024-11-24 22:41:57] Starting archiving of 1 snapshots in index...
[+] [2024-11-24 22:41:57] "rover.com#2024-11-24T22:41:56+00:00"
https://rover.com#2024-11-24T22:41:56+00:00
> ./archive/1732488117.005615
> favicon
> headers
> singlefile
+ creating new Chrome profile in:
./personas/Default/chrome_profile/Default
> pdf
Extractor timed out after 60s.
Run to see full output:
docker run -it -v $PWD/data:/data archivebox/archivebox /bin/bash
cd /data/archive/1732488117.005615;
/usr/bin/chromium-browser --virtual-time-budget=15000
--disable-features=DarkMode --run-all-compositor-stages-before-draw
--hide-scrollbars --autoplay-policy=no-user-gesture-required --no-first-run
--use-fake-ui-for-media-stream --use-fake-device-for-media-stream
"--simulate-outdated-no-au='Tue, 31 Dec 2099 23:59:59 GMT'" --headless=new
--no-sandbox --no-zygote --disable-dev-shm-usage --disable-software-rasterizer
--disable-sync --window-size=1440,2000 "--user-agent=Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0
Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)"
--user-data-dir=/data/personas/Default/chrome_profile
--profile-directory=Default --print-to-pdf
"https://rover.com#2024-11-24T22:41:56+00:00"
> screenshot
Extractor timed out after 60s.
Run to see full output:
docker run -it -v $PWD/data:/data archivebox/archivebox /bin/bash
cd /data/archive/1732488117.005615;
/usr/bin/chromium-browser --virtual-time-budget=15000
--disable-features=DarkMode --run-all-compositor-stages-before-draw
--hide-scrollbars --autoplay-policy=no-user-gesture-required --no-first-run
--use-fake-ui-for-media-stream --use-fake-device-for-media-stream
"--simulate-outdated-no-au='Tue, 31 Dec 2099 23:59:59 GMT'" --headless=new
--no-sandbox --no-zygote --disable-dev-shm-usage --disable-software-rasterizer
--disable-sync --window-size=1440,2000 "--user-agent=Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0
Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)"
--user-data-dir=/data/personas/Default/chrome_profile
--profile-directory=Default --screenshot
"https://rover.com#2024-11-24T22:41:56+00:00"
> dom
Extractor timed out after 60s.
Run to see full output:
docker run -it -v $PWD/data:/data archivebox/archivebox /bin/bash
cd /data/archive/1732488117.005615;
/usr/bin/chromium-browser --virtual-time-budget=15000
--disable-features=DarkMode --run-all-compositor-stages-before-draw
--hide-scrollbars --autoplay-policy=no-user-gesture-required --no-first-run
--use-fake-ui-for-media-stream --use-fake-device-for-media-stream
"--simulate-outdated-no-au='Tue, 31 Dec 2099 23:59:59 GMT'" --headless=new
--no-sandbox --no-zygote --disable-dev-shm-usage --disable-software-rasterizer
--disable-sync --window-size=1440,2000 "--user-agent=Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0
Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)"
--user-data-dir=/data/personas/Default/chrome_profile
--profile-directory=Default --dump-dom
"https://rover.com#2024-11-24T22:41:56+00:00"
> wget
> title
> readability
> mercury
Extractor failed:
Mercury was not able to get article text from the URL
Run to see full output:
docker run -it -v $PWD/data:/data archivebox/archivebox /bin/bash
cd /data/archive/1732488117.005615;
/home/archivebox/.npm/bin/postlight-parser --format=text
"https://rover.com#2024-11-24T22:41:56+00:00"
> htmltotext
> media
> archive_org
Extractor failed:
Failed to find "content-location" URL header in Archive.org
response.
Run to see full output:
docker run -it -v $PWD/data:/data archivebox/archivebox /bin/bash
cd /data/archive/1732488117.005615;
/usr/bin/curl --silent --location --compressed --head --max-time 60
--user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 ArchiveBox/{VERSION}
(+https://github.com/ArchiveBox/ArchiveBox/)"
"https://web.archive.org/save/https://rover.com#2024-11-24T22:41:56+00:00"
69 files (4.9 MB) in 0:03:22s
[√] [2024-11-24 22:45:19] Update of 1 pages complete (3.38 min)
- 0 links skipped
- 2 links updated
- 2 links had errors
Hint: To manage your archive in a Web UI, run:
archivebox server 0.0.0.0:8000
[2024-11-24 22:45:19,643] INFO:huey:Worker-4:archivebox.queues.tasks.bg_add: 89d64116-dc2e-43df-b4b3-9c2356e326d9 executed in 202.978s
[2024-11-24 22:45:19] INFO huey archivebox.queues.tasks.bg_add: api.py:425
89d64116-dc2e-43df-b4b3-9c2356e326d9
executed in 202.978s
INFO huey_monitor.tasks Store Task tasks.py:58
89d64116-dc2e-43df-b4b3-9c2356e326d9
signal 'complete' (finished: True)ArchiveBox Version
0.8.5rc51
ArchiveBox v0.8.5rc51 COMMIT_HASH=63bf902 BUILD_TIME=2024-10-24 06:30:40 1729751440
IN_DOCKER=True IN_QEMU=False ARCH=aarch64 OS=Linux PLATFORM=Linux-6.10.14-linuxkit-aarch64-with-glibc2.36 PYTHON=Cpython
EUID=911:911 UID=911:911 PUID=911:911 FS_UID=911:911 FS_PERMS=644 FS_ATOMIC=True FS_REMOTE=True
DEBUG=False IS_TTY=False SUDO=False ID=9f373648:f07b3960 SEARCH_BACKEND=ripgrep LDAP=False
Binary Dependencies:
√ python 3.11.10 sys_pip /usr/local/bin/python3.11
√ django 5.1.2 sys_pip /usr/local/lib/python3.11/site-packages/django/__init__.py
√ sqlite 2.6.0 sys_pip /usr/local/lib/python3.11/site-packages/django/db/backends/sqlite3/base.py
√ pip 24.0.0 sys_pip /usr/local/bin/pip
√ pipx 1.1.0 sys_pip /bin/pipx
√ node 22.10.0 apt /usr/bin/node
√ npm 10.9.0 apt /usr/bin/npm
√ npx 10.9.0 apt /usr/bin/npx
√ playwright 1.48.0 sys_pip /usr/local/bin/playwright
√ puppeteer 23.6.0 lib_npm ~/.npm/bin/puppeteer
√ ldap 3.4.4 sys_pip /usr/local/lib/python3.11/site-packages/ldap/__init__.py
√ rg 13.0.0 apt /usr/bin/rg
√ sonic 1.4.9 env /usr/local/bin/sonic
√ chrome 130.0.6723 env /usr/bin/chromium-browser
√ curl 8.10.1 apt /usr/bin/curl
√ git 2.39.5 apt /usr/bin/git
√ postlight-parser 2.2.3 sys_npm ~/.npm/bin/postlight-parser
√ readability-extractor 0.0.11 lib_npm ~/.npm/bin/readability-extractor
√ single-file 1.1.54 lib_npm ~/.npm/bin/single-file
√ wget 1.21.3 apt /usr/bin/wget
√ yt-dlp 2024.10.22 sys_pip /usr/local/bin/yt-dlp
√ ffmpeg 5.1.6 env /usr/bin/ffmpeg
Package Managers:
√ env /usr/bin/which UID=911 P…
√ apt /usr/bin/apt-get UID=0 P…
- brew not available UID=911 P…
√ sys_pip /usr/local/bin/pip UID=911 P…
- venv_pip not available UID=911 P…
- lib_pip not available UID=911 P…
√ sys_npm /usr/bin/npm UID=911 P…
- lib_npm /usr/bin/npm UID=911 P…
√ playwright /usr/local/bin/playwright UID=0 P…
√ puppeteer /usr/bin/npx UID=911 P…
Code locations:
√ PACKAGE_DIR 39 files valid /app/archivebox
√ TEMPLATES_DIR 4 files valid /app/archivebox/templates
- CUSTOM_TEMPLATES_DIR missing unused ./user_templates
- USER_PLUGINS_DIR missing unused ./user_plugins
√ LIB_DIR 0 files valid /usr/share/archivebox/lib
Data locations:
√ DATA_DIR 16 files @ valid /data
√ CONFIG_FILE 139.0 Bytes valid ./ArchiveBox.conf
√ SQL_INDEX 392.0 KB valid ./index.sqlite3
√ QUEUE_DATABASE 92.0 KB valid ./queue.sqlite3
√ ARCHIVE_DIR 2 files valid ./archive
√ SOURCES_DIR 2 files valid ./sources
√ PERSONAS_DIR 1 files valid ./personas
√ LOGS_DIR 5 files valid ./logs
√ TMP_DIR 0 files valid /tmp/archiveboxHow did you install the version of ArchiveBox you are using?
Docker (or other container system like podman/LXC/Kubernetes or TrueNAS/Cloudron/YunoHost/etc.)
What operating system are you running on?
macOS (including Docker on macOS)
What type of drive are you using to store your ArchiveBox data?
-
data/is on a local SSD or NVMe drive -
data/is on a spinning hard drive or external USB drive -
data/is on a network mount (e.g. NFS/SMB/CIFS/etc.) -
data/is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/OneDrive, etc.)