Skip to content

Bug: Chrome times out saving https://www.rover.com/ because of --virtual-time-budget=15000, causing SingletonFile lock contention breaking future launches #1604

@nguyenmp

Description

@nguyenmp

Provide a screenshot and describe the bug

With the current default settings, Chrome will hang trying to save https://www.rover.com/. This is triggered by the --virtual-time-budget=15000 argument. Removing this argument allows chrome to archive the DOM as desired.

However, there's a pathological case here! When we hang forever trying to save Rover.com, it causes the persona to get locked. ArchiveBox times out the action and moves on to the next one which runs into the singleton file and fails. As mentioned in #1599 (comment), this pathological case would go away if we used a temp personal directory each time.

I think this was added to work around full page screenshot issues. I just wanted to share this example in case this is a more common problem and maybe the default would be reconsidered. I'm not arguing one way or another, but it's probably worth tracking how many websites hit this issue.

Steps to reproduce

1. `docker run -it -v $PWD/data:/data archivebox/archivebox:dev init --setup`
2. `docker run -it -v $PWD/data:/data archivebox/archivebox:dev manage createsuperuser`
3. `docker run -it -p 8000:8000 -v $PWD/data:/data archivebox/archivebox:dev server`
4. Archive https://www.rover.com/

Logs or errors

[+] [2024-11-24 22:41:56] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1732488116-import.txt
    > Parsed 1 URLs from input (Generic TXT)
    > Found 1 new URLs not already in index

[*] [2024-11-24 22:41:57] Writing 1 links to main index...
    √ ./index.sqlite3

[*] [2024-11-24 22:41:57] Archiving 1/2 URLs from added set...

[▶] [2024-11-24 22:41:57] Starting archiving of 1 snapshots in index...

[+] [2024-11-24 22:41:57] "rover.com#2024-11-24T22:41:56+00:00"
    https://rover.com#2024-11-24T22:41:56+00:00
    > ./archive/1732488117.005615
      > favicon
      > headers
      > singlefile
        + creating new Chrome profile in: 
./personas/Default/chrome_profile/Default
      > pdf
        Extractor timed out after 60s.
        Run to see full output:
          docker run -it -v $PWD/data:/data archivebox/archivebox /bin/bash
            cd /data/archive/1732488117.005615;
            /usr/bin/chromium-browser --virtual-time-budget=15000 
--disable-features=DarkMode --run-all-compositor-stages-before-draw 
--hide-scrollbars --autoplay-policy=no-user-gesture-required --no-first-run 
--use-fake-ui-for-media-stream --use-fake-device-for-media-stream 
"--simulate-outdated-no-au='Tue, 31 Dec 2099 23:59:59 GMT'" --headless=new 
--no-sandbox --no-zygote --disable-dev-shm-usage --disable-software-rasterizer 
--disable-sync --window-size=1440,2000 "--user-agent=Mozilla/5.0 (Macintosh; 
Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 
Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)"
--user-data-dir=/data/personas/Default/chrome_profile 
--profile-directory=Default --print-to-pdf 
"https://rover.com#2024-11-24T22:41:56+00:00"

      > screenshot
        Extractor timed out after 60s.
        Run to see full output:
          docker run -it -v $PWD/data:/data archivebox/archivebox /bin/bash
            cd /data/archive/1732488117.005615;
            /usr/bin/chromium-browser --virtual-time-budget=15000 
--disable-features=DarkMode --run-all-compositor-stages-before-draw 
--hide-scrollbars --autoplay-policy=no-user-gesture-required --no-first-run 
--use-fake-ui-for-media-stream --use-fake-device-for-media-stream 
"--simulate-outdated-no-au='Tue, 31 Dec 2099 23:59:59 GMT'" --headless=new 
--no-sandbox --no-zygote --disable-dev-shm-usage --disable-software-rasterizer 
--disable-sync --window-size=1440,2000 "--user-agent=Mozilla/5.0 (Macintosh; 
Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 
Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)"
--user-data-dir=/data/personas/Default/chrome_profile 
--profile-directory=Default --screenshot 
"https://rover.com#2024-11-24T22:41:56+00:00"

      > dom
        Extractor timed out after 60s.
        Run to see full output:
          docker run -it -v $PWD/data:/data archivebox/archivebox /bin/bash
            cd /data/archive/1732488117.005615;
            /usr/bin/chromium-browser --virtual-time-budget=15000 
--disable-features=DarkMode --run-all-compositor-stages-before-draw 
--hide-scrollbars --autoplay-policy=no-user-gesture-required --no-first-run 
--use-fake-ui-for-media-stream --use-fake-device-for-media-stream 
"--simulate-outdated-no-au='Tue, 31 Dec 2099 23:59:59 GMT'" --headless=new 
--no-sandbox --no-zygote --disable-dev-shm-usage --disable-software-rasterizer 
--disable-sync --window-size=1440,2000 "--user-agent=Mozilla/5.0 (Macintosh; 
Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 
Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)"
--user-data-dir=/data/personas/Default/chrome_profile 
--profile-directory=Default --dump-dom 
"https://rover.com#2024-11-24T22:41:56+00:00"

      > wget
      > title
      > readability
      > mercury
        Extractor failed:
             Mercury was not able to get article text from the URL
        Run to see full output:
          docker run -it -v $PWD/data:/data archivebox/archivebox /bin/bash
            cd /data/archive/1732488117.005615;
            /home/archivebox/.npm/bin/postlight-parser --format=text 
"https://rover.com#2024-11-24T22:41:56+00:00"

      > htmltotext
      > media
      > archive_org
        Extractor failed:
             Failed to find "content-location" URL header in Archive.org 
response.
        Run to see full output:
          docker run -it -v $PWD/data:/data archivebox/archivebox /bin/bash
            cd /data/archive/1732488117.005615;
            /usr/bin/curl --silent --location --compressed --head --max-time 60 
--user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 ArchiveBox/{VERSION} 
(+https://github.com/ArchiveBox/ArchiveBox/)" 
"https://web.archive.org/save/https://rover.com#2024-11-24T22:41:56+00:00"

        69 files (4.9 MB) in 0:03:22s 

[√] [2024-11-24 22:45:19] Update of 1 pages complete (3.38 min)
    - 0 links skipped
    - 2 links updated
    - 2 links had errors

    Hint: To manage your archive in a Web UI, run:
        archivebox server 0.0.0.0:8000
[2024-11-24 22:45:19,643] INFO:huey:Worker-4:archivebox.queues.tasks.bg_add: 89d64116-dc2e-43df-b4b3-9c2356e326d9 executed in 202.978s
[2024-11-24 22:45:19] INFO     huey archivebox.queues.tasks.bg_add:   api.py:425
                               89d64116-dc2e-43df-b4b3-9c2356e326d9             
                               executed in 202.978s                             
                      INFO     huey_monitor.tasks Store Task         tasks.py:58
                               89d64116-dc2e-43df-b4b3-9c2356e326d9             
                               signal 'complete' (finished: True)

ArchiveBox Version

0.8.5rc51
ArchiveBox v0.8.5rc51 COMMIT_HASH=63bf902 BUILD_TIME=2024-10-24 06:30:40 1729751440
IN_DOCKER=True IN_QEMU=False ARCH=aarch64 OS=Linux PLATFORM=Linux-6.10.14-linuxkit-aarch64-with-glibc2.36 PYTHON=Cpython
EUID=911:911 UID=911:911 PUID=911:911 FS_UID=911:911 FS_PERMS=644 FS_ATOMIC=True FS_REMOTE=True
DEBUG=False IS_TTY=False SUDO=False ID=9f373648:f07b3960 SEARCH_BACKEND=ripgrep LDAP=False

 Binary Dependencies:
 √  python                3.11.10      sys_pip    /usr/local/bin/python3.11
 √  django                5.1.2        sys_pip    /usr/local/lib/python3.11/site-packages/django/__init__.py
 √  sqlite                2.6.0        sys_pip    /usr/local/lib/python3.11/site-packages/django/db/backends/sqlite3/base.py
 √  pip                   24.0.0       sys_pip    /usr/local/bin/pip
 √  pipx                  1.1.0        sys_pip    /bin/pipx
 √  node                  22.10.0      apt        /usr/bin/node
 √  npm                   10.9.0       apt        /usr/bin/npm
 √  npx                   10.9.0       apt        /usr/bin/npx
 √  playwright            1.48.0       sys_pip    /usr/local/bin/playwright
 √  puppeteer             23.6.0       lib_npm    ~/.npm/bin/puppeteer
 √  ldap                  3.4.4        sys_pip    /usr/local/lib/python3.11/site-packages/ldap/__init__.py
 √  rg                    13.0.0       apt        /usr/bin/rg
 √  sonic                 1.4.9        env        /usr/local/bin/sonic
 √  chrome                130.0.6723   env        /usr/bin/chromium-browser
 √  curl                  8.10.1       apt        /usr/bin/curl
 √  git                   2.39.5       apt        /usr/bin/git
 √  postlight-parser      2.2.3        sys_npm    ~/.npm/bin/postlight-parser
 √  readability-extractor 0.0.11       lib_npm    ~/.npm/bin/readability-extractor
 √  single-file           1.1.54       lib_npm    ~/.npm/bin/single-file
 √  wget                  1.21.3       apt        /usr/bin/wget
 √  yt-dlp                2024.10.22   sys_pip    /usr/local/bin/yt-dlp
 √  ffmpeg                5.1.6        env        /usr/bin/ffmpeg

 Package Managers:
 √  env         /usr/bin/which                                       UID=911  P…
 √  apt         /usr/bin/apt-get                                     UID=0    P…
 -  brew        not available                                        UID=911  P…
 √  sys_pip     /usr/local/bin/pip                                   UID=911  P…
 -  venv_pip    not available                                        UID=911  P…
 -  lib_pip     not available                                        UID=911  P…
 √  sys_npm     /usr/bin/npm                                         UID=911  P…
 -  lib_npm     /usr/bin/npm                                         UID=911  P…
 √  playwright  /usr/local/bin/playwright                            UID=0    P…
 √  puppeteer   /usr/bin/npx                                         UID=911  P…

 Code locations:
 √  PACKAGE_DIR           39 files        valid     /app/archivebox             
 √  TEMPLATES_DIR         4 files         valid     /app/archivebox/templates   
 -  CUSTOM_TEMPLATES_DIR  missing         unused    ./user_templates            
 -  USER_PLUGINS_DIR      missing         unused    ./user_plugins              
 √  LIB_DIR               0 files         valid     /usr/share/archivebox/lib   

 Data locations:
 √  DATA_DIR              16 files @      valid     /data                       
 √  CONFIG_FILE           139.0 Bytes     valid     ./ArchiveBox.conf           
 √  SQL_INDEX             392.0 KB        valid     ./index.sqlite3             
 √  QUEUE_DATABASE        92.0 KB         valid     ./queue.sqlite3             
 √  ARCHIVE_DIR           2 files         valid     ./archive                   
 √  SOURCES_DIR           2 files         valid     ./sources                   
 √  PERSONAS_DIR          1 files         valid     ./personas                  
 √  LOGS_DIR              5 files         valid     ./logs                      
 √  TMP_DIR               0 files         valid     /tmp/archivebox

How did you install the version of ArchiveBox you are using?

Docker (or other container system like podman/LXC/Kubernetes or TrueNAS/Cloudron/YunoHost/etc.)

What operating system are you running on?

macOS (including Docker on macOS)

What type of drive are you using to store your ArchiveBox data?

  • data/ is on a local SSD or NVMe drive
  • data/ is on a spinning hard drive or external USB drive
  • data/ is on a network mount (e.g. NFS/SMB/CIFS/etc.)
  • data/ is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/OneDrive, etc.)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions