Skip to content

Bug: "Archive Again" with multiple URLs breaks all Chromium based archival methods, and others #1599

@nguyenmp

Description

@nguyenmp

Provide a screenshot and describe the bug

When I submit multiple URLs for archival, it's very serial which I think is intentional and good.

However, when I select multiple snapshots and click "Archive Again", it's very noticeably done in parallel and breaks the Chromium profile. It'll sometimes leave the Chromium lock files in /data/personas/Default/chrome_profile/Singleton* which prevents future Chromium launches. Pretty much all archival attempts fail on the second run and even single URLs will fail after triggering the Chromium lockfile issue.

I'm not exactly sure what the knock-on effects are but the following fail very consistently once I get into this state:

  • archive_org
  • htmltotext
  • readability
  • title
  • dom
  • screenshot
  • pdf
  • singlefile

Workaround is to delete the Chrome profile and only submit one URL at any time:

rm -r data/personas/

Steps to reproduce

  1. docker run -v "./data/:/data/" archivebox/archivebox:dev archivebox init
  2. docker run -v "./data/:/data/" -it archivebox/archivebox:dev archivebox manage createsuperuser
  3. docker run -p "8000:8000" -v "./data/:/data/" archivebox/archivebox:dev
  4. Visit http://localhost:8000/add/ and add 3 urls, all at once:
    1. https://google.com
    2. https://wikipedia.com
    3. https://reddit.com
  5. Let those finish archiving for like 2 minutes
  6. Select all and "Archive Again" from http://localhost:8000/admin/core/snapshot/

Logs or errors

From `worker_scheduler.log`:

      > screenshot
        Extractor failed:
             Failed to save screenshot
            [1437:1437:1118/224805.720253:ERROR:process_singleton_posix.cc(340)]
Failed to create /data/personas/Default/chrome_profile/SingletonLock: File 
exists (17)
            [1437:1437:1118/224805.720577:ERROR:chrome_main_delegate.cc(594)] 
Failed to create a ProcessSingleton for your profile directory. This means that 
running multiple instances would start multiple browser processes rather than 
opening a new window in the existing process. Aborting now to avoid profile 
corruption.

ArchiveBox Version

0.8.5rc51
ArchiveBox v0.8.5rc51 COMMIT_HASH=63bf902 BUILD_TIME=2024-10-24 06:30:40 
1729751440
IN_DOCKER=True IN_QEMU=False ARCH=aarch64 OS=Linux 
PLATFORM=Linux-6.10.4-linuxkit-aarch64-with-glibc2.36 PYTHON=Cpython
EUID=911:0 UID=911:0 PUID=911:0 FS_UID=911:0 FS_PERMS=644 FS_ATOMIC=True 
FS_REMOTE=True
DEBUG=False IS_TTY=False SUDO=False ID=9f373648:efbea00e SEARCH_BACKEND=ripgrep 
LDAP=False

 Binary Dependencies:
 √  python                3.11.10      sys_pip    /usr/local/bin/python3.11
 √  django                5.1.2        sys_pip    /usr/local/lib/python3.11/site-packages/django/__init__.py
 √  sqlite                2.6.0        sys_pip    /usr/local/lib/python3.11/site-packages/django/db/backends/sqlite3/base.py
 √  pip                   24.0.0       sys_pip    /usr/local/bin/pip
 √  pipx                  1.1.0        sys_pip    /usr/bin/pipx
 √  node                  22.10.0      apt        /usr/bin/node
 √  npm                   10.9.0       apt        /usr/bin/npm
 √  npx                   10.9.0       apt        /usr/bin/npx
 √  playwright            1.48.0       sys_pip    /usr/local/bin/playwright
 √  puppeteer             23.6.0       lib_npm    ~/.npm/bin/puppeteer
 √  ldap                  3.4.4        sys_pip    /usr/local/lib/python3.11/site-packages/ldap/__init__.py
 √  rg                    13.0.0       apt        /usr/bin/rg
 √  sonic                 1.4.9        env        /usr/local/bin/sonic
 √  chrome                130.0.6723   env        /usr/bin/chromium-browser
 √  curl                  8.10.1       apt        /usr/bin/curl
 √  git                   2.39.5       apt        /usr/bin/git
 √  postlight-parser      2.2.3        sys_npm    ~/.npm/bin/postlight-parser
 √  readability-extractor 0.0.11       lib_npm    ~/.npm/bin/readability-extractor
 √  single-file           1.1.54       lib_npm    ~/.npm/bin/single-file
 √  wget                  1.21.3       apt        /usr/bin/wget
 √  yt-dlp                2024.10.22   sys_pip    /usr/local/bin/yt-dlp
 √  ffmpeg                5.1.6        env        /usr/bin/ffmpeg

 Package Managers:
 √  env         /usr/bin/which                                       UID=911  P…
 √  apt         /usr/bin/apt-get                                     UID=0    P…
 -  brew        not available                                        UID=911  P…
 √  sys_pip     /usr/local/bin/pip                                   UID=911  P…
 -  venv_pip    not available                                        UID=911  P…
 -  lib_pip     not available                                        UID=911  P…
 √  sys_npm     /usr/bin/npm                                         UID=911  P…
 -  lib_npm     /usr/bin/npm                                         UID=911  P…
 √  playwright  /usr/local/bin/playwright                            UID=0    P…
 √  puppeteer   /usr/bin/npx                                         UID=911  P…

 Code locations:
 √  PACKAGE_DIR           39 files        valid     /app/archivebox             
 √  TEMPLATES_DIR         4 files         valid     /app/archivebox/templates   
 -  CUSTOM_TEMPLATES_DIR  missing         unused    ./user_templates            
 -  USER_PLUGINS_DIR      missing         unused    ./user_plugins              
 √  LIB_DIR               0 files         valid     /usr/share/archivebox/lib   

 Data locations:
 √  DATA_DIR              17 files @      valid     /data                       
 √  CONFIG_FILE           139.0 Bytes     valid     ./ArchiveBox.conf           
 √  SQL_INDEX             476.0 KB        valid     ./index.sqlite3             
 √  QUEUE_DATABASE        92.0 KB         valid     ./queue.sqlite3             
 √  ARCHIVE_DIR           9 files         valid     ./archive                   
 √  SOURCES_DIR           6 files         valid     ./sources                   
 √  PERSONAS_DIR          2 files         valid     ./personas                  
 √  LOGS_DIR              5 files         valid     ./logs                      
 √  TMP_DIR               0 files         valid     /tmp/archivebox

How did you install the version of ArchiveBox you are using?

Docker (or other container system like podman/LXC/Kubernetes or TrueNAS/Cloudron/YunoHost/etc.)

What operating system are you running on?

macOS (including Docker on macOS)

What type of drive are you using to store your ArchiveBox data?

  • data/ is on a local SSD or NVMe drive
  • data/ is on a spinning hard drive or external USB drive
  • data/ is on a network mount (e.g. NFS/SMB/CIFS/etc.)
  • data/ is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/OneDrive, etc.)

Docker Compose Configuration

N/A

ArchiveBox Configuration

# Converted from INI to TOML format: https://toml.io/en/

[SERVER_CONFIG]
SECRET_KEY = "abcdefg"

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions