Description
Describe the bug
When using pywb (wb-manager reindex
, cdx-indexer
) and cdxj-indexer
a WARC file can’t get indexed. All indexing methods return an error. (“Error: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte”)
WARC file: https://storage.googleapis.com/rhizome-hurl/stuttering-by-nathaniel-stern-20230821133816.warc
The WARC Record causing problems seems to be a POST Request, with a payload containing query data in JSON.
Identified WARC Records causing the error:
-
request: urn:uuid:4bdccb83-e6e4-43d6-887d-51b81d6d1e90
-
response: urn:uuid:78218a84-3c12-11ee-804f-0242c0a89008
cdxj-indexer Error Message
cdxj-indexer -p [warc file] > [index file]
Error parsing: {"context":{"client":{"hl":"en","gl":"US","clientName":1,"clientVersion":"2.20230815.00.00","configInfo": [...]
The error refers to the payload of the Request Record urn:uuid:4bdccb83-e6e4-43d6-887d-51b81d6d1e90. (Full payload attached above)
wb-manager reindex Error Message
wb-manager reindex [collection]
Error: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
cdx-indexer Error Message
cdx-indexer -p [WARC file]
[...]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/mona/.local/bin/cdx-indexer", line 8, in <module>
sys.exit(main())
File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/cdxindexer.py", line 468, in main
write_multi_cdx_index(cmd.output, cmd.inputs,
File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/cdxindexer.py", line 306, in write_multi_cdx_index
for entry in entry_iter:
File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/archiveindexer.py", line 342, in __call__
for entry in entry_iter:
File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/archiveindexer.py", line 215, in join_request_records
for entry in entry_iter:
File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/archiveindexer.py", line 188, in create_record_iter
post_query = MethodQueryCanonicalizer(method,
File "/home/mona/.local/lib/python3.10/site-packages/pywb/warcserver/inputrequest.py", line 281, in __init__
sys.stderr.write("Ignoring query, error parsing as json: " + query.decode("utf-8") + "\n")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
Steps to reproduce the bug
- Download the WARC file: https://storage.googleapis.com/rhizome-hurl/stuttering-by-nathaniel-stern-20230821133816.warc
- use described tools and commands to index the file
Environment
- OS: Ubuntu 22.04
- Version pywb 2.7.4
- Version cdxj-indexer 1.4.5
- Version warcio 1.7.4
Additional context
Identification of Error Records
When using the indexing methods a index.cdxj is partly written. Comparing the entries of the WARC and index file chronologically, the first entry in the WARC file that is not in the index file was identified as the record causing problems. To verify that this record is causing problems, it was removed using warcio. After that, the indexing worked.
WARC-Processing with warcio
The WARC file and the identifies error records were processed with warcio and no utf-8 occurred.
from warcio.archiveiterator import ArchiveIterator
import sys
warc1_path = sys.argv[1]
from warcio.archiveiterator import ArchiveIterator
with open(warc1_path, 'rb') as stream:
for i, record in enumerate(ArchiveIterator(stream)):
print(i, record.rec_headers.get_header('WARC-Target-URI'))
print(i, record.rec_headers.get_header('WARC-Record-ID'))
if record.rec_type == 'request':
content = record.content_stream().read()
print(content.decode('utf-8'))
Activity