-
-
Notifications
You must be signed in to change notification settings - Fork 67
Open
Description
warcio fails to parse this valid WARC record correctly:
import gzip
import io
import warcio.archiveiterator
noise0 = b'WARC/1.1\r\nWARC-Record-ID: <urn:uuid:fe4275e8-87bd-435c-a3ff-9586e86427be>\r\nWARC-Type: warcinfo\r\nContent-Length: 0\r\nWARC-Date: 2021-07-04T17:52:55Z\r\n'
signal = b'WARC-Filename: "foo\\\nbar"\r\n'
noise1 = b'\r\n\r\n\r\n' # End of headers and end of record
f = io.BytesIO(gzip.compress(noise0 + signal + noise1))
for record in warcio.archiveiterator.WARCIterator(f):
print(repr(record.rec_headers.get_header('WARC-Filename')))The critical part here is the escaped line feed in the WARC-Filename value. WARC's quoted-string definition permits any valid UTF-8 character to be escaped with a backslash, including an LF. The correct value for the header would be, in Python notation, 'foo\nbar'. Instead, the code above prints '"foo\\'.
WARC-Filename and Content-Type are the only official fields which may contain quoted-strings, but any unofficial field can also use it.
Cf. iipc/warc-specifications#71 and iipc/warc-specifications#72 for bugs in the standard related to this.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels