Skip to content

quoted-string WARC header values are not parsed correctly #130

@JustAnotherArchivist

Description

@JustAnotherArchivist

warcio fails to parse this valid WARC record correctly:

import gzip
import io
import warcio.archiveiterator


noise0 = b'WARC/1.1\r\nWARC-Record-ID: <urn:uuid:fe4275e8-87bd-435c-a3ff-9586e86427be>\r\nWARC-Type: warcinfo\r\nContent-Length: 0\r\nWARC-Date: 2021-07-04T17:52:55Z\r\n'
signal = b'WARC-Filename: "foo\\\nbar"\r\n'
noise1 = b'\r\n\r\n\r\n' # End of headers and end of record
f = io.BytesIO(gzip.compress(noise0 + signal + noise1))
for record in warcio.archiveiterator.WARCIterator(f):
	print(repr(record.rec_headers.get_header('WARC-Filename')))

The critical part here is the escaped line feed in the WARC-Filename value. WARC's quoted-string definition permits any valid UTF-8 character to be escaped with a backslash, including an LF. The correct value for the header would be, in Python notation, 'foo\nbar'. Instead, the code above prints '"foo\\'.

WARC-Filename and Content-Type are the only official fields which may contain quoted-strings, but any unofficial field can also use it.

Cf. iipc/warc-specifications#71 and iipc/warc-specifications#72 for bugs in the standard related to this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions