Skip to content

warcio mangles non-ASCII HTTP headers #128

Open
@JustAnotherArchivist

Description

@JustAnotherArchivist

I discovered today that warcio mangles the HTTP header data when it isn't pure ASCII. Specifically, I am dealing with a server that returns ISO-8859-1 headers.

As far as I can tell, this behaviour was introduced by #45 in warcio 1.6.0. I suspect that it's fine for reading WARCs, but it's absolutely horrible for writing because the WARCs will not contain the data as sent by the server (nor even data that would be considered equivalent per HTTP!), hence violating the WARC specification.

Example:

import io
import warcio


output = io.BytesIO()
writer = warcio.warcwriter.WARCWriter(output, gzip = False)
payload = io.BytesIO()
payload.write(b'HTTP/1.1 200 OK\r\nDate: Thu, 27 May 2021 22:03:54 GMT\r\nContent-Length: 0\r\nX-non-utf8-value: \xff\r\n\r\n')
payload.seek(0)
record = writer.create_warc_record('http://example.org/', 'response', payload = payload)
writer.write_record(record)
print(output.getvalue())

Expected output for the relevant header: X-non-utf8-value: <0xFF> (where <0xFF> is that literal byte)
Actual output: X-non-utf8-value: %C3%BF

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions