Open
Description
I discovered today that warcio mangles the HTTP header data when it isn't pure ASCII. Specifically, I am dealing with a server that returns ISO-8859-1 headers.
As far as I can tell, this behaviour was introduced by #45 in warcio 1.6.0. I suspect that it's fine for reading WARCs, but it's absolutely horrible for writing because the WARCs will not contain the data as sent by the server (nor even data that would be considered equivalent per HTTP!), hence violating the WARC specification.
Example:
import io
import warcio
output = io.BytesIO()
writer = warcio.warcwriter.WARCWriter(output, gzip = False)
payload = io.BytesIO()
payload.write(b'HTTP/1.1 200 OK\r\nDate: Thu, 27 May 2021 22:03:54 GMT\r\nContent-Length: 0\r\nX-non-utf8-value: \xff\r\n\r\n')
payload.seek(0)
record = writer.create_warc_record('http://example.org/', 'response', payload = payload)
writer.write_record(record)
print(output.getvalue())
Expected output for the relevant header: X-non-utf8-value: <0xFF>
(where <0xFF>
is that literal byte)
Actual output: X-non-utf8-value: %C3%BF
Metadata
Metadata
Assignees
Type
Projects
Status
Triage