Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fully define multipart/form-data and allow for streaming #6424

Open
annevk opened this issue Feb 27, 2021 · 2 comments
Open

Fully define multipart/form-data and allow for streaming #6424

annevk opened this issue Feb 27, 2021 · 2 comments
Labels
integration Better coordination across standards needed topic: forms

Comments

@annevk
Copy link
Member

annevk commented Feb 27, 2021

#3223 is part of this, but to properly integrate with Fetch we need more. In particular, I think we want a serialization operation that returns a tuple. The tuple contains the boundary and a list of which each item is either a byte sequence or a Blob. That allows Fetch to compute the total size (go through the list, and increment by either byte sequence's length or blob's size) and allows it to enqueue chunks into a stream lazily without blocking I/O. It's not really possible to pretend synchronous I/O and allow user agents optimize as the I/O might fail, whereas obtaining the size should not fail (thanks to @mkruisselbrink for pointing that out).

We should also point out that this is a potentially lossy format as the boundary needs to be necessarily computed ahead-of-time without knowing the contents of the blobs. There is no way to avoid this as the boundary is part of the headers and exposed through something like new Response(formData).headers.get("content-type"). I suppose it was possible to avoid this before there was an API if you did not care about streaming, but here we are.

There's a separate question of where we want to define this format. At the moment it's mostly in HTML but FormData is in XMLHttpRequest. Status quo is fine with me.

cc @andreubotella

@annevk annevk added topic: forms integration Better coordination across standards needed labels Feb 27, 2021
@andreubotella
Copy link
Member

Isn't #3223 basically solved? The corresponding PR (#3276) was left untouched for years, and then I opened #6282 to replace it. It looks like @domenic edited the commit message of my PR to close the original PR, but he forgot to close the issue.

As for the boundary, every browser implement it ahead of time, and even the initial definition of multipart/* in RFC1341 allows for a probabilistic choice of boundary. Requiring it to be computed ahead of time isn't ideal, but if it's necessary, the spec would have to require a lower bound on entropy to ensure the probability of a collision is negligible.

RFC2046 requires multipart boundaries to be 1 to 70 bytes in the range [0-9A-Za-z'()+,./:=?_ -], except that the final byte cannot be the space. But since the Content-Type value generated by the form submission algorithm requires the boundary parameter to not be a quoted string, the boundary should only contain bytes which are safe as a parameter value: [0-9A-Za-z'+._-]

Additionally, a comment in WebKit's (and Chromium's) implementation of the boundaries reads:

// The RFC 2046 spec says the alphanumeric characters plus the
// following characters are legal for boundaries:  '()+_,-./:=?
// However the following characters, though legal, cause some sites
// to fail: (),./:=+

Assuming that is still the case, the remaining safe bytes would be [0-9A-Za-z'_-]


If we want to take cues from the implementations, Firefox's boundary string contains a constant prefix of 27 bytes (all hyphens) plus a random part of between 3 and 30 ASCII digits, whose entropy I don't quite know how to calculate but it's probably close to but lower than 96 bits. Webkit and Chromium's boundary string has a constant prefix of 22 bytes (hyphens and ASCII alpha) plus a random part of 16 ASCII alphanumeric bytes, with 95 bits of entropy.

If I'm doing my math right, with a fixed length l and an entropy h, the expected length of a form payload before the boundary occurs in it is (l * (2^h - 1))/2 bytes, which for the boundary strings generated by browsers is over a yottabyte.

@andreubotella
Copy link
Member

For the record, I'm working on defining multipart/form-data in https://github.com/andreubotella/multipart-form-data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration Better coordination across standards needed topic: forms
Development

No branches or pull requests

2 participants