Copyright © 2009 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
This specification defines the handling of Web addresses for Hypertext Markup Language (HTML) 5, the fifth major revision of the core language of the World Wide Web. In this version, special attention has been given to defining clear conformance criteria for user agents in an effort to improve interoperability.
This is a start at factoring out the URL material in the HTML 5 draft as a separate draft for consideration by the W3C HTML Working Group (See ACTION-68.)
See also a URI desk calculator.
Pending feedback:
This specification defines the term Web address, and defines various algorithms for dealing with Web addresses, because for historical reasons the rules defined by the URI and IRI specifications are not a complete description of what HTML user agents need to implement to be compatible with Web content.
A Web address is a string used to identify a resource.
The term "Web address" in this specification is used to include not only Uniform Resource Identifiers (URIs) as they are defined by RFC 3986 and Internationalized Resource Identifiers (IRIs) as they are defined by RFC 3987, but also other strings of characters which can be used to identify Web resources when processed appropriately.
The Web address is a valid URI reference (i.e. it matches the grammar for <URI-reference> given in RFC 3986).
The Web address is a valid IRI reference (i.e. it matches the grammar for <IRI-reference> given in RFC 3987), and it has no query component.
The Web address is a valid IRI reference and its query component contains no unescaped non-ASCII characters [RFC3987].
The Web address is a valid IRI
reference and the character encoding of the
Web address's Document
is UTF-8 or UTF-16 [RFC3987].
Document
, and the URL character
encoding is the document's character encoding.This 2nd step probably needs to be laid out in more detail.
://
"[
" and "]
")
following the first occurrence of "/
",
"?
", or
"#
" which follows the
first occurrence of "//
".Otherwise, percent-encode all left and right square brackets.
Percent-encode all occurrences of U+0023 (Number sign, "#
")
after the first.
Parse w using the grammar in RFC 3986.
If w doesn't match the <URI-reference> production, even after the above changes are made to it, then parsing the Web address fails with an error. [RFC3986]
As in the algorithm previously given, Web addresses containing percent-encoded characters here have components which similarly contain percent-encoded characters.
N.B. the rules given above will parse not only valid Web addresses but a variety of invalid ones as well. The point of making the algorithm have a scope different from that of the definition of valid Web address is not clear and needs to be discussed in the WG.
The parsing process described here should be more closely aligned with the rules given in RFC 3987.
How does this compare to just parsing using the IRI grammar of RFC 3987?
Let w be the Web address being resolved.
Let encoding be the character encoding of the Web address.
If encoding is UTF-16, then change it to UTF-8.
If the algorithm was invoked with an absolute Web address to use as the base Web address, let base be that absolute Web address.
Otherwise, let base be the base URI of
the element, as defined by the XML Base specification, with
the base URI of the document entity being defined as the
document base Web address of the Document
that
owns the element. [XMLBASE]
For the purposes of the XML Base specification, user agents
must act as if all Document
objects represented XML
documents.
It is possible for xml:base
attributes to be present
even in HTML fragments, as such attributes can be added
dynamically using script. (Such scripts would not be conforming,
however, as xml:base
attributes
are not allowed in HTML documents.)
Document
is
the absolute Web address obtained by running these
substeps:
If fallback base url is
about:blank
, and the Document
's
browsing context has a creator browsing
context, then let fallback base url
be the document base Web address of the creator
Document
instead.
If there is no base
element that is both a
child of the head
element and has an
href
attribute, then the
document base Web address is fallback base
url.
Otherwise, let w be the value of the
href
attribute of the first
such element.
Resolve w relative to fallback base
url (thus, the base
href
attribute isn't affected by
xml:base
attributes).
The document base Web address is the result of the previous step if it was successful; otherwise it is fallback base url.
Parse w into its component parts.
If parsing w resulted in a <host> component, then replace the matching subtring of w with the string that results from expanding any sequences of percent-encoded octets in that component that are valid UTF-8 sequences into Unicode characters as defined by UTF-8.
If any percent-encoded octets in that component are not valid UTF-8 sequences, then return an error and abort these steps.
Apply the IDNA ToASCII algorithm to the matching substring, with both the AllowUnassigned and UseSTD3ASCIIRules flags set. Replace the matching substring with the result of the ToASCII algorithm.
If ToASCII fails to convert one of the components of the string, e.g. because it is too long or because it contains invalid characters, then return an error and abort these steps [RFC3490].
//example.com/a^b☺c%FFd%z/?e
", then the
<path> component's substring
would be "/a^b☺c%FFd%z/
" and the two
characters that would have to be escaped would be "^
" and "☺
". The
result after this step was applied would therefore be that w now had the value "//example.com/a%5Eb%E2%98%BAc%FFd%z/?e
".
Apply the algorithm described in RFC 3986 section 5.2 Relative Resolution, using w as the potentially relative URI reference (R), and base as the base URI (Base). [RFC3986]
Apply any relevant conformance criteria of RFC 3986 and RFC 3987, returning an error and aborting these steps if appropriate. [RFC3986] [RFC3987]
For instance, if an absolute URI that would be
returned by the above algorithm violates the restrictions specific
to its scheme, e.g. a data:
URI using the
"//
" server-based naming authority syntax,
then user agents are to treat this as an error instead.
Let result be the target URI (T) returned by the Relative Resolution algorithm.
If result uses a scheme with a server-based naming authority, replace all U+005C REVERSE SOLIDUS (\) characters in result with U+002F SOLIDUS (/) characters.
Return result.
A Web address is an absolute Web address if resolving it results in the same Web address without an error.