This is [not yet] a W3C Working Draft for review by W3C members and other interested parties. It is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress". A list of current W3C working drafts can be found at: http://www.w3.org/pub/WWW/TR
Note: since working drafts are subject to frequent change, you are advised to reference the above address, rather than the addresses of working drafts themselves.
The HTML 2.0 specification, RFC1866, defines an SGML application and an Internet media type. The specification notes that extensions are planned, but only the text/html; level=2 internet media type and the "-//IETF//DTD HTML 2.0//EN" document type are defined. This document suggests the use of URIs as system identifiers for document type definitions, allowing decentralized evolution of the language. The use of marked sections as a transition technique and the continued use of the level mechanism for standardized points in the evolution path are discussed.
The goal of any HTML specification should be to promote that confidence in the fidelity of communications using HTML. This means:
HTML 2.0 specifies a set of idioms widely used and supported as of June of 1994. But HTML and the web are still in a stage of rapid innovation and evolution, and will be for the forseaable future. The HTML 2.0 specification fails to accomodate this evolution--it fails to meet goal #5, and goal #6 cannot be met by any frozen document, as "contemporary communications idioms" evolve over time.
Examples of this evolution include the introduction of forms and tables. In each case, information providers suddenly had two kinds of clients: those with support for the new feature, and those without. They were faced with the following choices:
Optimally, the system should obviate the need for information providers and consumers to deal with this issue explicitly. Interoperability between new and old components should be automatic.
This document proposes a mechanism that obviates the need for consumers to explicitly deal with the issue. The mechanism does not alleviate the information provider's burden, but it does increase reliability even in the case that information providers are unwilling to invest the effort necessary to support old clients.
Consider the following documents:
<title>Example: Simple HTML</title> <p>A paragraph with a <a href="#dest">link</a>. <ul> <li>a list <li>of <a href="dest">items
<title>Example: Phrase Markup, Nested Lists, and Images</title> <p>A paragraph with <em>emphasis<em> and an <img alt="image" src="foo.png">. <ol> <li>Section 1 <li>Section 2 <li>Section 2.1 <li>Section 2.2 <li>Section 3 </ol>
<title>Example: Forms</title> <h1>Forms</h1> <form action="/cgi-bin/test" method=POST> <p><input name=x> <p><input name=y> <p><input name=z> </form>
<title>Example: Tables, Inserts, and Figures</title> <table> <tr><th>Col 1 <th>Col 2 <th>Col 3 <tr><td>A <td>B <td>C <tr><td>1 <td>2 <td>3 </table> <fig> <caption>Figure 1: A Movie</caption> <object data="movie.mpg"> [Movie elided] </object> </fig>
There is a convention among HTML user agents to ignore unrecognized markup. Given the above documents, HTML user agents will behave reliably for documents containing only markup they support. In the face of unrecognized markup, the reliability varies:
Document: | Level 0 | Level 1 | Level 2 | Level 3 |
---|---|---|---|---|
Level 0 User Agent | 100% fidelity | phrase markup and images lost | forms shown as noise | tables and figure captions shown as noise |
Level 1 User Agent | 100% fidelity | 100% fidelity | forms shown as noise | tables and figure captions shown as noise |
Level 2 User Agent | 100% fidelity | 100% fidelity | 100% fidelity | tables and figure captions shown as noise |
Level 3 User Agent | 100% fidelity | 100% fidelity | 100% fidelity | 100% fidelity |
Actually, none of the above documents conforms to the specificatoin for the text/html media type given in [RFC1866] -- they are missing a document type declaration, e.g.:
<!doctype html public "-//IETF//DTD HTML 2.0//EN">
The HTML 2.0 specification advises implementors to infer the above declaration if none is given. This is poor advice since in practice, the chance that such a document conforms to the HTML 2.0 DTD is very small [Adams95] (cite Tim Bray at opentext, regarding %age of valid HTML docs?)
Rather than binding text/html to any particular DTD, we define it to be and SGML document type that includes HTML level 1, as defined by [RFC1866]. (An SGML document type t1 includes t2 if every document conforming to t2 also conforms to t1.)
We define a text/html body to be an SGML document entity whose DTD is externally referenced; i.e. the body begins with one of
<!doctype html public "..." system "..."> <!doctype html public "..."> <!doctype html system "..."> <!doctype html>
And we remove the default from the level parameter:
The widely deployed methods for submitting forms requests -- HTTP and SMTP -- provide little assurance of confidentiality. Information providers who request sensitive information via forms -- especially by way of the `PASSWORD' type input field -- should be aware and make their users aware of the lack of confidentiality.
The optional parameters are defined as follows:
The expectation is that in addition to the standard DTDs, the HTML processing capabilities of a user agent are described by some DTD, and that this DTD has a formal public identifier, a Uniform Resource Identifier (URI or URL), or both.
Most documents will be prepared for standard HTML user agents, and their document type will be declared ala:
<!doctype html public "-//IETF//DTD HTML 2.0//EN">
A Document prepared for a user agent with support for some other HTML dialect would have its document type declared using one of the following:
<!doctype html public "-//VendorCo Inc.//DTD HTML v1.4//EN" system "http://www.vendor.com/html-public-text/v1.4.dtd"> <!doctype html system "http://www.vendor.com/html-public-text/v1.4.dtd">
All user agents would have built-in support for the standard DTDs, plus a few popular de-jour DTDs. Some user agents would be able to accomodate new DTDs at runtime by fetching them from the network. User agents without this capability, on encountering an unknown DTD identifier, could warn that the document might not be processed as intended by the information provider.
The "ignore unrecognized markup" convention is unacceptably unreliable in cases such as forms and tables.
The improved convention is that marked sections are processed as per [ISO8879] (see @@marked sections primer). Additionally, parameter entity references of the form %if-xxx are presumed to resolve to IGNORE, and those of the form %no-xxx are presumed to resolve to INCLUDE, unless the DTD in effect has a declaration for those names.
Using this convention, consider the following enhanced document:
<doctype html system "http://www.w3.org/html-pubtext/960212/html.dtd"> <title>Example: Conditional Table</title> <![ %if-table [ <table> <tr><th>Col 1 <th>Col 2 <th>Col 3 <tr><td>A <td>B <td>C <tr><td>1 <td>2 <td>3 </table> ]]> <![ %no-table [ <pre> Col 1 Col 2 Col 3 A B C 1 2 3 </pre> ]]>
Assuming support for marked sections, an HTML 2.0 user agent will process the table marked up using <pre>, whereas a user agent that supports the 960212 DTD will process the <table> markup. A user agent that does not support the 960212 DTD, but does support tables, is likely to process the <tables> markup reliably, since its DTD is likely to have declarations ala:
<!entity % if-tables "INCLUDE">
<!entity % no-tables "IGNORE">
and declarations for <table>, <tr>, <td>, etc. that match the 960212 DTD.
This convention would have dealt gracefully with FORM and TABLES. It has the potential to deal gracefully with SCRIPT, MATH, APPLET, etc.
While the marked section markup may seem unwieldy, it is necessary only when both of the following conditions hold:
Here are some cases to mull over, in roughly historical order:
DOCTYPE | Features Used in Doc | Features in Marked Section? | Browser Capabilities | Result |
---|---|---|---|---|
1.0 | 1.0 | no | 1.0 | 100% reliable *1 |
1.x | 1.0+phrase markup | no | 1.0 | some signal loss *2 |
2.0 | 2.0lev1 (no forms) | no | 2.0lev1 | 100% reliable *1 |
2.0 | 2.0 incl forms | no | 2.0lev1 | some form noise *3 |
2.0 | 2.0 incl forms | no | 2.0 | 100% reliable *1 |
3.x(tables) | 2.0+tables | no (tables) | 2.0 | some table noise *3 |
3.x(tables) | 2.0+tables | no (tables) | 3.x (tables) | 100% reliable *1 |
3.x(tables) | 2.0+tables | yes, incl apology | 2.0+marked sections | 100% reliable *4 (apology shown) |
3.x(tables) | 2.0+tables | yes, incl apology | 2.0 | some table noise,*5, apology |
3.x(tables) | 2.0+tables | yes, incl apology | 2.0+tables | 98% reliable,*6 apology (uneeded) |
3.x (tables) | 2.0+tables | yes, incl apology | 3.x(tables) Marked S. | 100% reliable*1 (table shown) |
In the table above, substitute any of script, style, math, embed, etc. for forms/tables with the same result.
The HTML 2.0 "ignore unknown tags" absorbs changes along the lines of phrase markup and new IMG attributes ala *2. But for novel new features like forms and tables, we see *3. Note that without marked sections, each non-trivial feature introduced causes a transitional period involving lots of interactions ala *3, with most things settling down ala *1, but an indefinite burden of *3 style interactions due to outdated software.
Until marked sections are supported, providers who use marked sections are rewarded ala *5, but penalized ala *6. (They are apparently already to live with this, as evidenced by the "if your browsers doesn't support forms, ..." apologies we see, even on forms-capable browsers.)
With marked sections, non-trivial new features can be introduced with interactions ala *4, with graceful transition back to style *1.
@@information provider maintains several variants; one corresponds to the capabilities of most if his/her readership, and that's the one that's shipped by default. It has links to the other variants, so that remedial clients can downgrade at runtime.
@@see: tables deployment document
The combination of relying on internal labelling (with external labelling in the content type as an optimization) and marked sections is a viable medium-to-long term solution.
The internal labelling/marked section strategy is the equivalent ofthe color TV solution: send the color signal to everybody, and the folks that can't show the color just throw it away.
The external labelling/format negotiation strategy is like having the broadcasters send black-and-white signal to folks that request it, and color to the rest. In some cases (like inline graphics formats), this is the right thing to do. But it appears that in the vast majority of cases involving new HTML features, it's just not worth the trouble.
@@discuss negotiation based on user-agent, caching, etc.
See: "Marked Sections" in TEI Gentle Intro to SGML
Date: Thu, 9 Nov 95 13:03:39 EST Message-Id:<[email protected]> From: Glenn Adams<[email protected]> To: Multiple recipients of list<[email protected]>
To: [email protected] cc: Multiple recipients of list <[email protected]> Subject: Reliable Interoperability [was: LiveScript and HTML ] In-reply-to: Your message of "Mon, 16 Oct 1995 23:00:26 EDT." <[email protected]> Date: Tue, 17 Oct 1995 00:32:12 -0400 From: "Daniel W. Connolly" <[email protected]>
Date: Sun, 7 Jan 1996 23:45:23 -0800 (PST)
From: Brian Behlendorf <[email protected]>
To: [email protected]
Subject: HTML variants and content negotiation
Message-Id: <[email protected]>