HTML5 support
IMPORTANT NOTE FROM MAINTAINERS: The HTML parser in libxml2 was written 20+ years ago. It does not implement HTML5. Maybe it will some day, maybe it won't. Don't use libxml2 to parse HTML for anything serious. If you maintain a downstream project that uses libxml2's HTML parser, please forward this message to your users.
This probably won't be completed soon but here's an outline.
✅ Tokenization
HTML5 specifies exactly how to parse broken HTML. For the most part, handling of error and other corner cases has to be checked and possibly adjusted. Some work in this direction has already been completed. This shouldn't cause major regressions since valid HTML isn't affected. Changes can be implemented directly in the current HTML4 parser. An immediate benefit is that the security of several HTML sanitization libraries based on libxml2 (often through language bindings) is improved.
Some specifics that have to be addressed:
- Tag and attribute names.
- HTML5 treats
U+000C FORM FEED (FF)
as whitespace. - Named character references
- Doctype declaration
- Special content modes
- Script data
- RCDATA
- raw text
- Many quirks of the parsing rules, see for example https://htmlparser.info/parser/
❌ CDATA sections in foreign content
Unfortunately, foreign content (SVG and MathML) allows CDATA sections and can't be tokenized correctly unless some parts of the tree construction stage are executed as well.
❌ Tree builder
At some point, a separate database of HTML5 elements has to be added. Tree construction involves:
- Tree construction state machine
- Implied end tags, this could be used by the old parser as well
- Adoption agency algorithm
- Foster parenting
Some algorithms require extensive tree manipulations which means that a streaming parser is impossible to implement. In HTML5 mode, the SAX interface can only be used for the tokenization stage. It will receive unbalanced calls to startElement
and endElement
corresponding directly to tokens.
❌ Misc
- Encoding sniffing algorithm
- Only allow supported encodings: https://encoding.spec.whatwg.org