QSX: a CSS selector language for DOM extraction
CSS selectors are an expressive way to match a set of elements in the DOM with the querySelector*()
methods, but the API is geared towards fetching a flat list of DOM elements.
QSX (Query Selector eXtended) is a lightweight extension to the selector syntax that’s useful for extracting things from the DOM into structured JSON data. It introduces nested structures, the ability to grab HTML attributes and DOM properties, and basic reshaping of the resulting JSON — all in a compact format that’s ideal for command-line usage.
Explainer
An informal QSX language specification is available.
Implementation
An initial, slightly out-of-date implementation is available on GitHub at danburzo/qsx
.
This implementation is used in the hred
command-line tool, which works pretty well for day-to-day scraping from HTML and XML. Its features are dependent on the DOM environment available to jsdom
, whose querySelector*()
lags behind browsers in terms of CSS selectors.
A full realization of QSX 1.0, and further iteration on the spec, will be possible when I manage to wrap up selery, my CSS selector parser engine.
Feedback
Feedback on the specification and/or reference implementation is appreciated. You can contact me or open an issue in GitHub.
Colophon: The QSX logo is built with glyphs from LTR Principia, the brutally seriffed typeface by Erik van Blokland.