This document is also available in this non-normative format: ePub
Copyright © 2016 W3C® (MIT, ERCIM, Keio, Beihang). W3C liability, trademark and document use rules apply.
The Model for Tabular Data and Metadata on the Web describes mechanisms for extracting metadata from CSV documents starting with either a tabular data file, or a metadata description. In the case of starting with a CSV document, a procedure is followed to locate metadata describing that CSV (see Locating Metadata in [tabular-data-model]). Alternatively, processing may begin with a metadata file directly, which references the tabular data file(s). However, in some cases, it is preferred to publish datasets using HTML rather than starting with either CSV or metadata files.
Secondly, tabular data is often contained within HTML in the form of HTML table elements (see [html5]). This document describes a means of identifying such tables from [tabular-metadata] and extracting annotated tabular data from HTML tables.
This document does not attempt to address the full range of ways in which tabular datasets can be used within browser based applications, e.g. related Javascript efforts such as IndexedDB and Web Components. It is concerned primarily with providing additional information about tabular data. Discussion of deeper integration into Web-based apps is encouraged via the CSVW Community Group.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
The CSV on the Web Working Group was chartered to produce a recommendation "Access methods for CSV Metadata" as well as recommendations for "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various formats (e.g., RDF, JSON, or XML)". This non-normative document describes extensions for discovering [tabular-metadata] within HTML documents, and for extracting annotated tables from HTML tables. The normative standards are:
This document was published by the CSV on the Web Working Group as a Working Group Note. If you wish to make comments regarding this document, please send them to [email protected] (subscribe, archives). All comments are welcome.
Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This document is governed by the 1 September 2015 W3C Process Document.
Metadata may be exposed in an HTML document in a couple of different ways.
script
ElementThis section describes mechanisms similar to Embedding JSON-LD in HTML Documents (see [json-ld]) for embedding metadata within an HTML document.
HTML script
elements can be used to embed data blocks in documents (see Scripting in [html5]). Metadata [tabular-metadata] describing one or more tabular data files can be embedded in HTML, which can be used as an alternative way to publish datasets.
The content should be placed in a script
element with the type
set to application/csvm+json
. The character encoding of the embedded metadata will match the HTML documents encoding.
<html> <head> <script type="application/csvm+json"> { "@context": "http://www.w3.org/ns/csvw", "tables": [{ "url": "countries.csv", "tableSchema": { "columns": [{ "name": "countryCode", "titles": "countryCode", "datatype": "string", "propertyUrl": "http://www.geonames.org/ontology{#_name}" }, { "name": "latitude", "titles": "latitude", "datatype": "number" }, { "name": "longitude", "titles": "longitude", "datatype": "number" }, { "name": "name", "titles": "name", "datatype": "string" }], "aboutUrl": "countries.csv{#countryCode}", "propertyUrl": "http://schema.org/{_name}", "primaryKey": "countryCode" } }, { "url": "country_slice.csv", "tableSchema": { "columns": [{ "name": "countryRef", "titles": "countryRef", "valueUrl": "countries.csv{#countryRef}" }, { "name": "year", "titles": "year", "datatype": "gYear" }, { "name": "population", "titles": "population", "datatype": "integer" }], "foreignKeys": [{ "columnReference": "countryRef", "reference": { "resource": "countries.csv", "columnReference": "countryCode" } }] } }] } </script> ... </head> <body> ... </body> </html>
Depending on how the HTML document is served, script
content may need to be escaped. See Restrictions for contents of script
elements in [html5] for more information.
Processing embedded metadata is the same as processing Overriding Metadata where the retrieved document type is text/html
or application/xhtml+xml
instead of a JSON document type. The base URI of the encapsulating HTML document provides a "Base URI Embedded in Content" per [RFC3986] section 5.1.1; metadata is extracted from the first script
element having @type
application/csvm+json
. Metadata documents parsed from an HTML DOM will be a stream of character data rather than a stream of UTF-8 encoded bytes. No decoding is necessary if the HTML document has already been parsed into DOM. Each matching script
data block is considered to be it's own metadata document.
An alternative to embedding metadata within a script
element is linking to the metadata using an HTTP Link header and/or an HTML link
element using the equivalent mechanism described for CSV files by Link Header in [tabular-data-model]. Linked metadata provides an alternate mechanism for referencing metadata that would otherwise be discovered by Locating Metadata as defined in [tabular-data-model]. See The link
element in [html5] for more information.
HTTP/1.1 200 OK Link: <metadata.jsonld>; rel="describedby" Content-Type: text/html <html> <head> <link rel="describedby" type="application/csvm+json" href="metadata.json"/> ... </head> </html>
The preceding example shows an HTTP response for an HTML document containing a link
element referencing external metadata, along with an HTTP Link header referencing the same metadata.
Best Practice 1: HTML and HTTP Link references must be consistent
If using both HTML link
and HTTP Link it is important to reference the same metadata URI.
Best Practice 2: Prefer embedded metadata
To avoid inconsistencies, do not both embed metadata and link metadata as differences in the embedded representation and the linked representation can cause processing inconsistencies.
This section describes a mechanism for locating tabular data within an HTML document, extracting tabular data from an identified table element, and processing the tabular data to create annotated tables.
In addition to tabular data files, a metadata table id may reference an HTML table within an HTML document. A reference within an HTML document is described using a document-relative fragment identifier which is defined using the @id
attribute on an HTML table
element.
Best Practice 3: Include metadata and referenced HTML tables in a single HTML document
HTML documents which are self contained, including both embedded metadata which references HTML tables contained within the same document, are preferred to HTML tables or CSV files defined externally.
Consideration must be given to the generation of URLs. The standard forms of both JSON [csv2json] and RDF [csv2rdf] generate URLs by appending a fragment identifier to the table URL to identify rows. Also, unless an explicit propertyUrl is defined, RDF properties are also generated using a fragment of the table URL.
Best Practice 4: Avoid automatically generated URLs
Explicitly define aboutUrl, propertyUrl, and valueUrl, where appropriate, to avoid using automatically generated URL fragments which conflict with using fragments to identify tables.
Raw tabular data may be extracted from HTML tables with use of the dialect description as with CSV tables.
Table rows are numbered starting from 1
, as with CSV files.
The in scope language of the table element is used as the lang inherited property for embedded metadata.
Rows containing only th
elements have their text content used as the column titles in the embedded metadata.
Rows containing td
are used as row content with the text content of each td
element used as the cell string value; such rows may also contain th
elements which are treated as data elements.
caption
elements within a table
element are ignored.
th
and td
elements contained within thead
, tbody
, or tfoot
elements are processed as if they were child elements of the table
element.
Processing extracted tables is otherwise handled in a similar manner to CSV as defined in Parsing Tabular Data in [tabular-data-model].
Processors using a Document Object Model Model [DOM] may have their content coerced to a normalized including optional elements such as tbody
.
Best Practice 5: Header rows proceed content rows
Tables should be organized with the first rows containing only th
elements to describe column headers. Subsequent rows should contain only td
elements to describe table data.
Best Practice 6: Avoid use of @colspan
and @rowspan
attributes
The processing algorithm for tabular data does not account for differences in column counts and row counts that might be present in HTML tables using the @colspan
and/or @rowspan
attributes; use of these attributes should be avoided. Note that a header row containing @colspan
, or a data column containing @rowspan
may be ignored using appropriate dialect descriptions.
The following tables are identified using #countries
and #country_slice
:
countryCode | latitude | longitude | name |
---|---|---|---|
AD | 42.5 | 1.6 | Andorra |
AE | 23.4 | 53.8 | United Arab Emirates |
AF | 33.9 | 67.7 | Afghanistan |
countryRef | year | population |
---|---|---|
AF | 1960 | 9616353 |
AF | 1961 | 9799379 |
AF | 1962 | 9989846 |
<table id="countries"> <caption>Countries</caption> <tr><th>countryCode</th><th>latitude</th><th>longitude</th><th>name</th></tr> <tr><td>AD</td><td>42.5</td><td>1.6</td><td>Andorra</td></tr> <tr><td>AE</td><td>23.4</td><td>53.8</td><td>United Arab Emirates</td></tr> <tr><td>AF</td><td>33.9</td><td>67.7</td><td>Afghanistan</td></tr> </table> <table id="country_slice"> <caption>Country Slice</caption> <tr><th>countryRef</th><th>year</th><th>population</th></tr> <tr><td>AF</td><td>1960</td><td>9616353</td></tr> <tr><td>AF</td><td>1961</td><td>9799379</td></tr> <tr><td>AF</td><td>1962</td><td>9989846</td></tr> </table>
The metadata is describe here in a script
element:
Generating Minimal JSON from this document should result in the following:
[ { "@id": "http://example.org/#countries-AD", "http://www.geonames.org/ontology#countryCode": "AD", "schema:latitude": 42.5, "schema:longitude": 1.6, "schema:name": "Andorra" }, { "@id": "http://example.org/#countries-AE", "http://www.geonames.org/ontology#countryCode": "AE", "schema:latitude": 23.4, "schema:longitude": 53.8, "schema:name": "United Arab Emirates" }, { "@id": "http://example.org/#countries-AF", "http://www.geonames.org/ontology#countryCode": "AF", "schema:latitude": 33.9, "schema:longitude": 67.7, "schema:name": "Afghanistan" }, { "http://dbpedia.org/ontology/locationCountry": "http://example.org/#countries-AF", "http://dbpedia.org/property/urbanAreaDate": "1960", "http://www.geonames.org/ontology/population": 9616353 }, { "http://dbpedia.org/ontology/locationCountry": "http://example.org/#countries-AF", "http://dbpedia.org/property/urbanAreaDate": "1961", "http://www.geonames.org/ontology/population": 9799379 }, { "http://dbpedia.org/ontology/locationCountry": "http://example.org/#countries-AF", "http://dbpedia.org/property/urbanAreaDate": "1962", "http://www.geonames.org/ontology/population": 9989846 } ]
Generating Minimal RDF from this document should result in the following:
@prefix geonames: <http://www.geonames.org/ontology#> . @prefix schema: <http://schema.org/> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . <http://example.org/#countries-AD> schema:latitude 4.25e1; schema:longitude 1.6e0; schema:name "Andorra"; geonames:countryCode "AD" . <http://example.org/#countries-AE> schema:latitude 2.34e1; schema:longitude 5.38e1; schema:name "United Arab Emirates"; geonames:countryCode "AE" . <http://example.org/#countries-AF> schema:latitude 3.39e1; schema:longitude 6.77e1; schema:name "Afghanistan"; geonames:countryCode "AF" . [ <http://dbpedia.org/ontology/locationCountry> <http://example.org/#countries-AF>; <http://dbpedia.org/property/urbanAreaDate> "1962"^^xsd:gYear; <http://www.geonames.org/ontology/population> 9989846 ] . [ <http://dbpedia.org/ontology/locationCountry> <http://example.org/#countries-AF>; <http://dbpedia.org/property/urbanAreaDate> "1961"^^xsd:gYear; <http://www.geonames.org/ontology/population> 9799379 ] . [ <http://dbpedia.org/ontology/locationCountry> <http://example.org/#countries-AF>; <http://dbpedia.org/property/urbanAreaDate> "1960"^^xsd:gYear; <http://www.geonames.org/ontology/population> 9616353 ] .
This section describes a mechanism for locating tabular data within an HTML document, extracting tabular data from an identified script
element, and processing the tabular data to create annotated tables.
In addition to embedded metadata, CSV data may also be embedded within HTML using a script
element. The general provisions and access patterns described in section 2. Extracting Tabular Data from HTML Tables apply for embedded CSV data.
The following CSV script
elements are identified using #countries
and #country_slice
:
<script id="countries" type="text/csv"> countryCode,latitude,longitude,name AD,42.5,1.6,Andorra AE,23.4,53.8,"United Arab Emirates" AF,33.9,67.7,Afghanistan </script> <script id="country_slice" type="text/csv"> countryRef,year,population AF,1960,9616353 AF,1961,9799379 AF,1962,9989846 </script>
The metadata shown in section 2.2 Example can be used to access embedded CSV as well as HTML tables.