W3C W3C Team Submission

Gleaning Resource Descriptions from Dialects of Languages (GRDDL)

W3C Team Submission 16 May 2005

This Version:
http://www.w3.org/TeamSubmission/2005/SUBM-grddl-20050516/
Latest Version:
http://www.w3.org/TeamSubmission/grddl/
Previous Version:
http://www.w3.org/TR/2004/NOTE-grddl-20040413/
Authors:
Dominique Hazaël-Massieux, Dan Connolly

Abstract

This document presents GRDDL, a mechanism for Gleaning Resource Descriptions from Dialects of Languages; that is, for getting RDF data out of XML and XHTML documents using explicitly associated transformation algorithms, typically represented in XSLT.

Status of This Document

The previous version of this work was released in April 2004 as a W3C Coordination Group Note by the Semantic Web Coordination Group, as it was relevant to issues that were postponed by the RDF Core Working Group: rdfms-validating-embedded-rdf and faq-html-compliance. It turns out to be relevant to Web Architecture issues such as RDFinXHTML-35 and namespaceDocument-8 as well. A related design history and rationale discusses contribution of this design to those TAG issues. This 16 May 2005 version is released as a W3C Team Submission for consideration by the community.

This design started with a sketch in May 2003. There are now multiple implementations including an online service and a growing test suite. A log of changes is appended.

Please send review comments, implementation experience reports, etc. to [email protected], the mailing list of the RDF in XHTML task-force of the the Semantic Web Best Practices and Deployment Working Group and the HTML Working Group; the mailing list has a public archive.

By publishing this document, Dan Connolly and Dominique Hazaël-Massieux have made a formal submission to W3C for discussion. Publication of this document by W3C indicates no endorsement of its content by W3C, nor that W3C has, is, or will be allocating any resources to the issues addressed by it. This document is not the product of a chartered W3C group, but is published as potential input to the W3C Process. Please consult the complete list of acknowledged W3C Team Submissions.

Contents

  1. Introduction
  2. The GRDDL profile for XHTML
  3. The GRDDL transformation attribute in XML
  4. GRDDL for XML Namespace and HTML Profile Documents
  5. GRDDL Transformations
  6. Security Considerations
  7. References

1. Introduction: Data and Documents

Data formats like XML and XHTML are used in the Web for a large spectrum of purposes, from poetry and drama to spreadsheets and databases. The information in a poem may be rich and subtle; we might use a computer pick out the author's name, but themes and opposing forces are not readily computable. When extracting data from documents, preserving meaning is important: if a document says "It is highly unlikely that the king was over twenty years old" and a computation returns "the king was over twenty years old," that computation does not preserve meaning.

The Resource Description Framework[RDFC04] codifies certain forms of data—simple logical statements like age(king, 20)—and specifies basic rules for preserving meaning. The framework includes a constrained XML concrete syntax, but it also includes an abstract syntax. GRDDL is a mechanism for Gleaning Resource Descriptions from Dialects of Languages; that is, for getting RDF data out of XML and XHTML documents.

For example, Dublin Core meta-data can be written in an HTML dialect[RFC2731] that has a clear correspondence to an encoding in RDF/XML[DCRDF]. The correspondence can be expressed in an XSLT transformation, dc-extract.xsl:

diagram: HTML to RDF via dc-extract.xsl

Transforming HTML meta-data to RDF/XML (svg)

The transformation preserves the author's meaning, provided the author understood the conventions of this dialect. But an author may have accidentally conformed to the syntactic conventions without any knowledge of Dublin Core at all. In that case, the mapping most likely does not preserve the author's meaning. In GRDDL, documents contain explicit references to the conventions that the author used to encode data.

2. The GRDDL profile for XHTML

A reference to http://www.w3.org/2003/g/data-view from the profile attribute (c.f. section 7.4.4.3 Meta data profiles of [HTML4]) of an XHTML document[XHTML] indicates that links of type transformation relate the document to transformations that preserve its meaning.

For example, this document not only follows the conventions of [RFC2731], but it explicitly uses the GRDDL profile and links to a transformation that extracts the meta-data in RDF/XML in a way that preserves the meaning of the document:

<html xmlns="http://www.w3.org/1999/xhtml">
  <head profile="http://www.w3.org/2003/g/data-view">
    <title>Some Document</title>
    <link rel="transformation"
       href="http://www.w3.org/2000/06/dc-extract/dc-extract.xsl" />
    <meta name="DC.Subject"
       content="ADAM; Simple Search; Index+; prototype" />
    ...
  </head>
  ...
</html>

In the figure below, the arrow labelled info relates a document to an abstract notion of the information contained in the document. It shows that the RDF data extracted via the dc-extract.xsl transformation is part of the information contained in the document:

diagram: link to transformation
Decoding HTML meta-data to RDF
(svg)

This is what the data looks like in RDF/XML:

<rdf:RDF
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  >
  <rdf:Description rdf:about="">
    <dc:subject>ADAM; Simple Search; Index+; prototype</dc:subject>
  </rdf:Description>
</rdf:RDF>

Note that an XHTML document may conform to a number of dialects simultaneously and link to more than one decoding algorithm:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
       "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head profile="http://www.w3.org/2003/g/data-view">
  <title>Joe Lambda's Home page [an example of RDF in XHTML]</title>
  <link rel="transformation" href="http://www.w3.org/2003/12/rdf-in-xhtml-xslts/grokFOAF.xsl" />
  <link rel="transformation" href="http://www.w3.org/2003/12/rdf-in-xhtml-xslts/grokCC.xsl" />
  <link rel="transformation" href="http://www.w3.org/2003/12/rdf-in-xhtml-xslts/grokGeoURL.xsl" />
...
diagram: link to multiple transformations
multiple transformations
(svg)

3. The GRDDL transformation attribute in XML

The GRDDL profile mechanism is a special case of GRDDL designed to fit within the DTD-based syntax of XHTML. The general form of GRDDL is an attribute suitable for use with a wide variety of XML dialects.

The transformation attribute in the http://www.w3.org/2003/g/data-view# namespace on the root element of an XML document refers to a list of transformations that preserve the document's meaning.

The value of the grddl:transformation attribute designates a list of algorithms by URI reference (c.f. section 4.4.1. URI references in [WEBARCH]).

In some dialect of XHTML not constrained by DTD syntax, the above example can be written:

<html xmlns="http://www.w3.org/1999/xhtml"
  xmlns:data-view="http://www.w3.org/2003/g/data-view#"
  data-view:transformation=http://www.w3.org/2003/12/rdf-in-xhtml-xslts/grokFOAF.xsl
    http://www.w3.org/2003/12/rdf-in-xhtml-xslts/grokCC.xsl
    http://www.w3.org/2003/12/rdf-in-xhtml-xslts/grokGeoURL.xsl">
<head profile="http://www.w3.org/2003/g/data-view">
  <title>Joe Lambda's Home page [an example of RDF in XHTML]</title>
...

4.GRDDL for XML Namespace and HTML Profile Documents

Transformations can be associated not only with individual documents but also with whole dialects that share an XML namespace or XHTML profile. Consider this privacy policy written in P3Q, a contrived analog to P3P[P3P]:

<POLICIES xmlns="http://www.w3.org/2004/01/rdxh/p3q-ns-example">
	<EXPIRY max-age="604800"/>
...

The namespace document for P3Q relates the grokP3Q.xsl transformation to all P3Q documents:

<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dataview="http://www.w3.org/2003/g/data-view#">
 <rdf:Description
    rdf:about="http://www.w3.org/2004/01/rdxh/p3q-ns-example">
   <dataview:namespaceTransformation
       rdf:resource="http://www.w3.org/2004/01/rdxh/grokP3Q.xsl"/>
 </rdf:Description>
</rdf:RDF>
diagram: glean via profile
transformation applied to profile
(svg)

That is an example of the general case:

Likewise for XHTML profiles:

Note that statements gleaned from namespace documents and profile documents are a part of their meaning; these documents need not be written in RDF/XML directly

Consider a purchase order whose namespace document is an XML Schema, where the XML Schema bears a data-view:transformation attribute licensing extraction of statements that include namespaceTransformation statements:

diagram: glean via profile
transformation applied to profile
(svg)

Analogously, consider a profile document whose information content includes, by way of a GRDDL transformation, a profileTransformation relationship:

diagram: glean via profile
transformation applied to profile
(svg)

5. GRDDL Transformations

The transformation link type refers to a transformation algorithm that should have a available representations in widely-supported formats. We expect most consumers to support XSLT version 1[XSLT1] for the foreseeable future, though XSLT2[XSLT2] deployment is increasing. While javascript, C, or any other programming language technically expresses the relevant information, XSLT is specifically designed to express XML to XML transformations and has some good safety characteristics.

Transformation algorithms should be well-defined functions whose only input is the source document. The use of the XSLT document() function to incorporate other data at transformation time is an error.

6. Security considerations

RFC 2046, in section 9. Security Considerations says:

Implementors should pay special attention to the security implications of any media types that can cause the remote execution of any actions in the recipient's environment. In such cases, the discussion of the "application/postscript" type may serve as a model for considering other media types with remote execution capabilities.

Given the expressive power of XSLT, and the possibility to access external resources from a XSLT style sheet (e.g. through the document function or the xsl:import mechanism), implementors should take the appropriate measures to prevent malicious usage of this mechanism.

References

HTML4
HTML 4.01 Specification , D. Raggett, A. Le Hors, I. Jacobs, Editors, W3C Recommendation, 24 December 1999, http://www.w3.org/TR/1999/REC-html401-19991224 . Latest version available at http://www.w3.org/TR/html401 .
RDFC04
Resource Description Framework (RDF): Concepts and Abstract Syntax , G. Klyne, J. J. Carroll, Editors, W3C Recommendation, 10 February 2004, http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/ . Latest version available at http://www.w3.org/TR/rdf-concepts/ .
XSLT1
XSL Transformations (XSLT) Version 1.0 , J. Clark, Editor, W3C Recommendation, 16 November 1999, http://www.w3.org/TR/1999/REC-xslt-19991116 . Latest version available at http://www.w3.org/TR/xslt .
XHTML
Modularization of XHTML™ , S. Schnitzenbaumer, F. Boumphrey, T. Wugofski, S. McCarron, M. Altheim, S. Dooley, Editors, W3C Recommendation, 10 April 2001, http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/ . Latest version available at http://www.w3.org/TR/xhtml-modularization/ .
XSLT2
XSL Transformations (XSLT) Version 2.0 , M. Kay, Editor, W3C Working Draft (work in progress), 11 February 2005, http://www.w3.org/TR/2005/WD-xslt20-20050211/ . Latest version available at http://www.w3.org/TR/xslt20 .
WEBARCH
Architecture of the World Wide Web, Volume One , N. Walsh, I. Jacobs, Editors, W3C Recommendation, 15 December 2004, http://www.w3.org/TR/2004/REC-webarch-20041215/ . Latest version available at http://www.w3.org/TR/webarch/ .

Informative references

RFC2731
J. Kunze Encoding Dublin Core Metadata in HTML in 1999
DCRDF
Expressing Simple Dublin Core in RDF/XML Beckett, Miller, Brickley 2002-07-31
P3P
The Platform for Privacy Preferences 1.0 (P3P1.0) Specification , M. Marchiori, Editor, W3C Recommendation, 16 April 2002, http://www.w3.org/TR/2002/REC-P3P-20020416/ . Latest version available at http://www.w3.org/TR/P3P/ .

Extended Example

An example homepage with Dublin Core, GeoURL, RSS, Creative Commons, etc. demonstrates several transformations and dialects.

Available Software and Services

The authors provide pair of online services on an experimental, best-effort basis:

Client-side implementations are also in development:

Implementation experience to date suggests investigating the following issues:

Test Cases

A collection of test cases is in development. The original announcement was 02 Feb 2005. As of this writing ($Revision: 1.9 $ of $Date: 2005/05/16 20:37:52 $) they include:

Change History

Changes since the Apr 2004 release:


$Log: Overview.html,v $
Revision 1.9  2005/05/16 20:37:52  connolly
SOTD tweak w.r.t. prev ver

Revision 1.8  2005/05/16 20:36:33  connolly
added previous version link
noted author's draft in changes section

Revision 1.7  2005/05/16 20:32:49  connolly
- figure markup tweak
- SOTD CG to WG
- hid bib fodder

Revision 1.6  2005/05/16 20:25:34  connolly
SOTD

Revision 1.5  2005/05/16 20:15:34  connolly
copyright years

Revision 1.4  2005/05/16 20:14:52  connolly
standard TeamSubmission icon markup, copyright markup

Revision 1.3  2005/05/16 20:12:57  connolly
- pubrules:
 - this version/latest version
 - move CVS keywords to meta
 - added team submission stylesheet

Revision 1.2  2005/05/16 20:02:29  connolly
copy of http://www.w3.org/2004/01/rdxh/spec.html 1.74 2005/04/20 20:54:10

Revision 1.74  2005/04/20 20:54:10  connolly
tm fix

Revision 1.73  2005/04/20 20:43:31  connolly
- a computation, not an

Revision 1.72  2005/04/20 20:37:44  connolly
- revised abstract
- added P3P ref
- spell-check

Revision 1.71  2005/04/20 17:53:59  connolly
"Implementation Experience" section becomes "Software and Services"

Revision 1.70  2005/04/20 17:50:31  connolly
- re-worked namespace doc section
- moved GRDDL transformations section down near security considerations
- moved open issues to implementation experience section
- noted May 2003 sketch in change history
- reduced scope of "Example Use Cases" section
- added missing . in SOTD; removed extra blank line in example

Revision 1.69  2005/04/20 17:25:15  connolly
brought grddl-xml section inline with revised intro etc.
added example

Revision 1.68  2005/04/20 17:13:54  connolly
re-worked "GRDDL for XHTML" section w.r.t. information content
separated "GRDDL transformations" out as its own section

Revision 1.67  2005/04/20 16:27:54  connolly
smoothed out intro a bit

Revision 1.66  2005/04/20 16:07:59  connolly
re-worked SOTD in preparation for release as team submission

Revision 1.65  2005/03/25 23:03:14  connolly
- reworking intro...
  - not done... changing platforms to work on another figure
- reduced indentation for examples
- succeeded in usinig object for svg/png illustration

Revision 1.64  2005/03/24 16:05:55  connolly
- TR base stylesheet
- note some issues

Revision 1.63  2005/03/24 05:08:19  connolly
merged a couple paras in SOTD

Revision 1.62  2005/03/24 04:47:59  connolly
oops; figure was inside example div; fixed
removed extra paren in test case appendix

Revision 1.61  2005/03/24 04:35:59  connolly
added several illustrations which clarify quite
a bit and suggest different ways of explaining/specifying things

Revision 1.60  2005/03/24 03:23:49  connolly
getting feedback on diagrams

Revision 1.59  2005/03/23 23:50:28  connolly
working on examples, figures

Revision 1.58  2005/03/23 18:53:51  connolly
removed overly specific reference to HTML in the XML section

Revision 1.57  2005/03/22 23:49:22  connolly
- moved supplementary material from under TOC to appendixes

Revision 1.56  2005/03/22 23:34:46  connolly
- dropped issue about namespace-qualifying the rel value;
  conflicting profiles doesn't seem like a big concern
- dropped issue about xsl:import; seems covered elsewhere
- specified URI by reference to webarch
- added References section
- removed "a few issues remain" from SOTD
- added a class for editorial issues

Revision 1.55  2005/03/22 23:07:05  connolly
- specified profileTransformation along with namespaceTransformation
- added test cases appendix
  (thought about linking testable assertions to test cases,
   but didn't follow thru)

- linked changelog from "This Version"
- demangled Dom's name
   (again. argh. wish nxml-mode and mule would get along)
- copyright 2005 too.
- moved bulk of implementation stuff from SOTD to an appendix

Revision 1.54  2005/03/22 22:17:42  connolly
added changelog

----------------------------
revision 1.53
date: 2004/12/07 23:19:58;  author: connolly;  state: Exp;  lines: +5 -5
interpreter renamed transformation
----------------------------
revision 1.52
date: 2004/06/09 14:06:20;  author: connolly;  state: Exp;  lines: +3 -3
demangle Dom's name
----------------------------
revision 1.51
date: 2004/06/09 13:59:00;  author: connolly;  state: Exp;  lines: +3 -3
typo
----------------------------
revision 1.50
date: 2004/06/09 13:57:46;  author: connolly;  state: Exp;  lines: +10 -10
rework abstract
----------------------------
revision 1.49
date: 2004/06/09 13:47:20;  author: connolly;  state: Exp;  lines: +27 -8
- beefed up abstract
- pointed to client-side implementations
----------------------------
revision 1.48
date: 2004/04/13 20:54:41;  author: connolly;  state: Exp;  lines: +4 -17
remove some SOTD boilerplate
----------------------------
revision 1.47
date: 2004/04/13 20:53:37;  author: connolly;  state: Exp;  lines: +11 -8
now that the TR version is published, revert status, stylesheet, changelog
----------------------------