Automating the publication of Technical Reports

Abstract

This document presents the "TR Automation" project; this project, based on the use of Semantic Web tools and technologies, has allowed to streamline the publication paper trail of W3C Technical Reports, to maintain an RDF-formalized index of these specifications and to create a number of tools using these newly available data.

Introduction

The most visible part of W3C work, its main deliverables are its Technical Reports published by W3C Working Groups. These Technical Reports are published following a well-defined process, defined by the Process Document and detailed in the publication rules (also known as "pubrules") and in the Recommendation Track transition document.

Current Status and Deliverables

While there are still plenty of opportunities to automate the process behind the publication of W3C Technical Reports, the core of this project has been realized. This is translated in the following deliverables:

an formalized and authoritative list of W3C Technical Reports in RDF ; it's updated as soon as a new Technical Report is officially published, and can be reliably used as a reference for them
new views of the Technical Reports list: while the classical TR page (sorted by status then by date) is now generated from the RDF list of Technical Reports, one can now see the same list sorted by editor, by date, by title, or by activity/group.
a tool to check automatically most of the publication rules, the pubrules checker
an XSLT style sheet to extract metadata about Technical Reports in RDF (supported by a small test suite), that could possibly be re-used with GRDDL
two statistics tools on the records of publication of Technical Reports: one to calculate the number and type of TR published in a given period, another to count the current repartition of TR by status
bibliographic tools, the bibliographies formatter that properly formats the bibliographic entries of a set of given Technical Reports, and the TR references checker that checks whether TR linked from a document are the ones at the latest known version.

Maintenance of the TR page

Previously done by hand, the process of updating the list of Technical Reports (referred as the TR page) is now entirely automated; this means that the system is able to extract all the necessary information from a given Technical Report and to process it as described by the W3C Process to produce an updated version of the TR page.

This works as follows:

an XSLT style sheet is used to extract all the needed metadata from a Technical Report in RDF
these metadata are processed through a set of rules (in N3) that matches the W3C process
they are eventually added to the list of Technical Reports in RDF, which is then turned into XHTML using another XSLT style sheet

But going a bit more in the details reveals some interesting points.

Extracting RDF metadata from Technical Reports

To be published a W3C Technical Report, a document has to comply with a set of rules, often referred as pubrules. While these rules have been developed to enforce requirements from the Process Document and a certain visual consistency between Technical Reports, it happens that these rules are formal enough that:

most of them can be checked automatically - which has permitted to build the pubrules checker
they allow to extract programatically enough specific metadata from the documents to identify their properties needed to update the TR page - while it's understandable that pubrules requires that the documents have the metadata needed to update the TR page (if only so that the Webmaster could do it manually), it's less obvious that the fact that these can be extracted programatically can be easily applied to other scenari

Since W3C Technical Reports are published normatively as valid HTML or XHTML, and since RDF has an XML serialization, XSLT works pretty well to do the actual work of checking the rules and extracting the metadata - noting that valid HTML can be transformed in XHTML on the fly using for instance tidy.

Also, a fair number of the pubrules consist in checking that some properties of the document are properly and consistently reflected in text and formatting; that means there is a common base between extracting the metadata and checking the compliance to the pubrules.

Thus, there are 3 XSLT style sheets at work:

an XHTML parser defining a set of named templates to extract specific information from a Technical Report document; this parser is backed by a small test suite
an RDF/XML formatter that takes this information and puts it in proper RDF, using a set of well-defined RDF Schemas (esp. a schema describing the W3C publication track)

For instance, applying the RDF/XML Formatter on XML 1.0 (a pubrules compliant document) outputs:

<rdf:RDF xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#"
            xmlns:dc="http://purl.org/dc/elements/1.1/" 
            xmlns:doc="http://www.w3.org/2000/10/swap/pim/doc#" 
            xmlns:org="http://www.w3.org/2001/04/roadmap/org#" 
            xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
            xmlns:rec="http://www.w3.org/2001/02pd/rec54#" 
            xmlns="http://www.w3.org/2001/02pd/rec54#"  
            xmlns:mat="http://www.w3.org/2002/05/matrix/vocab#">
 <REC rdf:about="http://www.w3.org/TR/2004/REC-xml-20040204">
  <dc:date>2004-02-04</dc:date>
  <dc:title>Extensible Markup Language (XML) 1.0 (Third Edition)</dc:title>
  <cites>
    <ActivityStatement rdf:about="http://www.w3.org/XML/Activity"/>
  </cites>
  <doc:versionOf rdf:resource="http://www.w3.org/TR/REC-xml"/>
  <org:deliveredBy rdf:parseType="Resource">
    <contact:homePage rdf:resource="http://www.w3.org/XML/Group/Core"/>
  </org:deliveredBy>
  <doc:obsoletes rdf:resource="http://www.w3.org/TR/2003/PER-xml-20031030"/>
  <previousEdition rdf:resource="http://www.w3.org/TR/2004/REC-xml-20040204"/>
  <mat:hasErrata rdf:resource="http://www.w3.org/XML/xml-V10-3e-errata"/>
  <mat:hasTranslations rdf:resource="http://www.w3.org/2003/03/Translations/byTechnology?technology=REC-xml"/>
  <editor rdf:parseType="Resource">
    <contact:fullName>Tim Bray</contact:fullName>
    <contact:mailbox rdf:resource="mailto:[email protected]"/>
  </editor>
  <editor rdf:parseType="Resource">
    <contact:fullName>Jean Paoli</contact:fullName>
    <contact:mailbox rdf:resource="mailto:[email protected]"/>
  </editor>
  <editor rdf:parseType="Resource">
    <contact:fullName>C. M. Sperberg-McQueen</contact:fullName>
    <contact:mailbox rdf:resource="mailto:[email protected]"/>
  </editor>
  <editor rdf:parseType="Resource">
    <contact:fullName>Eve Maler</contact:fullName>
    <contact:mailbox rdf:resource="mailto:[email protected]"/>
  </editor>
  <editor rdf:parseType="Resource">
    <contact:fullName>François Yergeau</contact:fullName>
    <contact:mailbox rdf:resource="mailto:[email protected]"/>
  </editor>
  <mat:hasImplReport rdf:resource="http://www.w3.org/XML/2003/09/xml10-3e-implementation.html"/>
 </REC>
 <FirstEdition rdf:about="http://www.w3.org/TR/2004/REC-xml-20040204"/>
</rdf:RDF>

Open questions

how much the XSLT should do hard-coded inferences? Should it distinguish facts (this is a Rec) from guesses (this is a Last Call)?
validating the data to see if they are coherent would be good; are the OWL constraints enough for this?
should we define a profile to make TRs GRDDL-izable? or simply use the XSLT as transformation reference (but see issue about inferencing)?

Using a paper trail mechanism to keep the data up to date

The current publication process use the RDF data at its core as follows:

at a given date D, the TR list is frozen in its RDF form
once a document is pubrules compliant, its metadata are extracted from it
the new metadata is added to a list of documents published since D
the new TR page is generated from merging the frozen state to the new list (other views are generated at the same time)

This process is a good example of a paper trail machine.

Note: The freezing of the TR page happens regularly (every 6 months); at some point, it could be approved by the AC Forum as part of the process(at least at the first time).

XSLT-spiders

@@@

Benefits of using Semantic Web technologies

@@@

History

The publication process (through its many variations) had been enforced mostly by human-only interactions since the start of W3C, but with growing pain as the number of Working Groups and Technical Reports raised over time.

The main bottleneck that had started to appear was around the work done by the W3C Webmaster, who, in this process, is in charge of:

asserting that a document that a Working Group has prepared for publication does indeed follow the publication rules, and dealing with the editors of the document if it is not,
publishing the document in its final location under http://www.w3.org/TR/,
updating the authoritative list of Technical Reports

While these tasks may not seem overwhelming, the detailed analysis that some of the "pubrules" require and the ever growing size of the Technical Reports list made the exercise error-prone, particularly when in peak times, the number of (rather big) documents published was reaching 15 per day.

The automation needs were divided [member only] in 3 separate steps:

Automating the checking of compliance to pubrules
Extracting meta-data from pubrules compliant documents
Building the TR page and new views of it from these metadata

Automating the checking of documents

The idea that this should be automated gets back at least to September 1997 (see Dan Connolly email on this topic, and the follow-up meeting series - Team-only), and tools that helped the Webmaster assess the readiness of a document grew in parallel with the matching rules. For instance, the now indepedent W3C Linkchecker comes from a tool initially developed by one of the W3C Webmasters to help finding broken links in the to be published documents.

The culmination of these tools came with the pubrules checker, an XSLT-based tool that allows to see at a glance what rules are not met by the document being checked.

Automating the update of the TR page

Getting initial data

With the pubrules checker, it became possible to check semi-automatically if a document may be published and to extract the data that had to be added to the technical reports list.

To automate the publication process, the first step was to formalize these data - in RDF since the extracted metadata are in RDF. Dan Connolly had started to work on this step in March 2000 (Team-only), developing a fairly simple style sheet allowing to extract RDF data about all the latest versions information given in the TR list at that time.

As always, the evil was in the details and some side-cases had to be taken into account in this process. Some rare cases were handled on the side.

But this only got information about the latest versions, and to make a reasonably useful system, the dated versions URIs were needed to.

This meant getting the data from the filesystem, which was back then the only official encoding of latest/this versions relationships. This proved to be quite challenging, for various reasons, but mainly because the filesystem usage (usually symbolic links) had changed over the time and finding consistency was not necessarily easy. First we had to extract the core data from the filesystem and then specify the data that were incorrectly deduced from it.

Updating the data

@@@@

Once all those data collected, it just needed to be aggregated and sorted out, which was done using cwm and a filter as specified in a Makefile. The result was the first version a RDF formalized list of W3C digital library.

This allows to build the TR page from this list using a style sheet to create a HTML human readable version of the RDF data. Other views of the page can be generated pretty easily with the appropriate style sheet:

With a little more work and interaction with other RDF data, a list of TR by W3C Activities has also been produced.

Ideas for improvements and related projects

get the pubrules checker to output EARL
validates the data obtained from trdoc2rdf before processing

See also the ideas of what else could be automated in the TR publication process.

Related works

History data for TR doc (with a SVG view of the data)
Cross-references between W3C Recommendations
Translations in W3C managed using among other things the data provided by this project
The QA Matrix is now built upon these data, using more RDF data

Other references

Gleaning Resource Descriptions from Dialects of Languages (GRDDL), W3C Coordination Group Note
Dan Connolly, W3C process papertrail: WBS, groups, AC statement, ... [member-only]
Dan Connolly, extracting metadata from real-world data