May 05, 1999
A Revolution on the Web
Quietly, a revolution is coming to the World Wide Web. It will change the way web sites deliver information. It will change the way you browse. It will undermine portals and search engines.
You may have heard of it already. It is called XML, or eXtensible Markup Language. Existing only as potential for the last year or two, XML is now coming into its own.
Sure, practical web browsers for XML may not exist yet. But web sites such as NetCenter are using XML-flavoured data structures to personalize your online experience. And it's just the beginning.
Even smaller sites like NewsTrolls are climbing aboard. NewsTrolls now offers an RDF (Resource Description Format) version (See http://www.newstrolls.com/newstrolls.rdf) of itself, allowing it to be listed as a 'channel' on NetCenter, Slashdot, and more.
XML allows content creators to define their own classifications and categories. Browsers and web sites supporting XML - and its sisters, RDF, XSL and more - allow users to customize their display, filter unwanted information, define searches by interest and context, and more.
You may have heard of it already. This column will tell you how it works.
Resource Descriptions
Resources: anything can be a resource: people, towns, documents, maps, or buildings. Think of each of these things as having individual web pages. These web pages refer to the resource in question and contain a description of that resource.
To a large degree, we already do this in our paper-based records. People get their own listings in phone books. Buildings are recorded at the Lands Office. Documents are indexed in card catalogues. And towns are listed in the provincial register and on road maps.
The purpose of these records is to provide a description of the resource in question. Descriptions of objects consist of a list of their properties. A property is some aspect, characteristic, attribute, or relation of a resource.
Think of a town, for example. The properties of a town include its name, its population (these are usually indexed on a sign on the main road), its mayor, its fire chief, and its location (in longitude and latitude).
A description of the properties of a town would look like a list, as follows:
- Town of Slave Lake
- Name = Slave Lake
Population = 4235
Mayor = Fred Penner
Fire Chief = Jules Verne
Latitude = 140
Longitude = 58
All descriptions of properties have two parts: the relation and the object.
The relation is the name of the property itself. In the example above, 'Name', 'Population' and 'Mayor' each describe relations. Think of a relation as the name of the property.
The object is the value of the property. In the example above, 'Slave Lake', '4235', and 'Fred Penner' are the objects of their respective properties.
Thus, in a resource description, we have three components:
- The resource
- The relation
- The object
These three components assembled together form a statement, as follows:
Slave Lake | is named | 'Slave Lake' |
Resource | Relation | Object |
Slave Lake | has a population of | 4235 |
Resource | Relation | Object |
Descriptions of this type form the foundation of 'Resource Description Format' (RDF). It is a standard being adopted on the World Wide Web for internet data. For more information on RDF, see Resource Description Framework (RDF) Model and Syntax Specification at http://www.w3.org/TR/1999/PR-rdf-syntax-19990105/.
The RDF for Slave Lake could be diagrammed like this:
Objects
One of the features of RDF that makes it a powerful information retrieval tool lies in its definition of objects. Objects may be one of two types of things:
- String literals
- Resources
A string literal is a sequence of letters, numbers or other symbols. A string literal is best thought of as a sign, a name or identification. String literals are arbitrary, in the sense that any unique string of characters would do just as well. However, most string literals convey a meaning in a natural language (such as English) which makes them useful.
In the example above, the name 'Slave Lake' is a string literal. Personal names, like 'Stephen Downes', are string literals. Unique identification numbers, like your social insurance number or your drivers license number, are string literals.
A resource is, as described above, anything which can be described: people, towns, documents, maps, or buildings, to name a few. In the example above, two of the objects are resources: the mayor of Slave Lake, Fred Penner, and the fire chief, Jules Verne.
This is significant because it means that resources can be related to each other. Indeed, the diagram describing Slave Lake should be redrawn to reflect this:
Because they are resources, Fred Penner and Jules Verne can also be described. For example, they have names, telephone numbers, and email addresses. Each of Fred Penner and Jules Verne would have his own RDF including this data. Additionally, this data may be included in our diagram:
This sort of diagram can be multiplied for any number of resources. It tracks relations between resources. A similar diagram, for example, could show that Jules Verne is the Fire Chief in the town that has Fred Penner for its mayor. Or it could show that the email address for the mayor of Slave Lake is [email protected].
RDF / XML
Resource descriptions needs to be machine readable. That is, it should be possible for a computer to look at a resource description and understand what to do with it. Machine-readable descriptions on the World Wide Web are coded in a language called XML (eXtensible Markup Language - for the official World Wide Web Consortium version see http://www.w3.org/XML/ - please note that this account is a survey only). XML is a very free-form language that defines how properties and objects are represented.
RDF (Resource Description Format) is a subset of XML. It is more structured than XML and has a very precise purpose: to describe resources on the World Wide Web. In RDF, two major formatting rules are used to describe properties:
If the object is a string literal the string literal is displayed between tags bearing the name of the relation. Thus, for Slave Lake's name, we have:
If the object is another resource, the location of the resource description is stored within the tag bearing the name of the relation. Thus, for Slave Lake's mayor we have:
And thus, the RDF description for Slave Lake is:
As the diagram shows, the description of Slave Lake includes four string literals and two resources. A more full description of Slave Lake would actually incorporate the information contained in the resource RDF files.
The RDF documentation is currently a bit fuzzy on how this would proceed. But the logical approach is to merge the information in the additional RDF files with the original RDF file, producing a single, longer file (the merged data is indented for clarity):
If the additional RDF files for the mayor and the fire chief contain resources as well, the RDF file could be expanded further still.
RDF as Metadata
It is very unlikely that the majority of data stored on the World Wide Web will be in RDF format. Much of the data will be in incompatible file formats. For example, some documents will be stored in MS-Word or Excel format. Other documents will be images in .gif or .jpeg format. Even day-to-day data that is added to the web databases will most probably be in some other format.
In such cases, an RDF data record is data about data. Data about data is called metadata. Metadata may be recorded in two major ways:
- Embedded as part of the document file itself
- Separate and referring to the original document
In order to make this structure explicit, let's return to the diagrams used above. Any resource in Muni Mall will have data stored in, say, an MS-SQL format. We can think of that data as being a single file (say, a .cfm file). Thus, metadata in the form of an RDF file actually describes that .cfm file. We might diagram it as follows:
In the actual RDF file, this relationship is expressed with an ABOUT tag:
Thus, each resource listed on the web will have its own description, and these descriptions may be merged to created more complex data elements.
Schemas
Resources described in an RDF file are described in terms of properties. For the Town of Slave Lake, the properties listed included 'mayor', 'fire chief' and 'population'. But not all resources will have these same properties. A different type of resource - say, a public library - will not have a 'mayor' or a 'population'.
In a traditional database, entities are classified into categories or classes. Members of a particular class inherit a 'template' of properties for that class. For example, if the entity in question is a 'town', then it will inherit a template with blank fields for 'mayor', 'fire chief', and the rest.
The problem with the traditional approach is two-fold: first, there is no unique system of categorization available. What constitutes a town in one jurisdiction may be only a village in another. And second, there is no unique set of properties that could be applied to resources. Some towns define 'population' to be the number of humans permanently residing within the town's corporate limits. Other towns may include non-permanent residents in their definition.
Instead of attempting to force the data into predetermined categories, RDF works around this problem by using schemas. A schema defines not only the properties of the resource but may also define the kinds of resources being described. When asserting that some resource has a property, say, a 'population', RDF refers to a schema which defines what a 'population' is, what sort of things can have populations, and what sort of things can be populations.
Schemas are specially formatted DTD (Document Type Definition) documents located on the internet. An RDF document which uses a schema will define that schema according to its location, or URI. For example:
- xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/metadata/dublin_core#">
Two schemas are identified:
- first, the main RDF schema. The XML name server (xmlns) for RDF is located at http://www.w3.org/1999/02/22-rdf-syntax-ns#
- and second, the Dublin Core. The xmlns for the Dublin Core is located at http://www.w3.org/1999/02/22-rdf-syntax-ns#
When either of these two schemas are used later in the document to identify a property, the schema prefix will be included with the name of the property. For example, if the property 'title' is a Dublin Core property, then the property will be identified 'dc:title'.
Like this:
The property 'Description' is defined by RDF, and so is identified with a tag reading Types Using schemas, we can now define what type of thing a resource is. We use the 'type' attribute in the 'Description' tag. The 'type' attribute points to a schema where the type or category being employed is defined. Suppose, for example, that we wanted to say that the resource in question is a 'person'. Then in the RDF file describing that person, we would use the following code: To define Ora Lassila as a 'person' we refer, using a 'type' tag, to a schema where the term 'person' is defined. Thus the line: points to an external resource at description.org, where the term 'person' is defined as part of a larger schema. Additional type and category specifications are found in the RDFS schema, the core RDF Schema. Using these specifications, a hierarchy of classes and subclasses may be constructed. The RDFS specification is in proposal stage only at this point; see the documentation from March, 1999, at http://www.w3.org/TR/PR-rdf-schema/ The full set of RDFS specifications are depicted as follows: In practice, schemas will not be created from scratch. A number of good schema editors already exist, such as XML Authority (from a company called Extensibility - see http://www.extensibility.com/index_net.htm). Schema editors allow schemas to be defined with considerable precision, and also provide 'tree' type views of defined schemas. For example, consider the price list schema created by XML Authority: The schema represented by this diagram defines the type of tags an XML or RDF file may have, their relationship to each other, and what sort of values those tags may contain. Displaying RDF Documents To date, there are no commercial browsers that display RDF documents. At best, we have some programs, written in PERL, Java or JavaScript that interpret these documents, usually rendering them as a sort of tree diagram. Xparse (http://www.jeremie.com/Dev/XML/) or xmlTree (http://www.xmltree.com/index.cfm) are examples of this approach. Other sites use a proprietary back-end, such a Netscape's NetCenter (http://my.netscape.com) or Microsoft's Active Channels (http://www.microsoft.com). Because of this, some commentators have warned of a 'Balkanization' of the World Wide Web (The Power of Babel (http://www.feedmag.com/cgi-in/FeedlineLoop/deliverance.cgi?areanum=13:13) by Mark Pesce, for example, or The Balkanization of the Web by David Siegel (http://www.dsiegel.com/balkanization/). What will be required - and is being worked on now - is a system of style sheets that interpret XML or RDF documents and render them in HTML code for browsers to read. These approaches fall under the general heading of the Extensible Stylesheet Language (XSL. See http://www.w3.org/TR/WD-xsl/). The principle of XSL is simple. For a list of properties, as may be found in a schema or DTD, it defines a corresponding set of HTML tags. Then, when an RDF document is interpreted, for each property tag encountered, the appropriate HTML is inserted. Thus, for example, suppose we had the following definition in XSL: And suppose this is the piece of XML being used: The interpreter would produce the following line of HTML: The relation between XSL, XML, and HTML may thus be diagrammed as follows: (See http://www.datachannel.com/news/wp_xsl_client.shtml) for more details and diagrams. This may seem like an unnecessary complication - why not include the style information with the document? By separating the style from the data, however, the same data may be viewed in different ways. Users may define their own custom displays. Or the same data may be used to create different types of documents. Where to From Here? A large cottage industry has developed around the development of schemas (http://www.schema.net/), ontologies (http://www.ontology.org), and XML generally (http://www.xml.com). A proliferation of specialized applications is being developed in specific industries. Thus we see the Mathematics Markup Language (http://www.w3c.org/Math/), Music Markup Language (http://www.tcf.nl/trends/trends6-en.html), and the Chemical Markup Language (http://www.venus.co.uk/omf/cml/intro.html). The internet is transforming itself from a large pile of books to a large series of neatly stacked and indexed collections, complete with translations, summaries, and cross-references. While browsing will still be a popular pass-time for many, just as browsing though the stacks in a library may be, our use of the internet will in the future be much more directed. We will read news from custom-tailored news feeds, listen to personalized radio stations, select movies and videos from online menus, or conduct research in directed and power topic searches. Automated resource description also means that the internet can talk more easily (and more safely) with our other appliances. We would never let our microwave download and run instructions for cooking turkey today (it would probably retrieve information about the middle-eastern nation, or a movie review of Waterworld). But our microwave of the future, able to define a context of inquiry and able to define authoritative sources, could handle the task on its own on an XML powered web. And that's just the beginning….
xmlns:v="http://description.org/schema">
My Chapter