RdfThesaurus
This page is maintained for historical reasons - these issues have been mostly concluded. Visit SkosDev for latest developments and resources.
Links
SWAD-Europe Thesaurus Activity
Mailing list [email protected] (archives).
RDF Thesaurus Design Issues
In SWAD-Europe we're working on ways of encoding thesauri and similar Knowledge Organisation Systems (KOS) using RDF. Here is a description of some of the unresolved design issues. Please feel free to add comments to this page.
Issue 1 - Specialised vocab vs. extensible modular vocabs?
Although most thesauri are pretty similar, there are important variations, and many thesauri deviate from the standards. Also, thesauri are very similar to other KOS e.g. classification schemes, taxonomis, topic maps. How do we cope with this?
Option 1 - Define a specialised vocabulary that covers only thesauri that comply with the standards.
Option 2 - Define a core vocab that captures what is common to all thesauri. Then define extension modules to cope with different flavours of thesauri.
Option 3 - Define a core vocab that captures what is common to all KOS (thesauri, taxonomies, classification schemes, topic maps etc.). Define first level extension module for thesauri. Define second level extension for flavours.
Comments on Issue 1
AJM>>
What we did previously (early draft of 8.1) was half way between (1) and (2).
I would like to go for (3), but am prepared to backtrack towards (2), which may happen when we hit interop with this and OWL. (3) Would mean we have a way of fitting all these KOS together on the semantic web, which would be a good thing.
Going for (3) means we have to define a core vocab. I've kind of assumed this is what we are doing (tell me if you think it's a bad idea), and issues below relate first to this core vocab. We need a name for this core vocab, so at least we can refer to it. For now, I'm going to call it the core vocab. In code, I'm using the prefix soks
. Why soks? Short for SuperKOS! Got any ideas about a better name?
Issue 2 - To concept or not to concept?
A thesaurus is a collection of concepts. So for the core-vocab we need to model abstract concepts in RDF.
Option 1 - We define an rdfs:Class
called soks:Concept
. We use this to type resources that are intend ed to refer to abstract concepts.
Option 2 - We define no such class. We use some other way to determine whether a resource is a concept or not, if at all we need to.
Comments on Issue 2
AJM>>
At the recent SWAD meeting at HP, Chaals said (correct me if I'm wrong) in RDF every resource with a URI that has a fragment identifier necessarily is an abstract concept. Therefore we don't need a type for concepts.
I say:
- My reading of the debate & TimBL's writeups is that resources with a http:// uri and a frag id MAY (but NOT necessarily) refer to an abstract concept. Resources with a http:// uri and without a frag ID may NOT be an abstract concept (must necessarily be a document).
- If we have a
soks:Concept
class, we can type b-nodes as concepts. So we can use reference by description to make statements about abstract concepts without URIs. - It makes the format look nicer if it starts with 'Concept' rather than 'rdf:Description' all the time. This may be a serious point for KOS & DL people.
Issue 3 - How to label concepts?
In a thesaurus, every concept has one preferred term (label) and 0 or more alternative terms.
The obvious way to model this in RDF is to have one property for linking a resource to a preferred label (I'll call this soks:prefLabel
for now) and one property for linking a resource to any alternative labels (I'll call this soks:altLabel
for now).
This raises two design questions:-
Question 1: domain restriction? - Do we (a) restrict the domain of these properties to soks:Concept
or do we (b) allow them to be used with any resource?
Question 2: resources or literals? - Do we (a) restrict the range of these properties to rdfs:Literal
, or do we (b) restrict the range to some type of resource?
We may be able to re-use and/or extend existing properties, e.g. rdfs:label
, but what we choose to re-use depends on the resolution of these questions, so I'm saving a discussion of that for later.
In relation to question 2, if we choose 2(a) we get data that could look like ...
<Concept> <prefLabel>Bangers and mash</prefLabel> <altLabel>Sausage and mash</altLabel> </Concept>
... and if we choose 2(b) we get data that could look like ...
<Concept> <prefLabel rdf:parseType="resource"> <rdf:value>Bangers and mash</rdf:value> </prefLabel> <altLabel rdf:parseType="resource"> <rdf:value>Sausage and mash</rdf:value> </altLabel> </Concept>
Although 2(a) might seem the obvious choice because it is the most economical, when it comes to multilingual data there are further implications. Some multilingual data under 2(a) might look like ...
<Concept> <prefLabel xml:lang="en">Bangers and mash</prefLabel> <prefLabel xml:lang="fr">Saucisson et pomme de terre anglais</prefLabel> </Concept>
... and the same data under 2(b) might look like ...
<Concept> <prefLabel parseType="resource"> <rdf:value>Bangers and Mash</rdf:value> <dc:language> <dcterms:RFC1766> <rdf:value>EN</rdf:value> </dcterms:RFC1766> <dc:language> </prefLabel> <prefLabel parseType="resource"> <rdf:value>Saucisson et pomme de terre anglais</rdf:value> <dc:language> <dcterms:RFC1766> <rdf:value>FR</rdf:value> </dcterms:RFC1766> <dc:language> </prefLabel> </Concept>
The fact that I used the dc:language
property and construct in the example is not important, the point is that some property has been used to specify the language of the label, and therefore this information is part of the RDF graph. Under 2(a) the language of the label is embedded in the literal.
Comments on Issue 3
AJM>>
Dave B. has suggested that it's hard in RDF systems today to query things inside a literal - i.e. ask for everything in one language, for example. This would push towards 2(b).
Issue 4 - Concepts as language-embedded, language independent, or both?
There are multilingual thesauri. When modelling multilingual data in RDF, we can choose one of the following options:
Option 1: Concepts in a language - allow language properties only on nodes typed as Concepts
.
Option 2: Labels in a language - allow language properties (or tags) only on nodes (or literals) which represent labels.
Option 3: And/or - allow concepts and/or labels to have language properties.
The choice of solution has bold implications. If we choose option 1 we are assuming that all abstract concepts are embedded in a language; there can be no language independent concepts. If we choose option 2 we model all concepts as 'language-independent'.
Comments on Issue 4
AJM>> NB. The resolution of this issue is closely tied to issue 3.
Issue 5 - Using Concepts for subject-based indexing and classification
We want to be able to say, 'my article is about Concept x from Thesaurus y'. These statements could then be used for subject-based indexing.
The question is, what property to we recommend to use to do this sort of thing? Do we invent a new one, or do we try to re-use something?
Comments on Issue 5
AJM>>
We could go with qualified dublin core, and recommend using dc:subject
.
Then, if the concepts in the thesaurus have been given URIs, we allow statements like ...
<rdf:Description rdf:about="http://www.bigal.com/penguins.html"> <dc:subject rdf:resource="http:///www.bigal.com/thesaurus#penguins"/> </rdf:Description>
... or if the concepts have been defined by their properties, we use reference by description ...
<rdf:Description rdf:about="http://www.bigal.com/penguins.html"> <dc:subject> <soks:Concept> <soks:externalID>AN001</soks:externalID> <soks:prefLabel>Penguins</soks:prefLabel> <rdfs:isDefinedBy rdf:resource="http://www.bigal.com/thesaurus"/> </soks:Concept> </dc:subject> </rdf:Description>
... now if we make soks:prefLabel
--sub-property-of--> rdfs:label
, and soks:externalID
--sub-property-of--> rdf:value
what we have is entirely consistent with the examples in the qualified dc spec.
E.g. from dcq spec ...
<RDF xmlns="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:dc="http://purl.org/dc/elements/1.1/"> <Description> <dc:subject> <Description> <value>19D10</value> <rdfs:label>Algebraic K-Theory of spaces</rdfs:label> <rdfs:isDefinedBy rdf:resource="URI2"/> </Description> </dc:subject> </Description> </RDF>
... taking this a step further, we could make soks:descriptor
a sub-property-of BOTH rdfs:label
AND rdf:value
, and so allow this property to act as both the value and the label as in the above example from dcq, allowing e.g.
<rdf:Description rdf:about="http://www.bigal.com/penguins.html"> <dc:subject> <soks:Concept> <soks:descriptor>Penguins (animals)</soks:descriptor> <rdfs:isDefinedBy rdf:resource="http://www.bigal.com/thesaurus"/> </soks:Concept> </dc:subject> </rdf:Description>
This would mean we have a standard way of dealing with concepts in thesauri that have not been given an external (non-lexical) identifier.
Finally, dqc has things like ...
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/"> <rdf:Description> <dc:subject> <dcterms:MESH> <rdf:value>D08.586.682.075.400</rdf:value> <rdfs:label>Formate Dehydrogenase</rdfs:label> </dcterms:MESH> </dc:subject> </rdf:Description> </rdf:RDF>
To be consistent with this type of construct, we could allow a thesaurus owner to define an rdfs:Class
which is a sub-class-of soks:Concept
, and use this new class to type all nodes that are concepts within a specific conceptual scheme (i.e. the owner's thesaurus). So then, for example, I could say ...
<rdf:Description rdf:about="http://www.bigal.com/penguins.html"> <dc:subject> <bigal:Concept> <soks:descriptor>Penguins (animals)</soks:descriptor> </bigal:Concept> </dc:subject> </rdf:Description>
NB. this provides an alternative to using the rdfs:isDefinedBy
property.
So in summary, I have suggested an approach to subject-based indexing using thesaurus concepts that is entirely consistent with all variations in the qualified dublin core approach. The question is, do we like dcq?
Issue 6 - Defining semantic relationships
Description: A thesaurus consists of concepts, labels for concepts, and semantic relationships between concepts. A semantic relationship is a relationship of meaning. Most thesauri use a similar set of semantic relationships, which they label 'broader' 'narrower' and 'related'.
Problem 1: 'broader/narrower' means different things in different thesauri. In some thesauri it means strictly class-subsumption. In other thesauri it can mean either is-a, instance-of, or part-of. Also 'related' is not consistently used. For example some thesauri model part-of relations with 'related', others use 'broader/narrower'
=> We must invent some mechanism for providing clear definitions of semantic relationships, and for removing any scope for ambiguity.
Problem 2: some thesauri have semantic relations other than 'broader/narrower' and 'related'. Some overcome the 'broader/narrower' fuzziness by using 'BTI', 'BTG' and 'BTP', which stand for 'broader-term-instantive' 'broader-term-generic' and 'broader-term-partitive' respectively. In others there are custom relationships like 'related-broader'.
=> We must provide some mechanism by which users can extend the given relationship set and define their own semantic relations.
Comments on Issue 6
AJM>
A solution to problem 1 is to define all semantic relationships by reference to international standards documents, and we encourage others to do the same. So for 'broader/narrower' and 'related', we define three properties within a namespace that indicates these properties are defined by reference to the ISO 2788:1985 standard for thesauri. I.e.
<soks:Concept> <soks:descriptor>Jungle (environment)</soks:descriptor> <rdfs:isDefinedBy rdf:resource="http://www.bigal.com/thesaurus"/> <iso2788:broader> <soks:Concept> <soks:descriptor>Tropical Environments</soks:descriptor> <rdfs:isDefinedBy rdf:resource="http://www.bigal.com/thesaurus"/> </soks:Concept> </iso2788:broader> </soks:Concept>
Although ISO2788 isn't the best standard in the world, the principal remains sound, and we would replace these properties as soon as a new and better standard is written.
A solution to problem 2 is to define a super-property for all semantic relationships, extend it for the standard relationships, and recommend that others extend it if they want to define further extensions or customisations. I.e.
@prefix soks: <http://www.w3c.rl.ac.uk/2003/10/31-kos-core#> . @prefix iso2788: <http://www.w3c.rl.ac.uk/2003/10/31-kos-iso2788#> . soks:semanticRelation a rdf:Property; rdfs:domain soks:Concept; rdfs:range soks:Concept. iso2788:broader a rdf:Property; rdfs:subPropertyOf soks:semanticRelation. iso2788:narrower a rdf:Property; rdfs:subPropertyOf soks:semanticRelation. iso2788:related a rdf:Property; rdfs:subPropertyOf soks:semanticRelation.
For other possible solutions, see the document Review of RDF Thesaurus Work.
Issue 7 - Thesaurus Linking vs. Thesaurus Mapping
There are two distinct situations.
Thesaurus Linking: An indexer wants a thesaurus for subject-based indexing of their collection. They find no one thesaurus offers adequate coverage, but a combination of two or more thesauri does. So they want to create and use a hybrid thesaurus.
Thesaurus Mapping A collection has been indexed with concepts from thesaurus A. A user wants to search this collection, but using concepts from thesaurus B. Thus some mechanism is required for either transforming the query or transforming the index.
The purpose of this issue is to distinguish these two scenarios. If we use the same mechanism for both linking and mapping, there is a lot of scope for confusion.
Comments on Issue 7
SC> Cause confusion? How so? (maybe just give an example, and then delete this)
Issue 8 - Mechanisms for Thesaurus Linking
The Problem: A user wants to create a hybrid thesaurus by plugging bits of existing thesauri together. By what mechanism should they do this?
Solution 1: To express thesaurus linke by using the normal semantic relations of a thesaurus, 'broader/narrower' and 'related'. So if a user wants to prune a branch from a thesaurus, they remove the 'broader/narrower' statements. If they want to add a branch, they add 'broader/narrower' statements. This solution treats thesaurus linking as a special case of altering the organisational structure of a set of thesaurus concepts.
Solution 2: Create custom linking properties that are not 'semantic relations'. In this way the integrity and internal structure of each thesaurus is maintained.
Comments on Issue 8
AJM>
I favour solution 1 for simplicity.
SC> Doesn't seem as if there'd be any harm adding an option for (2) - even just one semantically empty property analogous to rdfs:seeAlso. Don't have to use it. The exact/inexact equivalence terms would be nice but do add extra complexity.
Issue 9 - Inter-thesaurus mapping
For an introduction to the problem space, a good reference is Semantic Problems of Thesaurus Mapping.
The problem: How to express a mapping from concepts in one thesaurus to concepts in another?
The current (non-semweb) solution: A mapping relationship between concepts from different thesauri is usually called an "equivalence relationship". The following equivalence relationships are used:
Concepts A, B. A exact-equivalent B. A inexact-equivalent B. A narrower-equivalent B. A broader-equivalent B.
Also, combinations of concepts are allowed, in the following ways:
Concepts A, B, C, D, E. A exact-equivalent B AND C. A exact-equivalent B OR C. A exact-equivalent B AND NOT C.
This set of mapping relationships has so far been deemed to be sufficient by the KOS community, as far as I am aware.
I'm going to focus the rest of this discussion on the following questions:
- What do these mapping relationships actually mean? (Implied semantics?)
- How should they be used for real world mappings? (Best practises?)
- How do we encode mapping relationships in RDF? (RDF encoding?)
Concept Mapping - Implied Semantics?
This is going to get a bit philosophical, and I'm on shaky ground here, so I welcome any corrections.
There are two possible ways to interpret a statement that includes one of the above mapping relationships, such as Concept A has exact-equivalent Concept B
.
- Pure Semantic Interpretation.
- Set Theoretic Interpretation.
Interpretation 1 - Pure Semantic (Intensional):
Concepts A, B. A exact-equivalent B => The meaning (intension) of concept A is identical to the meaning of concept B. A inexact-equivalent B => The meaning (intension) of concept A overlaps in some way with the meaning of concept B. A broader-equivalent B => The meaning of concept A is narrower (more specific) than the meaning of concept B (directly analagous to the 'broader/narrower' semantic-relation). A narrower-equivalent B => The meaning of concept A is broader (more general) than the meaning of concept B (directly analagous to the 'broader/narrower' semantic-relation).
Under the Pure Semantic interpretation, the meaning of the combinations 'A AND B
' 'A OR B
' and 'A AND NOT B
' is not clear.
Interpretation 2 - Set Theoretic (Extensional):
There exists a set of Resources R. For each concept X, the functor x(R) corresponds to the subset of resources that are properly indexed (classified) against concept X. Concepts A, B. A exact-equivalent B => a(R) = b(R) A inexact-equivalent B => There exists some resource r such that r is a member of both a(R) and b(R). A narrower-equivalent B => a(R) is a super-set of b(R) A broader-equivalent B => a(R) is sub-set of b(R)
Under the Set Theoretic Interpretation the concept combination expressions have a clear interpretation ...
A AND B => The intersection of sets a(R) and b(R) A OR B => The union of the sets a(R) and b(R) A AND NOT B => The intersection of the sets a(R) and (the complement of b(R))
A big problem is this. The set theoretic interpretation is good, because it gives perfect information about how to transform a query in order to guarantee recall of all relevant documents. However, every statement of the form 'A [equivalence-relation] B
' is understood to be a true proposition about the world. But, this statement must be made a priori, and inferred from the meaning of the concepts. That is to say, all resources have not been classified against concepts from all conceptual schemes, so I cannot examine the sets and dicover the equivalence relationships a posteriori.
You can see that I'm not very clear on this. However, the point I'm trying to get at is this. We have to be clear about what these equivalence relationships mean in order to give clear guidance and instructions on how they should be used.
Concept Mapping - Best Practises?
The following best practises can be derived from the set-theoretic interpretation of the equivalence relationships:
When mapping from concepts in thesaurus Ta to concepts in thesaurus Tb ...
- For every concept x in thesaurus Ta, attempt to find an exact-equivalent concept or combination of concepts from Tb.
- If an exact equivalent cannot be found, find BOTH a narrower- and a broader- equivalent concept or combination of concepts from Tb.
- If a close approximation may be made to concept x in Ta (i.e. a concept that is not identical but is very close in meaning), assert an inexact-equivalence relation. This will constitute the "best match" mapping.
NB. For an explanation of point 2, see the paper Semantic Problems of Thesaurus Mapping.
Basically, point 2 ensures that complete recall may be guaranteed under all searches.
If these guidelines are implemented, then a complete mapping is said to have been achieved. Note that a complete mapping is uni-directional. An incomplete mapping in the opposite direction may be inferred. However an incomplete mapping cannot guarantee recall under all searches.
Concept Mapping - RDF Encoding?
What's nice about the set theoretic interpretation is that all equivalence relations can be represented using constructs already present in OWL. See the following examples ...
<soks:Concept rdf:ID="A"/> <soks:Concept rdf:ID="B"/> <soks:Concept rdf:ID="C"/> <!-- A exact-equivalent B --> <owl:Restriction> <owl:onProperty rdf:resource="&dc;subject"/> <owl:hasValue rdf:resource="#A"/> <owl:equivalentClass> <owl:Restriction> <owl:onProperty rdf:resource="&dc;subject"/> <owl:hasValue rdf:resource="#B"/> </owl:Restriction> </owl:equivalentClass> </owl:Restriction> <!-- A broader-equivalent B AND C --> <owl:Restriction> <owl:onProperty rdf:resource="&dc;subject"/> <owl:hasValue rdf:resource="#A"/> <rdfs:subClassOf rdf:parseType="resource"> <owl:intersectionOf rdf:parseType="collection"> <owl:Restriction> <owl:onProperty rdf:resource="&dc;subject"/> <owl:hasValue rdf:resource="#B"/> </owl:Restriction> <owl:Restriction> <owl:onProperty rdf:resource="&dc;subject"/> <owl:hasValue rdf:resource="#C"/> </owl:Restriction> </owl:intersectionOf> </rdfs:subClassOf> </owl:Restriction>
... i.e. all the set operators are already present in OWL.
However this is not a particularly compact or simple way of expressing the equivalence statements. So perhaps some sort of shorthand vocabulary is in order, with some rules that generate the OWL set expressions that are the entailments of the equivalence expressions?
So, users assert statements such as ...
<soks:Concept rdf:about="#A"/> <soks-eq:broadMatch rdf:resource="#B"/> <soks-eq:narrowMatch rdf:resource="#C"/> </soks:Concept>
Which is in fact nothing more than a shorthand representation of the statements ...
<owl:Restriction> <owl:onProperty rdf:resource="&dc;subject"/> <owl:hasValue rdf:resource="#A"/> <rdfs:subClassOf rdf:parseType="resource"> <owl:Restriction> <owl:onProperty rdf:resource="&dc;subject"/> <owl:hasValue rdf:resource="#B"/> </owl:Restriction> </rdfs:subClassOf> </owl:Restriction> <owl:Restriction> <owl:onProperty rdf:resource="&dc;subject"/> <owl:hasValue rdf:resource="#C"/> <rdfs:subClassOf rdf:parseType="resource"> <owl:Restriction> <owl:onProperty rdf:resource="&dc;subject"/> <owl:hasValue rdf:resource="#A"/> </owl:Restriction> </rdfs:subClassOf> </owl:Restriction>
Comments on Issue 9
AJM>
Perhaps there should be a property called 'best-match
' rather than 'inexact-equivalent
' to express that fact that two concepts are pretty close, where 'inexact' only implies there is some overlap.
I haven't come up with a good shorthand for expressions that use concept combinations, like ...
<!-- A exact-equivalent B AND C --> <!-- Shorthand expression? --> <soks:Concept rdf:about="#A"> <soks-eq:exactMatch> <soks-eq:AND> <rdf:li rdf:resource="#B"/> <rdf:li rdf:resource="#C"/> </soks-eq:AND> </soks-eq:exactMatch> </soks:Concept> <!-- ??? --> <!-- A inexact-equivalent B OR C --> <!-- Shorthand expression? --> <soks:Concept rdf:about="#A"> <soks-eq:bestMatch> <soks-eq:OR> <rdf:li rdf:resource="#B"/> <rdf:li rdf:resource="#C"/> </soks-eq:OR> </soks-eq:bestMatch> </soks:Concept> <!-- ??? -->
Also if we don't restrict the domain or range of the mapping properties, they could be used to express mappings between things other than soks:Concept
s, e.g. Topic
s in DMOZ and topicexchange.
Issue 10 - Pure Lexical Relationships
By stating that relationships like 'broader' and 'narrower' are 'semantic relationships' we have been very clear about that fact that these are not relationships between terms, but relationships between the meaning of the terms, i.e. the concepts.
However, there are some relationships which could be considered to exist purely between the terms. For example, "stimuli" is the plural-form-of
"stimulus". "RDF" is an acronym-for
"Resource Description Framework".
By what mechanism to we allow these kinds of statements to be expressed as part of an RDF thesaurus?
Suggested solution 1: We use b-nodes to represent terms, with the property rdf:value
pointing to the literal value. The statements may be made connecting these two b-nodes, e.g.
<rdf:Description> <rdf:value>stimuli</rdf:value> <soks:pluralFormOf> <rdf:Description> <rdf:value>stimulus</rdf:value> </rdf:Description> </soks:pluralFormOf> </rdf:Description> <rdf:Description> <rdf:value>NIH</rdf:value> <soks:acronymFor> <rdf:Description> <rdf:value>National Institues for Health</rdf:value> </rdf:Description> </soks:acronymFor> </rdf:Description>
Suggested solution 2:: We don't bother with them. Instead we offer the recommendation that all acronyms be included as possible labels for a concept. Plural forms probably don't need be included as modern stemming algorithms can identify the root of the term. So e.g. ...
<soks:Concept> <soks:descriptor>National Institutes for Health</soks:descriptor> <rdfs:label>NIH</rdfs:label> </soks:Concept>
NB. It could be argued that there is no such thing as a purely lexical relationship. All so called 'lexical relationships' are dependent on the context in which the term is interpreted, i.e. the meaning of the term. Therefore allowing statements to be made between terms is a poor representation of the information.