BigData Tutorial Part1
BigData Tutorial Part1
Tutorial Agenda
Introduction to Linked Data (45 m 60 m) Andreas
Consuming Norwegian Linked Data (30 m) Titi
Large Scale Linked Data Management (30 m) Andreas
Big Data Intro and Analytics (60 m 90 m) Marko
Questions & Answers Session (30 m) all
Ontology Languages
RDF Vocabulary Description Language (RDFS)
Web Ontology Language (OWL)
Application Architectures
Summary
MOTIVATION
Motivation
With increased use of computers more and more data is
being stored
Organisations rely on data for business decisions
Data drives policy decisions in government
Individuals rely on data from the Web for information and
communication
2007-10
Marko Grobelnik, Andreas Harth, Dumitru Roman, Big Linked Data
2007-11
Marko Grobelnik, Andreas Harth, Dumitru Roman, Big Linked Data
2008-02
Marko Grobelnik, Andreas Harth, Dumitru Roman, Big Linked Data
2008-03
Marko Grobelnik, Andreas Harth, Dumitru Roman, Big Linked Data
2008-09
Marko Grobelnik, Andreas Harth, Dumitru Roman, Big Linked Data
2009-03
Marko Grobelnik, Andreas Harth, Dumitru Roman, Big Linked Data
2009-07
Marko Grobelnik, Andreas Harth, Dumitru Roman, Big Linked Data
2010-09
Marko Grobelnik, Andreas Harth, Dumitru Roman, Big Linked Data
2011-09
Marko Grobelnik, Andreas Harth, Dumitru Roman, Big Linked Data
Scenario Overview
Semantic Technologies facilitate access to data
1. Query
2. Answer
DBpedia
Linked Data version of Wikipedia
Scripts that extract data (text, links, infoboxes) from
Wikipedia
Published as Linked Data
Interlinking hub in the Linked Data web
Berlin
http://dbpedia.org/resource/Berlin
Hegel
http://dbpedia.org/resource/Georg_Wilhelm_Friedrich_Hegel
Marlene Dietrich
http://dbpedia.org/resource/Marlene_Dietrich
Marko Grobelnik, Andreas Harth, Dumitru Roman, Big Linked Data
BBC Music
Data about BBC (radio) programmes, artists, songs
Combination of BBC-internal data (playlists), MusicBrainz
(artists, albums), Wikipedia (artists)
Underpinning the BBC Music website
Data published according to Linked Data principles
Marlene Dietrich
http://www.bbc.co.uk/music/artists/191cba6a-b83f-49ca-883c02b20c7a9dd5.rdf#artist
Marlene Dietrich
http://viaf.org/viaf/97773925/
Semantic Technologies
Semantic Web technologies,
standardised by the W3C, are
mature:
RDF recommendation in 1999,
update in 2004
RDFa (RDF in HTML) note in 2008
RDFS recommendation in 2004
SPARQL recommendation in 2008
OWL recommendation in 2004,
update in 2009
4.
http://www.w3.org/DesignIssues/LinkedData
Marko Grobelnik, Andreas Harth, Dumitru Roman, Big Linked Data
Name?
Creator?
Birth date?
Last change date?
License?
Copyright?
User Agent
http://www.bbc.co.uk/music/artists/191cba6
a-b83f-49ca-883c-02b20c7a9dd5.rdf#artist
HTTP
GET
RDF
Web Server
http://www.bbc.co.uk/music/artists/191cba6
a-b83f-49ca-883c-02b20c7a9dd5.rdf
Marko Grobelnik, Andreas Harth, Dumitru Roman, Big Linked Data
RESPONSE
REQUEST
HTTP/1.1 200 OK
Date: Tue, 08 May 2012 07:12:19 GMT
Server: Apache/2.2.3 (Red Hat)
Content-Type: application/rdf+xml
Content-Length: 1956
[data not shown]
Marko Grobelnik, Andreas Harth, Dumitru Roman, Big Linked Data
User Agent
http://dbpedia.org/resource/Marlene_Dietrich
HTTP
GET
303 HTTP
GET
Web Server
RDF
http://dbpedia.org/data/Marlene_Dietrich
http://dbpedia.org/page/Marlene_Dietrich
RDF Example
dbpedia:Georg_Wilhelm_Friedrich_Hegel rdf:type
foaf:Person .
dbpedia:Georg_Wilhelm_Friedrich_Hegel rdf:type
yago:PoliticalPhilosophers .
dbpedia:Georg_Wilhelm_Friedrich_Hegel
rdfs:comment "Georg Wilhelm Friedrich Hegel var
en tysk filosof."@no .
dbpedia:Georg_Wilhelm_Friedrich_Hegel dbpediaowl:influenced dbpedia:Francis_Fukuyama .
dbpedia:Georg_Wilhelm_Friedrich_Hegel dbpediaowl:influenced dbpedia:Friedrich_Nietzsche .
+
=
http://www.bbc.co.uk/music/artists/191cba6a-b83f-49ca-883c-02b20c7a9dd5#artist
http://dbpedia.org/resource/Marlene_Dietrich
http://viaf.org/viaf/97773925/
http://dbpedia.org/resource/Marlene_Dietrich .
http://d-nb.info/gnd/118525565
http://libris.kb.se/resource/auth/238817
http://www.idref.fr/027561844/id
http://dbpedia.org/resource/Berlin
http://mpii.de/yago/resource/Berlin
http://data.nytimes.com/N50987186835223032381 - Berlin (Germany)
http://www4.wiwiss.fu-berlin.de/flickrwrappr/photos/Berlin
http://data.nytimes.com/16057429728088573361 - Gaspe Peninsula (Quebec) (?)
Marko Grobelnik, Andreas Harth, Dumitru Roman, Big Linked Data
SPARQL
SPARQL Protocol and RDF Query Language
Query language for RDF graphs
SQL for RDF
SPARQL specification consists of
Query language
Result formats (representation of results in RDF and XML)
Query protocol (mechanisms to pose queries and retrieve results)
Query Results
Table with one row per result
?s
?name
http://dbpedia.org/resource/Erik_Nevland
"Erik Nevland"@no
http://dbpedia.org/resource/Jan_Simonsen
"Jan Simonsen"@no
http://dbpedia.org/resource/Laila_Goody
"Laila Goody"@no
http://dbpedia.org/resource/Henriette_Henriksen
"Henriette Henriksen"@no
http://dbpedia.org/resource/Guri_Hjeltnes
"Guri Hjeltnes"@no
http://dbpedia.org/resource/Johan_E._Holand
"Johan E. Holand"@no
http://dbpedia.org/resource/Kristian_Valen
"Kristian Valen"@no
Further Functionality
Optional triple patterns (e.g., return name and optionally
birthdate if available)
Unions (e.g., return material scientists and also physicists)
Filter (e.g., only return scientists born before 1970)
Result formats (e.g., return RDF triples instead of results
table)
Modificators (e.g., sort results, only return unique results)
Distributed System
Decentralised distributed ownership and control facilitates adoption and
scalability
Cross-referencing
Allows for linking and referencing of existing data, via reuse of URIs
Challenges (I)
Ramp-up cost for data conversion
May be alleviated by semi-automatic mappings and adequate tool
support for manual conversion
Challenges (II)
Often very much oriented towards individuals
Little possibilities for expressing schema knowledge
Different data sources have different ways of representing the same
facts
Ontology languages (RDFS, OWL) solve that drawback
RDFS and OWL are layered on top of RDF
ONTOLOGY LANGUAGES
Ontology in Philosophy
Term exists only in singular (there are no
ontologies)
Ontology is concerned with the study of the
nature of being, existence or reality as such
Discussed by Aristoteles (Sokrates), Thomas
von Aquin, Descartes, Kant, Hegel,
Wittgenstein, Heidegger, Quine, ...
Ontology in Informatics
An Ontology is a
formal specification
> interpretable by machines
of a shared
> based on consensus
conceptualisation
> describes terminology
of a domain of interest > covers a specific topic
Schema Knowledge
RDF provides universal mechanism for the representation
of facts using triples
Possible to describe individuals and their relations
Required: describe generic sets of individuals (classes),
e.g., people, chemical compounds, organisations
Required: specification of logical connections between
individuals, classes and properties to describe their
meaning, e.g., researchers write papers, materials are
chemical compounds
In database-speak: schema knowledge
Subclasses - Motivation
Given triple
dbpedia:Georg_Wilhelm_Friedrich_Hegel rdf:type
yago:PoliticalPhilosophers .
and a query for all foaf:Person instances
we do not get any results
Subclasses
Solution:
Make one statement which says that every scientist is a person
Which means every instance of class
yago:PoliticalPhilosophers is also an instance of class
foaf:Person
Subclasses
rdfs:subClassOf is reflexive, that is, every class is a
subclass of itself
Example:
yago:PoliticalPhilosophers rdfs:subClassOf
yago:PoliticalPhilosophers .
Possible to equate two classes via reciprocal subclass
relations:
Example:
dbpedia:Person rdfs:subClassOf foaf:Person .
foaf:Person rdfs:subClassOf dbpedia:Person .
Class Hierarchies
Typically, ontologies contain not only single subclass relations, but class
hierarchies
Example:
yago:PoliticalPhilosophers rdfs:subClassOf
yago:Philosophers .
yago:Philopsophers rdfs:subClassOf dbpedia:Person .
dbpedia:Person rdfs:subClassOf dbpedia:Mammal .
Transitivity of rdfs:subClassOf is part of the RDFS semantics, which
means e.g., the following holds:
Example:
dbpedia:Philopsophers rdfs:subClassOf dbpedia:Mammal .
RDFS Summary
RDFS can be used to describe semantic aspects of
specific domains
On the basis of RDFS it is possible to infer implicit
knowledge
However, the primitives of RDFS have limited expressivity
Equivalence
OWL allows for specification of equivalence; needed in data integration
scenarios
Between individuals: owl:sameAs
Example:
<http://viaf.org/viaf/97773925/> owl:sameAs
<http://dbpedia.org/resource/Marlene_Dietrich> .
Between properties: owl:equivalentProperty
Between classes: owl:equivalentClass
Example:
dbpedia:Person owl:equivalentClass foaf:Person .
However, equivalences are often implicitly stated in the data
? !
Integration
Wrapper
Wrapper11
Wrapper 2
Wrapper n
Source 1
Source 2
Source n
( )
Marko Grobelnik, Andreas Harth, Dumitru Roman, Big Linked Data
2. Answer
( )
1. Query
Architecture Styles
Warehousing/
Crawl-Index-Serve
2. Answer
!
1. Query
Virtual Integration/
Distributed Querying
0. CrawlIndex
Marko Grobelnik, Andreas Harth, Dumitru Roman, Big Linked Data
Virtual Integration/
Distributed Querying
SUMMARY
Summary
The Linked Data Web is a large, decentralised, complex system built
on simple principles
identify resource via HTTP URIs
provide RDF that links to other URIs upon lookup
Attribution
Slides from my SWT-2 lectures and WWW 2010 SILD tutorial
Slides about RDFS and OWL adapted from SWT-1 lecture (Rudolph,
Kroetzsch, Harth)
Linking Open Data cloud diagrams, by Richard Cyganiak and Anja
Jentzsch. http://lod-cloud.net/
Images of Berlin, Hegel and Dietrich via Wikipedia
Hendler 97: http://www.cs.rpi.edu/~hendler/LittleSemanticsWeb.html
Borst 97: Construction of Engineering Ontologies, Ph.D. Thesis,
University of Twente 1997.
Studer, Benjamins, Fensel 98: Knowledge Engineering: Principles and
Methods, DKE 25(1-2):161-198.
Gruber 93: Towards principles for the design of ontologies used for
knowledge sharing, Formal Ontology in Conceptual Analysis and
Knowledge Representation, Kluwer.
Marko Grobelnik, Andreas Harth, Dumitru Roman, Big Linked Data