The Graph Traversal Programming Pattern

The Graph Traversal Programming Pattern

Marko A. Rodriguez
Graph Systems Architect
http://markorodriguez.com
http://twitter.com/twarko

WindyCityDB - Chicago, Illinois – June 26, 2010

June 25, 2010

Abstract
A graph is a structure composed of a set of vertices (i.e. nodes, dots)
connected to one another by a set of edges (i.e. links, lines). The concept
of a graph has been around since the late 19th century, however, only in
recent decades has there been a strong resurgence in the development of
both graph theories and applications. In applied computing, since the late
1960s, the interlinked table structure of the relational database has been
the predominant information storage and retrieval paradigm. With the
growth of graph/network-based data and the need to eﬃciently process
such data, new data management systems have been developed. In
contrast to the index-intensive, set-theoretic operations of relational
databases, graph databases make use of index-free traversals. This
presentation will discuss the graph traversal programming pattern and its
application to problem-solving with graph databases.

Outline

• Graph Structures

• Graph Databases

• Graph Traversals
Artiﬁcial Example
Real-World Examples

Dots and Lines

There are dots and there are lines.
Lets call them vertices and edges, respectively.

Constructions from Dots and Lines

Its possible to arrange the dots and lines into various
conﬁgurations.
Lets call such conﬁgurations graphs.

The Undirected Graph

1. Vertices
• All vertices denote the same
type of object.

2. Edges
• All edges denote the same type
of relationship.
• All edges denote a symmetric
relationship.

Denoting an Undirected Structure in the Real World

Collaborator graph is an undirected graph. Road graph is an undirected graph.

A Line with a Triangle

Dots and lines are boring.
Lets add a triangle to one side of each line.
However, lets call a triangle-tipped line a directed edge.

The Directed Graph

1. Vertices
• All vertices denote the same
type of object.

2. Edges
• All edges denote the same type
of relationship.
• All edges denote an
asymmetric relationship.

Denoting a Directed Structure in the Real World

Twitter follow graph is a directed graph. Web href-citation graph is a directed graph.

Single Relational Structures

• Without a way to demarcate edges, all edges have the same
meaning/type. Such structures are called single-relational graphs.

• Single-relational graphs are perhaps the most common graph type
in graph theory and network science.

How Do You Model a World with Multiple Structures?
I-25

lives_in
lives_in

I-40
is
is
follows

follows
created

lives_in

is

created created
cites created

cites

The Limitations of the Single-Relational Graph

• A single-relational graph is only able to express a single type of vertex
(e.g. person, city, user, webpage).1

• A single-relational graph is only able to express a single type of edge
(e.g. collaborator, road, follows, citation).2

• For modelers, these are very limiting graph types.

1
This is not completely true. All n-partite single-relational graphs allow for the division of the vertex set
into n subsets, where V = n Ai : Ai ∩ Aj = ∅. Thus, its possible to implicitly type the vertices.
i
2
This is not completely true. There exists an injective, information-preserving function that maps any
multi-relational graph to a single-relational graph, where edge types are denoted by topological structures.
Rodriguez, M.A., “Mapping Semantic Networks to Undirected Networks,” International Journal of Applied
Mathematics and Computer Sciences, 5(1), pp. 39–42, 2009. [http://arxiv.org/abs/0804.0277]

The Gains of the Multi-Relational Graph
• A multi-relational graph allows for the explicit typing of edges
(e.g. “follows,” “cites,” etc.).

• By labeling edges, edges can have diﬀerent meanings and vertices
can have diﬀerent types.
follows : user → user
created : user → webpage
cites : webpage → webpage
...

created

Increasing Expressivity with Multi-Relational Graphs
cites
cites

created
created
created

follows
cites
follows

created follows

follows
follows
created

The Flexibility of the Property Graph
• A property graph extends a multi-relational graph by allowing for both
vertices and edges to maintain a key/value property map.

• These properties are useful for expressing non-relational data (i.e. data
not needing to be graphed).

• This allows for the further reﬁnement of the meaning of an edge.
Peter Neubauer created the Neo4j webpage on 2007/10.

name=neo4j
name=peterneubauer
views=56781

created
date=2007/10

Increasing Expressivity with Property Graphs
name=neo4j
views=56781

page_rank=0.023 cites
cites

name=tenderlove
gender=male
created
created
created
date=2007/10

cites
follows

follows
created
name=peterneubauer follows
name=graph_blog follows
views=1000 follows
created

name=ahzf
name=twarko
age=30

Property Graph Instance Schema/Ontology

name=<string> name=<string>
age=<integer> views=<integer>
gender=<string> page_rank=<double>

user created webpage

date=<string>

follows cites

No standard convention, but in general, specify the types of vertices, edges, and the
properties they may contain. Look into the world of RDFS and OWL for more rigorous,
expressive speciﬁcations of graph-based schemas.

Property Graphs Can Model Other Graph Types
weighted graph

add weight attribute

property graph

remove attributes remove attributes no op

labeled graph no op semantic graph no op directed graph

remove edge labels remove edge labels
make labels URIs no op

remove directionality
rdf graph multi-graph

remove loops, directionality,
and multiple edges

simple graph no op undirected graph

NOTE: Given that a property graph is a binary edge graph, it is diﬃcult to model an n-ary edge graph (i.e. a hypergraph).

Persisting a Graph Data Structure

• A graph is a relatively simple data structure. It can be
seen as the most fundamental data structure—something is
related to something else.

• Most every database can model a graph.3

3
For the sake of simplicity, the following examples are with respect to a directed, single-relational graph.
However, note that property graphs can be easily modeled by such databases as well.

Representing a Graph in a Relational Database

outV | inV
------------ A
A | B
A | C
C | D B C
D | A

D

Representing a Graph in a JSON Database

{
A : {
out : [B, C], in : [D] A
}
B : {
in : [A]
}
B C
C : {
out : [D], in : [A]
}
D : {
out : [A], in : [C] D
}
}

Representing a Graph in an XML Database

<graphml>
<graph>
<node id=A /> A
<node id=B />
<node id=C />
<edge source=A target=B />
<edge source=A target=C />
<edge source=C target=D /> B C
<edge source=D target=A />
</graph>
</graphml>
D

Deﬁning a Graph Database

“If any database can represent a graph, then what
is a graph database?”


A graph database is any storage system that
provides index-free adjacency.45

4
There is no “official” definition of what makes a database a graph database. The one provided is my
definition. However, hopefully the following argument will convince you that this is a necessary definition.
5
There is adjacency between the elements of an index, but if the index is not the primary data structure
of concern (to the developer), then there is indirect/implicit adjacency, not direct/explicit adjacency. A
graph database exposes the graph as an explicit data structure (not an implicit data structure).


• Every element (i.e. vertex or edge) has a direct pointer to
its adjacent element.

• No O(log2(n)) index lookup required to determine which
vertex is adjacent to which other vertex.

• If the graph is connected, the graph as a whole is a single
atomic data structure.

Deﬁning a Graph Database by Example

Toy Graph Gremlin
(stuntman)

B E

A

C D

Graph Databases and Index-Free Adjacency
B E

A

C D

• Our gremlin is at vertex A.
• In a graph database, vertex A has direct references to its adjacent vertices.
• Constant time cost to move from A to B and C . It is dependent upon the number
of edges emanating from vertex A (local).

Graph Databases and Index-Free Adjacency

B E

A

C D

The Graph (explicit)

Non-Graph Databases and Index-Based Adjacency

B E

A B C A
B,C E D,E

D E
C D

• Our gremlin is at vertex A.
• In a non-graph database, the gremlin needs to look at an index to determine what
is adjacent to A.
• log2(n) time cost to move to B and C . It is dependent upon the total number of
vertices and edges in the database (global).

Non-Graph Databases and Index-Based Adjacency

B E

A B C A
B,C E D,E

D E C D

The Index (explicit) The Graph (implicit)

Index-Free Adjacency

• While any database can implicitly represent a graph, only a
graph database makes the graph structure explicit.

• In a graph database, each vertex serves as a “mini index”
of its adjacent elements.6

• Thus, as the graph grows in size, the cost of a local step
remains the same.7
6
Each vertex can be intepreted as a “parent node” in an index with its children being its adjacent
elements. In this sense, traversing a graph is analogous in many ways to traversing an index—albeit the
graph is not an acyclic connected graph (tree).
7
A graph, in many ways, is like a distributed index.

Graph Databases Make Use of Indices

A B C
} Index of Vertices
(by id)

D E } The Graph

• There is more to the graph than the explicit graph structure.

• Indices index the vertices, by their properties (e.g. ids).

Graph Databases and Endogenous Indices

• Many indices are trees.8

• A tree is a type of constrained graph.9

• You can represent a tree with a graph.10

8
Even an “index” that is simply an O(n) container can be represented as a graph (e.g. linked list).
9
A tree is an acyclic connected graph with each vertex having at most one parent.
10
This follows as a consequence of a tree being a graph.

• Graph databases allows you to explicitly model indices
endogenous to your domain model. Your indices and
domain model are one atomic entity—a graph.11

• This has beneﬁts in designing special-purpose index
structures for your data.
Think about all the numerous types of indices in the
geo-spatial community.12
Think about all the indices that you have yet to think
about.
11
Originally, Neo4j used itself as its own indexing system before moving to Lucene.
12
Craig Taverner explores the use of graph databases in GIS-based applications.

name property index

views property index gender property index

name=neo4j
views=56781

cites

name=tenderlove
gender=male
created
created
created
date=2007/10

cites
follows

follows
created
views=1000 follows
created

name=ahzf
name=twarko
age=30

name property index

views property index gender property index

name=neo4j
views=56781

cites

name=tenderlove
gender=male
created
created
created
date=2007/10

cites
follows

follows
created
views=1000 follows
created

name=ahzf
name=twarko
age=30

The Graph Dataset

Graph Traversals as the Foundation

• Question: Once I have my data represented as a
graph, what can I do with it?

• Answer: You can traverse over the graph to
solve problems.

Graph Database vs. Relational Database

• While any database can represent a graph, it takes time to
make what is implicit explicit.

• The graph database represents an explicit graph.

• The experiment that follows demonstrate the problem with
using lots of table JOINs to accomplish the eﬀect of a
graph traversal.13

13
Though not presented in this lecture, similar results were seen with JSON document databases.

Neo4j vs. MySQL – Generating a Large Graph

• Generated a 1 million vertex/4 million edge graph with “natural statistics.”14
• Loaded the graph into both Neo4j and MySQL in order to empirically evaluate the
eﬀect of index-free adjacency.
14
What is diagramed is a small subset of this graph 1 million vertex graph.

Neo4j vs. MySQL – The Experiment

• For each run of the experiment, a traverser (gremlin) is
placed on a single vertex.

• For each step, the traverser moves to its adjacent
vertices.
Neo4j (graph database): the adjacent vertices are provided by the
current vertex.15
MySQL (relational database): the adjacent vertices are provided by a
table JOIN.

• For the experiment, this process goes up to 5 steps.
15
You can think of a graph traversal, in a graph database, as a local neighborhood JOIN.

Neo4j vs. MySQL – The Experiment (Zoom-In Subset)

Neo4j vs. MySQL – The Experiment (Step 1)

Neo4j vs. MySQL – The Results
total running time (ms) for step traverals of length n
total running time (ms) for traversals of length n
mysql
2.3x faster
neo4j

100000
time(ms)
60000
0 20000

2.6x faster
4.5x faster 1.9x faster

1 2 3 4

traversal length
steps
average over the 250 most dense vertices as root of the traveral

• At step 5, Neo4j completed it in 14 minutes.
• At step 5, MySQL was still running after 2 hours (process stopped).

Neo4j vs. MySQL – More Information
For more information on this experiment, please visit
http://markorodriguez.com/Blarko/Entries/2010/3/
29_MySQL_vs._Neo4j_on_a_Large-Scale_Graph_
Traversal.html

Why Use a Graph Database? – Data Locality

• If the solution to your problem can be represented as a local
process within a larger global data structure, then a graph
database may be the optimal solution for your problem.

• If the solution to your problem can be represented as being
with respect to a set of root elements, then a graph
database may be the optimal solution to your problem.

• If the solution to your problem does not require a global
analysis of your data, then a graph database may be the
optimal solution to your problem.

Why Use a Graph Database? – Data Locality

Some Graph Traversal Use Cases
• Local searches — “What is in the neighborhood around
A?”16

• Local recommendations — “Given A, what should A
include in their neighborhood?”17

• Local ranks — “Given A, how would you rank B relative to
A?”18
16
A can be an individual vertex or a set of vertices. This set is known as the root vertex set.
17
Recommendation can be seen as trying to increase the connectivity of the graph by recommending
vertices (e.g. items) for another vertex (e.g. person) to extend an edge to (e.g. purchased).
18
In this presentation, there will be no examples provided for this use case. However, note that searching,
ranking, and recommendation are presented in the WindyCityDB OpenLab Graph Database Tutorial. Other
terms for local rank are “rank with priors” or “relative rank.”

Graph Traversals with Gremlin Programming Language

Gremlin G = (V, E)

http://gremlin.tinkerpop.com

The examples to follow are as basic as possible to get the idea across. Note that
numerous variations to the themes presented can be created. Such variations are driven
by the richness of the underlying graph data set and the desired speed of evaluation.


1 created 3

knows created

4

name = peter
age = 37

vertex 3 in edges
vertex 1 out edges
edge label
edge out vertex
edge in vertex

1 created 3

knows created

4

name = peter
age = 37

vertex 4 properties vertex 4 id

Graph Traversal in Property Graphs
name=tobias

follows

name=alex created name=C

created
created
name=emil follows name=B

name=E

follows name=A
created
created

name=johan created name=D

follows

name=peter

Red vertices are people and blue vertices are webpages.

Local Search: “Who are the followers of Emil Eifrem?”
name=tobias

follows


created
created
name=emil follows 2 name=B name=alex

name=E

name=A
1 follows 2 name=johan
created
created


follows

name=peter ./outE[@label=ʻfollowsʼ]/inV

Local Search: “What webpages did Emil’s followers
create?”
name=tobias

follows

created name=C
name=alex 3
created
created
name=emil follows 2 name=B name=B

name=E

name=A name=A
1 follows 2
created
created

name=johan 3 created name=D

follows
./outE[@label=ʻfollowsʼ]/inV
name=peter /outE[@label=ʻcreatedʼ]/inV

Local Search: “What webpages did Emil’s followers
followers create?”
name=tobias

follows

name=alex
3 created name=C
name=C
created
created
name=emil 4
follows 2 name=B name=D

name=E

1 follows 2 name=A 4 4 name=E
created
created 4
name=E
3
follows
./outE[@label=ʻfollowsʼ]/inV/
name=peter outE[@label=ʻfollowsʼ]/inV/
outE[@label=ʻcreatedʼ]/inV

Local Recommendation: “If you like webpage E, you
may also like...”
name=tobias

follows

2 created name=C
name=alex

created
created
3
name=emil follows name=B name=C

name=E

name=A
1 name=D
follows
created
created 3
name=johan name=D
2 created

follows
./inV[@label='created']/outV/
name=peter outE[@label='created']/inV[g:except($_)]

Assumption: if you like a webpage by X , you will like others that they have created.

Local Recommendation: “If you like Johan, you may also
like...”
name=tobias

follows


created
created
name=emil follows 3 name=B

name=E name=alex

2 follows 1 name=A
created
created


follows
./inV[@label='follows']/outV/
name=peter outE[@label='follows']/inV

Assumption: if many people follow the same two people, then those two may be similar.

Assortment of Other Specific Graph Traversal Use Cases
• Missing friends: Find all the friends of person A. Then find all the
friends of the friends of person A that are not also person A’s friends.19
./outE[@label=‘friend’]/inV[g:assign(‘$x’)]/
outE[@label=‘friend’]/inV[g:except($x)]

• Collaborative filtering: Find all the items that the person A likes. Then
find all the people that like those same items. Then find which items
those people like that are not already the items that are liked by person
A.20
./outE[@label=‘likes’]/inV[g:assign(‘$x’)]/
inE[@label=‘likes’]/outV/outE[@label=‘likes’]/inV[g:except($x)]
19
This algorithm is based on the notion of trying to close “open triangles” in the friendship graph. If
many of person A’s friends are friends with person B , then its likely that A and B know each other.
20
This is the most concise representation of collaborative filtering. There are numerous modifications to
this general theme that can be taken advantage of to alter the recommendations.

Assortment of Other Specific Graph Traversal Use Cases

• Question expert identification: Find all the tags associated with
question A. For all those tag, find all answers (for any questions) that
are tagged by those tags. For those answers, find who created those
answers.21
./inE[@label=‘tag’]/outV[@type=‘answer’]/inE[@label=‘created’]/outV

• Similar tags: Find all the things that tag A has been used as a tag for.
For all those things, determine what else they have been tagged with.22
./inE[@label=‘tag’]/outV/outE[@label=‘tag’]/inV[g:except($_)]

21
If two resources share a “bundle” of resources in common, then they are similar.
22
This is the notion of “co-association” and can be generalized to find the similarity of two resources
based upon their co-association through a third resource (e.g. co-author, co-usage, co-download, etc.). The
third resource and the edge labels traversed determine the meaning of the association.

Some Tips on Graph Traversals

• Ranking, scoring, recommendation, searching, etc. are all
variations on the basic theme of deﬁning abstract paths
through a graph and realizing instances of those paths
through traversal.

• The type of path taken determines the meaning
(i.e semantics) of the rank, score, recommendation, search,
etc.

• Given the data locality aspect of graph databases, many of
these traversals run in real-time (< 100ms).

Property Graph Algorithms in General
• There is a general framework for mapping all the standard single-relational
graph analysis algorithms over to the property graph domain.23
Geodesics: shortest path, eccentricity, radius, diameter, closeness,
betweenness, etc.24
Spectral: random walks, page rank, spreading activation, priors, etc.25
Assortativity: scalar or categorical.
... any graph algorithm in general.

• All able to be represented in Gremlin.
23
Rodriguez M.A., Shinavier, J., “Exposing Multi-Relational Networks to Single-Relational Network Analysis
Algorithms,” Journal of Informetrics, 4(1), pp. 29–41, 2009. [http://arxiv.org/abs/0806.2274]
24
Rodriguez, M.A., Watkins, J.H., “Grammar-Based Geodesics in Semantic Networks,” Knowledge-Based
Systems, in press, 2010.
25
Rodriguez, M.A., “Grammar-Based Random Walkers in Semantic Networks,” Knowledge-Based Systems,
21(7), pp. 7270–739, 2008. [http://arxiv.org/abs/0803.4355]

Conclusion

• Graph databases are eﬃcient with respects to local data
analysis.

• Locality is deﬁned by direct referent structures.

• Frame all solutions to problems as a traversal over local
regions of the graph.
This is the Graph Traversal Pattern.

Acknowledgements

• Pavel Yaskevich for advancing Gremlin. Pavel is currently writing a
new compiler that will make Gremlin faster and more memory eﬃcient.

• Peter Neubauer for his collaboration on many of the ideas discussed in
this presentation.

• The rest of the Neo4j team (Emil, Johan, Mattias, Alex, Tobias, David,
Anders (1 and 2)) for their comments.

• WindyCityDB organizers for their support.

• AT&T Interactive (Aaron, Rand, Charlie, and the rest of the Buzz
team) for their support.

References to Related Work
• Rodriguez, M.A., Neubauer, P., “Constructions from Dots and Lines,” Bulletin
of the American Society of Information Science and Technology, June 2010.
[http://arxiv.org/abs/1006.2361]

• Rodriguez, M.A., Neubauer, P., “The Graph Traversal Pattern,” AT&Ti and
NeoTechnology Technical Report, April 2010. [http://arxiv.org/abs/1004.1001]

• Neo4j: A Graph Database [http://neo4j.org]

• TinkerPop [http://tinkerpop.com]
Blueprints: Data Models and their Implementations [http://blueprints.tinkerpop.com]
Pipes: A Data Flow Framework using Process Graphs [http://pipes.tinkerpop.com]
Gremlin: A Graph-Based Programming Language [http://gremlin.tinkerpop.com]
Rexster: A Graph-Based Ranking Engine [http://rexster.tinkerpop.com]
∗ Wreckster: A Ruby API for Rexster [http://github.com/tenderlove/wreckster]

The Graph Traversal Programming Pattern

More Related Content

The Graph Traversal Programming Pattern