37. Open Graph
「今日のウェブは、さまざまページの間に張
られた無構造でランダムなリンクによって成
りたっています。Open Graph(オープンググ
ラフ)は人々の間の関係を構造化します。」
Zuckerberg
Facebookは、2010年に、ソーシャル・グラ
フの拡張であるオープングラフの初期バージ
ョンを導入した。このオープングラフ・プロ
トコルを通じて、Web中の人々が好きなウェ
ブサイトやページを含めることが出来る。
Open Graph
https://developers.facebook.com/docs/opengraph/
39. Google‘s Schmidt: ‗I Screwed
Up‘ on Social Networking
「SNSでは、私はへまをした」
2010年6月1日
Wired誌へのインタビュー
http://www.wired.com/business/2011/06/g
oogles-schmidt-social/
40. Our new search index:
Caffeine
検索インデックス付けのリアルタイム化
2010年6月8日
Google Webmaster Central Blog
http://googlewebmastercentral.blogspot.jp/
2010/06/our-new-search-indexcaffeine.html
41. Pregel: A System for LargeScale Graph Processing
Googleの大規模グラフデータ処理エンジン
2010年6月8日
SIGMOD 2010
http://kowshik.github.io/JPregel/pregel_paper
.pdf
45. Google Knowledge Graph
新しい検索の三つの特徴
正しい「もの」を見つける。(Find the
right thing)
最良の要約を得る。(Get the best
summary)
さらに深く、さらに広く。(Go deeper
and broader)
GoogleのKnowledge Graphについては、2012年7月12日の
クラウド研究会での丸山の資料「Googleの新しい検索技術
Knowledge Graphについて」を参照されたい。
https://drive.google.com/file/d/
0B04ol8GVySUuUFUtbWcxNFlHY3c/edit?usp=sharing
56. Microsoft
Trinity: A Distributed Graph
Engine on a Memory Cloud
MSの新しいアーキテクチャのグラフエン
ジン
2013年6月26日
ACM SIGMOD 2013
http://research.microsoft.com/pubs/183710
/Trinity.pdf
http://research.microsoft.com/apps/pubs/d
57. Facebook Scaling Apache
Giraph to a trillion edges
FacebookのGiraph拡張の試み
2013年8月15日
Avery Ching
Northern Western University
https://www.facebook.com/avery.ching
63. Facebook
next 10-year plans
「Graph Searchは、ほとんど動かない
」
2013年1月30日
Interview with Business Week
http://www.businessweek.com/articles/201
4-01-30/facebook-turns-10-the-markzuckerberg-interview#p4
65. Part II まとめ
検索とグラフをめぐる主要な出来事
2010/04/21 Facebook Open Graph公開
2010/06/01 Schmidt 自己批判
2010/06/08 Google Caffeine投入
2010/06/08 Google Pregel Paper
SIGMOD 2010
2011/09/21 Google Google+スタート
2012/02/06 Apache Giraph 0.1
incubating
2012/06/16 Google Knowledge Graph
68. Pregel: A System for LargeScale Graph Processing
2010年6月8日
SIGMOD 2010
http://kowshik.github.io/JPregel/preg
el_paper.pdf
69. Pregel: A System for LargeScale Graph Processing
SIGMOD 2010
Grzegorz Malewicz, Matthew Austern,
Aart Bik, James Dehnert, Ilan Horn,
Naty Leiser, Grzegorz Czajkowski
(Google, Inc.)
http://www.cse.iitb.ac.in/dbms/Data/
Courses/CS632/Talks/pregel.pptx
148. Experiments
log-normal random graphs, mean out-degree 127.1 (thus over
127 billion edges in the largest case): varying graph sizes on
800 worker tasks
150. 参考文献
[1] Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson,
and Jonathan W. Berry, Challenges in Parallel Graph
Processing. Parallel Processing Letters 17, 2007, 5-20.
[2] Kameshwar Munagala and Abhiram Ranade, I/Ocomplexity of graph algorithms. in Proc. 10th Annual
ACM-SIAM Symp. on Discrete Algorithms, 1999, 687694.
[3] Grzegorz Malewicz , Matthew H. Austern , Aart J.C Bik ,
James C. Dehnert , Ilan Horn , Naty Leiser , Grzegorz
Czajkowski, Pregel: a system for large-scale graph
processing, Proceedings of the 2010 international
conference on Management of data, 2010
[4] Leslie G. Valiant, A Bridging Model for Parallel
Computation. Comm. ACM 33(8), 1990, 103-111.
155. Hadoop Summit 2011 8月
Giraph: Large-scale
graph processing
infrastructure on
Hadoop
Avery Ching, Facebook
Christian Kunz, Jybe
10/14/2011
@Hortonworks, Sunnyvale, CA
http://www.slideshare.net/averyching/20110628giraph-hadoop-summit
http://www.youtube.com/watch?v=l4nQjAG6fac
156. Facebook: Scaling Apache
Giraph to a trillion edges
2013年8月15日 Avery Ching
Northern Western University
https://www.facebook.com/notes/facebookengineering/scaling-apache-giraph-to-atrillion-edges/10151617006153920
157. September 10, 2013
Scaling Apache Giraph
Nitay Joffe, Data Infrastructure Engineer
[email protected]
@nitayj
http://www.slideshare.net/nitayj/
20130910-giraph-at-london-hadoop-users-group
163. Giraphのデータの流れ
2
3
グラフのロード
計算と繰り返し
グラフの格納
Split 4
Load /
Send
Graph
Part 1
Part 2
Part 3
Comput
e / Send
Messag
es
Split
Send stats / iterate!
Worker
0
Comput
e / Send
Messag
es
Output
format
Part 0
Part 0
Part 1
Part 1
Part 2
Worker
1
Split 3
Part 0
Worker 1
Load /
Send
Graph
Split 2
Worker 1
Master
Split 1
In-memory
graph
Master
Split 0
Worker 0
Input format
Worker 0
1
Part 2
Part 3
Part 3
171. Problem: Master Crash.
Solution: ZooKeeper Master Queue.
Before failure of active master 0
“Active”
Master 0
“Spare”
Master 1
“Spare”
Master 2
After failure of active master 0
“Active”
Master 0
Active
Master
State
ZooKeeper
“Active”
Master 1
“Spare”
Master 2
Active
Master
State
ZooKeeper
175. Problem: byte[]のシリアライズ化
• DataInput? Kyro? Custom?
Solution: Unsafe
• Dangerous. No formal API. Volatile. Non-portable (oracle JVM only).
• AWESOME. As fast as it gets.
• True native. Essentially C: *(long*)(data+offset);
176. K-Means Clustering
Problem: Large Aggregations.
e.g. Similar Emails
Master
Worker
Worker
Worker
Worker
Worker
Solution: Sharded Aggregators.
Aggregator owners
Workers own aggregators
communicate
with Master
Aggregator owners
distribute values
Mas
ter
Mas
ter
Mas
ter
Wor
ker
Wor
ker
Wor
ker
Wor
ker
Wor
ker
Wor
ker
Wor
ker
Wor
ker
Wor
ker
Wor
ker
Wor
ker
Wor
ker
Wor
ker
Wor
ker
Wor
ker
177. End compute
Problem: ネットワーク Wait.Barrier
• RPC はモデルに合わない
• 同期型の呼び出しは良くない
Barrier
wait
compute
network
Begin superstep
End superstep
End compute
Barrier
Solution: Netty
queueサイズとスレッドを調整する
Barrier
compute
wait
network
Time to first message
Begin superstep End superstep
214. Trinity 参考文献
Bin Shao, Haixun Wang, and Yatao Li, Trinity:
A Distributed Graph Engine on a Memory
Cloud, SIGMOD 2013.
Wanyun Cui, Yanghua Xiao, Haixun Wang, Ji
Hong, and Wei Wang, Local Search of
Communities in Large Graphs, SIGMOD 2013
Kai Zeng, Jiacheng Yang, Haixun Wang, Bin
Shao, and Zhongyuan Wang, A Distributed
Graph Engine for Web Scale RDF Data, VLDB
2013.
Zhao Sun, Hongzhi Wang, Bin Shao, Haixun
Wang, and Jianzhong Li, Efficient Subgraph
Matching on Billion Node Graphs, VLDB 2012.
215. Trinity 参考文献
Bin Shao, Haixun Wang, and Yanghua Xiao,
Managing and Mining Large Graphs: Systems
and implementations (tutorial), SIGMOD 2012.
Lijun Chang, Jeffrey Yu, Lu Qin, Yuanyuan Zhu,
and Haixun Wang, Finding Information Nebula
over Large Networks, in ACM CIKM, October
2011.
Ruoming Jin, Lin Liu, Bolin Ding, and Haixun
Wang, Reachability Computation in Uncertain
Graphs, in VLDB, September 2011.
Ye Yuan, Guoren Wang, Haixun Wang, and Lei
Chen, Efficient Subgraph Search over Large
Uncertain Graphs, in VLDB, September 2011.
216. Trinity 参考文献
Ruoming Jin, Yang Xiang, Ruan Ning, and
Haixun Wang, Path-Tree: An Efficient
Reachability Indexing Scheme for Large
Directed Graphs, in ACM Transactions on
Database Systems (TODS), ACM Transactions
on Database Systems (TODS), 2011
233. Shards
https://github.com/hibernate/hibernate-shards
You can't always put all your relational data in
a single relational database. Sometimes you
simply have too much data. Sometimes you
have a distributed deployment architecture
Hibernate Shards is a framework that is
designed to encapsulate and reduce this
complexity by adding support for horizontal
partitioning on top of Hibernate Core. Simply
put, we aim to provide a unified view of
multiple databases via Hibernate.
234. Gizzard
https://github.com/twitter/gizzard
Twitter has built several custom distributed
data-stores. Many of these solutions have a lot
in common, prompting us to extract the
commonalities so that they would be more
easily maintainable and reusable. Thus, we
have extracted Gizzard, a Scala framework
that makes it easy to create custom faulttolerant, distributed databases.
235. Thrift
https://github.com/apache/thrift
Thrift is a lightweight, language-independent
software stack with an associated code
generation mechanism for RPC.
Thrift provides clean abstractions for data
transport, data serialization, and application
level processing.
The code generation system takes a simple
definition language as its input and generates
code across programming languages that uses
the abstracted stack to build interoperable RPC
clients and servers.
Social networksare graphs that describe relationships among people. Transportation routes create a graph of physical connections among geographical locations. Paths of disease outbreaks form a graph, as do games among soccer teams, computer network topologies.Perhaps the most pervasive graph is the web itself, where documents are vertices and links are edges. The scale of these graphs,in some cases billions of vertices, trillions of edgesposes challenges to theireffcient processing
Despite differences in structure and origin, many graphs out there have two things in common: each of them keeps growing in size, and there is a seemingly endless number of facts and details people would like to know about each one. Take, for example, geographic locations. A relatively simple analysis of a standard map (a graph!) can provide the shortest route between two cities. But progressively more sophisticated analysis could be applied to richer information such as speed limits, expected traffic jams, roadworks and even weather conditions. In addition to the shortest route, measured as sheer distance, you could learn about the most scenic route, or the most fuel-efficient one, or the one which has the most rest areas. All these options, and more, can all be extracted from the graph and made useful — provided you have the right tools and inputs. The web graph is similar. The web contains billions of documents, and that number increases daily. To help you find what you need from that vast amount of information, Google extracts more than 200 signals from the web graph, ranging from the language of a webpage to the number and quality of other pages pointing to it. In order to achieve that, we need a scalable infrastructureto mine a wide range of graphs
The name Pregel honors Leonard Euler. The Bridges of of Konigsberg, which inspired his famous theorem, spanned the Pregel river.
A BSP computer consists of processors connected by a communication network. Each processor has a fast local memory, and may follow different threads of computation. A BSP computation proceeds in a series of global supersteps. A superstep consists of three components:Concurrent computation: Several computations take place on every participating processor. Each process only uses values stored in the local memory of the processor. The computations are independent in the sense that they occur asynchronously of all the others.Communication: The processes exchange data between themselves. This exchange takes the form of one-sided put and get calls, rather than two-sided send andreceive calls.Barrier synchronisation: When a process reaches this point (the barrier), it waits until all other processes have finished their communication actions.The computation and communication actions do not have to be ordered in time. The barrier synchronization concludes the superstep.
Only active vertices
Google cluster architecture consists of thousands of commodity PCs organized into racks with high intra-rack bandwidth. Clusters are interconnected but distributed geographically.
Messages are sent asynchronously, to enable overlapping of computation, communication and batching.This step is repeated as long as any vertices are active or any messages are in transit.After the computation halts, the master may instruct each worker to save its portion of the graph.
Messages are sent asynchronously, to enable overlapping of computation, communication and batching.This step is repeated as long as any vertices are active or any messages are in transit.After the computation halts, the master may instruct each worker to save its portion of the graph.
Included in later version of pregel
ExampleClusters can be reduced to a single nodeIn a minimum spanning tree algorithm, edges can be removed.Requests for adding vertices and edges can also be made.
s/w failure : premption by high priority jobs
Once you have out msg, system can compute the states of recovering partition. No need to execute other partitions.
Sum of all PageRanks = 1
Sum of all PageRanks = Number of pages
Sum of all PageRanks = 1
No internal FB repo. Everyone committer.A global epoch followed by a global barrier where components do concurrent computation and send messages.Graphs are sparse.
Giraph is a map-only job
Code is real, checked into Giraph.All vertices find the maximum value in a strongly connected graph
One active master, with spare masters taking over in the event of an active master failureAll active master state is stored in ZooKeeper so that a spare master can immediately step in when an active master fails“Active” master implemented as a queue in ZooKeeperA single worker failure causes the superstep to failApplication reverts to the last committed superstep automaticallyMaster detects worker failure during any superstep with a ZooKeeper “health” znodeMaster chooses the last committed superstep and sends a command through ZooKeeper for all workers to restart from that superstep
One active master, with spare masters taking over in the event of an active master failureAll active master state is stored in ZooKeeper so that a spare master can immediately step in when an active master fails“Active” master implemented as a queue in ZooKeeper
Primitive collections are primitive.Lots of boxing / unboxing of types.Object and reference for each instance.
Also other implementations like Map<I,E> for edges which use more space but better for lots of mutations.Realistically for FB sized graphs need even bigger.Edges are not uniform in reality, some vertices are much larger.
Dangerous, non-portable, volatile. Oracle JVM only. No formal API.Allocate non-GC memory.Inherit from String (final class).Direct access memory (C pointer casts)
Cluster open source projects.Histograms. Job metrics.
Start sending messages early and sendwith computation.Tune message buffer sizes to reduce wait time.