グラフデータベースの何がいいのか？

RDBMSでよくね？

本1冊ぐらい読んで判断してください
https://neo4j.com/book-graph-databases/

いろいろあるけど判断ポイントはこのあたりかな

http://www.allthingsdistributed.com/2015/08/titan-graphdb-integration-in-dynamodb.html

In this way, graphs can scale to billions of vertices and edges, while allowing efficient queries and traversal of any subset of the graph with consistent low latency that doesn’t grow proportionally to the overall graph size. This is an important benefit for many use cases that involve accessing and traversing small subsets of a large graph. A concrete example is generating a product recommendation based on purchase interests of a user’s friends, where the relevant social connections are a small subset of the total network. Another example is for tracking inventory in a vast logistics system, where only a subset of its locations is relevant for a specific item.

https://www.sitepoint.com/why-you-should-use-neo4j-in-your-next-ruby-app/#comment-2689399402

Why is this great? Imagine a world with no foreign keys. Each entity in your database can have many relationships referring directly to other entities. If you want to explore the relationships there are no table or index scans, just a few connections to follow. This matches up well with the typical object model. It is more powerful, though, because Neo4j, while providing a lot of the database functionality that we expect, gives us tools to query for complex patterns in our data.

My typical answer is that something like a database where you're doing logging of lots of repetitive data is usually a better fit for on RDMS (or even Mongo since you're wouldn't generally have foreign keys there). For a graph database obviously things that are already graphy are on the other side of the spectrum (e.g. social networks or hierarchical structures)

Another typical answer I've seen is that you'd want Neo4j for when your data has a lot of relationships. I've found, though, that relationships start coming from places that you don't expect when you have the ability to create them easily ;)

グラフデータベースもいろいろあるけどNeo4jがいいと思った

Neo4jと、Titanというグラフデータベースも簡単に調べた結果、Neo4jの方がいいと思った。（もちろん用途によって選択肢は変わるでしょう）

Neo4j

難しい事はおいといて、とりあえずインストールするとものすごく良くできたチュートリアルがあるので、自分で勝手にためしてください。
https://neo4j.com/download/
そのあとで、このあたり読むとだいたい把握できる
https://neo4j.com/developer/get-started/

Titan

http://titan.thinkaurelius.com/
AWSで公式サポートされてる？ところがやっぱり一番惹かれた
詳しくはこのあたり
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Tools.TitanDB.html

Titanというよりもグラフデータベース自体の導入記事としてもいいかも
https://blogs.aws.amazon.com/bigdata/post/Tx12NN92B1F5K0C/Building-a-Graph-Database-on-AWS-Using-Amazon-DynamoDB-and-Titan

でもやっぱりNeo4jの方がいいと思った

1番の理由はデータ構造

Titanのデータ構造

http://s3.thinkaurelius.com/docs/titan/current/data-model.html
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Tools.TitanDB.BestPractices.html

Neo4jのデータ構造

https://neo4j.com/book-graph-databases/
の６章のNative Graph Storageに詳しく書かれている

両方読んで、(以下略)

その他

Enterprise Clusteringの実力は？
https://neo4j.com/customers/
LinkedInとかも使ってるし

Neo4jのホスティングサービス
https://neo4j.com/developer/guide-cloud-deployment/

RailsでNeo4jを使ってみる

http://neo4jrb.readthedocs.io/en/7.0.x/
Railsでアプリ作った事あればすんなり理解できると思うけど、適当に気になった事などを書いときます。

Rails app作成したらまず、pretty_logged_cypher_queriesを設定しておくと Cypherの理解も進むのでオススメ
http://neo4jrb.readthedocs.io/en/7.0.x/Configuration.html
グラフデータベースは、RDBMSと違ってリレーションも第1級オブジェクトなので、 ActiveRecordとは違って、 ActiveNode, ActiveRel(リレーション)があるYo
http://neo4jrb.readthedocs.io/en/7.0.x/ActiveNode.html
なるほど継承をNeo4jで表すときはLabelを複数つけるのね（Labelについては、http://neo4jrb.readthedocs.io/en/7.0.x/Introduction.html#terminology）
http://neo4jrb.readthedocs.io/en/7.0.x/ActiveNode.html#eager-loading
with_associationsがActiveRecordでいうeager_load
なにもつけないと勝手にpreloadぽくなる。（←これはナイスな気もするけど、余計なお世話な気もする。）
http://neo4jrb.readthedocs.io/en/7.0.x/Querying.html
Chaining associationsのところ
student.lessons.teachersみたいに has_manyをchainできる。（当然１クエリ）
proxy_as
ActiveNodeでidとしてUUIDが生成されるけど、実行されるCypher Queryにはそれとは違う別のIDが使われている事が多いのに気がつく。
これはneo4jのID関数で取得できるneo4jが内部的に生成するIDで、これ自体が直接データ構造上の位置をあらわす。
内部的には、RDBMSでおなじみのB Tree探索とかすらする必要がないという事。
(詳しくは https://neo4j.com/book-graph-databases/ の６章のNative Graph Storage)
ちなみにこのIDは、バージョンアップなどで値が変わる可能性があるので、あくまでも一時的なクエリ生成のみに使用すべきのよう。 https://github.com/neo4jrb/neo4j/blob/master/docs/UniqueIDs.rst

Cypher query

Neo4j版のSQL
チュートリアルやったあとに、 https://neo4j.com/docs/cypher-refcard/current/ みながら https://neo4j.com/graphgists/ を適当にみていると
使えるようになったような気になれます。

Label = テーブル(複数つけられる。継承を表すときは親ラベルも一緒につける)
Optinal match (left joinみたいなかんじ)
With (サブクエリ的な用途)

UNIONのpost processについてだけど、SQLサブクエリ的な用途でも使える https://neo4j.com/blog/cypher-union-query-using-collect-clause/

Cyperクエリのチューニング

https://neo4j.com/blog/tuning-cypher-queries/

この部分の説明はチューニング云々ではなくてグラフデータベースの理解として重要

With the first create index on, we are setting an index on the title of a :movie; with the second, we are setting an index on the name of a :person, both of which allow us to create unique indexes. In graph databases, indexes are only used to find the starting point for queries, while in relational databases indexes are used to return a set of rows that are then used for JOINS.

で、この結果の違いを見る。よい例題だと思う。

Again, an index is used to find the starting point. We have now directed our query to zoom in on Tom Hanks and Meg Ryan and find the connections between them. This gives the query plan a very different shape:

たしかにグラフデータベースのデータ構造を考えると右側のほうがはるかに効率よいのがわかる。
この右のパターンはむしろMySQLの場合はスロークエリ監視してたらよく見るパターンのあれと同じに見える
一時テーブルどおしのjoinをしてしまうとインデックスが使えなくてデータ量が一定以上になると急に遅くなる
こういう点でもアプリケーションエンジニアは両者の内部の動きをきちんと把握しておく事が重要。

With > Split MATCH Clauses to Reduce Cardinality
Using Size with Relationships
これgistとかみてもあんまり意識されてない気がする
Using SCAN

GraphGist(Neo4j公式の事例集みたいなの)

https://neo4j.com/graphgist/ac0b2c27-2a5f-4943-8b4b-100273cb285e
ネタとして面白かった
Cypher例
http://portal.graphgist.org/graph_gists/zombie-apocalypse
ネタとして面白かった
http://portal.graphgist.org/graph_gists/bank-fraud-detection
このモデルは通常の用途に使えるとは思えない。たぶん検出専用に作られたモデルかな
http://portal.graphgist.org/graph_gists/project-management
クリティカルパスもシンプルな１クエリで出せる。
クエリでEarliest Start/Finishを設定するあたりとか懐かしい
http://portal.graphgist.org/graph_gists/network-dependency-graph
http://portal.graphgist.org/graph_gists/credit-card-fraud-detection http://linkurio.us/
http://portal.graphgist.org/graph_gists/competency-management-a-matter-of-filtering-and-recommendation-engines
ちゃんと理解できてないけど勉強になった。これは業務でそのまま使えそうな

However, it may be critical to only select those candidates who have skills, required by activities, within a certain competency area. Therefore, not to filter through node properties, but through their links or data relationships. Furthermore, it is also essential to expand the search to (and eventually beyond) 3rd degree connections between skills, activities and areas. In other words, we are looking for how potential candidates are connected to competency areas, within a depth of 3. A SQL database will need to execute more JOIN operations to provide the answer – a task that is difficult to code and creates a time-consuming query. As the depth of connections queried expands, this search will become increasingly difficult with an RDBMS and will result in incredibly poor performance.

After 1 year of operations, these parameters result in a graph of approximately 1M nodes. For a graph of this size, the query traversing paths of depth 3 (see above) requires over 30 seconds for a RDBMS to perform, but will only take less than 0.2 seconds with Neo4j [23]. The difference can be critical, whenever querying the database is part of an online tool.

http://portal.graphgist.org/graph_gists/finding-influencers-in-a-social-network
Twitterインフルエンサーを検出する。リツイートはこういうモデリングになるのか
http://portal.graphgist.org/graph_gists/piping-water
http://portal.graphgist.org/graph_gists/organization-learning

kissrobber/graph_database_memo.md