Talk:Apache Spark

Computing: Software / Free and open-source software

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
???	This article has not yet received a rating on the project's importance scale.
	This article is supported by WikiProject Software.
	This article is supported by Free and open-source software (assessed as Low-importance).

University of California

	This article is within the scope of WikiProject University of California, a collaborative effort to improve the coverage of articles relating to University of California, its history, accomplishments and other topics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.University of CaliforniaWikipedia:WikiProject University of CaliforniaTemplate:WikiProject University of CaliforniaUniversity of California articles
???	This article has not yet received a rating on the project's importance scale.
	This page is within the scope of the UC Berkeley task force. New members are always welcome!

Tip: Anchors are case-sensitive in most browsers.

This article links to one or more target anchors that no longer exist.

[[Amazon Web Services#Database|Kinesis]] Anchor Amazon Web Services#Database links to a specific web page: Database. The anchor (#Database) is no longer available because it was deleted by a user before.
[[Graph database#Distributed processing|Pregel]] The anchor (Distributed processing) has been deleted.

Please help fix the broken anchors. You can remove this template after fixing the problems. | Reporting errors

NPOV?

Because it is based on RDDs, which are immutable, graphs are immutable and thus GraphX is unsuitable for graphs that need to be updated, let alone in a transactional manner like a graph database.

Sounds like someone has an axe to grind here. Is not everything in Spark read-only (i.e. that is one of the intentional aspects of design, "it's not a bug, it's a feature") then harping on how Spark isn't a database sounds a lot like somebody doesn't like it, or has something else they want people to use/buy. — Preceding unsigned comment added by 75.73.1.89 (talk) 15:55, 28 September 2016 (UTC)[reply]

I wrote that line in this Wikipedia article. I'm also the author of the book Spark GraphX in Action. I attempted to present a balanced view, and chose to highlight the immutability of graphs because the question comes up sometimes on the Apache mailing lists. See [1] and [2]. Also until recently, GraphX was listed in the Graph database article! See [3]. The lack of mutability was even acknowledge as a weakness by Ankur Dave, one of the primary authors of GraphX, and he attempted to address it via the external package IndexedRDD. Michaelmalak (talk) 17:48, 28 September 2016 (UTC)[reply]

Links to potential references

RDD Versus Dataset.

This article states that Spark is built around RDD but the official documentation at https://spark.apache.org/docs/latest/quick-start.html says that RDD is deprecated and Datasets are the new paradigm. It's beyond my knowledge and experience in Spark to fix the article but it would be great if someone expert on the change could update this. I find wiki articles to be better intro than most software documentation so I'd love to see a good, updated, intro to Spark here. — Preceding unsigned comment added by 138.32.32.166 (talk) 17:31, 19 October 2017 (UTC)[reply]

Done Michaelmalak (talk) 00:16, 20 October 2017 (UTC)[reply]

PySpark

PySpark redirects here but isn't actually mentioned in the article. The article should explain what PySpark is. --Jameboy (talk) 11:14, 1 November 2022 (UTC)[reply]