Talk:Apache Spark
This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||||||||||||||
|
This article links to one or more target anchors that no longer exist.
Please help fix the broken anchors. You can remove this template after fixing the problems. | Reporting errors |
NPOV?
Because it is based on RDDs, which are immutable, graphs are immutable and thus GraphX is unsuitable for graphs that need to be updated, let alone in a transactional manner like a graph database.
Sounds like someone has an axe to grind here. Is not everything in Spark read-only (i.e. that is one of the intentional aspects of design, "it's not a bug, it's a feature") then harping on how Spark isn't a database sounds a lot like somebody doesn't like it, or has something else they want people to use/buy. — Preceding unsigned comment added by 75.73.1.89 (talk) 15:55, 28 September 2016 (UTC)
- I wrote that line in this Wikipedia article. I'm also the author of the book Spark GraphX in Action. I attempted to present a balanced view, and chose to highlight the immutability of graphs because the question comes up sometimes on the Apache mailing lists. See [1] and [2]. Also until recently, GraphX was listed in the Graph database article! See [3]. The lack of mutability was even acknowledge as a weakness by Ankur Dave, one of the primary authors of GraphX, and he attempted to address it via the external package IndexedRDD. Michaelmalak (talk) 17:48, 28 September 2016 (UTC)
Links to potential references
- http://www.pcworld.com/article/2336380/apache-lights-a-fire-under-hadoop-with-spark.html
- http://www.toptechnews.com/article/index.php?story_id=0010002ZTG58
- http://gigaom.com/2013/10/28/spark-is-a-really-big-deal-for-big-data-and-buttera-gets-it/
- http://strata.oreilly.com/2013/02/the-future-of-big-data-with-bdas-the-berkeley-data-analytics-stack.html
- http://blog.mikiobraun.de/2014/01/apache-spark.html
RDD Versus Dataset.
This article states that Spark is built around RDD but the official documentation at https://spark.apache.org/docs/latest/quick-start.html says that RDD is deprecated and Datasets are the new paradigm. It's beyond my knowledge and experience in Spark to fix the article but it would be great if someone expert on the change could update this. I find wiki articles to be better intro than most software documentation so I'd love to see a good, updated, intro to Spark here. — Preceding unsigned comment added by 138.32.32.166 (talk) 17:31, 19 October 2017 (UTC)
PySpark
PySpark redirects here but isn't actually mentioned in the article. The article should explain what PySpark is. --Jameboy (talk) 11:14, 1 November 2022 (UTC)
- Start-Class Computing articles
- Unknown-importance Computing articles
- Start-Class software articles
- Unknown-importance software articles
- Start-Class software articles of Unknown-importance
- All Software articles
- Start-Class Free and open-source software articles
- Low-importance Free and open-source software articles
- Start-Class Free and open-source software articles of Low-importance
- All Free and open-source software articles
- All Computing articles
- Start-Class University of California articles
- Unknown-importance University of California articles
- Unknown-importance University of California, Berkeley articles
- WikiProject University of California articles