Posts about NoSQL databases and Polyglot persistence from Monday, 14 March 2011
Hadoop and NoSQL Databases at Twitter
Three presentations covering the various NoSQL usages at Twitter:
-
Kevin Weil talking about data analysis using Scribe for logging, base analysis with Pig/Hadoop, and specialized data analysis with HBase, Cassandra, and FlockDB on InfoQ
Ryan King’s presentation from last year’s QCon SF NoSQL track on Gizzard, Cassandra, Hadoop, and Redis on InfoQ
Dmitriy Ryaboy on Hadoop from Devoxx 2010:
By looking at the powered by NoSQL page and my records, Twitter seems to be the largest adopter of NoSQL solutions. Here is an updated version of who is using Cassandra and HBase
- Twitter: Cassandra, HBase, Hadoop, Scribe, FlockDB, Redis
- Facebook: Cassandra, HBase, Hadoop, Scribe, Hive
- Netflix: Amazon SimpleDB, Cassandra
- Digg: Cassandra
- SimpleGeo: Cassandra
- StumbleUpon: HBase, OpenTSDB
- Yahoo!: Hadoop, HBase, PNUTS
- Rackspace: Cassandra
And probably many more missing from the list. But that could change if you leave a comment.
Original title and link: Hadoop and NoSQL Databases at Twitter (NoSQL databases © myNoSQL)
RethinkDB and SSD Write Performance
I didn’t know too much about RethinkDB until watching Tim Anglade’s interview with Slava Akhmechet and Mike Glukhovsky. There were mainly three things that caught my attention:
RethinkDB is firstly building a persistent memcached compatible solution to work with SSD. The reason for starting with a memcached-compatible system is that building it is much simpler than implementing a MySQL storage engine. On the other hand I think that having a persistent memcached might bring RethinkDB some customers to validate the technology.
Even if announced in 8-10 weeks at the time of the interview, I don’t think this implementation has been launched yet. Update: according to Tim, RethinkDB technology has been available to private beta users for a while now. But I still couldn’t find any reference to it on either the website or blog.
Next will come a MySQL engine optimized for SSD
- Replacing rotational disks with SSD shows an immediate bump in performance. But shortly after (months) performance seriously degrades.
It is this last point that I haven’t heard before. And I’d really be interested to understand:
- if it applies to all scenarios or if it is related to databases in general
- are there specific database scenarios (access patterns, read/write ratios) that lead to this behavior or will it manifest in general cases too
My current assumption is that this behavior occurs for write intensive databases only. But I’d really like to hear some better documented answers.
Update: First answer I got to the above questions comes from Travis Truman: The SSD Anthology: Understanding SSDs and New Drives from OCZ.
Update: RethinkDB guys have published a follow up: On TRIM, NCQ, and write amplification.
Original title and link: RethinkDB and SSD Write Performance (NoSQL databases © myNoSQL)
Comparing Dryad and Hadoop
Madhu Reddy[1] comparing the commercial and not yet released Dryad with the open source, widely used Hadoop:
- While Hadoop has chosen to build these capabilities from scratch [management and administration of large clusters], Dryad has chosen to leverage the proven and tested cluster management capabilities already present in Windows HPC Server.
- Hadoop […] has focused on performance and scale. Dryad, building on the performance and scale of Windows HPC Server, has in addition focused on making big data easier to use for mainstream application developers.
- Dryad and DSC are based on the widely used and mature NTFS (New Technology File System), the file system that comes standard with Windows Server.
- Hadoop uses the MapReduce computational model, which provides support for expressing the application logic in two simple steps — map and reduce. However, to develop more complex applications, developers will have to manually string together a sequence of MapReduce steps. DryadLINQ offers a higher-level computational model where complex sequence of MapReduce steps can be easily expressed in a query language similar to SQL.
A couple of aspects that were left out:
- licensing costs for Windows HPC Server, Microsoft Visual Studio, and the future Dryad
- Dryad commercial closed source model versus Hadoop open source model. (nb: example question: how soon could you get a bug fix or improvement?)
- Hadoop tools ecosystem
- Other Hadoop tools like Karmasphere studio — a graphical environment to develop, debug, deploy and monitor MapReduce jobs.
That’s not to say that Dryad and DryadLINQ are not interesting projects.
Madhu Reddy is senior product manager for Technical Computing marketing at Microsoft ↩
Original title and link: Comparing Dryad and Hadoop (NoSQL databases © myNoSQL)
via: http://gigaom.com/cloud/with-dryad-microsoft-is-trying-to-democratize-big-data/
Clustrix: Distribution, Fault Tolerance, and Availability Models
Using as a pretext a comparison with MongoDB — why MongoDB? — Sergei Tsarev provides some details about Clustrix data distribution, fault tolerance, and availability models.
At Clustrix, we think that Consistency, Availability, and Performance are much more important than Partition tolerance. Within a cluster, Clustrix keeps availability in the face of node loss while keeping strong consistency guarantees. But we do require that more than half of the nodes in the cluster group membership are online before accepting any user requests. So a cluster provides fully ACID compliant transactional semantics while keeping a high level of performance, but you need majority of the nodes online.
Original title and link: Clustrix: Distribution, Fault Tolerance, and Availability Models (NoSQL databases © myNoSQL)
via: http://sergeitsar.blogspot.com/2011/02/mongodb-vs-clustrix-comparison-part-2.html
Hadoop, Hive and Redis for Foursquare Analytics
Foursquare’s move from querying the production databases to a data analytics system using Hadoop and Hive with Redis playing the role of a cache:
- Provide an easy-to-use end-point to run data exploration queries (using SQL and simple web-forms).
- Cache the results of queries (in a database) to power reports, so that the data is available to everyone, whenever it is needed.
- Allow our hadoop cluster to be totally dynamic without having to move data around (we shut it down at night and on weekends).
- Add new data in a simple way (just put it in Amazon S3!).
- Analyse data from several data sources (mongodb, postgres, log-files).
One of the most often heard complains about NoSQL databases is about their reduced querying capabilities. Running reports and analysis against the production servers is only going to work when you have little data and the set of queries is limitted and stable over time. Otherwise you’ll want to run these against a copy of your data to avoid bringing down production databases and avoid corrupting data.
Original title and link: Hadoop, Hive and Redis for Foursquare Analytics (NoSQL databases © myNoSQL)