Putting your NoSQL data to work
The fact that you are storing your data into a NoSQL solution, doesn’t mean that you are done with it. You’ll still have to put it to work, transform and move it, or do some data warehousing[1]. And the lack of SQL should not stop you for doing any of these.
One solution available in many NoSQL stores is MapReduce — as an example you can see how you can translate SQL to MongoDB MapReduce.
But MapReduce is not the only option available and I’d like to quickly introduce you to a couple of alternative solutions.
HBase-dsl ☞
Working with HBase may be at times quite verbose and while Java is not very good at creating DSLs sometimes even a more fluent APIs are useful. This is exactly what HBase-dsl brings you:
However I found myself writing tons of code to perform some fairly simple tasks. So I set out to simply my HBase code and ended up writing a Java HBase DSL. It’s still fairly rough around the edges but it does allow the use of standard Java types and it’s extensible.
hBase.save("test").
row("abcd").
family("famA").
col("col1", "hello world!");
String value = hBase.fetch("test").
row("abcd").
family("famA").
value("col1", String.class);
HBql ☞
HBql goals is to bring, to those missing SQL, a more SQLish interface to HBase. You can take a look at ☞ HBql statements to get a better feeling of what it looks like.
Hive ☞
Hive is a data warehouse infrastructure for Hadoop that proposes a SQL-like query language to enable easy data ETL.
Pig ☞
Pig is a platform for analyzing large data sets built on Hadoop. I have found a great article ☞ comparing Pig Latin over Hadoop to SQL over a relational database
- Pig Latin is procedural, where SQL is declarative.
- Pig Latin allows pipeline developers to decide where to checkpoint data in the pipeline.
- Pig Latin allows the developer to select specific operator implementations directly rather than relying on the optimizer.
- Pig Latin supports splits in the pipeline.
- Pig Latin allows developers to insert their own code almost anywhere in the data pipeline.
But don’t think that the HBase and Hadoop are the only one getting such tools. In the graph databases world, there is Gremlin ☞: a graph-based programming language meant to ease graph query, analysis, and manipulation.
I think sooner than later we will see more such solutions appearing in the NoSQL environment.
References
- [1] ☞ Reporting in NoSQL (↩)