MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services

MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das, Divyakant Agrawal, Amr El Abbadi (University of California, Santa Barbara) Work done as a visiting researcher at UCSB Appeared in MDM 2011, Lulea, Sweden

Overview A Motivating Story Existing Technologies Our proposal Evaluation Conclusion

Motivating Scenario: Mobile Coupon Distribution Mobile Coupon Distributer Coupon Current Location Current Location Current Location Distribution Policy Area # of coupons

Motivating Scenario: Mobile Coupon Distribution 125,000,000 subscribers in Japan Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Distribution Policy Area # of coupons Coupon Coupon Coupon Large amounts of Data High Throughput System Scalability Multi-Dimensional Query Nearest Neighbors Query Efficient Complex Queries

Existing Technologies at a reasonable price Key-Value Stores Commercial products but expensive Relational DBs Spatial DBs What We Want Open source products Scalability Multi-dimensional Queries

Ordered Key-Value Stores Sorted by key Good at 1-D Range Query ex. BigTable HBase key00 key11 keynn key00 key01 key0X value00 value01 value0X key11 key12 key1Y value11 value12 value1Y keynn valuenn Index Buckets Longitude Time Latitude But, our target is multi-dimensional…

Naïve Solution: Linearlization key00 key11 keynn keynn valuenn Projects n-D space to 1-D space Simple, but problematic… Apply a Z-ordering curve… key00 key01 key0X value00 value01 value0X key11 key12 key1Y value11 value12 value1Y 10 8 2 0 11 9 3 1 14 12 6 4 15 13 7 5

Problem: False positive scans MD-query on Linearized space Translate a MD-query to linearized range query. Ex. Query from 2 to 9. Scan queried linearized range. Filter points out of the queried area. ex. blue-hatched area (4 to 7) Require the boundary information of the original space. 10 8 2 0 11 9 3 1 14 12 6 4 15 13 7 5 2 9

Build a Multi-dimensional Index Layer on top of an Ordered Key-Value store Our Approach: MD-HBase Single Dimensional Index Multi-Dimensional Index Ordered Key-Value Store ex. BigTable, HBase, … MD-HBase

Introduce Multi-dimensional Index Multi-dimensional Index (ex. The K-d tree, The Quad tree) Divide a space into subspaces containing almost same # of points Organize subspaces as tree Efficient subspace pruning -> to avoid false positive scans Divide into Organize as

Space Partition By the K-d tree Binary Z-ordering space 00 01 10 11 11 10 01 00 00 01 10 11 11 10 01 00 Partitioned space by the K-d tree How do we represent these subspaces? bitwise interleaving ex. x= 00 , y= 11 -> 0 1 0 1 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 0 0 1 0 0 1 1 0 0 0 1 1 1 1 0 1 1 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 1 0 1 0 1 1 1 0 1 0 1 1010 1000 0010 0000 1011 1001 0011 0001 1110 1100 0110 0100 1111 1101 0111 0101

Key Idea: The longest common prefix naming scheme 00 01 10 11 11 10 01 00 000* 1*** Subspaces represented as the longest common prefix of keys! Remarkable Property Preserve boundary information of the original space 1*** 1010 1000 0010 0000 1011 1001 0011 0001 1110 1100 0110 0100 1111 1101 0111 0101 Left-bottom corner Right-top corner 1 0 0 0 1 1 1 1 *->0 *->1 ( 10 , 00 ) ( 11 , 11 )

Build an index with the longest common prefix of keys 00 01 10 11 11 10 01 00 000* 001* 01** 1*** 000* 001* 01** 1*** Index Buckets allocate per subspace 000* 001* 01** 1*** 1010 1000 0010 0000 1011 1001 0011 0001 1110 1100 0110 0100 1111 1101 0111 0101

Multi-dimensional Range Query Reconstruct the boundary Info. & Check whether intersecting the queried area 00 01 10 11 11 10 01 00 Index Filter 001* 000* 11** 01** 10** Scan Scan Subspace Pruning Scan 0010 -1001 on the index 1010 1000 0010 0000 1011 1001 0011 0001 1110 1100 0110 0100 1111 1101 0111 0101 11** 10** 01** 001* 000* 10** 001*

K Nearest Neighbors Query The best first algorithm can be applied. the most efficient technique in practical case Check the detail in our paper 1 2 4 3 5

Variations of Storage Layer Table Share Model Uses single table, Maintain bucket boundary Most space efficiency Bucket co-location may cause disk access congestions Table per Bucket Model Allocates a table per bucket Most flexible mapping One-to-one, one-to-many, many-to-one Bucket split is expensive Copy all points to the new buckets. Region per Bucket Model Allocates a region per bucket Most bucket split efficiency Asynchronous bucket split Requires modification of HBase

Experimental Results: Multi-dimensional Range Query Dataset: 400,000,000 points Queries: select objects within MD ranges and change selectivity Cluster size: 4 nodes MD-HBase responses 10~100 times faster than others and responses proportional time to selectivity.

Experimental Results: k Nearest Neighbors Query Dataset: 400,000,000 points Queries: choose a point and change the number of neighbors Cluster size: 4 nodes MD-HBase responses 1.5 sec where k ≦ 100, and 11 sec even if k = 10,000

Experimental Results: Insert Dataset: spatially skewed data generated by zipfian distribution MD-HBase shows good scalability without significant overhead.

Conclusions Designed a scalable multi-dimensional data store. Scalability & Efficient multi-dimensional queries Key Idea: indexing the longest common prefix of keys Easily extend general ordered key-value stores. Demonstrated scalable insert throughput and excellent query performance. Range Query: 10-100 times faster than existing technologies. kNN Query: 1.5 s when k ≦ 100. Insert: 220K inserts/sec on 16 nodes cluster without overhead Thank you. Any Questions?

MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services

More Related Content

Similar to MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services

Recently uploaded

MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services

Editor's Notes