MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.),   Sudipto Das, Divyakant Agrawal, Amr El Abbadi (University of California, Santa Barbara) Work done as a visiting researcher at UCSB Appeared in MDM 2011, Lulea, Sweden
Overview A Motivating Story Existing Technologies Our proposal Evaluation Conclusion
Motivating Scenario: Mobile Coupon Distribution Mobile Coupon Distributer Coupon Current Location Current Location Current Location Distribution Policy Area # of coupons
Motivating Scenario: Mobile Coupon Distribution 125,000,000 subscribers in Japan Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Distribution Policy Area # of coupons Coupon Coupon Coupon Large amounts of Data High Throughput System Scalability Multi-Dimensional Query Nearest Neighbors Query Efficient Complex Queries
Existing Technologies at a reasonable price Key-Value Stores Commercial products but  expensive Relational DBs Spatial DBs What We Want Open source products Scalability Multi-dimensional Queries
Ordered Key-Value Stores Sorted by key Good at 1-D Range Query ex. BigTable   HBase key00 key11 keynn key00 key01 key0X value00 value01 value0X key11 key12 key1Y value11 value12 value1Y keynn valuenn Index Buckets Longitude Time Latitude But, our target is  multi-dimensional…
Naïve Solution: Linearlization key00 key11 keynn keynn valuenn Projects n-D space to 1-D space Simple, but problematic… Apply a Z-ordering curve… key00 key01 key0X value00 value01 value0X key11 key12 key1Y value11 value12 value1Y 10 8 2 0 11 9 3 1 14 12 6 4 15 13 7 5
Problem: False positive scans MD-query on Linearized space Translate a MD-query to  linearized range query. Ex. Query from 2 to 9. Scan queried linearized range. Filter points out of the queried area. ex. blue-hatched area (4 to 7) Require the boundary information of the original space. 10 8 2 0 11 9 3 1 14 12 6 4 15 13 7 5 2 9
Build a Multi-dimensional Index Layer on top of an Ordered Key-Value store Our Approach: MD-HBase Single Dimensional Index Multi-Dimensional Index Ordered Key-Value Store ex. BigTable, HBase, … MD-HBase
Introduce Multi-dimensional Index Multi-dimensional Index (ex. The K-d tree, The Quad tree) Divide a space into subspaces containing almost same # of points Organize subspaces as tree Efficient subspace pruning -> to avoid  false positive scans Divide into Organize as
Space Partition By the K-d tree Binary Z-ordering space 00  01  10  11 11 10 01 00 00  01  10  11 11 10 01 00 Partitioned space by the K-d tree How do we represent these subspaces? bitwise interleaving ex. x= 00 , y= 11  ->  0 1 0 1 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 0 0 1 0 0 1 1 0 0 0 1 1 1 1 0 1 1 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 1 0 1 0 1 1 1 0 1 0 1 1010 1000 0010 0000 1011 1001 0011 0001 1110 1100 0110 0100 1111 1101 0111 0101
Key Idea: The longest common prefix naming scheme 00  01  10  11 11 10 01 00 000* 1*** Subspaces represented as the longest common prefix of keys! Remarkable Property Preserve boundary information of the original space  1*** 1010 1000 0010 0000 1011 1001 0011 0001 1110 1100 0110 0100 1111 1101 0111 0101 Left-bottom corner Right-top corner 1 0 0 0 1 1 1 1 *->0 *->1 ( 10 ,  00 ) ( 11 ,  11 )
Build an index with the longest common prefix of keys 00  01  10  11 11 10 01 00 000* 001* 01** 1*** 000* 001* 01** 1*** Index Buckets allocate per subspace 000* 001* 01** 1*** 1010 1000 0010 0000 1011 1001 0011 0001 1110 1100 0110 0100 1111 1101 0111 0101
Multi-dimensional Range Query Reconstruct the boundary Info. & Check whether intersecting the queried area 00  01  10  11 11 10 01 00 Index Filter 001* 000* 11** 01** 10** Scan Scan Subspace Pruning Scan 0010 -1001 on the index 1010 1000 0010 0000 1011 1001 0011 0001 1110 1100 0110 0100 1111 1101 0111 0101 11** 10** 01** 001* 000* 10** 001*
K Nearest Neighbors Query The best first algorithm can be applied. the most efficient technique in practical case Check the detail in our paper 1 2 4 3 5
Variations of Storage Layer Table Share Model Uses single table, Maintain bucket boundary Most space efficiency Bucket co-location may cause  disk access congestions Table per Bucket Model Allocates a table per bucket Most flexible mapping One-to-one, one-to-many, many-to-one Bucket split is expensive Copy all points to the new buckets. Region per Bucket Model Allocates a region per bucket Most bucket split efficiency Asynchronous bucket split Requires modification of HBase
Experimental Results: Multi-dimensional Range Query Dataset: 400,000,000 points Queries: select objects within MD ranges and change selectivity Cluster size: 4 nodes MD-HBase responses  10~100 times   faster  than others  and responses  proportional  time to selectivity.
Experimental Results: k Nearest Neighbors Query Dataset: 400,000,000 points Queries: choose a point and change the number of neighbors Cluster size: 4 nodes MD-HBase responses  1.5 sec  where k ≦ 100,  and 11 sec even if k = 10,000
Experimental Results: Insert Dataset: spatially skewed data generated by zipfian distribution  MD-HBase shows good scalability without  significant overhead.
Conclusions Designed a scalable multi-dimensional data store. Scalability & Efficient multi-dimensional queries Key Idea: indexing  the longest common prefix of keys Easily extend general ordered key-value stores. Demonstrated scalable insert throughput and excellent query performance. Range Query:  10-100 times faster  than existing technologies. kNN Query:  1.5 s  when k ≦ 100. Insert: 220K inserts/sec on  16 nodes  cluster without overhead Thank you. Any Questions?

MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services

  • 1.
    MD-HBase: A ScalableMulti-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das, Divyakant Agrawal, Amr El Abbadi (University of California, Santa Barbara) Work done as a visiting researcher at UCSB Appeared in MDM 2011, Lulea, Sweden
  • 2.
    Overview A MotivatingStory Existing Technologies Our proposal Evaluation Conclusion
  • 3.
    Motivating Scenario: MobileCoupon Distribution Mobile Coupon Distributer Coupon Current Location Current Location Current Location Distribution Policy Area # of coupons
  • 4.
    Motivating Scenario: MobileCoupon Distribution 125,000,000 subscribers in Japan Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Distribution Policy Area # of coupons Coupon Coupon Coupon Large amounts of Data High Throughput System Scalability Multi-Dimensional Query Nearest Neighbors Query Efficient Complex Queries
  • 5.
    Existing Technologies ata reasonable price Key-Value Stores Commercial products but expensive Relational DBs Spatial DBs What We Want Open source products Scalability Multi-dimensional Queries
  • 6.
    Ordered Key-Value StoresSorted by key Good at 1-D Range Query ex. BigTable HBase key00 key11 keynn key00 key01 key0X value00 value01 value0X key11 key12 key1Y value11 value12 value1Y keynn valuenn Index Buckets Longitude Time Latitude But, our target is multi-dimensional…
  • 7.
    Naïve Solution: Linearlizationkey00 key11 keynn keynn valuenn Projects n-D space to 1-D space Simple, but problematic… Apply a Z-ordering curve… key00 key01 key0X value00 value01 value0X key11 key12 key1Y value11 value12 value1Y 10 8 2 0 11 9 3 1 14 12 6 4 15 13 7 5
  • 8.
    Problem: False positivescans MD-query on Linearized space Translate a MD-query to linearized range query. Ex. Query from 2 to 9. Scan queried linearized range. Filter points out of the queried area. ex. blue-hatched area (4 to 7) Require the boundary information of the original space. 10 8 2 0 11 9 3 1 14 12 6 4 15 13 7 5 2 9
  • 9.
    Build a Multi-dimensionalIndex Layer on top of an Ordered Key-Value store Our Approach: MD-HBase Single Dimensional Index Multi-Dimensional Index Ordered Key-Value Store ex. BigTable, HBase, … MD-HBase
  • 10.
    Introduce Multi-dimensional IndexMulti-dimensional Index (ex. The K-d tree, The Quad tree) Divide a space into subspaces containing almost same # of points Organize subspaces as tree Efficient subspace pruning -> to avoid false positive scans Divide into Organize as
  • 11.
    Space Partition Bythe K-d tree Binary Z-ordering space 00 01 10 11 11 10 01 00 00 01 10 11 11 10 01 00 Partitioned space by the K-d tree How do we represent these subspaces? bitwise interleaving ex. x= 00 , y= 11 -> 0 1 0 1 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 0 0 1 0 0 1 1 0 0 0 1 1 1 1 0 1 1 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 1 0 1 0 1 1 1 0 1 0 1 1010 1000 0010 0000 1011 1001 0011 0001 1110 1100 0110 0100 1111 1101 0111 0101
  • 12.
    Key Idea: Thelongest common prefix naming scheme 00 01 10 11 11 10 01 00 000* 1*** Subspaces represented as the longest common prefix of keys! Remarkable Property Preserve boundary information of the original space 1*** 1010 1000 0010 0000 1011 1001 0011 0001 1110 1100 0110 0100 1111 1101 0111 0101 Left-bottom corner Right-top corner 1 0 0 0 1 1 1 1 *->0 *->1 ( 10 , 00 ) ( 11 , 11 )
  • 13.
    Build an indexwith the longest common prefix of keys 00 01 10 11 11 10 01 00 000* 001* 01** 1*** 000* 001* 01** 1*** Index Buckets allocate per subspace 000* 001* 01** 1*** 1010 1000 0010 0000 1011 1001 0011 0001 1110 1100 0110 0100 1111 1101 0111 0101
  • 14.
    Multi-dimensional Range QueryReconstruct the boundary Info. & Check whether intersecting the queried area 00 01 10 11 11 10 01 00 Index Filter 001* 000* 11** 01** 10** Scan Scan Subspace Pruning Scan 0010 -1001 on the index 1010 1000 0010 0000 1011 1001 0011 0001 1110 1100 0110 0100 1111 1101 0111 0101 11** 10** 01** 001* 000* 10** 001*
  • 15.
    K Nearest NeighborsQuery The best first algorithm can be applied. the most efficient technique in practical case Check the detail in our paper 1 2 4 3 5
  • 16.
    Variations of StorageLayer Table Share Model Uses single table, Maintain bucket boundary Most space efficiency Bucket co-location may cause disk access congestions Table per Bucket Model Allocates a table per bucket Most flexible mapping One-to-one, one-to-many, many-to-one Bucket split is expensive Copy all points to the new buckets. Region per Bucket Model Allocates a region per bucket Most bucket split efficiency Asynchronous bucket split Requires modification of HBase
  • 17.
    Experimental Results: Multi-dimensionalRange Query Dataset: 400,000,000 points Queries: select objects within MD ranges and change selectivity Cluster size: 4 nodes MD-HBase responses 10~100 times faster than others and responses proportional time to selectivity.
  • 18.
    Experimental Results: kNearest Neighbors Query Dataset: 400,000,000 points Queries: choose a point and change the number of neighbors Cluster size: 4 nodes MD-HBase responses 1.5 sec where k ≦ 100, and 11 sec even if k = 10,000
  • 19.
    Experimental Results: InsertDataset: spatially skewed data generated by zipfian distribution MD-HBase shows good scalability without significant overhead.
  • 20.
    Conclusions Designed ascalable multi-dimensional data store. Scalability & Efficient multi-dimensional queries Key Idea: indexing the longest common prefix of keys Easily extend general ordered key-value stores. Demonstrated scalable insert throughput and excellent query performance. Range Query: 10-100 times faster than existing technologies. kNN Query: 1.5 s when k ≦ 100. Insert: 220K inserts/sec on 16 nodes cluster without overhead Thank you. Any Questions?

Editor's Notes

  • #4 アニメーション化
  • #5 Scalability for Data Size # of users Continuously Generated High Insertion Throughput # of users Data collection Frequency Efficient Complex Query Performance Complex Queries Multi-dimensional Range Queries K Nearest Neighbor Queries Near Real-time Data is easy to stale
  • #9 Synchronize text and figures
  • #12 Put an example