[B! columnar storage] yassã®ãƒ–ãƒƒã‚¯ãƒžãƒ¼ã‚¯

yass id:yass

columnar storageã«é–¢ã™ã‚‹yassã®ãƒ–ãƒƒã‚¯ãƒžãƒ¼ã‚¯ (25)

${{author_name}}$

{{{comment_expanded}}}

{{label}}

{{#is_bookmark}}ãƒªã‚¹ãƒˆ{{/is_bookmark}}{{^is_bookmark}}ãƒªãƒ³ã‚¯{{/is_bookmark}}

${{author_name}}$
{{author_name}}{{created}}
{{ #comment }}{{ comment }}{{ /comment }}
- {{ label }}

${{author_name}}$

{{{comment_expanded}}}

{{label}}

{{#is_bookmark}}ãƒªã‚¹ãƒˆ{{/is_bookmark}}{{^is_bookmark}}ãƒªãƒ³ã‚¯{{/is_bookmark}}

EventQL from 10,000 feet
A quick introduction to the EventQL architecture.
yass 2017/03/26
eventql

time series database

columnar storage
ãƒªãƒ³ã‚¯
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014 Apache Parquet is an open-source columnar storage format for efficient data storage and analytics. It provides efficient compression and encoding techniques that enable fast scans and queries of large datasets. Parquet 2.0 improves on these efficiencies through techniques like delta encoding, dictionary encoding, run-length
yass 2014/06/25
parquet

delta encoding

columnar storage

integer

compression

branch prediction
ãƒªãƒ³ã‚¯
Cloudera Blog
We are thrilled to announce the general availability of the Cloudera AI Inference service, powered by NVIDIA NIM microservices, part of the NVIDIA AI Enterprise platform, to accelerate generative AI deployments for enterprises. This service supports a range of optimized AI models, enabling seamless and scala ble AI inference. Background The generative AI landscape is evolving [â€¦] Read blog post
yass 2014/04/25
parquet

hadoop

columnar storage
ãƒªãƒ³ã‚¯
PostgreSQL9.3ã‚’ã‚«ãƒ©ãƒ æŒ‡å‘ã‚¹ãƒˆãƒ¬ãƒ¼ã‚¸(cstore_fdw)ã«å¯¾å¿œã•ã›ã‚‹
åˆ†æžå‘ã‘ãƒ‡ãƒ¼ã‚¿ãƒ™ãƒ¼ã‚¹ã‚’å±•é–‹ã—ã¦ã„ã‚‹ CitusDB ãŒ PostgreSQL ã‚’åˆ—æŒ‡å‘ã‚¹ãƒˆãƒ¬ãƒ¼ã‚¸å¯¾å¿œã•ã›ã‚‹ foreign data wrapper(cstore_fdw) ã‚’ã‚ªãƒ¼ãƒ—ãƒ³ã‚½ãƒ¼ã‚¹åŒ–ã—ãŸã®ã§ã€ã¨ã‚Šã‚ãˆãšã‚¤ãƒ³ã‚¹ãƒˆãƒ¼ãƒ«ã—ã¦ã¿ãŸã€‚ cstore_fdw ã®ç‰¹å¾´ github ã® cstore_fdw ã«ç‰¹å¾´ãŒã¾ã¨ã‚ã‚‰ã‚Œã¦ã„ã‚‹ã€‚ http://citusdata.github.io/cstore_fdw/ ç®‡æ¡æ›¸ãã™ã‚‹ã¨ Faster Analytics â€“ Reduce analytics query disk and memory use by 10x Lower Storage â€“ Compress data by 3x Easy Setup â€“ Deploy as standard PostgreSQL extension Flexibility â€“ Mix row- and c
yass 2014/04/21
" pglz åœ§ç¸®ã«ã‚ˆã‚Š åœ§ç¸®çŽ‡ 3.5å€ / ã‚¯ã‚¨ãƒªãƒ¼é€Ÿåº¦ãŒ2å€ / pglz åœ§ç¸®ã—ãŸ cstore ã§ã¯ disk I/O ãŒ 1/10 ã«ãªã£ãŸ / ã¨ã„ã£ãŸã“ã¨ãŒæ›¸ã‹ã‚Œã¦ã„ã‚‹ "

postgresql

FDW

citusdb

orcfile

columnar storage
ãƒªãƒ³ã‚¯
Parquet - Data I/O - Philadelphia 2013
yass 2014/01/09
parquet

columnar storage
ãƒªãƒ³ã‚¯
A tour through hybrid column/row-oriented DBMS schemes
There has been a lot of talk recently about hybrid column-store/row-store database systems. This is likely due to many announcements along these lines in the past month, such as Verticaâ€™s recent 3.5 release which contained FlexStore, Oracleâ€™s recent revelation that Oracle Database 11g Release 2 uses column-oriented storage for the purposes of superior compression, and VectoreWiseâ€™s recent decloaki
yass 2013/11/12
database

columnar storage
ãƒªãƒ³ã‚¯
Who is How Columnar? Exadata, Teradata, and HANA â€“ Part 1: Column Compression
There are three forms of columnar-orientation currently deployed by database systems today. Each builds upon the next. The simplest form uses column-orientation to provide better data compression. The next level of maturity stores columnar data in separate structures to support columnar projection. The most mature implementations support a columnar database engine that performs relational algebra
yass 2013/11/12
compression

columnar storage
ãƒªãƒ³ã‚¯
GitHub - metamx/druid: RealÂ²time Exploratory Analytics on Large Datasets
yass 2013/10/19
" Druid is an open-source analytics datastore designed for realtime, exploratory, queries on large-scale data sets (100â€™s of Billions entries, 100â€™s TB data). Druid provides for cost effective, always-on, realtime data ingestion and arbitrary data exploration. "

sql

columnar storage

cluster

druid
ãƒªãƒ³ã‚¯
Parquet Hadoop Summit 2013
Parquet is a columnar storage format for Hadoop data. It was developed by Twitter and Cloudera to optimize storage and querying of large datasets. Parquet provides more efficient compression and I/O compared to traditional row-based formats by storing data by column. Early results show a 28% reduction in storage size and up to a 114% improvement in query performance versus the original Thrift form
yass 2013/09/30
Parquet

cloudera

columnar storage

hadoop
ãƒªãƒ³ã‚¯
A BILLION ROWS PER SECOND Metaprogramming Python for Big Data
Ville Tuulos Principal Engineer @ AdRoll ville.tuulos@adroll.com We faced the key technical challenge of modern Business Intelligence: How to query tens of billions of events interactively? Our solution, DeliRoll, is implemented in Python. Everyone knows that Python is SLOW. You can't handle big data with low latency in Python! Small Benchmark Data: 1.5 billion rows, 400 columns - 660GB. Smaller e
yass 2013/09/29
compression

redmine

python

LLVM

integer

columnar storage
ãƒªãƒ³ã‚¯
Metaprogramming Python for Big Data
For many companies, understanding what is going on in your business involves lots of data. But, how do you query 10s of billions of data points? How can a company begin to make sense of so much information? Ville Tuulos, Principle Engineer at AdRoll, a company producing tons of big data, demonstrates how AdRoll uses Python to squeeze every bit of performance out of a single high-end server. They m
yass 2013/09/29
compression

python

LLVM

columnar storage

redshift

integer

video
ãƒªãƒ³ã‚¯
Hadoop Hive - ORC Files
ORC File Format File Structure Stripe Structure HiveQLSyntax Serialization and Compression Integer Column Serialization String Column Serialization Compression ORC File Format The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is readi
yass 2013/09/23
" Index data includes min and max values for each column and the row positions within each column (A bit field or bloom filter could also be included.) / present bit stream: is the value non-null? "

ORCFile

hive

columnar storage

bloom filter

vbyte

zigzag encoding

RLE

integer

snappy

dictionary encoding
ãƒªãƒ³ã‚¯
Byte-dictionary encoding - Amazon Redshift
In byte dictionary encoding, a separate dictionary of unique values is created for each block of column values on disk. (An Amazon Redshift disk block occupies 1 MB.) The dictionary contains up to 256 one-byte values that are stored as indexes to the original data values. If more than 256 values are stored in a single block, the extra values are written into the block in raw, uncompressed form. Th
yass 2013/09/23
" This encoding is very effective when a column contains a limited number of unique values. This encoding is optimal when the data domain of a column is fewer than 256 unique values. Byte dictionary encoding is especially space-efficient if the column holds long character strings. "

redshift

compression

dictionary encoding

columnar storage
ãƒªãƒ³ã‚¯
Text255 and Text32k encodings - Amazon Redshift
Text255 and text32k encodings are useful for compressing VARCHAR columns in which the same words recur often. A separate dictionary of unique words is created for each block of column values on disk. (An Amazon Redshift disk block occupies 1 MB.) The dictionary contains the first 245 unique words in the column. Those words are replaced on disk by a one-byte index value representing one of the 245
yass 2013/09/23
" useful for compressing VARCHAR columns in which the same words recur often. A separate dictionary of unique words is created for each block of column values on disk. The dictionary contains the first 245 unique words in the column. Those words are replaced on disk by a one-byte index value "

redshift

compression

dictionary encoding

columnar storage
ãƒªãƒ³ã‚¯
Mostly encoding - Amazon Redshift
Mostly encodings are useful when the data type for a column is larger than most of the stored values require. By specifying a mostly encoding for this type of column, you can compress the majority of the values in the column to a smaller standard storage size. The rem aining values that cannot be compressed are stored in their raw form. For example, you can compress a 16-bit column, such as an INT2
yass 2013/09/23
" a raw integer column, which means that its values consume 4 bytes of storage. However, the current range of values in the column is 0 to 309. Therefore, re-creating and reloading this table with MOSTLY16 encoding for VENUEID would reduce the storage of every value in that column to 2 bytes. "

redshift

compression

columnar storage

integer
ãƒªãƒ³ã‚¯
Apache HBase I/O - HFile | Cloudera Blog
Introduction Apache HBase is the Hadoop open-source, distributed, versioned storage manager well suited for random, realtime read/write access. Wait wait? random, realtime read/write access? How is that possible? Is not Hadoop just a sequential read/write, batch processing system? Yes, weâ€™re talking about the same thing, and in the next few paragraphs, Iâ€™m going to explain to Â you how HBase achiev
yass 2013/09/23
" HFile v3 / Pack all keys together at beginning of the block and all the value together at the end of the block. In this way you can use two different algorithms to compress key and values. Compress timestamps using the XOR with the first value and use VInt instead of long. "

HBase

cloudera

hadoop

prefix encoding

diff encoding

columnar storage

compression

xor

HFile

bloom filter
ãƒªãƒ³ã‚¯
Delta encoding - Amazon Redshift
Delta encodings are very useful for date time columns. Delta encoding compresses data by recording the difference between values that follow each other in the column. This difference is recorded in a separate dictionary for each block of column values on disk. (An Amazon Redshift disk block occupies 1 MB.) For example, suppose that the column contains 10 integers in sequence from 1 to 10. The firs
yass 2013/09/23
" if the column contains 10 integers in sequence from 1 to 10, the first will be stored as a 4-byte integer (plus a 1-byte flag), and the next 9 will each be stored as a byte with the value 1 / the full original value is stored, with a leading 1-byte flag. "

redshift

compression

delta encoding

integer

columnar storage
ãƒªãƒ³ã‚¯
ã‚°ãƒ¼ã‚°ãƒ«ã®BigQueryã€é«˜é€Ÿå‡¦ç†ã®ä»•çµ„ã¿ã¯ã€Œã‚«ãƒ©ãƒ åž‹ãƒ‡ãƒ¼ã‚¿ã‚¹ãƒˆã‚¢ã€ã¨ã€Œãƒ„ãƒªãƒ¼æ§‹é€ ã€ã€‚è§£èª¬æ–‡æ›¸ãŒå…¬é–‹ ï¼ Publickey
ã‚°ãƒ¼ã‚°ãƒ«ã®BigQueryã€é«˜é€Ÿå‡¦ç†ã®ä»•çµ„ã¿ã¯ã€Œã‚«ãƒ©ãƒ åž‹ãƒ‡ãƒ¼ã‚¿ã‚¹ãƒˆã‚¢ã€ã¨ã€Œãƒ„ãƒªãƒ¼æ§‹é€ ã€ã€‚è§£èª¬æ–‡æ›¸ãŒå…¬é–‹ SQLã®ã‚¯ã‚¨ãƒªã«å¯¾å¿œã—ã€3å„„ä»¶ã‚’è¶…ãˆã‚‹ãƒ‡ãƒ¼ã‚¿ã«å¯¾ã—ã¦ã‚¤ãƒ³ãƒ‡ãƒƒã‚¯ã‚¹ã‚’ä½¿ã‚ãªã„ãƒ•ãƒ«ã‚¹ã‚ãƒ£ãƒ³æ¤œç´¢ã§10ç§’ä»¥å†…ã«çµæžœã‚’å‡ºã™ã€‚ã‚°ãƒ¼ã‚°ãƒ«ã®BigQueryã¯å¤§è¦æ¨¡ãªã‚¯ã‚¨ãƒªã‚’è¶…é«˜é€Ÿã§å®Ÿè¡Œã™ã‚‹èƒ½åŠ›ã‚’æä¾›ã™ã‚‹ã‚µãƒ¼ãƒ“ã‚¹ã§ã™ã€‚ãã®å†…éƒ¨ã‚’è§£èª¬ã™ã‚‹æ–‡æ›¸ã€ŒAn Inside Look at Google BigQueryã€ï¼ˆPDFï¼‰ã‚’å…¬é–‹ã—ã¾ã—ãŸã€‚ ã‚°ãƒ¼ã‚°ãƒ«ã¯å¤§è¦æ¨¡ã‚¯ã‚¨ãƒªã‚’å®Ÿè¡Œã™ã‚‹ã‚µãƒ¼ãƒ“ã‚¹ã¨ã—ã¦ç¤¾å†…ã§ã‚³ãƒ¼ãƒ‰ãƒãƒ¼ãƒ ã€ŒDremelã€ã‚’æ§‹ç¯‰ã—ã¦ãŠã‚Šã€2010å¹´ã«ãã®Dremelã‚’è§£èª¬ã™ã‚‹æ–‡æ›¸ã€ŒDremel: Interactive Analysis of Web-Scale Datasetsã€ã‚’å…¬é–‹ã—ã¦ã„ã¾ã™ã€‚BigQueryã¯ã€ãã®Dremelã‚’å¤–éƒ¨å…¬é–‹å‘ã‘ã«å®Ÿè£…ã—ãŸã‚‚ã®ã§ã™ã€‚ ã‚°ãƒ¼ã‚°ãƒ«ã¯ã“ã®Dremel/BigQue
yass 2013/09/15
Dremel

google

BigQuery

columnar storage
ãƒªãƒ³ã‚¯
Dremel made simple with Parquet
Columnar storage is a popular technique to optimize analytical workloads in parallel RDBMs. The performance and compression benefits for storing and processing large amounts of data are well documented in academic literature as well as several commercial analytical databases. The goal is to keep I/O to a minimum by reading from a disk only the data required for the query. Using Parquet at Twitter,
yass 2013/09/14
" a technique outlined in the Dremel paper from Google. / We will first describe the general model used to represent nested data structures. Then we will explain how this model can be represented as a flat list of columns. Finally weâ€™ll discuss why this representation is effective. "

twitter

parquet

columnar storage

Dremel

toread
ãƒªãƒ³ã‚¯
Cloudera Blog
The ongoing progress in Artificial Intelligence is constantly expanding the realms of possibility, revolutionizing industries and societies on a global scale. The release of LLMs surged by 136% in 2023 compared to 2022, and this upward trend is projected to continue in 2024. Today, 44% of organizations are experimenting with generative AI, with 10% having [â€¦] Read blog post
yass 2013/08/12
"Since all the values in a given column have the same type, generic compression tends to work better and type-specific compression can be applied. / Self-tuning dictionary encoding / Dynamic Bit-Packing RLE-encoding"

parquet

hadoop

column oriented database

columnar storage

bit packing

RLE
ãƒªãƒ³ã‚¯
1 2 æ¬¡ã®ãƒšãƒ¼ã‚¸