Parquetãã©ã¼ãããæ¦è¦³
Parquetã¯ä¾¿å©ãªãã¡ã¤ã«å½¢å¼ã§ãåå¿åã®ãã©ã¼ãããã¨ãã¦ã¯ããã¡ã¯ãã®1ã¤ã¨è¨ã£ã¦ãéè¨ã§ã¯ãªãã§ãããã
ã§ãããjsonãcsvã¨ã¯éãããã¡ã¤ã«ãè¦ãã ãã§ã©ããªæ§é ãããããã®ã§ã¯ããã¾ããã
ãã®è¨äºã¯ãParquetã®å
·ä½çãªæ§é ã«ã¤ãã¦è¨è¿°ãã¾ãã
ã¯ããã«
ãã®æç¨¿ã¯ãParquetã®æ§é ã«ã¤ãã¦ããã¤ããªãè¦ãªãã確èªãããã®ã§ãã
ãã ããParquetã®å¤§æ ã«æ³¨ç®ããæç¨¿ãªã®ã§ãdelta encodingãrun-lengthãªã©ãåå¥ã®å§ç¸®æ¹æ³ã«ã¤ãã¦ã¯åãæ±ãã¾ããã
â» Parquetã®ä½æã«ã¯ https://github.com/parquet-go/parquet-go ã使ç¨ãã¦ãã¾ãããgoã®ç¥èã¯å¿ è¦ããã¾ãã
tldr
Parquetã¯ä»¥ä¸ã®æ§é ãæã£ã¦ãã¾ãã
- ãã¡ã¤ã«ã¯RowGroupã¨ã¡ã¿ãã¼ã¿ã«åããã¦ãã
- RowGroupã®ä¸ã«ã¯Columnããã
- Columnã®ä¸ã«ã¯Pageããã
- Pageã®ä¸ã«ãã¼ã¿ã®æ¬ä½ãå ¥ã£ã¦ãã
- Parquetã¯ãã¹ããå
¨ã¦å±éãã¦ãå
¨ã¦å¥ã®åã¨ãã¦æ±ããã (ãã¹ããé
åãå
¥ãåã«ãã¦ããªã)
definition level
ã¨repetition level
ã®ããã
ä¾ãã°{col1: "val1", col2: {col3: "val2"}}
ã¨ãããã¼ã¿ãParquetã«æ¸ãã¨ããã¡ã¤ã«ã®æ§é ã¯ä¸ã®ããã«ãªãã¾ãã
{ RowGroups: [ { Column(for "col1"): [{ Pages: [{ Header(thriftå½¢å¼), Values: [ "val1", ... ] }] }] }, { Column(for "col2.col3"): [{ Pages: [{ Header, Values: [ "val2", ... ] }] }] } ], MetaData(thrift) }
Parquetã®ç¹å¾´
Parquetã¯äººæ°ãªå½¢å¼ã§ããããã®ä¸èº«ã«èå³ãæã¤ãããªäººã¯Parquetã®ç¹å¾´ãã¨ã£ãã«ç¥ã£ã¦ãããã¨ã§ãããã
ããã§ã¯ãã®æç¨¿ã«å¿
è¦ãªç¹å¾´ã ãç´¹ä»ãã¾ãã
- åå¿åã§ãã
- Dremel(Googleã®ç¤¾å ãã¼ã«)ã®ãã¡ã¤ã«ã®ãã©ã¼ããããåºã«ãã¦ãã
Parquetãhexdump
ã¾ããæ§é ã確èªããããã«ãã¡ã¤ã«ã使ãã¾ãã
jsonã§è¡¨ãã¨ãä¸ã®ãããªå
容ã§ãã(è¦å½ãã¤ããããããã«textåã«ã¯plainã¨ã³ã³ã¼ãã£ã³ã°ã使ç¨ãã¦ãã¾ã)
[ {"text": "text1"}, {"text": "text2"}, {"text": "text3"}, {"text": "text4"}, {"text": "text5"}, ]
ãã¡ã¤ã«ä½æã®ããã®goã³ã¼ã
parquet-goã使ç¨ãã以ä¸ã®ãããªã³ã¼ããããã¡ã¤ã«ã使ãã¾ãã
import ( "github.com/parquet-go/parquet-go" ) type MyTypeSimple struct { Text string `parquet:"text,plain"` } func write1() { v := []MyTypeSimple{ {Text: "text1"}, {Text: "text2"}, {Text: "text3"}, {Text: "text4"}, {Text: "text5"}, } err := parquet.WriteFile("./simple.parquet", v) if err != nil { panic(err) } }
ãã¡ã¤ã«(307ãã¤ã)ãè¦ãã¨æ¬¡ã®ããã«ãªãã¾ãã
$ hexdump -C simple.parquet 00000000 50 41 52 31 15 06 15 5a 15 5a 15 ab a3 84 b4 03 |PAR1...Z.Z......| 00000010 4c 15 0a 15 00 15 0a 15 00 15 00 15 00 12 00 00 |L...............| 00000020 05 00 00 00 74 65 78 74 31 05 00 00 00 74 65 78 |....text1....tex| 00000030 74 32 05 00 00 00 74 65 78 74 33 05 00 00 00 74 |t2....text3....t| 00000040 65 78 74 34 05 00 00 00 74 65 78 74 35 19 12 00 |ext4....text5...| 00000050 19 18 05 74 65 78 74 31 19 18 05 74 65 78 74 35 |...text1...text5| 00000060 15 00 19 16 00 00 19 1c 16 08 15 92 01 16 00 00 |................| 00000070 00 15 04 19 2c 48 0c 4d 79 54 79 70 65 53 69 6d |....,H.MyTypeSim| 00000080 70 6c 65 15 02 00 15 0c 25 00 18 04 74 65 78 74 |ple.....%...text| 00000090 25 00 4c 1c 00 00 00 16 0a 19 1c 19 1c 26 00 1c |%.L..........&..| 000000a0 15 0c 19 15 00 19 18 04 74 65 78 74 15 00 16 0a |........text....| 000000b0 16 92 01 16 92 01 26 08 3c 58 05 74 65 78 74 35 |......&.<X.text5| 000000c0 18 05 74 65 78 74 31 00 19 1c 15 06 15 00 15 02 |..text1.........| 000000d0 00 00 16 cc 01 15 16 16 9a 01 15 32 00 16 92 01 |...........2....| 000000e0 16 0a 19 0c 16 08 16 92 01 00 19 0c 18 37 67 69 |.............7gi| 000000f0 74 68 75 62 2e 63 6f 6d 2f 70 61 72 71 75 65 74 |thub.com/parquet| 00000100 2d 67 6f 2f 70 61 72 71 75 65 74 2d 67 6f 20 76 |-go/parquet-go v| 00000110 65 72 73 69 6f 6e 20 30 2e 32 33 2e 30 28 62 75 |ersion 0.23.0(bu| 00000120 69 6c 64 20 29 19 1c 1c 00 00 00 ba 00 00 00 50 |ild )..........P| 00000130 41 52 31 |AR1|
ASCIIãè¦ãã¨ã3è¡ç®ããtext1
,text2
,text3
...ã¨ãã¼ã¿ãå
¥ã£ã¦ãããã¨ã確èªã§ãã¾ãã
ãã®ãã¡ã¤ã«ã使ã£ã¦Parquetã®ä¸èº«ãç´è§£ãã¦ããã¾ãã
â» è¦ãããã®ããããã®è¨äºã§ã¯ãã¤ãã®ä½ç½®ã0x00ããã®å¤ã§è¡¨è¨ãã¾ããä¾ãã°ã16ãã¤ãç®(03
)ã¯ã16(0x0f)
ã¨æ¸ãã¾ãã(0åããfåã¾ã§ã1è¡ãã¨è¦ã)
ã¡ã¿ãã¼ã¿ã¨ãã¡ã¤ã«æ§é
Parquetã¯ãå
¬å¼ã®å³ã«ããããã«ãMagic Number
, RowGroup
, Footer
, Footer length
, Magic Number
ã§æ§æããã¦ãã¾ãã
ãã®ç« ã§ã¯ãåç« ã§ä½æãããã¡ã¤ã«ã®ãã¼ã¿ãåºã«ãParquetã®ãã¼ã¿æ§é ãè¦ã¦ããã¾ãã
ã¾ãæåã«è¦ã¤ããã®ãããã¡ã¤ã«ã®æåã¨æå¾ã«ããMagic Number
ã§ãã
åç« ã®hexdumpã®çµæã«ãæåã¨æå¾ã«"PAR1(50 41 52 31)"ã¨ããæåãè¦ã¤ããã¾ãã
Footer length
ãã¦ãParquetãèªãã«ã¯ãã¾ãFooterã«ãããã¡ã¤ã«å
¨ä½ã®ã¡ã¿ãã¼ã¿ãåç
§ããã¨ããããå§ã¾ãã¾ãã
Footerã®é·ããæå¾ããæ°ãã¦8~4ãã¤ãç®ã«ããã®ã§ãããããFooterã®å ´æãè¨ç®ãã¾ãã(little endian)
åç« ã®hexdumpã®ä¾ã§ã¯ã300~303ãã¤ãç®(0x12b~0x12e)ã«ããã186(ba 00 00 00
)ãFooterã®é·ãã§ãã
Footer (FileMetaData) ã®åå¾
Footerã®é·ããããã£ãã®ã§ãFooterãããã¡ã¤ã«ã®ã¡ã¿ãã¼ã¿ãæãåºãã¾ãã
ã¨ãã£ã¦ããFooterã«ã¯ã¡ã¿ãã¼ã¿ãããªãã®ã§ãå®éã«ã¯Footerããã®ã¾ã¾æãåºãã ãã§ã¡ã¿ãã¼ã¿ãåãåºãã¾ãã
åç« ã®hexdumpã§ã¯é·ãã186ãã¤ããªã®ã§ã114(0x71)~299(0x12a)ãã¤ããã¡ã¿ãã¼ã¿é¨åã§ã(15 04 19 2c ~ 1c 00 00 00
é¨å)
Parquetã®ã¡ã¿ãã¼ã¿ã¯Thriftã®ThriftCompactProtocolå½¢å¼ã§ã·ãªã¢ã©ã¤ãºããã¦ãã¾ãã
parquet-formatãªãã¸ããªã«ããå®ç¾©ãåºã«ã¡ã¿ãã¼ã¿ãèªãã¨ã以ä¸ã®æ
å ±ãå
¥ã£ã¦ããã¨ãããã¾ãã
{ "version": 2, "schema": [ { "name": "MyTypeSimple", "num_children": 1 }, { "type": "BYTE_ARRAY", "repetition_type": "REQUIRED", "name": "text", "converted_type": "UTF8", "logicalType": { "STRING": {} } } ], "num_rows": 5, "row_groups": [ { "columns": [ { "file_offset": 0, "meta_data": { "type": "BYTE_ARRAY", "encodings": ["PLAIN"], "path_in_schema": ["text"], "codec": "UNCOMPRESSED", "num_values": 5, "total_uncompressed_size": 73, "total_compressed_size": 73, "data_page_offset": 4, "statistics": { "max_value": "text5", "min_value": "text1" }, "encoding_stats": [ { "page_type": "DATA_PAGE_V2", "encoding": "PLAIN", "count": 1 } ] }, "offset_index_offset": 102, "offset_index_length": 11, "column_index_offset": 77, "column_index_length": 25 } ], "total_byte_size": 73, "num_rows": 5, "file_offset": 4, "total_compressed_size": 73 } ], "created_by": "github.com/parquet-go/parquet-go version 0.23.0(build )", "column_orders": [ { "TYPE_ORDER": {} } ] }
ã¡ã¿ãã¼ã¿ã«ã¯è²ã
ãªæ
å ±ãå
¥ã£ã¦ãã¾ãããæ¬è¨äºã§ã¯columns
ã®ä¸ã®æ
å ±ãéè¦ã§ãã
RowGroup, Column, Pageå 容ã®åå¾
åç¯ã§ãã¡ã¤ã«ã®ã¡ã¿ãã¼ã¿ãåå¾ã§ãã¾ããã
æ´ã«columnã®æ
å ±ãåå¾ãã¾ãã
ã¡ã¿ãã¼ã¿ãè¦ãã¨ãrow_groupsãã£ã¼ã«ããããããã®ä¸ã«columnsãã£ã¼ã«ãããããã¨ããããã¾ãã
columnsãã£ã¼ã«ãã¯é
åã§ããä»ã®ä¾ã§ã¯textåãããªãã®ã§ãä»åã¯1ã¤ã§ãã
ãã®columnã®ä¸ã«ã¯æåã®Pageã®å§ç¹(data_page_offset=4)ã¨columnå
¨ä½ã®ãã¤ãæ°(total_compressed_size=73)ã表ããã£ã¼ã«ããããã¾ãã
ããããtextåã®ãã¼ã¿ãå
¥ã£ã¦ããå ´æã示ãã¦ãã¾ãã
ã¤ã¾ãããã®ä¾ã§ã¯ãã¡ã¤ã«ã®5(0x4)~77(0x4c)ãã¤ãç®ãcolumnã®å¯¾è±¡ã§ãã
ãã¡ã¤ã«ã®hexdumpãåæ²ãã¾ãã
$ hexdump -C simple.parquet 00000000 50 41 52 31 15 06 15 5a 15 5a 15 ab a3 84 b4 03 |PAR1...Z.Z......| 00000010 4c 15 0a 15 00 15 0a 15 00 15 00 15 00 12 00 00 |L...............| 00000020 05 00 00 00 74 65 78 74 31 05 00 00 00 74 65 78 |....text1....tex| 00000030 74 32 05 00 00 00 74 65 78 74 33 05 00 00 00 74 |t2....text3....t| 00000040 65 78 74 34 05 00 00 00 74 65 78 74 35 19 12 00 |ext4....text5...| 00000050 19 18 05 74 65 78 74 31 19 18 05 74 65 78 74 35 |...text1...text5| 00000060 15 00 19 16 00 00 19 1c 16 08 15 92 01 16 00 00 |................| 00000070 00 15 04 19 2c 48 0c 4d 79 54 79 70 65 53 69 6d |....,H.MyTypeSim| 00000080 70 6c 65 15 02 00 15 0c 25 00 18 04 74 65 78 74 |ple.....%...text| 00000090 25 00 4c 1c 00 00 00 16 0a 19 1c 19 1c 26 00 1c |%.L..........&..| 000000a0 15 0c 19 15 00 19 18 04 74 65 78 74 15 00 16 0a |........text....| 000000b0 16 92 01 16 92 01 26 08 3c 58 05 74 65 78 74 35 |......&.<X.text5| 000000c0 18 05 74 65 78 74 31 00 19 1c 15 06 15 00 15 02 |..text1.........| 000000d0 00 00 16 cc 01 15 16 16 9a 01 15 32 00 16 92 01 |...........2....| 000000e0 16 0a 19 0c 16 08 16 92 01 00 19 0c 18 37 67 69 |.............7gi| 000000f0 74 68 75 62 2e 63 6f 6d 2f 70 61 72 71 75 65 74 |thub.com/parquet| 00000100 2d 67 6f 2f 70 61 72 71 75 65 74 2d 67 6f 20 76 |-go/parquet-go v| 00000110 65 72 73 69 6f 6e 20 30 2e 32 33 2e 30 28 62 75 |ersion 0.23.0(bu| 00000120 69 6c 64 20 29 19 1c 1c 00 00 00 ba 00 00 00 50 |ild )..........P| 00000130 41 52 31 |AR1|
ã¤ã¾ãã15 06 15 5a ~ 65 78 74 35
ã®é¨åã§ããã
æ´ã«ãcolumnã®æåã«ã¯æåã®pageç¨ã®ã¡ã¿ãã¼ã¿(Page Header)ãå
¥ã£ã¦ãã¾ãã
ãã®ã¡ã¿ãã¼ã¿ãparquet-formatã§å®ç¾©ããã¦ããthriftãªã®ã§ãFooterã¨åãããã«ãã·ãªã¢ã©ã¤ãºããã¨ã次ã®ãã¼ã¿ãåå¾ã§ãã¾ãã
{ "type": "DATA_PAGE_V2", "uncompressed_page_size": 45, "compressed_page_size": 45, "crc": -457214166, "data_page_header_v2": { "num_values": 5, "num_nulls": 0, "num_rows": 5, "encoding": "PLAIN", "definition_levels_byte_length": 0, "repetition_levels_byte_length": 0, "is_compressed": false } }
ãã®thriftãçµããã®ã32(0x1f)ãã¤ãç®ãªã®ã§ãcompressed_page_sizeã®å¤(45)ãèããã33(0x20)ãã77(0x4c)ãã¤ãç®ããã¼ã¿ã®æ¬ä½ã ã¨ãããã¾ãã(05 00 00 00 ~ 65 78 74 35
)
ãã¼ã¿ã®åå¾
åç¯ã§33(0x20)ãã77(0x4c)ãã¤ãããã¼ã¿ã®é¨åã¨ãããã¾ããã
ãã¼ã¿ã®ãã¤ããªã¯05 00 00 00 74 65 78 74 31 05 00 00 00 74 65 78 74 32...
ã¨ãªã£ã¦ãã¾ãã
ãã®ãã¡ã¤ã«ã¯plainã¨ã³ã³ã¼ãã£ã³ã°ã使ç¨ãã¦ããã®ã§ãæåã«4ãã¤ãã§æåé·ãããããã®å¾ã«æåãç¶ãã¾ãã
ã¤ã¾ããæåã5ãã¤ãã®ããã¹ã(74 65 78 74 31
=text1)ãæ¬¡ã5ãã¤ãã®ããã¹ãã»ã»ã»ã¨ç¶ãã¾ãã
ãªã®ã§ããã¼ã¿ã«ã¯[text1, text2, text3, text4, text5]
ãå
¥ã£ã¦ããã¨ãããããã§ãã
ãã¡ã¤ã«ã«ã¯ä»¥ä¸ã®ãã¼ã¿ãå ¥ãã¦ããã®ã§ããã¼ã¿ãæ£ããåå¾ã§ãããã¨ããããã¾ãã
[ {"text": "text1"}, {"text": "text2"}, {"text": "text3"}, {"text": "text4"}, {"text": "text5"}, ]
ãã¼ã¿ã«ã¢ã¯ã»ã¹ã§ãã
ããã§ãParquetãããã¼ã¿ãåãåºããã¨ãã§ãã¾ããã
Parquetã«ã¯ãåã®å¤ãé£ç¶ã§æ ¼ç´ããã¦ããããã¨ã¨ãã¡ã¿ãã¼ã¿ã辿ããã¨ã§ãã¼ã¿ã«ã¢ã¯ã»ã¹ã§ããããã¨ããããã¾ããã
ããã§å¤§ã¾ããªæ§é ããããã¾ããã
ããããããã ãã§ã¯ããã¾ããã
Parquetã¯ãã¹ãã¨é
åããµãã¼ãããã¦ãã¾ãã
ãã¹ããé
åã®ãã¼ã¿ãåãåºãã«ã¯ãæ´ã«ãã1æ®µéæ·±å ãããå¿
è¦ãããã¾ãã
æ´ã«å ã«é²ããããç« ãåãã¦ãããå°ãç¶ãã¾ãã
è¤éãªãã¼ã¿æ§é ã«å¯¾å¿ãã
åç« ã§ã¯ç°¡åãªãã¼ã¿æ§é ãæã¤Parquetã®æ§é ãè¦ã¾ããã
ããããParquetã¯ãã¹ãã¨é
åãæ±ããã¨ãã§ãã¾ãã
ãããã¯ãdefinition level
ã¨repetition level
ã¨ããä»çµã¿ã«ãã£ã¦ãé常ã®ãã£ã¼ã«ãã¨åãããã«1åã§è¡¨ããã¦ãã¾ãã
ã¤ã¾ããä¸ã®ããã«ãã¹ããé åããã£ã¦ã
{ col1: "val1", col2: {col3: "val3"}, col4: {col5: {col6: [1,2,3]}} }
ä¸ã®ããã«åãã¬ãã«ã§ä¿åãããã®ã§ãã
{ "col1": ["val1"], "col2.col3": ["val3"], "col4.col5.col6": [1,2,3] }
ãã®ç« ã§ã¯definition level
ã¨repetition level
ã¨ããä»çµã¿ãçè§£ãã¦ãè¤éãªãã¼ã¿æ§é ãæã¤Parquetãèªã¿è§£ãã¾ãã
ã¡ãªã¿ã«ããã®2ã¤ã®ã¬ãã«ã«é¢ãã説æã¯Parquetã®ããã¥ã¡ã³ãã«ã¯è¦ã¤ãããªãã£ãã®ã§ãããDremelã®è«æããããã¨ãããã¾ã
https://research.google/pubs/dremel-interactive-analysis-of-web-scale-datasets-2/
definition levelã¨repetition level
ã¾ããdefinition level
ã¨repetition level
ã®ä»çµã¿ã«ã¤ãã¦èª¬æãã¾ãã
éã«è¨ãã¨ã
definition level
ã¯nullable(optional)ãªã®ã«nullãªããªãã£ã親ãã£ã¼ã«ãã®æ°ã§ã
repetition level
ã¯ç¹°ãè¿ãããã£ããã£ã¼ã«ãã®ã¬ãã«ã表ãã¦ãã¾ãã
æå³ãããããªãã¨æãã®ã§ä¾ãã¤ãã¦è§£èª¬ãã¾ãã
definition level
ã¾ããdefinition level
ã«ã¤ãã¦ã
次ã®jsonã§è¡¨ããããã¼ã¿ãå
¥ã£ã¦ããã¨ãã¾ããå
¨ã¦ã®ãã£ã¼ã«ããoptional(nullable)ã§ãã
{ nest1: { nest2: { value: "hello" // D: 3 } }, }, { nest1: { nest2: { value: null // D: 2 } }, }, { nest1: { nest2: null // D: 1 } }, { nest1: null // D: 0 }, {} // D: 0
ãã®ã¨ããåãã£ã¼ã«ãã®definition level
ã¯ãã£ã¼ã«ã横ã®ã³ã¡ã³ãã®å¤(3,2,1,0,0
)ã§ãã
definition level
ãæ±ããã¨ã
æåã®å¤(hello)ã¯ãnest1, nest2, valueã®3ãã£ã¼ã«ããoptionalãªã®ã«nullã§ã¯ãªãã£ããããdefinition level
ã¯3ãã¨ãªãã¾ãã
åæ§ã«ã2ã¤ãã¯ãnest1, nest2ãoptionalãªã®ã«nullã§ã¯ãªãã£ããã2(valueã¯nullã ããã«ã¦ã³ãããªã)ãã¨ãªãã¾ãã
3ã¤ç®ã¯ãnest1ã ãnullãããªããã1ãã§ãã
æå¾ã®2ã¤ã¯nest1ããnullãªã®ã§0ã§ãã
ãã®ããã«ãoptionalãªè¦ªãã£ã¼ã«ãã®æ°ã表ãã®ãdefinition level
ã§ãã
ãã®ä»çµã¿ã«ãã£ã¦ã対象columnã ããã¹ãã£ã³ãã¦ãdefinition level
ãã¿ãã°ã©ãã®è¦ªã¾ã§nullã«ãªã£ã¦ãããããããã¾ãã
repetition level
次ã«ãrepetition level
ã«ã¤ãã¦ã
次ã®jsonã§è¡¨ããããã¼ã¿ãå
¥ã£ã¦ããã¨ãã¾ãã
{ repeated1: [ { repeated2: [ "value1-1", // R: 0 "value1-2", // R: 2 "value1-3" // R: 2 ], normalField2: "v1" // R: 0 }, { repeated2: [ "value2-1", // R: 1 "value2-2" // R: 2 ] normalField2: "v2" // R: 1 }, { repeated2: [ "value3-1", // R: 1 ], normalField2: "v3" // R: 1 } ], normalField1: "v4" // R: 0 },
ãã®ã¨ããrepetition level
ã¯ã³ã¡ã³ãã«ããRã®å¤ã§ãã
repetition level
ãæ±ããã¨ã
æåã®value1-1ã¯ããã以åã«é
å(repeated)ã®ãã£ã¼ã«ãããªãã£ããã0ãã§ãã
次ã®value1-2ã¯ãrepeated2ã®è¦ªã®repeated1ãé
åã§ãããããrepeated2ã®ã¬ãã«ã¯2ãvalue1-1ããããããrepeated2ã«ã¯ç¹°ãè¿ãããã£ãããã£ã¦ãvalue1-2ã®repetition level
ã¯repeated2ã®ã¬ãã«ããã2ãã¨ãªãã¾ãã
次ã®value1-3ããvalue1-2ããããããrepeated2ã«ç¹°ãè¿ãããã£ãããã£ã¦2ãã§ãã
対ãã¦ãvalue2-1ã¯ãrepeated2ã«ã¯ç¹°ãè¿ãããªãããrepeated1ã«ã¯ç¹°ãè¿ãããã(value1-1ãªã©ã®åã®è¦ç´ ããã)ãrepeated1ã®ã¬ãã«ããã1ãã
æ´ã«value2-2ã¯ãvalue2-1ããããããrepeated2ã«ç¹°ãè¿ãããã£ãããã£ã¦2ãã
value3-1ã¯value2-1ã¨åæ§ã«ããrepeated2ã«ã¯ç¹°ãè¿ãããªãããrepeated1ã«ã¯ç¹°ãè¿ãããããrepeated1ã®ã¬ãã«ããã1ãã
ã¨æ±ºå®ãã¾ãã
ã¾ããèªèº«ãé
åã§ãªãå ´åã§ã親ã«é
åãããã°å½±é¿ãããã®ã§ã
v1ã¯ããã以åã«é
å(repeated)ã®ãã£ã¼ã«ãããªãã£ããã0ã
v2ã¨v3ã¯ãrepeated1ãé
åã¨ãã¦æ©è½ãã¦ãããããrepeated1ã®ã¬ãã«ããã1ã
v4ã¯ã親ã«é
åããªãããã0ãã
ã¨ãrepetition level
ã®å¤ãä»ãã¾ãã
ãã®ããã«ãé
åã¨ãã¦æ©è½ãããã£ã¼ã«ãã®ã¬ãã«ã表ãã®ãrepetition level
ã§ãã
ãã®ä»çµã¿ã«ãã£ã¦ã対象columnã ããã¹ãã£ã³ãã¦ãrepetition level
ãåç
§ãããã¨ã§é
åã®æåãã©ããã親ãå¤ãã£ããããªã©ããããã¾ãã
hexdump
ããã¾ã§definition level
ã¨repetition level
ã«ã¤ãã¦è¦ã¦ããã®ã§ããã¹ããé
åããããã¡ã¤ã«ãèªããã¨ãã§ããããã«ãªãã¾ããã
å®éã®ãã¼ã¿ããã¤ããªã§è¦ã¦ã¿ã¾ãããã
以ä¸ã®ãã¼ã¿ã使ç¨ãã¾ãã
[ { "nest": { "nest": "nest1", "repeated": [ {"nest": {"repeated": ["nestRep1", "nestRep2", "nestRep3"]}}, {"nest": {"repeated": ["nestRep4", "nestRep5"]}}, ] }, }, { "nest": { "nest": "nest2", "repeated": [ {"nest": {"repeated": ["nestRep6"]}}, ] }, }, { "nest": { "nest": null, "repeated": [] }, }, ]
ãã®ãã¼ã¿ã¯parquetå
ã§ã¯æ¬¡ã®ãããªã¤ã¡ã¼ã¸ã§é
ç½®ãããããããã®å¤ã«definition level
ã¨repetition level
ã®å¤ãæ¯ããã¾ãã
[ {"nest.nest": [ "nest1", // D: 2, R: 0 "nest2", // D: 2, R: 0 null, // D: 1, R: 0 ]}, {"nest.repeated.nest.repeated": [ "nestRep1", // D: 4, R: 0 "nestRep2", // D: 4, R: 2 "nestRep3", // D: 4, R: 2 "nestRep4", // D: 4, R: 1 "nestRep5", // D: 4, R: 2 "nestRep6", // D: 4, R: 0 null // D: 1, R: 0 ]} ]
ãã®æ å ±ãparquetãã確èªãã¾ãã
ãã¼ã¿ãparquetã«ãã¦hexdumpã§è¦ãã¨ãä¸ã®ããã«ãªã£ã¦ãã¾ãã
00000000 50 41 52 31 15 06 15 2c 15 2c 15 df cf fe db 0d |PAR1...,.,......| 00000010 4c 15 06 15 02 15 06 15 00 15 08 15 00 12 00 00 |L...............| 00000020 04 02 02 01 05 00 00 00 6e 65 73 74 31 05 00 00 |........nest1...| 00000030 00 6e 65 73 74 32 15 06 15 ac 01 15 ac 01 15 bc |.nest2..........| 00000040 81 ac e6 09 4c 15 0e 15 02 15 06 15 00 15 08 15 |....L...........| 00000050 14 12 00 00 02 00 04 02 02 01 02 02 04 00 0c 04 |................| 00000060 02 01 08 00 00 00 6e 65 73 74 52 65 70 31 08 00 |......nestRep1..| 00000070 00 00 6e 65 73 74 52 65 70 32 08 00 00 00 6e 65 |..nestRep2....ne| 00000080 73 74 52 65 70 33 08 00 00 00 6e 65 73 74 52 65 |stRep3....nestRe| 00000090 70 34 08 00 00 00 6e 65 73 74 52 65 70 35 08 00 |p4....nestRep5..| 000000a0 00 00 6e 65 73 74 52 65 70 36 19 12 00 19 18 05 |..nestRep6......| 000000b0 6e 65 73 74 31 19 18 05 6e 65 73 74 32 15 00 19 |nest1...nest2...| 000000c0 16 02 00 19 12 00 19 18 08 6e 65 73 74 52 65 70 |.........nestRep| 000000d0 31 19 18 08 6e 65 73 74 52 65 70 36 15 00 19 16 |1...nestRep6....| 000000e0 02 00 19 1c 16 08 15 64 16 00 00 00 19 1c 16 6c |.......d.......l| 000000f0 15 e8 01 16 00 00 00 15 04 19 6c 48 06 4d 79 44 |..........lH.MyD| 00000100 65 65 70 15 02 00 35 02 18 04 6e 65 73 74 15 04 |eep...5...nest..| 00000110 00 15 0c 25 02 18 04 6e 65 73 74 25 00 4c 1c 00 |...%...nest%.L..| 00000120 00 00 35 04 18 08 72 65 70 65 61 74 65 64 15 02 |..5...repeated..| 00000130 00 35 02 18 04 6e 65 73 74 15 02 00 15 0c 25 04 |.5...nest.....%.| 00000140 18 08 72 65 70 65 61 74 65 64 25 00 4c 1c 00 00 |..repeated%.L...| 00000150 00 16 06 19 1c 19 2c 26 00 1c 15 0c 19 25 00 06 |......,&.....%..| 00000160 19 28 04 6e 65 73 74 04 6e 65 73 74 15 00 16 06 |.(.nest.nest....| 00000170 16 64 16 64 26 08 3c 36 02 28 05 6e 65 73 74 32 |.d.d&.<6.(.nest2| 00000180 18 05 6e 65 73 74 31 00 19 1c 15 06 15 00 15 02 |..nest1.........| 00000190 00 00 16 c4 03 15 14 16 d4 02 15 32 00 26 00 1c |...........2.&..| 000001a0 15 0c 19 25 00 06 19 48 04 6e 65 73 74 08 72 65 |...%...H.nest.re| 000001b0 70 65 61 74 65 64 04 6e 65 73 74 08 72 65 70 65 |peated.nest.repe| 000001c0 61 74 65 64 15 00 16 0e 16 e8 01 16 e8 01 26 6c |ated..........&l| 000001d0 3c 36 02 28 08 6e 65 73 74 52 65 70 36 18 08 6e |<6.(.nestRep6..n| 000001e0 65 73 74 52 65 70 31 00 19 1c 15 06 15 00 15 02 |estRep1.........| 000001f0 00 00 16 d8 03 15 16 16 86 03 15 3e 00 16 cc 02 |...........>....| 00000200 16 06 19 0c 16 08 16 cc 02 00 19 0c 18 37 67 69 |.............7gi| 00000210 74 68 75 62 2e 63 6f 6d 2f 70 61 72 71 75 65 74 |thub.com/parquet| 00000220 2d 67 6f 2f 70 61 72 71 75 65 74 2d 67 6f 20 76 |-go/parquet-go v| 00000230 65 72 73 69 6f 6e 20 30 2e 32 33 2e 30 28 62 75 |ersion 0.23.0(bu| 00000240 69 6c 64 20 29 19 2c 1c 00 00 1c 00 00 00 57 01 |ild ).,.......W.| 00000250 00 00 50 41 52 31 |..PAR1| 00000256
æåã«ãããã¿ã¼ã®ã¡ã¿ãã¼ã¿ã¯343(0x0157)ãã¤ãããã®ã§ã248(0xf7=598-343-8+1)ãã¤ãç®ãã(15 04 19 6c
ãã)å§ã¾ããã¨ããããã¾ãã
ãã®ã¡ã¿ãã¼ã¿ãthriftã«ãã¦columnã®æ
å ±ã«æ³¨ç®ããã¨ã
{ "version": 2, "row_groups": [ { "columns": [ { "meta_data": { "path_in_schema": ["nest", "nest"], "total_compressed_size": 50, "data_page_offset": 4, }, }, { "meta_data": { "path_in_schema": ["nest", "repeated", "nest", "repeated"], "total_compressed_size": 121, "data_page_offset": 54, }, } ], } ], }
ã¨æ¸ãã¦ããã¾ãã
ããããæ´ã«"nest.nest"ã®columnã®ã¡ã¿ãã¼ã¿(5ãã¤ãç®ããå§ã¾ã)ã«ããã«ã©ã ã®æ å ±ãè¦ãã¨ã
{ "data_page_header_v2": { "num_values": 3, "num_nulls": 1, "num_rows": 3, "encoding": "PLAIN", "definition_levels_byte_length": 4, "repetition_levels_byte_length": 0, "is_compressed": false } }
ã¨æ¸ããã¦ãã¦ãdefinition level
ã®4ãã¤ãã§è¡¨ãããrepetition level
ã¯æ¸ããã¦ããªã(0ãã¤ã)ãã¨ããããã¾ãã
repetition level
ãæ¸ããã¦ããªãã®ã¯ãå
¨ã¦ãã¼ããªã®ã§ãçç¥ããã¦ããããã§ãã
definition level
ã¯4ãã¤ãã§ã33ãã¤ãç®ãã36ãã¤ãç®(0x20~0x23)ã®04 02 02 01
ã該å½ç®æã§ãã
04 02 02 01
ã¯2鲿°ã«ããã¨00000100 00000010 00000010 00000001
ã§ãã
definition level
ã¨repetition level
ã¯RunLengthãBitPackingã¨ã³ã³ã¼ãã£ã³ã°ããã¦ãã¾ãã
https://parquet.apache.org/docs/file-format/data-pages/encodings/ ãè¦ãã¨ããã¼ã¿ã¯ãããã¼ã¨å¤ã®2ãã¤ãã§æ§æããã¦ãã¦ããããã¼(åå1ãã¤ã)ã®8ãããç®ã0ã®ã¨ãRunLengthã¨ã³ã³ã¼ãã£ã³ã°ããã¦ãã¦ã1ãªãBitPackã¨ã³ã³ã¼ãã£ã³ã°ããã¦ããã¨ãããã¾ãã
ä»åã®ä¾ã§ã¯1ãã¤ãç®ã¨3ãã¤ãç®ã®ãããã¼ã¯ä¸¡æ¹0ã§çµããã®ã§RunLengthã§ããããã¼ã®æå¾å°¾1ããããé¤ããã¨ã§0000010(0) 00000010 0000001(0) 00000001
ã¯ã2(0b10)å2(0b10)ã®å¾ã1(0b1)å1(0b1)ãã¨è§£éãã¦definition level
ã¯2,2,1
ã¨ãããã¾ãã
次ã«"nest.repeated.nest.repeated"ã®columnã®ã¡ã¿ãã¼ã¿ã®æ å ±ãè¦ãã¨
{ "data_page_header_v2": { "num_values": 7, "num_nulls": 1, "num_rows": 3, "encoding": "PLAIN", "definition_levels_byte_length": 4, "repetition_levels_byte_length": 10, "is_compressed": false } }
ã¨æ¸ããã¦ãã¦ãdefinition level
ã4ãã¤ãã§ãrepetition level
ã10ãã¤ãã¨ãããã¾ãã
repetition level
ãå
ã«æ¸ãããã®ã§repetition level
ããè¦ãã¨ã
ãã¼ã¿ã¯85(0x54)ãã¤ãç®ããå§ã¾ãã®ã§ãrepetition level
ã¯85(0x54)~94(0x5d)ãã¤ãç®ã§ã0000001(0) 00000000 0000010(0) 00000010 0000001(0) 00000001 0000001(0) 00000010 0000010(0) 00000000
(02 00 04 02 02 01 02 02 04 00
)ã§ãã
å±éããã¨0,2,2,1,2,0,0
ã§ãã
definition level
ã¯ãã®å¾ã®4ãã¤ããªã®ã§ã95(0x5e)~98(0x61)0000110(0) 00000100 0000001(0) 00000001
(0C 04 02 01
)ã§ããå±éããã¨4,4,4,4,4,4,1
ã¨ãããã¾ãã
ã¾ã¨ããã¨ã
- "nest.nest"ã®åã¯
definition level
ã2,2,1
ãrepetition level
ã0,0,0
- "nest.repeated.nest.repeated"ã®åã¯
definition level
ã4,4,4,4,4,4,1
ãrepetition level
ã0,2,2,1,2,0,0
ã¨èªããã¨ãã§ãã¾ããã
ãããã®å¤ã¯ãã¯ããã«åºããDã¨Rã«ä¸è´ãã¦ãã¾ãã
[ {"nest.nest": [ "nest1", // D: 2, R: 0 "nest2", // D: 2, R: 0 null, // D: 1, R: 0 ]}, {"nest.repeated.nest.repeated": [ "nestRep1", // D: 4, R: 0 "nestRep2", // D: 4, R: 2 "nestRep3", // D: 4, R: 2 "nestRep4", // D: 4, R: 1 "nestRep5", // D: 4, R: 2 "nestRep6", // D: 4, R: 0 null // D: 1, R: 0 ]} ]
ããã§ããã¹ããé åããã£ã¦ãparquetãèªã¿è§£ããããã«ãªãã¾ããã
æå¾ã«
ããã§ãParuetãã¡ã¤ã«ã¯ã¡ã¿ãã¼ã¿ãå«ã¾ãã¦ãã¦ãåãã£ã¼ã«ãã¯å ¨ã¦åã§è¡¨ããã¦ãããã¨ããããã¾ããã
ãã詳細ãªè©±ã¯ã以ä¸ã®ãã¼ã¸ãåèã«ãã¦ãã ããã
https://parquet.apache.org/docs/file-format/
https://github.com/apache/parquet-format
ã¾ããParquetã¯Dremelã®ä¸ã§ä½¿ããã¦ãããã¡ã¤ã«ã®ãã©ã¼ããããåèã«ãã¦ãããã¨ãããå
¬å¼ã«ããDremelã®xxxã使ã£ã¦ã»ã»ã»ãã¨ãã話ãåºã¦ãã¾ã(ç¹ã«definition level
ã¨repetition level
)ããã®å ´åã¯ãDremelã®è«æãèªãã¨ãããã¾ãã
Dremel: Interactive Analysis of Web-Scale Datasets
https://research.google/pubs/dremel-a-decade-of-interactive-sql-analysis-at-web-scale/
以ä¸ãParquetã®æ§é ã®ç´¹ä»ã§ããã