2024/6/4追è¨ï¼åå¼·ä¼ï¼OTFSG #2ï¼ã§ä»¥ä¸ã®è¨äºã®å å®¹ãæ´çï¼æ å ±ã追å ããçºè¡¨ããã¾ããã以ä¸ããã®æã®ã¹ã©ã¤ãã«ãªãã¾ãã®ã§ãä½µãã¦ãåç §ãã ãã
ãã®è¨äºã¯ Distributed computing (Apache Spark, Hadoop, Kafka, ...) Advent Calendar 2023 ã®2æ¥ç®ã®è¨äºã§ãã
Apache Icebergã«ããã¦ãCatalogã¯ãã®æ ¹å¹¹ãæ ãã³ã³ãã¼ãã³ãã ãIcebergã®reader,writerã¯Catalogã«ãã£ã¦ãã¼ãã«ãçºè¦ããæ´åæ§ãç¶æããªãããã¼ãã«ãæä½ã§ããã䏿¹ã§Catalogãæ§æãã鏿è¢ã¯å¤æ§ã§ãè¦ä»¶ã«å¿ãã¦é¸ã¶å¿ è¦ããããããã§æ¬è¨äºã§ã¯ãIceberg Catalogã®ä¸»ãªé¸æè¢ã¨ç¹å¾´ãã¾ã¨ããã
ããããIcebergã£ã¦ãªã«ï¼ã¨ããæ¹ã¯ä»¥ä¸ã®é¢é£è¨äºããåç §ä¸ããã
- Apache Iceberg ã¨ã¯ä½ã
- ãã¼ã¿ã¬ã¤ã¯ã®æ°ããã«ã¿ãï¼Open Table Formatã®ç´¹ä»
- ã翻訳ãBilibiliã¯å¦ä½ã«ãã¦Apache Icebergã§Data Lakehouseãæ§ç¯ãããï¼
Iceberg Catalogã®è¦ä»¶
Iceberg Catalogã®ä¸»ãªä»äºã¯ä»¥ä¸ã§ããã
- ãã¼ãã«ã®ä½æããããããªã©ã®ãã¼ãã«æä½ã管çãã(Catalog interfaceåç §ï¼
- ãã¼ãã«ã®ä½æããããããªã©ã®ãã¼ãã«æä½ãæ°¸ç¶åãã
- åãã¼ãã«ã«ã¤ãã¦ãç¾å¨ã®ææ°æé¢ã®ã¡ã¿ãã¼ã¿ãæ ¼ç´ããã¦ããmetadata fileã®ãã±ã¼ã·ã§ã³ããã¤ã³ãããå¿ è¦ã«å¿ãã¦(ã¢ãããã¯ã«)æ´æ°ãã
ç¹ã«3ç¹ç®ã«ã¤ãã¦ãIcebergãã¼ãã«ã¯é層åãããã¡ã¿ãã¼ã¿ãã¡ã¤ã«ã辿ã£ã¦å®ãã¼ã¿ã«ã¢ã¯ã»ã¹ããä»çµã¿ã§ãããããã§ã¯é層ã®ãå
¥ãå£ãã¨ãªãã¡ã¿ãã¼ã¿ãã¡ã¤ã«ã示ãå½¹å²ãå¿
è¦ã«ãªãããããIceberg Catalogã¨ããããã ã
ã¾ããIcebergã§ã¯æ¥½è¦³çåæå®è¡å¶å¾¡ã§ãã©ã³ã¶ã¯ã·ã§ã³ãå®ç¾ããããããã¼ãã«ãæä½ããã度ã«è¾¿ãã¹ãææ°ã®ã¡ã¿ãã¼ã¿ãæ´æ°ãããã(ä¸å³ã®s0,s1,s2)ããã§ãIcebergã¯ç¾æç¹ã§ã®ææ°ã®ã¡ã¿ãã¼ã¿ã示ãå½¹å²ãæ
ãã
è¤æ°ã®reader,writerããã¼ã¿ã®æ´åæ§ãä¿ã¡ãªãããã¼ãã«ãæä½ããã«ã¯ãCatalogããã¡ã¿ãã¼ã¿ã¸ã®ãã¤ã³ã¿ãã¢ãããã¯ã«æ´æ°ã§ããå¿
è¦ããããããã¯ãå¿
é ãã®è¦ä»¶ã§ã¯ãªããæ´åæ§ãæ
ä¿ããªãå®è£
ã®é¸æè¢ãããããæè¡é¸å®ããéã«å¿
ãé ã«å
¥ãã¦ããã¹ããã¤ã³ãã ã
ãªããIceberg Catalogã®ä»äºã®ä¸é¨ã¯ãCatalogã®æ§ææ¹æ³ã«ãã£ã¦é¨åçã«ã¨ã³ã¸ã³ããã¼ã«(ã«ããã±ã¼ã¸ãããorg.apache.iceberg.hadoop.HadoopCatalog
ãªã©ã®ã©ã¤ãã©ãª)ãæ
ãå ´åã¨ãå¤é¨ã®ãµã¼ãã¹(REST Catalogãªã©)ãæ
ãå ´åãããã
Catalogã®é¸æè¢
åæã¨ãã¦ãCatalogã¯Icebergãã¼ã¹ã®ãã¼ã¿ã¬ã¤ã¯ã®æ ¹å¹¹ãæ
ãã³ã³ãã¼ãã³ããªã®ã§ãã©ããé¸ã¶å ´åã§ãè¦ä»¶ã«å¿ãã¦ç¶ãã¹ããã¼ã¿ã®èä¹
æ§ããµã¼ãã¹ã®å¯ç¨æ§ã確ä¿å¯è½ãªè¨è¨ã«ããå¿
è¦ããããå ãã¦ãCatalogãé¸ã¶éã«ã¯ã使ãäºå®ã®åã¨ã³ã¸ã³/ãã¼ã«(Spark, Trinoãªã©)ããã®æ¹å¼ããµãã¼ããã¦ããããèæ
®ããªããã°ãªããªããå¾ã£ã¦ãCatalogã®é½åã ãã§ãªãã¯ã©ã¤ã¢ã³ããå«ããç·åçãªã¢ã¼ããã¯ãã£ã®ä¸ã§é¸å®ããªããã°ãªããªãã
以ä¸ãè¸ã¾ãã¦Iceberg Catalogãæ§æããä¸ã§ã®é¸æè¢ã®æ¦è¦ã示ãã
Hadoop Catalog
æ¦è¦
å°ç¨ã®ããã»ã¹ãå¿ è¦ã¨ããªããæãç°¡åãªCatalogã§ããããã®ä»çµã¿ã¯ãä»»æã®ã¡ã¿ãã¼ã¿ãã£ã¬ã¯ããªå ã®ã¿ã¤ã ã¹ã¿ã³ãã«åºã¥ãã¦ãæãæè¿æ¸ãè¾¼ã¾ããã¿ã¤ã ã¹ã¿ã³ãã®ãã¡ã¤ã«ãã¡ã¿ãã¼ã¿ã¨ãã¦ãã¤ã³ãããã ããã®ã ãå¾ã£ã¦ãHadoopãã¨ããååã§ã¯ããããå®ã¯Hadoopã¨ç¡é¢ä¿ã«ä½¿ãããä½¿ãæ¹ã¯ä»¥ä¸ã®ããã«ãªãã
import org.apache.hadoop.conf.Configuration; import org.apache.iceberg.hadoop.HadoopCatalog; Configuration conf = new Configuration(); String warehousePath = "hdfs://host:8020/warehouse_path"; HadoopCatalog catalog = new HadoopCatalog(conf, warehousePath);
ã¡ã¿ãã¼ã¿ãã£ã¬ã¯ããª(warehousePath)ã®å ´æã¯é©å½ãªãã¡ã¤ã«ã·ã¹ãã ããS3ãGCSãªã©ã®ãªãã¸ã§ã¯ãã¹ãã¬ã¼ã¸ãªã©ãå¹ åºã鏿è¢ãããããã ããã¹ãã¬ã¼ã¸ã«ãã£ã¦ã¯ãã¡ã¤ã«/ãªãã¸ã§ã¯ãã®ã¢ãããã¯ãªå¤æ´ããµãã¼ããã¦ããªãå ´åããããã¤ã¾ãããã¼ã¿ã®æ´åæ§ãæ ä¿ã§ããªãå¯è½æ§ããããã¾ããS3ãªã©ã®ãªãã¸ã§ã¯ãã¹ãã¬ã¼ã¸ã®å ´åããã¼ãã«æ°ããã¼ãã«ãæ§æãããã¡ã¤ã«æ°ãå¤ãå ´åã¯ããã©ã¼ãã³ã¹ã«åé¡ãåºãå¯è½æ§ãããã»ããAPIã¬ã¼ãå¶éãèæ ®ããå¿ è¦ãããã
使ãæ
- Icebergã®æ¤è¨¼ãç®çã«ç°¡æãªç°å¢ãæ§ç¯ãããå ´å
Hadoop Catalogã¯ã·ã³ãã«ã§ãã䏿¹ã§ã¹ã±ã¼ã©ããªãã£ãéç¨æ§ãéã¿ãã¨æ¬ ç¹ãå¤ããæ¬çªåãã«ã¯ä»ã®é¸æè¢ãæ¤è¨ããã»ããè¯ãã ããã
åè
以éã§ç´¹ä»ãã鏿è¢ã¯å ¨ã¦ã¡ã¿ãã¼ã¿ãã±ã¼ã·ã§ã³ã®ã¢ãããã¯ãªæ´æ°ããµãã¼ããã¦ããã
Hive Catalog(Hive Metastore Catalog)
æ¦è¦
Hive Metastoreã®ãã¼ãã«ã¨ã³ããªã®locationããããã£ã«ã¡ã¿ãã¼ã¿ãã¡ã¤ã«ã®ãã¹ããã¤ã³ãããæ¹å¼ã
Hive Metastoreã¯æ§ã
ãªã¨ã³ã¸ã³ããã¼ã«ã«ãµãã¼ãããã¦ãããé常ã«å¤ãã®æ¡ç¨äºä¾ããããIcebergã®å°å
¥ããç¾ç¶ã®Hive Metastore/Hive tableã®èª²é¡ã解決ããæµãã§æ¤è¨ããããã¨ãå¤ãã ããã䏿¹ã§ãHive Metastoreãç¾ç¶éç¨ãã¦ããªãå ´åã«Icebergã®ããã ãã«Hive Metastoreãæ±ãã®ã¯ãæ§ç¯ãéç¨ã®æéãããããããããªãã
使ãæ
- ãã§ã«Hive Metastore, Hive Tableãå©ç¨ãã¦ããå ´å
åè
- CPU使ç¨ç90%ãè¶ ããé«è² è·ãLNEã®Hive Metastoreã§çºç Hive table formatã®èª²é¡ã¯Apache Icebergã§è§£æ¶
- ãã°ãã¤ãã©ã¤ã³ã®4ã¤ã®åé¡ã«LINEã¯ã©ãç«ã¡åãããã·ã³ãã«ãã¤æ¡å¼µæ§ã®ããã¢ã¼ããã¯ãã£ãå¶ãããIcebergã¨ãã鏿è¢
JDBC Catalog
org.apache.iceberg.jdbc.JdbcCatalog
ã«ãã£ã¦ãMySQLãPostgreSQLãªã©ã®ãã¼ã¿ãã¼ã¹ã§Icebergã®ã¡ã¿ãã¼ã¿ãã¡ã¤ã«ããã¤ã³ãããæ¹å¼ãJDBC Catalogã§ä½¿ç¨ãããã¼ã¿ãã¼ã¹èªä½ã¯å¤æ§ãªé¸æè¢ãããå¾ããDBå´ã®ä¿¡é ¼æ§ç¢ºä¿ã®ä»çµã¿ãªã©ãDBã®æ©è½ãæ´»ç¨ã§ããç¹ãé
åã§ããããã ããIcebergãå©ç¨ããåãã¼ã«ãã¨ã³ã¸ã³ã§JDBCãã©ã¤ããããã±ã¼ã¸ããå¿
è¦ãããã®ã§ãããã¤ãç
©éã«ãªããã¾ãåæã«ããããå¯è½ãªãã¼ã«ãã¨ã³ã¸ã³ããé¸æåºæ¥ãªããªãç¹ã¯èæ
®ãå¿
è¦ã§ããã
å¿ç¨ä¾ã¨ãã¦JDBC CatalogãREST APIã§ã©ããããæ¹æ³ãèããããããã«ãã£ã¦å
è¿°ã®æ¸å¿µç¹ãç·©åãããã¨ãåºæ¥ãã
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.4.1 \ --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/my/key/prefix \ --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog \ --conf spark.sql.catalog.my_catalog.uri=jdbc:mysql://test.1234567890.us-west-2.rds.amazonaws.com:3306/default \ --conf spark.sql.catalog.my_catalog.jdbc.verifyServerCertificate=true \ --conf spark.sql.catalog.my_catalog.jdbc.useSSL=true \ --conf spark.sql.catalog.my_catalog.jdbc.user=admin \ --conf spark.sql.catalog.my_catalog.jdbc.password=pass
使ãæ
- (ç¹ã«ãªã³ãã¬ãã¹ã§)Icebergã®ããã ãã«Hive Meta Storeãæ§ç¯ããããªãå ´å
- ãã¼ã ã®ã±ã¤ãããªãã£çã®è¦³ç¹ã§ãã¼ã¿ãã¼ã¹ã«ãã管çã好ã¾ããå ´å
åè
- ãã¢ã¡ã¿ã«ã§å®ç¾ããSparkï¼Trino on K8sãªãã¼ã¿åºç¤
- ãã¤ã¯ãã¢ãããã®äºä¾ãRESTã®ããã¯ã¨ã³ãã§JDBCã«ã¿ãã°ã使ç¨ããã¦ããã¨ã®ãã¨
Nessie Catalog
æ¦è¦
Gitã©ã¤ã¯ã«ãã¼ã¿ã¬ã¤ã¯ã®ãã©ã³ã¶ã¯ã·ã§ã³ã管çããNessieã®ãã¼ãã«ããããã£ã§ã¡ã¿ãã¼ã¿ãã¡ã¤ã«ã®ãã¹ããã¤ã³ãããæ¹å¼ãNessieãªãã§ã¯ã®é åã¨ãã¦ãIceberg Tableã¯é常åä¸ãã¼ãã«(ã¯ã¨ãª)åä½ã§ã®ACIDãããµãã¼ãããªãã®ã«å¯¾ãã¦ãIceberg on Nessieã¯ãã«ããã¼ãã«ããã«ãã¹ãã¼ãã¡ã³ãã§ã®ãã©ã³ã¶ã¯ã·ã§ã³ããµãã¼ããã¦ããç¹ãæããããã䏿¹ã§æ¸å¿µç¹ã¨ãã¦ãå種ã¨ã³ã¸ã³ããã¼ã«ã®Nessieãµãã¼ããHive Metastoreã«æ¯ã¹éå®çã§ãããã¾ããæ¬ã¨ã³ããªãæ¸ãã¦ãã2023/12/02æç¹ã§ãããä¸ã®æ å ±ãéããã¦ãã(宿§çè©ä¾¡)ããã«è¦åãããããä»çµã¿ãæ§è½ç¹æ§ãªã©æ°ã«ãªãããçè ãç¥è¦ããªãè¦èª¿æ»ã
使ãã©ãã
- Nessieã®ãã¼ã¿ã¬ã¤ã¯ã®ç®¡çæ©æ§ã«é åãæããå ´å
- ãã«ããã¼ãã«/ãã«ãã¹ãã¼ãã¡ã³ããã©ã³ã¶ã¯ã·ã§ã³ãå¿ è¦ãªå ´å
åè
REST Catalog
æ¦è¦
Iceberg REST Open API specificationã«æºæ ãã¦ä»»æã®ã«ã¿ãã°å®è£ ã®ã¤ã³ã¿ã¼ãã§ã¼ã¹ãRESTã§æä¾ããæ¹å¼ã(ããã¯ã¨ã³ãã®ç¹å®ã®å®è£ ãéå®ãããã®ã§ã¯ãªã)REST Catalogã®èã¯ãIceberg Catalogå®è£ ã«é¢ãããã¸ãã¯ãRESTã®å ã«ããCatalogãµã¼ãã«å¯ããããç¹ã«ãããå¾ã£ã¦ããã¼ã«ãã¨ã³ã¸ã³å´ã«ã©ã¤ãã©ãªãçµã¿è¾¼ã¾ãã«æ¸ã¿ããããã¤ãç°¡ç´ åã§ãããã¾ããããã¯ã¨ã³ãã«ä»»æã®å®è£ ã使ç¨ã§ããæè»æ§ãé åã¨è¨ããã
使ãæ
- Iceberg Catalogãã¨ã³ã¸ã³ããã¼ã«ããçã«ãã¤ã¤ãå®è£ ãé è½ãã¦æä¾ãããå ´å(JDBC Catalogã使ç¨ãã¦ããå ´åãªã©)
åè
- Iceberg's REST Catalog: A Spark Demo
- Tabular社ãREST Catalogã®ãµã³ãã«å®è£ ãæä¾ãã¦ãã
- REST Catalog Explained
- Apache Iceberg's REST catalog
ããã¼ã¸ããªã½ãªã¥ã¼ã·ã§ã³
ããã¾ã§å¤§ãªãå°ãªãã»ã«ãããã¼ã¸ããªã½ãªã¥ã¼ã·ã§ã³ãæãã¦ããããIceberg Catalogãããã¼ã¸ãã«æä¾ãã製åãåå¨ãã¦ãããããã¦æããå¤ãã®äººã«ã¨ã£ã¦ã¯ããã¼ã¸ããªã½ãªã¥ã¼ã·ã§ã³ãå©ç¨ããã®ãæ§ç¯ãéç¨å
±ã«æãç°¡åãªé¸æè¢ã ã¨æããããIceberg Catalogã®æ§ç¯ãéç¨ãäºæ¥è
ã«å§ããããã ãã§ãªããCatalogããã¹ãããããã®ã¤ã³ãã©ã管çããã«æ¸ãããè² è·ã«å¿ããã¹ã±ã¼ã«ãå¯ç¨æ§ãå
ç¢æ§ãã»ãã¥ãªãã£ã®ç¢ºä¿ãä»»ãããã¨ãã§ããããã ãå¾ã£ã¦å人çãªæè¦ã¨ãã¦ã¯ãç¹å¥ãªè¦ä»¶ããªããSaaSãã¯ã©ã¦ããå©ç¨ãããã¨ã«ç¹æ®µå¶ç´ããªãã®ã§ããã°ããã¼ã¸ããªã½ãªã¥ã¼ã·ã§ã³ãå©ç¨ãã¦ããã®ãç¡é£ãªã®ã§ã¯ãªããã¨æãã
ã¾ããããã¼ã¸ããªIceberg Catalogã®ä¸ã«ã¯ãTableã®Comapactionãªã©ã®ãã¼ãã«ã¡ã³ããã³ã¹ãèªååããä»çµã¿ãæä¾ãã¦ãããã®ãããããããã£ãä¾¿å©æ©è½ã使ããã®ãããã¼ã¸ããªã½ãªã¥ã¼ã·ã§ã³ã®é
åã ããã
Icebergããµãã¼ãããããã¼ã¸ããªã½ãªã¥ã¼ã·ã§ã³ã®ä¾
- AWS Glue DataCatalog
- AWSã®Hive Metastoreäºæãªãµã¼ãã¬ã¹ã®Data Catalog
- Dremio Arctic
- Dremioã®Project Nessieã¨Apache Icebergããã¼ã¹ã«æ§ç¯ãããã¬ã¤ã¯ãã¦ã¹ç®¡çãµã¼ãã¹ãNessieãã¼ã¹ãªã®ã§ãã«ãã¹ãã¼ãã¡ã³ãããã«ããã¼ãã«ãã©ã³ã¶ã¯ã·ã§ã³ã®æ´åæ§ã確ä¿ã§ãã
- Tabular
- Tabularã®ãã¼ã¿ãã©ãããã©ã¼ã ãæä¾ããTabular Catalog, Iceberg Tables
- Snowflake
- å¤é¨ã®Iceberg Tableãèªã¿è¾¼ãæ¹å¼ã¨ãSnowflakeä¸ã§ç®¡çãããNative Iceberg Tableãå©ç¨ããæ¹å¼ã鏿ã§ãã
åè
ãããã«
Apache Iceberg Catalogã®é¸æè¢ã«ã¤ãã¦ã¯è©³ãããªãé¨åãå¤ã ããã追è¨ããã»ããè¯ããããªæ å ±ãããã°æ¯éæãã¦ãã ããã
ææ¥ãDistributed computing Advent Calendar 2023 ã®3æ¥ç®ã¯ããªãã¨3æ¥é£ç¶ã§Icebergãã¿ã¨ãªã£ã¦ãã¾ããæ¥½ãã¿ã§ãï¼ï¼ï¼