Amazon S3 / Google Cloud Storage ã«ãã大éã»å·¨å¤§ãªãã¡ã¤ã«ã COPY INTO ã§ Snowflake ã¸ãã¼ã¿ãã¼ãããéã«æ¼ããã¦ããã¹ããã¤ã³ããåå¿é²çã«è¨ãã¦ããã¾ãã
åæã¨ãªããã¼ã¿ãã¼ã
以ä¸ã®ãããªã¦ã¼ã¹ã±ã¼ã¹ã«ãããCOPY INTO ã®å©ç¨ãæ³å®ãã¦ãã¾ãã
- Snowpipe ã§åãè¾¼ããã¼ãã«ã®éå»ãã¼ã¿ã®ãã¼ã
- åä¸ãã¼ãã«ã§ replace (æ´ãæ¿ã) ãè¡ã巨大ãªãã¼ãã«ã®ãã«ã¯ãã¼ã
å¼ç¨: https://docs.snowflake.com/ja/user-guide/data-load-s3
ãã¼ããããã¡ã¤ã«å½¢å¼ã¯ CSV ãæãéããæ¬¡ç¹ã§ Parquet, ORC
ãã¡ãã®ã³ãã¥ããã£ã®è¨äºãé常ã«åèã«ãªãã¾ãã
CSV ã§ã®ãã¼ããæéã§ããããã¼ã¿ã®ãã¡ã¤ã«ä¿ç®¡ã¨ã㦠JSON ã®ãããª åæ§é åãã¼ã¿ã¯æç¨ã㤠CSV åã¯å¤æã®ã³ã³ãã¥ã¼ããªã½ã¼ã¹ããããããã夿ããããã¯ãã®ã¾ã¾ JSON ã§åãè¾¼ãã ã»ããæ¥½ã§ãã ä»å¾ãã¼ã¿ã¬ã¤ã¯ã¨ãã¦ãã¡ã¤ã«ãä¿ç®¡ããå¿ è¦ããããªã Parquet ã対å¿ã追å ããã Iceberg ãã©ã¼ããããæ¤è¨ããã¨è¯ãã§ãããã
ãã¼ããããã¡ã¤ã«æ°/ãµã¤ãºã«å¿ã㦠Warehouse ã®ãµã¤ãºã夿´ããã¨é«éåãæå¾ ã§ãã
Snowflake ã®ãã¼ã¿ãã¼ãã¯ã³ã³ãã¥ã¼ããªã½ã¼ã¹ã§ãã Warehouse ã«ã¦ 1ã¹ã¬ãã1ãã¡ã¤ã«ã§åãã¾ãã ã¾ããã¼ã¿ãã¼ãæã®ãã¡ã¤ã«ãµã¤ãºã®æ¨å¥¨ã¯å§ç¸®ãµã¤ãºã§ 100ã250MB ã§ãããã10MB 以ä¸ã®ãã¡ã¤ã«ã大éã«ããå ´åã¯çµåããããã«ãã¨ããã°ã«ç´¹ä»ãããã¾ãã https://select.dev/posts/snowflake-batch-loading
Warehouse ã® XS ãµã¤ãºã§ã¯8ã¹ã¬ããã§ãS, M, ... ã¨ãµã¤ãºãä¸ãããã¨ã«ã¹ã¬ããæ°ãåã ã«ãªãã¾ãã100GB ã®1ãã¡ã¤ã«ããã¼ãããã¨ãªãã¨ãXS ãµã¤ãºã§ã8ã¹ã¬ããã®ãã¡1ã¹ã¬ãããã使ãããã³ã³ãã¥ã¼ããªã½ã¼ã¹ãç¡é§ã«ãªãä¸ã«æéãããã£ã¦ãã¾ãã¾ãã https://select.dev/posts/snowflake-warehouse-sizing
100GB ã 100MB * 1000ãã¡ã¤ã«ã«åå²ããã¨ãXS ã®8ã¹ã¬ããããã¹ã¦ä½¿ããã®ã§8åã«ãªãã¾ããããã« warehouse ã128ã¹ã¬ãããæã¤ XL ã§åããã°1024åéããã¼ãã§ãã¾ããåºæ¬çã«ã¯ãã¡ã¤ã«æ°ã«å¿ãã¦ç¡é§ã«ãªããªã Warehouse ãµã¤ãºã鏿ããã¨è¯ãã§ãããã
ãã¼ã¿ãã¼ãã§ã¯ã³ã³ãã¥ã¼ããªã½ã¼ã¹ã®å¾é課é
COPY INTO ã§åãã Warehouse æéã課éããã¾ãã ä¾ã¨ã㦠éå§ç¸®4.7TB ç¸å½ã® Parquet (Snappy comp/Structured) ãã2-XL ãµã¤ãºã® Warehouse ã§ 3095sec ããã¦ãã¼ãããå ´åï¼ãã¡ãã®ãã³ããã¼ã¯è¨äºããæåï¼ãèãã¦ã¿ã¾ãã
ãã¡ã¤ã«æ°ã¯ååã«åå²ããã¦ããã2-XL Warehouse ã§åããã¨ããã¨
- 3095sec / 4.7TB * 32 credit(2XL)/h = 5.85 credit/TB
Enterprise + AWS (Tokyo) ã®å価ã§ã¯
- 5.85credit/TB * $4.3/credit = $25.1/TB
ã¨ãããªãã«ãéãããããã¨ããããã¾ããéå»ãã¼ã¿ã䏿°ã«ãã¼ããã¦ã³ã£ããããªãããã«ãããããè¦ç©ãããã¨ãéè¦ã§ãã
巨大ãªãã¼ã¿ã COPY INTO ä¸çºã§ãã¼ãããã¨è¯ãæãã«ãã¤ã¯ããã¼ãã£ã·ã§ãã³ã°ã«ãªããªã
Snowflake ãé«éã«ã¯ã¨ãªã§ããç§å¯ã¯ããã¤ã¯ããã¼ãã£ã·ã§ãã³ã°ã¨ããç¬èªã®æ©æ§ãæã£ã¦ãããã¨ã§ããããã¯ãã¼ãæã«ããã¤ãã®åã®ã¾ã¨ã¾ããå°ããªãã¡ã¤ã«ã«ã¾ã¨ãã¡ã¿ãã¼ã¿ãå é¨çã«ä¿æããä»çµã¿ã§ããã¼ã¿åç §ããéã«ä¸è¦ãªãã¼ã¿ãèªã¿é£ã°ãï¼ãã«ã¼ãã³ã°ï¼ãã¨ã§é«éãªã¯ã¨ãªãå®ç¾ãã¾ããããã¯ãã¤ã¯ããããã§é次çã«ãã¡ã¤ã«ãåãè¾¼ã Snowpipe ã¨ç¸æ§ãè¯ãã大éã®ãã°ãã¼ã¿ã Snowpipe çã§é次åãè¾¼ã¿ãè¡ãã¨æç³»åçã«è¿ããã¼ã¿ãèªåçã«åããã¤ã¯ããã¼ãã£ã·ã§ã³ã«ã¾ã¨ã¾ããããã¯ã¨ãªå¹çãè¯ããªãã¾ãã
https://select.dev/posts/snowflake-micro-partitions
ããããä¾ãã°éå»1å¹´åã®ãã¼ã¿ã COPY INTO ã§ä¸æ¬ãã¼ããã¦ãã¾ãã¨ã1å¹´ã¨ããåä½ã§å¦çãããå¦çããããã¡ã¤ã«é ã«ãã¤ã¯ããã¼ãã£ã·ã§ã³ãã©ã³ãã ã«çæããã¦ãã¾ãã®ã§ãWHERE å¥ã§ç¹å®ã®æ¥ä»ãæãåºããããªã¯ã¨ãªãçºè¡ãã¦ã1å¹´åã®ãã¼ã¿ãåç §ãã¦ãã¾ãå¯è½æ§ãããã¾ãããã®ããã«ãCOPY INTOãããã¼ãã«ã§æ¥ä»ã«ã©ã ã§ãã¤ã¯ããã¼ãã£ã·ã§ãã³ã°ãæ§æãã対å¦ã¨ãã¦3ã¤ã®ã¯ã©ã¹ã¿ãªã³ã°æ¹æ³ãããã¾ãã
- Natural Clustering
- Auto Clustering
- Manual Sorting
Natural Clustering
COPY INTO ãä¸çºã§ã¯ãªããã¯ã©ã¹ã¿ãªã³ã°ããããã¼ã¿ã®ç¯å²ã«å°åãã㦠COPY INTO ããæ¹æ³ã ãã¤ã¯ããã¼ãã£ã·ã§ãã³ã°ãããåä½ã§ COPY INTO ãå®è¡ãã¾ãã
æ¥ä»ãã¹ãã¨ã«ãã¡ã¤ã«ããã¼ãããä¾
copy into log_table from @s3.log_archives.access_log/dt=20240601/ file_format = (type = 'json') match_by_column_name = case_insensitive; copy into log_table from @s3.log_archives.access_log/dt=20240602/ file_format = (type = 'json') match_by_column_name = case_insensitive; ...
Auto Clustering
ãã¼ãã«ã«ã¯ã©ã¹ã¿ãªã³ã°ãã¼ãæå®ãããã¨ã§ãèªåçã«ãã¤ã¯ããã¼ãã£ã·ã§ãã³ã°ãåé ç½®ãã¦ãããæ©è½ããã¼ãã«ä½ææã使å¾ã«ã¯ã©ã¹ã¿ãªã³ã°ãã¼ãè¨å®ãã¾ãã
alter table log_table cluster by (date_col);
https://docs.snowflake.com/ja/user-guide/tables-auto-reclustering
ãã®æ¹æ³ã¯ã¨ã¦ãç°¡åã§ãããèªåã¯ã©ã¹ã¿ãªã³ã°ã®ããã«è£å´ã§ã³ã³ãã¥ã¼ããªã½ã¼ã¹ãåããããå©ç¨æéãé«é¡ã«ãªããããã®ã§æ³¨æãå¿ è¦ã§ãã
Manual Sorting
åãè¾¼ãã ãã¼ãã«ã CTAS ã§å使ããéã« ORDER BY ãæå®ããæ¹æ³ãORDER BY ã«æå®ããã«ã©ã ããã¤ã¯ããã¼ãã£ã·ã§ãã³ã°æ§ææã«ãã³ãå¥ã¨ãã¦ã®å½¹å²ãæããã¾ãã
以ä¸ã®å ´åãdate_col ã«ã©ã ã§ãã¤ã¯ããã¼ãã£ã·ã§ãã³ã°ãæ§æããã¾ãã
create or replace table log_table as select * from log_table order by date_col;
ãã¤ã¯ããã¼ãã£ã·ã§ãã³ã°ãå ¨ãå¹ããªãç¶æ ã§ãã¼ã¿åãè¾¼ã¿ãã¦ãã¾ã£ãå ´åã«ç°¡åã«ä¿®æ£å¯è½ãªã®ã§ãéç¨ã§ã¯ãä¸è©±ã«ãªããã¨ã«ãªãã§ãããã
VARIANTåã®å é¨ã«ã©ã ã§ã¯ãã¤ã¯ããã¼ãã£ã·ã§ãã³ã°ãå¹ããªãå ´åããã
VARIANTåã®ã«ã©ã å ã®å¤ã§ã¯ãã¤ã¯ããã¼ãã£ã·ã§ãã³ã°ãå¹ããªãå ´åããããããå ¨ãã¼ã¿ãã¹ãã£ã³ãããã¨ãããã¾ãã ã¨ãããããã°ãã¼ã¿ã VARIANTå ã§åãè¾¼ãã§å¾ã§ Snowflake ä¸ã§å å·¥ã»æ§é åãããã¨æãã¨æãã¨çãç®ã«ããã¾ãã
NG ãªãã¼ã¿ãã¼ãä¾
create or replace table log_table ( raw_data variant ); copy into log_table from @s3.log_archives.access_log/dt=20240601/ file_format = (type = 'json'); copy into log_table from @s3.log_archives.access_log/dt=20240602/ file_format = (type = 'json'); ... -- raw_data ã«ã©ã å ã® timestamp ã«ã©ã ã WHERE å¥ã«æå®ãã¦ããã«ã¼ãã³ã°ãå¹ããªã select * from log_table where raw_data:timestamp::timestamp > to_timestamp('2024-06-01 00:00:00') and raw_data:timestamp::timestamp < to_timestamp('2024-06-01 01:00:00') ;
ãããåé¿ããã«ã¯ãVARIANT åã§ãã¼ãããã«ãã¦ããã¤ã¯ããã¼ãã£ã·ã§ãã³ã°ã«ãããã«ã©ã ã¯ãããããæ§é åãã¦ããã»ããè¯ãã§ãããã
ããã¥ã¡ã³ãã«ãè¨è¼ãããã¾ãã
For better pruning and less storage consumption, we recommend flattening your OBJECT and key data into separate relational columns if your semi-structured data includes:
- Dates and timestamps, especially non-ISO 8601Â dates and timestamps, as string values
- Numbers within strings
- Arrays
ã¾ããã¤ã¯ããã¼ãã£ã·ã§ãã³ã°ãå¹ãã¦ããã©ããã確èªãã颿°ãç¨æããã¦ãã¾ãããVARIANTåã«ã¯ä½¿ãã¾ããã
select SYSTEM$CLUSTERING_INFORMATION('log_table', '(timestamp)');
https://docs.snowflake.com/ja/sql-reference/functions/system_clustering_information
ä»ã«ã
ããããã Snowflake ã§ã®ãã¼ã¿åºç¤éçºã»éç¨ã§ Tips ãåºã¦ãããæ¸ãçãã¦ããã¾ãã
åèæç®
- https://docs.snowflake.com/ja/user-guide/data-load-s3
- https://select.dev/posts/snowflake-batch-loading
- https://select.dev/posts/snowflake-warehouse-sizing
- https://select.dev/posts/snowflake-micro-partitions
- https://docs.snowflake.com/ja/user-guide/tables-auto-reclustering
- https://docs.snowflake.com/en/user-guide/semistructured-considerations#storing-semi-structured-data-in-a-variant-column-vs-flattening-the-nested-structure
- https://docs.snowflake.com/ja/sql-reference/functions/system_clustering_information