ãã¦ãããã°ãã¼ã¿å ¨çã®æ¨ä»ãæ°ã®ã¬ãã¤ãç¨åº¦ã®ãã¼ã¿ã®ããåãã¯çããããªãã¨ããªãæ代ã«ãªãã¾ããã交æç¨ãã¼ã¿ãã¡ã¤ã«ã®ãã©ã¼ãããããããããªå½¢å¼ã使ããã¦ãã¾ãããããã§ã¯Pythonã§ä¸è¬çã«ä½¿ããã¦ãããã¡ã¤ã«å½¢å¼ãç°¡åã«æ¤è¨ãã¦ã¿ã¾ãããã CSV¶æãããåç´ãªè¡¨å½¢å¼ã®ãã¼ã¿ã«ã¯CSVã使ããã¦ãã¾ãããMicrosoft Excelãã¯ããã¨ãã¦ãã¾ãã¾ãªãã¼ã«ã§ãµãã¼ãããã¦ãããå¹ åºãç°å¢ã§å©ç¨ã§ãã¾ãã ãã¼ã¿ã®ä½æ¶ä¾ã¨ãã¦10ä¸è¡ã»100ã«ã©ã ã®ãã¼ã¿ãä½æããCSVå½¢å¼ã§ä¿åãã¦ã¿ã¾ããããã¤ã³ããã¯ã¹ã¨ãã¦ãdatetimeåã®å¤ãæå®ãã¦ã¾ãã %pip install pandas pyarrow numpy tqdm dask graphviz import sys import numpy as np import pandas as pd pd.
ããã«ã¡ã¯ãUZOUã®ãããã¯ãéçºããã¦ããã¨ã³ã¸ãã¢ã®@kanga333ã§ãã UZOUã§ã¯åºåãã¼ã¿ã®éè¨ã®ä¸é¨ã«Amazon Athenaãæ¡ç¨ãã¦ãã¾ãã ãã®è¨äºã§ã¯UZOUã«ãããAthenaã使ã£ããã¼ã¿å¦çåºç¤ã®è¨è¨ã«ã¤ãã¦ç´¹ä»ãããã¨æãã¾ãã å ¨ä½æ§æ ãã¼ã¿å¦çåºç¤ã®å ¨ä½æ§æã¯æ¬¡ã®ããã«ãªã£ã¦ãã¾ãã 以å¾ã¯ããããã®ã³ã³ãã¼ãã³ãã«ã¤ãã¦é 次紹ä»ãã¦ããã¾ãã Fleuntdã«ããS3ã¸ã®éç´ UZOUã§ã¯ç¹ã«Fluentdã¢ã°ãªã²ã¼ã¿ã®ãããªä¸ç¶ãµã¼ãã¯è¨ãã¦ãã¾ãããåºåé ä¿¡ãµã¼ãã«å¸¸é§ããFluentdããã°ãç´æ¥S3ã«ããããã¦ãã¾ãã 以ä¸ã¯Fluentdã®S3 outputé¨åã®è¨å®ã®ä¸é¨æç²ã§ãã <buffer time> @type file timekey 60m </buffer> path example_table/dt=%Y%m%d/h
ãã£ãã㨠pythonã§2次å é åãã¼ã¿ãä¸æä¿åããã¨ãã«ãã使ã 1. pickle.dump 2. joblib.dump 3. pyarrowã«å¤æãã¦parquetä¿å 4. pd.write_csv ã®ããããã«ã¤ãã¦èªã¿æ¸ãé度ã¨ä¿å容éãæ¯è¼ãã¾ããã çµè« å§ç¸®çã¨é度ãªãpickle protocol=4 ä¸é¨ã ãèªãã ãæ¸ããããç¹°ãè¿ããããªå ´åã¯pyarrowã§parquetä¿å ãè¯ããã 試è¡ç°å¢ CPU: Xeon E5-2630 x 2 chip VRAM: 128GB Windows8 64bit python 3.6 æ¯è¼ã«ä½¿ã£ããã¼ã¿ æ©æ¢°å¦ç¿ç¨ã®ç¹å¾´éãã¼ã¿ã§è©¦è¡ãã¾ãã ã»pandas.DataFrameã® 536è¡178886å 0.77GB ã»pandas.DataFrameã®4803è¡178886å 6.87GB æ¯è¼çµæ 0.77GBã®
1ã¶æåã«ããApache Arrowã§Parquetãèªã¿æ¸ãããã¨ãfactorãæ±ãæ¹æ³ãããããããªããã¿ãããªãã¨ãã´ãã§ã´ãã§æ¸ãã¾ãããã ç´ãã¾ããã å°ãã ã解説ãã¦ããã¨ãå®ã¯Apache Parquetã«ã¯ç´æ¥ãã«ãã´ãªå¤ãã«å½ãããã®ã¯ããã¾ãããããDictionary encodingã¨ããå§ç¸®ã®ããã®æè¡ãæ´ç¨ãã¦ã«ãã´ãªå¤ã表ç¾ãããã¨ãã§ãã¾ãã Dictionary encoding is a compression strategy in Parquet, and there is no formal âdictionaryâ or âcategoricalâ type. (Faster C++ Apache Parquet performance on dictionary-encoded string data coming in Apache A
ã¿ã¤ãã«ã®éãã§ããPandasã®Dataframeãpyarrowã§Parquetã«å¤æãã¦ããã®ã¾ã¾GCSã«ã¢ãããã¼ããã¦ãã¾ãã ã¹ã¯ãªãã ãããªå½¢ã§å®è¡å¯è½ã§ãããã¡ã¤ã«ãçµç±ããªãã§Bufferããããã®ã¾ã¾ã¢ãããã¼ããã¦ãã¾ãã import pandas as pd import pyarrow as pa import pyarrow.parquet as pq import numpy as np import datetime from google.cloud import storage as gcs # ããã¼ãã¼ã¿ã§Dataframeä½æ row_num = 100000 string_values = ['Python', 'Ruby', 'Java', 'JavaScript', 'PHP','Golang'] df = pd.DataFrame({
å æ¥ãApache Arrowæ±äº¬ãã¼ãã¢ãã2019ã§ãRã¨Apache Arrowãã¨ããã¿ã¤ãã«ã§çºè¡¨ãã¦ãã¾ããããã¨ãJapan.Rã§ãApache Arrowã«ã¤ãã¦LTãã¾ããã 話ãããã¨ã¨ãã¦ã¯ã arrowããã±ã¼ã¸ã使ãã¨Parquetãã¡ã¤ã«ï¼å¾è¿°ï¼ã®èªã¿æ¸ããã§ãã sparklyrããã±ã¼ã¸ãå é¨ã§Apache Arrowã使ãããã«ãªã£ã¦ãRâSparkéã®ãã¼ã¿ã®ããåããé«éã«ãªã£ã Arrow Flightããã£ã¨ä¸è¬çã«ãªãã°ãJDBCãODBCã使ããªãã¦ããã¼ã¿ãã¼ã¹ãããã¼ã¿ãåã£ã¦ãããããã«ãªã ã¨ããæãã§ãå人çã«ãã¾å¼·èª¿ãããã®ã¯1.ã§ããã¨ããããParquetãã¡ã¤ã«ã®èªã¿æ¸ãã¨ããã®ãRã¦ã¼ã¶ã¼ã«ã¨ã£ã¦ä¸çªããããããã¡ãªãããªã®ã§ãããããã£ããã«ã¿ããªApache Arrowã«ãºããºãã«ãªã£ã¦ããã£ã¨ä¸ã®ä¸ã®ã·ã¹ãã ãApac
å°ããªãã¡ã¤ã«ã®ETLã«Glueã使ãã®ããã£ãããªãã£ãã®ã§ãPandasãpyarrowã§å®è£ ãã¾ããã Lambda Layerã«pandasã¨pyarrowã追å Layerã«ç»é²ããããã±ã¼ã¸ãä½æ ããã±ã¼ã¸ãã¢ãããã¼ã Lambdaã®ã³ã¼ã ã¨ã©ã¼å¯¾å¿ åè Lambda Layerã«pandasã¨pyarrowã追å Layerã«ç»é²ããããã±ã¼ã¸ãä½æ ä»åå©ç¨ããã®ã¯pandasã¨pyarrowãs3fsãªã®ã§ããå°ã工夫ãå¿ è¦ã§ããã ï¼ã¤ãå ¨ã¦ãä¸ã¤ã®ZIPã«çºããã¨Lambda Layerã®50MBã®å¶éã«ããã£ã¦ãã¾ãã¾ãã 3ã¤ã«ZIPãåå²ããã¨Lambdaã«ã¬ã¤ã¤ã¼è¿½å ããæã®å¶éã«ããã£ã¦ãã¾ãã¾ãã Layers consume more than the available size of 262144000 bytes 大ããªnumpyãªã©ãå ±æ
ãªãªã¼ã¹ãé害æ å ±ãªã©ã®ãµã¼ãã¹ã®ãç¥ãã
ææ°ã®äººæ°ã¨ã³ããªã¼ã®é ä¿¡
å¦çãå®è¡ä¸ã§ã
j次ã®ããã¯ãã¼ã¯
kåã®ããã¯ãã¼ã¯
lãã¨ã§èªã
eã³ã¡ã³ãä¸è¦§ãéã
oãã¼ã¸ãéã
{{#tags}}- {{label}}
{{/tags}}