ã¯ããã«
ãã®è¨äºã¯Enigmo Advent Calendar 2018ã®11æ¥ç®ã§ãã
Enigmoã§ã¯ããã¼ã¿ã¦ã§ã¢ãã¦ã¹ï¼DWHï¼ã¨ãã¦BigQueryã使ã£ã¦ãã¦ããµã¼ãã¹ã®ã¢ã¯ã»ã¹ãã°ããµã¤ãå ã®è¡åãã°ããã¼ã¿ãã¼ã¹ã®ãã¼ã¿ãBigQueryã¸éç´ããã¦ãã¾ãã
ãã¼ã¿ãã¼ã¹ããBigQueryã¸ã®ãã¼ã¿åæã«ã¯Apache Airflowã使ã£ã¦ãã¦ãä»æ¥ã¯ãã®ä»çµã¿ã«ã¤ãã¦ç´¹ä»ãã¾ãã
Apache Airflowã¨ã¯
Airflowã¯ãpythonã§ã¯ã¼ã¯ããã¼ï¼DAGï¼ãå®ç¾©ããã¨ããã®ã¨ããã«ã¿ã¹ã¯(ãªãã¬ã¼ã¿ã¼)ããã¹ã±ã¸ã¥ã¼ãªã³ã°ãã¦èµ·åãã¦ããããã¼ã«ã§ããGCPã§ãGKEä¸ã§AirflowãåããCloud Composerã¨ãããµã¼ãã¹ãæä¾ããã¦ãã¦ãåç¥ã®æ¹ãå¤ãã¨æãã¾ãã
ãã¼ã¿ã®å¦çã®åä½ããªãã¬ã¼ã¿ã§å®ç¾©ãããã®å¦çã®ä¾åé¢ä¿ãåæ ããã¯ã¼ã¯ããã¼ãDAGã§å®ç¾©ãã¦ããã°ãã¼ã¿å¦çã®ãã¤ãã©ã¤ã³ãå®ç¾ãããã¨ãå¯è½ã¨ãªãã¾ãã
DBããBigQueryã¸ã®ãã¼ã¿ãã¤ãã©ã¤ã³
ãã¼ã¿ã®æµã
ãã¼ã¿ã®æµãã¨ãã¦ã¯ãä¸ã®å³ã®éã大ããï¼ãã§ã¼ãºã«åããã¦ãã¦ãã¾ãã¯DBï¼SQL Serverï¼ããGoogle Cloud Storageï¼GCSï¼ã¸ãã¼ã¿ãã¢ãããã¼ããã¦ãã¾ãããã®æ¬¡ã«GCSããBigQueryã¸ãã®ãã¼ã¿ããã¼ããã¦ãã¾ãã
ããããã®ãã§ã¼ãºãAirflowã®ã¿ã¹ã¯ã®åä½ã§ãããªãã¬ã¼ã¿ã¼ã§å®ç¾ãã¦ãã¦ãããã«ï¼ã¤ã®ãªãã¬ã¼ã¿ã¼ã¯ããããåæãããã¼ãã«ãã¨å¥ã®ã¿ã¹ã¯ã¨ãã¦åå¨ããããããDAGã¨ããï¼ã¤ã®ã¯ã¼ã¯ããã¼ã®åä½ã§ã¾ã¨ãã¦ãã¾ãã
SQL ServerããGCSã¸
JdbcToGoogleCloudStorageOperator
SQL ServerããGCSã¸ã®ãã¼ã¿ã®ç§»åã¯ãJdbcToGoogleCloudStorageOperator ã¨ããAirflowã®ãªãã¬ã¼ã¿ã¼ãæ å½ãã¾ãã
DBãMySQLã®å ´åã¯MySqlToGoogleCloudStorageOperatorã¨ããAirflowã«çµã¿è¾¼ã¿ã®ãªãã¬ã¼ã¿ã¼ããããã§ããããã¤ãã®ãã¼ã¿ãã¼ã¹ã¯SQL Serverãªã®ã§ãJDBCã®ã¯ã©ã¤ã¢ã³ãã§åæ§ã®åãããããªãã¬ã¼ã¿ã¼ãèªåã§ä½ã£ããã®ã JdbcToGoogleCloudStorageOperator ã§ããAirflowã®ãã©ã°ã¤ã³ã¨ãã¦å ¬éãã¦ãã¾ãã
ãã®ãªãã¬ã¼ã¿ã§ã®å¦çã¯ãã¾ãDBããSQLã§ãã¼ã¿ãæ½åºããä¸åº¦JSONLå½¢å¼ã®ãã¡ã¤ã«ã¨ãã¦ã®ãªãã¬ã¼ã¿ã¼ãåããµã¼ãã¼ã®ãã¼ã«ã«ã«ä¿åããããããGCSã¸ã¢ãããã¼ããããã¨ããæµãã§ããBigQueryã¸ãã¼ãããã¨ãã«ã¹ãã¼ãå®ç¾©ãå¿ è¦ãªã®ã§ããã¼ã¿ãã¡ã¤ã«ã¨ã¯å¥ã«ã¹ãã¼ãå®ç¾©ã®ãã¡ã¤ã«ãJSONå½¢å¼ã§GCSã¸ã¢ãããã¼ãããã¾ãã
ã¹ã±ã¸ã¥ã¼ãªã³ã°ã¨æ´æ°å·®åæ½åºã®ä»çµã¿
DAGã®ã¹ã±ã¸ã¥ã¼ãªã³ã°ééã¯ï¼æéã«è¨å®ãã¦ãã¾ããããã¨Airflowã¯æéãï¼æéãã¨ã«æéãåãã¦DAGã«ãã®æéã®éå§æå»ï¼execution_date
ï¼ãçµäºæå»ï¼next_execution_date
)ããã³ãã¬ã¼ãã®ãã©ã¡ã¼ã¿ã¼ã¨ãã¦æ¸¡ãã¦ããã¾ãããããã
ãã¼ã¿æ½åºSQLã®WHEREå¥ã®ã¨ããã§ã¬ã³ã¼ãã®æ´æ°æ¥æãè¨é²ããã«ã©ã ï¼ä¸ã®ä¾ã§ã¯updated_at
ï¼ãåºæºã«æéæå®ããã¨ããã®æéã«æ´æ°ããã£ãã¬ã³ã¼ãã ããæ½åºãããBigQueryå´ã¸éãããä»çµã¿ã§ãã
SELECT * FROM table1 WHERE "{{execution_date.strftime('%Y-%m-%d %H:%M:%S')" <= updated_at AND updated_at < "{{next_execution_date).strftime('%Y-%m-%d %H:%M:%S')}}"
ããééãå¤ãã¦ãDAGãç·¨éãããã¨ãªãSQLããã®æéã«åããã¦å¤ãã£ã¦ãããã®ã§ä¾¿å©ã§ãã
GCSããBigQueryã¸
GoogleCloudStorageToBigQueryOperator
GCSããBigQueryã¸ã¯ãã®åã®éãAirflowçµã¿è¾¼ã¿ã®GoogleCloudStorageToBigQueryOperatorã¨ãããªãã¬ã¼ã¿ã¼ããã£ã¦ããã¾ãã
BigQueryå´ã®ãã¼ã¿ã»ããã¯åæå DBã®ãã¼ã¿ãã¼ã¹åä½ããã¼ãã«ã¯åæå DBã®ãã¼ãã«åä½ã«åãã¦ãã¾ããBigQueryå´ã®ãã¼ãã«ã¯DBå´ã®ã¬ã³ã¼ãã®æ´æ°æ¥ãã¨ã«æ¥ä»åå²ãã¦ãã¾ãã
BigQueryã®æ´æ°ã¯DMLã¯ä½¿ããã«ããã¡ã¤ã«ãèªã¿è¾¼ã¿ã¸ã§ãã§æ´æ°ããã¾ããããããã¨DBå´ã®ã¬ã³ã¼ããæ´æ°ãããã¨BigQueryå´ã«ã¯éè¤ãã¦ã¬ã³ã¼ããæºã¾ã£ã¦ããã®ã§ãããããã¯å¾è¿°ã®éè¤é¤å¤ãã¥ã¼ã§è§£æ±ºãã¦ãã¾ãã
BigQueryå´ã§ã¬ã³ã¼ãã®éè¤ãé¤å¤
BigQueryå´ã®ãã¼ãã«ã§ã¯ã次ã®ãããªSQLã§ãã¥ã¼ãã¼ãã«ãä½ããã¨ã§ãåæå ã®DBã§ã¬ã³ã¼ããä½åº¦ãæ´æ°ããã¦ã常ã«ææ°ã®ã¬ã³ã¼ãããç¾ããªãä»çµã¿ã«ãªã£ã¦ãã¾ãã
ãã®ä¾ã¯ã主ãã¼ãid
ã§æ´æ°æ¥æã®ã«ã©ã ããupdated_at
ã®å ´åã®SQLã§ããåä¸id
ã«å¯¾ãã¦å¸¸ã«ææ°ã®updated_at
ããã¤ã¬ã³ã¼ããããã®ãã¥ã¼ã«ã¯åºã¦ãã¾ããã
SELECT * FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY id ORDER BY updated_at DESC) etl_row_num FROM `db1.table1_*`) WHERE etl_row_num = 1
Airflowã§ä¾¿å©ã ã£ãæ©è½
Airflowã®æ©è½ã§ãã®ä»çµã¿ãã¤ããã®ã«å©ããããæ©è½ãããã¤ããã£ãã®ã§ç´¹ä»ãã¾ããæ¸ãããã¦ãªãã§ãããã»ãã«ãããããããã¾ãã
Catchup
DAGã®ã¹ã±ã¸ã¥ã¼ã«ãéå»ã®æéã«ããã®ã¼ã£ã¦å®è¡ãã¦ãããæ©è½ãªãã§ãããé常ã«ãããããã£ãã§ãã éå»ã®ãã¼ã¿ã®ç§»è¡ã§ãå·®ååæã®ä»çµã¿ããã®ã¾ã¾ä½¿ãã¾ããããä¸åº¦ã«åæããã«ãæéãåºåã£ã¦å°ããã¤ãã¼ã¿ãæã£ã¦ãããã®ã§ãåæå ã®DBã«ãè² è·ããããã«ãã¿ã¾ããã
ConnectionãVariable
Connectionã¯æ¥ç¶å ã¨ãªãDBãGCPã¸ã®èªè¨¼æ å ±ãä¸å 管çãã¦ãããä¸åº¦è¨å®ããã°ã©ã®DAGããã¢ã¯ã»ã¹ã§ãã¦ä¾¿å©ã§ããã次ã®Poolãåããªãã§ãããè¨å®ã¯GUIã§ãCLIã§ãè¨å®ã§ããã®ã§ãansibleãªã©ã®ãããã¸ã§ãã³ã°ãã¼ã«ã§ãè¨å®ã§ããã®ããããããã£ãã§ãã
Variableãåãªããã¼ã¨å¤ãè¨å®ã§ããã ããªãã§ãããDAGãæ±ããã¨ãªãdevãproductionãªã©ãªãªã¼ã¹ã¹ãã¼ã¸ãã¨ã«å¤ãåãæ¿ãããã¦ä¾¿å©ã§ããã
Pool
ã¿ã¹ã¯ã®åæå®è¡æ°ãå¶éããæ©è½ã§ããPoolã¯ã¦ã¼ã¶ã¼ãå®ç¾©ã§ãããã®Poolã«ãªãã¬ã¼ã¿ã¼ãç´ä»ããã¨ãã®ãªãã¬ã¼ã¿ã¼ã¯ãã®Poolã®slotæ°ãè¶ ãã¦åæå®è¡ããã¾ããããã¼ã¿æ½åºã®ã¿ã¹ã¯ãï¼ã¤ã®DBã«å¯¾ãã¦å¤æ°åæå®è¡ããã¦ãã¾ãã¨ãã®DBã®ã³ãã¯ã·ã§ã³ãåæã«æ¶è²»ãããæ¯æ¸ãããã¾ãããããã®Poolã§ä¸éæ°ãè¨å®ã§ããã®ã§å®å¿ã§ããã
ã¾ã¨ã
æåã¯æã£åãæ©ãcronã¨ã¹ã¯ãªããã§ä½ã£ã¦ãã¾ããã¨æã£ãã®ã§ãããããããªããã¾ã§æéã¯ããã£ããã®ã®Airflowã§ä½ã£ã¦è¯ãã£ãã§ããéçºãé²ãã«ã¤ããç¹ã«ãããã¯ã·ã§ã³ç°å¢ã§åããã«ããã£ã¦ããããèæ ®ãã¹ããã¨ãåºã¦ããã¨æãã®ã§ãããä½ããªããã»ããã¨æã£ãæ©è½ãå åãããã¦ãããã®ããã«ç¨æããã¦ãã¦ã¨ã¦ãå©ããã¾ãããå ¨ã¦ä½¿ãããã¦ãªãã§ãããã¯ã¼ã¯ããã¼éç¨ã®ãã¦ãã¦ãããããè©°ã¾ã£ãè¯ããããã¯ãã ã¨æãã¾ããã