ç 究éçºé¨ã®å島ã§ããä»æ¥ã¯è¡¨é¡ã®æ¸ãããããã¤ãã£ã話ããã¾ãã
ãã£ã¡ã§ãå½¢æ ç´ è§£æããã£ã¡ã§ãå½¢æ ç´ è§£æ
ã¿ãªãããå½¢æ ç´ è§£æãã¦ã¾ããï¼ãã¦ã¾ãããï¼ã¯ãã¯ãããã§ããã¾ãã¾ãªããã¸ã§ã¯ãã§å½¢æ ç´ è§£æããã¦ãã¾ãã
ããããããããããã§ããããã¸ã§ã¯ã A ã§ã¬ã·ãã解æããããã¸ã§ã¯ã B ã§ãã¬ã·ãã解æããããã¸ã§ã¯ã C ã§ãã¬ã·ãã解æãã... ã¨ãã£ãå ·åã§ããã¡ãªã¿ã«ãå½¢æ ç´ è§£æï¼ã®çµæï¼ãå¿ è¦ãªããã¸ã§ã¯ãã¨ãã¦ã¯ã¬ã·ãã®åé¡ãã¬ã³ã¡ã³ããå種åæ£è¡¨ç¾ï¼e.g., word2vecï¼ã BERT ã®å¦ç¿ãªã©ãããã¾ãã
ãã¡ãããæçµçã«å¾ãã解æçµæãéãã®ã§ããã°åé¡ããã¾ãããããããç§ãè¦ãããããã»ã¨ãã©ã®å ´åã¯åãï¼ãããã¯ãåãã«ã§ãããï¼ã§ãããã§ããã°ã
- 解æå¨ãã¤ã³ã¹ãã¼ã«ï¼â Dockerfile ã試è¡é¯èª¤ï¼
- 解æ対象ãåå¾ï¼â SQL ã試è¡é¯èª¤ï¼
- 解æå¨ãå®è¡ï¼â ã¯ãã¯ãããã®å ´å㯠ECS ã IAM ã®è¨å®ã試è¡é¯èª¤ï¼
- 解æçµæãä¿åï¼â åæ§ã« S3 ã RDS ã®è¨å®ã試è¡é¯èª¤ï¼
ã¨ããä¸é£ã®å¦çãåéçºè ãåå¥ã«è¡ãã®ã¯éå¹çã§ããåã解æå¨ã使ããåã解æ対象ï¼åºæ¬çã«ã¯è§£ææã®å ¨ã¬ã·ãï¼ãéããå®æçã«è§£æãè¡ãã解æçµæãç°¡åã«ä½¿ãåããããã«ããããå½¢æ ç´ è§£æãå¿ è¦ãªããã¸ã§ã¯ããå¢ããã«ã¤ããããããæ³ããåã£ã¦ãã¾ããã
å ±éåã¯é¢åã ãã©ãé£ãã話ã§ã¯ãªã
ããããèæ¯ã§ãéãããè °ãä¸ãã¦ãå ±éåãè¡ããã¨ã«ãã¾ãããããããä½æ¥ã£ã¦ã©ããã¦ãå¾åãã«ãªããã¡ã§ããããåéçºè ã¯åããã¸ã§ã¯ããé²ãããã®ã§ãã£ã¦ãä»ã®ããã¸ã§ã¯ãã®ãã¨ã¾ã§ã±ã¢ãã¦å ±éåãè¡ãã®ã¯ãªããªãé¢åã§ãã
ä¸æ¹ããã®è©±ã¯æè¡çã«ã¯å¤§ãããã®ã§ã¯ããã¾ãããåã«ããå½¢æ ç´ è§£æãè¡ãã ãã®ããããã¤ãããã¨ããã ãã®è©±ã§ããä¸ã®å¦ç 1 ãã 4 ãä¸å¯§ã«è¡ãªããåããã¸ã§ã¯ãã使ããããå½¢ã§æçµçãªè§£æçµæãæ®ãã°ä»»åå®äºã§ããé£ãã話ã§ã¯ããã¾ããã
å½¢æ ç´ è§£æãè¡ãã ãã®ããã
ã¨ããããã§ããããã£ãããããã¤ãã£ã¦ã¿ã¾ããããããã®æ¦è¦³ã¯ä¸å³ã®ã¨ããã§ãã以ä¸ã§ã¯ããããã®åå¦çã«ã¤ãã¦ããã®è©³ç´°ãã話ããã¾ãã
1. 解æå¨ãã¤ã³ã¹ãã¼ã«
ãã¦ãã¾ãããååã® @himkt ãã¤ãã£ã konoha ã使ãã¾ãããkonoha ã¯ãã¾ãã¾ãªå½¢æ ç´ è§£æå¨ï¼e.g., MeCabãSudachiãKyTeaï¼ã®ã©ããã¼ã§ããkonoha ã§è§£æãè¡ãªã社å ãµã¼ãããã£ãã®ã§ã解æå¨ã®ã¤ã³ã¹ãã¼ã«ãè¨å®ã¯ãã®ãµã¼ãã«å§ãããã¨ã«ãã¾ããï¼ãªã®ã§ãä¸å³ã«ãå¦ç 1 ã¯å«ã¾ãã¦ãã¾ããï¼ããªããã¯ãã¯ãããã§ä½¿ã£ã¦ãã解æå¨ã¯ MeCab ã§ãã
2. 解æ対象ãåå¾
解æ対象ã¯ã¬ã·ãï¼ã®ã¿ã¤ãã«ãç´¹ä»æãæé ãªã©ï¼ã§ãããã㯠Redshift ããåå¾ãã¾ãããã¯ãã¯ãããã§ã¯ã»ã¨ãã©ã®ãã¼ã¿ã Redshift ã«éç´ããã¦ãã¾ããã¾ããQueueryï¼ãã ã¼ãï¼ã¨ãã社å åãã®ã·ã¹ãã ããããUNLOAD ã使ããã¨ã§ Redshift ã«è² è·ããããã« SELECT ãå®è¡ã§ããããã«ãªã£ã¦ãã¾ãã
ä»åã¯ãQueuery ãããã«ã©ãããã corterï¼ãã¼ãã¼ï¼ã¨ãã社å åãã® Python ããã±ã¼ã¸ãã¤ããã¾ãããcorter 㯠COllect Recipe-related TExts from Redshift ã®ç¥ã§ããã®åã®ã¨ãããã¬ã·ãã«é¢ããããã¹ãã Redshift ããåéããããã®ãã®ã§ãã
以ä¸ã¯ãcorter ã§ã¯ãã¯ãããã®å ¨ã¬ã·ãã®ã¿ã¤ãã«ãåå¾ããã³ã¼ãã§ãï¼ã¬ã·ã ID ã¯ããã¼ã§ãï¼ã
from corter.agent import RecipeTitleAgent agent = RecipeTitleAgent() recipe_ids, titles = agent.collect() print(recipe_ids[0], titles[0]) # => 12345, 'ãã¹ã®èå³åçã'
corter ã®å é¨ã«ã¯ãã¯ãã¯ãããã®æ©æ¢°å¦ç¿ã¨ã³ã¸ãã¢ã§ããã°èª°ãã試è¡é¯èª¤ããã§ããã SQL ãæ¼ãè¾¼ãããã¦ãã¾ããå ¬éæ¸ã¿ã®ã¬ã·ããçµãè¾¼ã WHERE ã®æ¡ä»¶ã¨ãæ¯åå¿ãã¡ãããã§ãããã
ã¡ãªã¿ã«ããã®ãã㪠Python ããã±ã¼ã¸ãã¤ãã£ãã®ã«ã¯ããä¸ã¤çç±ãããã¾ããçæï¼å½¢æ ç´ è§£æããã¦ããªãæï¼ã使ãããã¸ã§ã¯ãã§ããã®ããã±ã¼ã¸ã使ãããã£ãããã§ãããã®ãããªããã¸ã§ã¯ãã¨ãã¦ã¯ããã¨ãã°ãSentencePiece ã®å¦ç¿ãªã©ãããã¾ãã
3. 解æå¨ãå®è¡
ããã¯ç°¡åã§ãã2 ã§éããã¬ã·ãã 1 ã§è§¦ãã解æãµã¼ãã«æãã¦ããã ãã§ãããã ããç¾å¨ãã¯ãã¯ãããã«ã¯å ¨é¨ã§ç´ 350 ä¸åã®ã¬ã·ããããã¾ããããã MeCab ãé«éã§ããå ¨ã¬ã·ãã 1 並åã§è§£æããã®ã¯æéããããã¾ãã
ããã§ãã¸ã§ã管çã·ã¹ãã kuroko2 ã® paralle_fork ã使ãã10 並åã§è§£æãè¡ãããã«ãã¦ãã¾ãï¼æ£ç¢ºã«ã¯ãå¦ç 2 ã®æ®µéã§è§£æ対象ã 10 åå²ã§åå¾ãã¦ãã¾ãï¼ãããã§ã350 ä¸åã解æ対象ã¨ãã¦ããã¿ã¤ãã«ã®ãããªçãããã¹ãã§ããã° 5 åç¨åº¦ã§ãæé ã®ãããªé·ãããã¹ãã§ã 1 æéç¨åº¦ã§è§£æãè¡ããããã«ãªãã¾ããã
4. 解æçµæã®ä¿å
解æçµæ㯠Redshift ã«ä¿åããããã«ãã¾ãããS3 ã RDS ãªã©ã®é¸æè¢ãããã¾ãããã2 ã§ã触ããããã«ãã¯ãã¯ãããã§ã¯ã»ã¨ãã©ã®ãã¼ã¿ã Redshift ã«éç´ããã¦ãã¾ãã解æçµæã Redshift ã«ä¿åãããã¨ã§ãä»ã®ãã¼ã¿ã¨ä¸ç·ã«ä½¿ããããããã«ãã¾ããã
解æçµæã Redshift ã«ä¿åããã«ã¯ãã¾ãã3 ã®çµæã S3 ã«ã¢ãããã¼ããã¾ãï¼ä¸å³ 4-1ï¼ã次ã«ãã¢ãããã¼ãããã 10 åã®è§£æçµæããã¼ã¸ãã¦ã1 åã® TSV ãã¤ããã¾ãï¼4-2ï¼ãããã¦ãSQL ããããã¬ã¼ã ã¯ã¼ã¯ bricolage ã使ããTSV ã®ä¸èº«ã Redshift ã« COPY ãã¾ãï¼4-3ï¼ãããã§ãæ°åè¡ã®ãã¼ãã§ã 10 åå¼±ã§çµããã¾ãã
æçµçã«ãä¸è¡¨ã®ãããªãã¼ãã«ã«è§£æçµæãä¿åããã¾ããããã¯ã¿ã¤ãã«ã®è§£æçµæãä¿åãããã¼ãã«ï¼ä¸å³ã§ã¯ recipe_title_wordsï¼ã§ããä¸è¡ãä¸åèªã§ãã主ãªã«ã©ã 㯠positionï¼åèªã®åºç¾ä½ç½®ï¼ã¨ surfaceï¼è¡¨å±¤å½¢ï¼ãposï¼åè©ï¼ãbaseï¼åå½¢ï¼ã§ããanalyzed_at ã¯ããã®åã®ã¨ããã解ææå»ã§ãã
recipe_id | position | surface | pos | base | analyzed_at |
---|---|---|---|---|---|
12345 | 0 | ãã¹ | åè© | ãã¹ | 2021-03-08 00:00:00 |
12345 | 1 | ã® | å©è© | ã® | 2021-03-08 00:00:00 |
12345 | 2 | è | åè© | è | 2021-03-08 00:00:00 |
12345 | 3 | å³å | åè© | å³å | 2021-03-08 00:00:00 |
12345 | 4 | çã | åè© | çãã | 2021-03-08 00:00:00 |
2 ãã 4 ãæ¥æ¬¡ã§å®è¡
以ä¸ã®å¦ç 2 ãã 4ï¼ä»åã1 ã¯ãªã«ããã¦ãã¾ããï¼ãããããã¤ãã¼ã« hako ã使ããæ¥æ¬¡ã§å®è¡ãã¦ãã¾ããå ·ä½çã«ã¯ãECS ã§ã³ã³ãããèµ·åããããã§å¦ç 2 ãã 4 ãå®è¡ãã¦ãã¾ããããã«ãããæ¯æ¥ããã®æ¥ã®æç¹ã§å ¬éæ¸ã¿ã®ãã¹ã¦ã®ã¬ã·ãã®è§£æçµæã Redshift ã«ä¿åããã¦ããç¶æ ã«ãã¦ãã¾ãã
ä½è«ã¨ãã¦ãå·®åæ´æ°ãèãã¾ãããã¤ã¾ããåæ¥ã«æ稿ï¼æ£ç¢ºã«ã¯å ¬éï¼ãããã¬ã·ãã®è§£æçµæã¯è¿½å ããä¿®æ£ãããã¬ã·ãã®è§£æçµæã¯ä¿®æ£ããåé¤ï¼æ£ç¢ºã«ã¯éå ¬éï¼ãããã¬ã·ãã®è§£æçµæã¯åé¤ãããã¨ãã§ãã¾ãããããããã¹ã¦ã®ã¬ã·ãã解æããã¨ããã§å¤§ãã¦æéãããããããããå·®åæ´æ°ã«ãã£ã¦è¤éæ§ãå¢ããã¡ãªããã®æ¹ã大ããã£ãã®ã§ãå·®åæ´æ°ã¯ããã¾ããã
ãããã£ã¦ææ¸ã«ããã¨ãåå¦çã¯å¤§å¤ã«è¦ããããããã¾ãããããããå®éã®ã¨ããã¯æ¢åã®ä¾¿å©ãã¼ã«ï¼konohaãQueueryãkuroko2ãbricolageãhakoï¼ãçµã¿åãããã ãã§ããããªã«å¤§å¤ã§ã¯ããã¾ãããç§ãé å¼µã£ããã¨ã¨è¨ãã°ãcorter ãã¤ãã£ããã¨ãããã§ãããããããã Queuery ã®ãããã§ã ãã¶æ¥½ãããã¦ãããã¾ããã
解æçµæãåããã¸ã§ã¯ãã§ä½¿ã
ãã¦ã解æçµæã¯ãã¾ãã¾ãªããã¸ã§ã¯ãã§ä½¿ããªããã°æå³ãããã¾ãããããã§ãçæã ãã§ãªã解æçµæã corter ã§åå¾ã§ããããã«ãã¾ãããçæãåå¾ããã¨ãã¨åæ§ãè£ã§ã¯ Queuery ã使ã£ã¦ãããUNLOAD ã§ã©ãããã SELECT æã Redshift ã§å®è¡ãã¦ãã¾ãã
以ä¸ã¯ãã¯ãã¯ãããã®å ¨ã¬ã·ãã®ã¿ã¤ãã«ã®è§£æçµæãåå¾ããã³ã¼ãã§ãã解æçµæã¯ã¹ãã¼ã¹åºåãã§è¿ã£ã¦ãã¾ãããªãã·ã§ã³ãå¤ãããã¨ã§ãåè©ã§çµãè¾¼ãã ããã¹ãããã¯ã¼ããå¼¾ããã¨ãã§ãã¾ãã
from corter.agent import SegmentedRecipeTitleAgent agent = SegmentedRecipeTitleAgent() recipe_ids, segmented_titles = agent.collect() print(recipe_ids[0], segmented_titles[0]) # => 12345, 'ãã¹ ã® è å³å çã'
解æçµæãå¿ è¦ãªããã¸ã§ã¯ãã¯åºæ¬çã« Python ããã¼ã¹ã§ããä¸ã®ã³ã¼ãã®ãã¨ã§ schikt-learn ãªã gensim ãªã transformers ãªãã使ãã°ãã¬ã·ãã®åé¡ãã¬ã³ã¡ã³ããå種åæ£è¡¨ç¾ã BERT ã®å¦ç¿ãªã©ãããã«å§ãããã¾ãã
ãããã¦ãå½åã®ç®çã©ããããã¾ãã¾ãªããã¸ã§ã¯ãã§åéçºè ãåå¥ã«å½¢æ ç´ è§£æãè¡ãã¨ããäºæ ãé¿ããããããã«ãªãã¾ããã
次ã¯ï¼
æ¬ã¨ã³ããªã®æå¾ã«ã次ã«åãçµã¿ãããã¨ãä¸ã¤ã»ã©æãã¦ããã¾ãã
ä¸ã¤ç®ã¯ãä»å Redshift ã«ä¿åãã解æçµæãã¾ã 使ãã¦ããªãããã¸ã§ã¯ããæ®ã£ã¦ããã®ã§ããããããªãããã¨ã§ããåºæ¬çã«ã¯ãæ¢åã®ã³ã¼ãã corter ã«ç½®ãæãã¦ããã ãã§ããä¸æ¹ãããã ãã§ã¯æ¸ã¾ãªãããã¸ã§ã¯ããããã¾ããã¬ã·ãæ¤ç´¢ã®ã¤ã³ããã·ã³ã°ã§ããã¾ãã«å½¢æ ç´ è§£æãéè¦ãªããã¸ã§ã¯ãã§ãããã¬ã·ãæ¤ç´¢å¨ãã¯ã¬ã¬ã·ã¼ã®å·£çªãªã®ã§ããããç½®ãæããã«ã¯ç¸å½ã®æéã¨è¦æãå¿ è¦ã§ãã
äºã¤ç®ã¯ãåå¦ç¿ãã解æå¨ã使ããã¨ã§ããã¯ãã¯ãããã§ã¯ãæ¨å¹´ã500 åãããªãã¬ã·ãã®è§£ææ¸ã¿ã³ã¼ãã¹ãã¤ããã¾ããããã®ã³ã¼ãã¹ã«ã¯å½¢æ ç´ è§£æï¼ã¨åºæ表ç¾èªèãæ§æ解æï¼ã®æ£è§£ãã¼ã¿ãå«ã¾ãã¦ãã¾ãããã®ã³ã¼ãã¹ã§è§£æå¨ãæ¹åãã解æ誤ããæ¸ãããã¨ã§ãå½¢æ ç´ è§£æãå¿ è¦ãªãã¹ã¦ã®ããã¸ã§ã¯ããåºä¸ããããã¨èãã¦ãã¾ãã解ææ¸ã¿ã³ã¼ãã¹ã¨åå¦ç¿ã«ã¤ãã¦ã¯æ¥é±ã®è¨èªå¦çå¦ä¼ã® @himkt ã®çºè¡¨ãã覧ãã ããã
ä¸ã¤ç®ã¯ãRedshift ML ã使ããã¨ã§ããRedshift MLãçªç¶ç¾ãã¾ããããä¸ã§ãè¿°ã¹ãããã«ã解æçµæ㯠Redshift ã«ä¿åãã¦ããã¾ããããããç¹å¾´éã¨ãã¦ã¢ãã«ãå¦ç¿ããRedshift å ã§æ¨è«ããã¨ããããã¼ãã¤ããã°ãã¬ã·ãã®åé¡ãªã©ã®ããã¸ã§ã¯ãã¯å¤§é¨åã Redshift ã«ä»»ããããããï¼ã¨èãã¦ãã¾ããRedshift ML ã«ã¤ãã¦ã¯ã¾ã åå¼·ä¸è¶³ãªã®ã§ãã¾ãã¯åå¼·ãã¾ãã
ãã¦ããããã«åãçµãã ãã§ã大å¤ã§ãããã¯ãã¯ãããã«ã¯ä»ã«ãåãçµã¿ãããã¨ãããããããã¾ãããããããããï¼ãã¨ãããã£ã¦ã¿ããï¼ãã¨æã£ã¦ãã ãã£ãæ¹ã¯ããã²ãæ¡ç¨ãã¼ã¸ãã覧ãã ããããå¿åããå¾ ã¡ãã¦ããã¾ãã