SQLite + Pythonã¦ã¼ã¶å®ç¾©é¢æ°çµè¾¼ã§é²æãã¡ãããªãã§ãã«ãªãã¾ãã
æ¦è¦
ããã¾ã§ãHiveãããã¼ã¿åå¾ã»ç°¡åãªå å·¥âPythonã§å å·¥ã»åæã
ã¨ããæµãã§ä½æ¥ãã¦ããã®ã§ããã
HiveâSQLiteâPythonã¨ããæµãã«ããã¨ããé²æãæ¹åãããã®ã§ã
SQLiteã®ç°¡åãªä½¿ãæ¹ã¨Pythonã«ããSQLã¦ã¼ã¶å®ç¾©é¢æ°ã®çµè¾¼æ¹æ³
ã«ã¤ãã¦ã¡ã¢ãæ®ãã¦ããã¾ãã
ç¹ã«ã¦ã¼ã¶å®ç¾©é¢æ°ã®çµè¾¼ãèªç±ã«åºæ¥ãã¨ã
åæããéãç¸å½æ¥½ã«ãªãã¨ãããã¨ã«æ°ä»ãã¾ããã
SQLiteæããã¨ã§ä½ãã©ãæ¹åãããã®ï¼
Hiveã¯ãã«ããã¼ã¿ãã´ãªã´ãªåã£ã¦ããåã«ã¯
SQLã¡ãã£ã¨æ¸ãã ãã§æ¸ãã®ã§å¤§å¤ä¾¿å©ã§ããã
ååé
ãããã¡ããã¡ããå°ãããã¼ã¿ãä½åº¦ãåããã¨ããã¨ã¹ãã¬ã¹æºã¾ãã¾ãã
ãã®ãããããã¾ã§ã¯ããç¨åº¦ã®ãã¼ã¿ãã¾ã¨ãã¦Hiveã§è½ã¨ãã¦ãã¦
Pythonã§å å·¥ãã¦ããåæããã¨ããæµããåã£ã¦ãã¾ããã
ãã å å·¥ããããã«ä¼¼ããããªã³ã¼ãä½åº¦ãæ¸ãã®ã ãããã
ä¸æããã¨æ¸ããªãã¨çµæ§å¦çã«æéããã£ã¦ãã¾ãã®ã§ã ããã
ããã«ãã«ãã¦é²æãã¡ã§ããã
ããã§ãHiveããåã£ã¦ãããã¼ã¿ãä¸æ¦SQLiteã«å
¥ãã¦ãã¾ãããããã°
1. æ½åºã»å å·¥ãSQLã ãã§åºæ¥ãã
2. ãããçµæ§é«é*1
ãªã®ã§é²æãã¡ãããªãã§ãã«ãªãã¾ããã
MySQLã§ããããã£ã¦è¨ããããã¾ãããããããªãã§ããã
SQLiteã¯ãæ軽ãªã®ã§å¯æããå¯æãã¯æ£ç¾©ã
SQLiteã¯ä¸è¨ãããã¤ããªä¸ã¤è½ã¨ãã¦ããã ãã§å©ç¨åºæ¥ã¾ãã
é¢åãªç°å¢è¨å®ãªã©ã¯å¿
è¦ç¡ããªã®ã§æã£åãæ©ãRDBMS使ãããã£ã¦æã«ç¹ã«è¯ãã
http://www.sqlite.org/download.html
SQLiteã®ç°¡åãªä½¿ãæ¹
sqlite3 test.db --test.dbã¨ããååã§DBãã¡ã¤ã«ãçæãSQLiteã¯ãã®DBãã¡ã¤ã«ã«ãã¼ãã«æ å ±å ¨é¨å ¥ã£ã¦ã¦ãããã渡ãã ãã§ãã¼ã¿ã¾ããã¨æ¸¡ãããä¾¿å© .separator , --ãã¡ã¤ã«ã»ãã¬ã¼ã¿ã,ã«è¨å® .import test.csv test_table --test.csvãã¡ã¤ã«ãtest_tableã«åã込㿠-- ãªãã¨ãã®ä¸è¡ã§ä¸æºåçµãã .mode csv --CSVå½¢å¼ã§åºåããã¨ããè¨å® .output output.csv --ã¯ã¨ãªã®çµæãoutput.csvã«åºåããã¨ããè¨å® select * from test_table; --ãã®ã¯ã¨ãªã®çµæãoutput.csvã«åããã .output stdout --åºåå ãæ¨æºåºåã«æ»ã
Pythonã§SQLiteç¨ã«ã¦ã¼ã¶å®ç¾©é¢æ°ãçµã¿è¾¼ã
# å®ã¯ãã®è¨äºãªãã¦èªãå¿ è¦ç¡ãã¦ä¸è¨åç §ããã°åºæ¥ã # https://python-doc-ja.readthedocs.org/en/latest/library/sqlite3.html # ä»åã¯SQLiteã«ã¯æ¨æºæè¼ããã¦ããªãmedianãçµã¿è¾¼ãã§ã¿ã import sqlite3 import numpy #ç§å¦è¨ç®ç¨ã©ã¤ãã©ãªãæ§ã ãªçµ±è¨é¢æ°ãç¨æããã¦ããã®ã§SQLã«çµã¿è¾¼ãã¨å¤§å¤ä¾¿å© class Median: def __init__(self): self.values = [] def step(self, value): self.values.append(value) def finalize(self): return numpy.median(self.values) con = sqlite3.connect("test.db", isolation_level=None) con.create_aggregate("median", 1, Median) c = con.cursor() query = "select median(hoge) from test_table" c.execute(query) l = [] for row in c: print row
ææ³
ã¨ã¦ãç°¡åã
ã¨æãããmediançµã¿è¾¼ãã®ã«çµæ§åè¦å
«è¦ããã
å§ãã¯SQLã ãã§ãããã¨ããã
numpy使ããèªåã§medianç¸å½ã®ã³ã¼ãæ¸ãã¦ãããããã©ã
ãã¼ã¿ãµã¤ãºã8GBãããã«ãªã£ããç³é
ãã£ããéä¸ã§è½ã¡ãããã¦ããªãå°ã£ã¦ãã
èªåå®è£
ã«ããã¾ã§æãå¿
è¦ç¡ã*2ããªã¨æã£ã¦
numpy.median使ã£ãç¬éãã¼ã¿ã10GBãªã¼ãã¼ã§ãå
¨ç¶è»½å¿«ã«æãã¦ããã*3
ã¦ã¼ã¶å®ç¾©é¢æ°èªç±ã«çµã¿è¾¼ããã¨ä½ããããã¼ãã£ã¦ã
Pythonã¯çµ±è¨/æ©æ¢°å¦ç¿ã®ã©ã¤ãã©ãªãå
å®ãã¦ããããã
ãã®ã©ã¤ãã©ãªã®é¢æ°ãSQLã«çµã¿è¾¼ã¿ããããã°
ã¯ã¨ãªæã¤ã ãã§åæãåºæ¥ãããã«ãªãã£ã¦ãã¨ã§ãã
ãããå¼ç¤¾ã§ã¯ç¤¾å
ã®ä¸è¬ã¦ã¼ã¶ï¼ï¼éã¨ã³ã¸ãã¢ï¼åãã«ãã¼ã¿ãèªç±ã«åãåºãããã
ã¯ã¨ãªãæã¤å£ãä½ã£ã¦ãã¾ãã
ãã®ã¯ã¨ãªãæã¤ã·ã¹ãã ã«ã³ã¤ããçµã¿è¾¼ãã°ã
åã¦ã¼ã¶ã«åæç°å¢æ§ç¯ãã¦è²°ãå¿
è¦ç¡ãã
ã¯ã¨ãªæã¤ã ãã§èª°ã§ãçµ±è¨ææ³ãé©ç¨å¯è½ã«ãªãã®ã§ãããã¼ã
ãããã¨ãSQLiteãnumpyï¼
ãé°æ§ã§é²æãã¡ãããªãã§ãï¼
*1:â»å人ã®ææ³ã§ãããã³ãã¯åã£ã¦ããªãããå¦æ³ã®å¯è½æ§ãããã¾ã
*2:ãã©ããã¯ã¡ãã£ã¨å¾®å¦ã§ãæ®éä½æ¥ç¨ãµã¼ãã«numpyå ¥ã£ã¦ãªãã¨æãããã®ã§ãåºæ¥ãéãSQLã ãã¨ãPythonæ¨æºé¢æ°ã ãã¨ãã§çµãããããã£ãã£ã¦æãããã
*3:â»å人ã®ææ³ã§ãããã³ãã¯åã£ã¦ããªãããå¦æ³ã®å¯è½æ§ãããã¾ã