HadoopStreaming 㧠xml ãã¡ã¤ã«ãå¦çãã
HadoopStreaming 㧠xml ãã¡ã¤ã«ãæ±ãæ¹æ³ã®è§£èª¬ã§ãï¼
ãã®è¨äºã§ã¯ï¼ããããï¼ã®RSSãã£ã¼ããã <title>
~</title>
ãæ½åºãããã¨ãç®æ¨ã¨ãã¾ãï¼
ã¾ãï¼è¨èªã¯ Python ã使ç¨ãã¾ãï¼
å®è£
ã«ããã£ã¦ã¯ä»¥ä¸ã®è¨äºãåèã«ãã¾ããï¼è±èªã§ãï¼ï¼
http://davidvhill.com/article/processing-xml-with-hadoop-streaming
RSSãã£ã¼ããåå¾ãã¦HDFSã«è»¢é
$ wget http://shirokai.hatenablog.com/feed -O feed.xml $ hadoop fs -put feed.xml
mapper.py
<entry>
~</entry>
éãã¾ã¨ããå¾ï¼xml ããã¼ã¹ã㦠<title>
~</title>
ãåºåãã¾ãï¼
#!/usr/bin/env python # -*- coding: utf-8 -*- import sys import cStringIO import xml.etree.ElementTree as xml # <entry>~</entry>ã®æååãä¿æãããããã¡ buff = None # <entry>~</entry>éãå¦çä¸ãªãTrue intext = False for line in sys.stdin: line = line.strip() # <entry>ã®éå§ï¼buffã«æ¸ãè¾¼ããç¶æ ã«ããï¼ if '<entry>' in line: buff = cStringIO.StringIO() intext = True # <entry>~</entry>éï¼buffã«æ¸ãè¾¼ã¿ï¼ if intext: buff.write(line) # </entry>ï¼xmlãã¼ã¹ãã¦<title>~</title>ãåºåï¼buffã¯è§£æ¾ï¼ if '</entry>' in line: root = xml.fromstring(buff.getvalue()) print root.find('title').text.encode('utf-8') buff.close() buff = None intext = False
reducer.py
Mapperã®åºåããã®ã¾ã¾åºåããã ãã§ãï¼
#!/usr/bin/env python # -*- coding: utf-8 -*- import sys for line in sys.stdin: print line.strip()
HadoopStreaming ã®å®è¡
$ hadoop jar hadoop-streaming-***.jar -mapper mapper.py -reducer reducer.py -file mapper.py reducer.py -input feed.xml -inputreader "StreamXmlRecordReader,begin=<entry>,end=</entry>" -output feed.out
â» ***
ã«ã¯ä½¿ç¨ãããã¼ã¸ã§ã³ãå
¥ãã¾ãï¼
-inputreader
ãªãã·ã§ã³ã§ï¼Mapper ã¸ã®å
¥åå½¢å¼ãæå®ã§ãã¾ãï¼
å
¥åã xml ã«ããã«ã¯ StreamXmlRecordReader,begin=<entry>,end=</entry>
ãæå®ãã¾ãï¼ããã§ï¼xml ã®<entry>
~</entry>
éã1ã¤ã®ããã¾ãã¨ãã¦åä¸ã® Mapper ã§å¦çãããããã«ãªãã¾ã*1ï¼
çµæã確èª
$ hadoop fs -cat "feed.out/*" ãMacããGet Plain Textãã§Evernoteã¸ã®ã³ãããæãï¼ Rails Tutorial å ¨é¨èªãã ã®ã§ææ³ã¨ãã¾ã¨ãã¨ã LIBLINEARã®ãã©ã¡ã¼ã¿ãã°ãªãããµã¼ãããã¹ã¯ãªããæ¸ãã ... 以ä¸ç¥ ...
ã¡ãã㨠<title>
~</title>
ãåãåºãã¾ããï¼
ã¾ã¨ã
HadoopStreaming 㧠xml ãã¡ã¤ã«ãæ±ãæ¹æ³ã解説ãã¾ããï¼
ãã®è¨äºã§æ±ã£ãããã«ï¼xml ãæ±ãã«ã¯å°ã工夫ãå¿
è¦ã§ãï¼
ããããæ°ãã¤ããã°å¤§è¦æ¨¡ãª xml ãã¡ã¤ã«ã Hadoop ã§é«éã«å¦çãããã¨ãã§ãã¾ãï¼
ä¾ãã° Wikipedia ã® dump ãã¡ã¤ã«ãå¦çããæãªããã«å½¹ç«ã¤ã¨æãã¾ãï¼
æ©ä¼ãããã°æ¯é試ãã¦ã¿ã¦ãã ããï¼
Wikipedia:データベースダウンロード - Wikipedia