Pythonã§ã¹ã¯ã¬ã¤ãã³ã°ã«æé©ãªã©ã¤ãã©ãªã¯lxmlãªæ°ããããæéçãªæå³ã§
ããæ°æ¥ã§HTMLããTagãé¤å»ããæ¹æ³ããè²ã ç¥ã£ããã¨ã¦ãåå¼·ã«ãªãã¾ãããæãã¦ããã人ãããã¨ãã§ãã
å ·ä½çã«ã¯ãBeautifulSoupã¨HTMLParserã¨lxmlã¨ãã3ã¤ã®ã©ã¤ãã©ãªã§ããããTagé¤å»ãå¯è½ãªäºãåãã£ããå®éã©ããæºè¶³ãªæåã§ããããã©ãã使ãã°ããã®ãï¼ã¨ãè¿·ã£ãã®ã§å®è¡é度ãé©å½ã«æ¸¬ã£ã¦ã¿ãã
æéãè¨ãã¨ããã®ã³ã¼ããæ¿ããæ¥ããããããã³ãã¯ã3ã¤ã®ãã¡ã³ã¯ã·ã§ã³ãé åã«å ¥ãã¦ãforã§åãããã£ããã©ãé åã«å ¥ããæã«è©ä¾¡ããã¦ãã¾ã£ã¦NGãmapé¢æ°ã§ãé¢æ°ã¨é¢æ°ï¼è¨æ¸¬ãããé¢æ°ã¨ãè¨æ¸¬ããé¢æ°ï¼ãï¼ã¤æ¸¡ãããæ¹ãããããªãã£ãã®ã§ãåãäºã3åæ¸ãäºã«ããããã¸æ¥ãããããããããæãã¤ããªãã£ãã
è¨æ¸¬ç¨ã®HTMLã«ã¯ãã¯ã¦ãã®ããããã¼ã¸ã¨ãããã³ã¡ã³ããStyleãScriptãhtmlãããããã®ããªã¥ã¼ã ã§å ¥ã£ã¦ããçº
è¨æ¸¬ç°å¢
# coding:utf-8 from urllib import urlopen from BeautifulSoup import BeautifulSoup from lxml.html import fromstring from HTMLParser import HTMLParser from timeit import Timer from time import time class TagStrip(HTMLParser): # id:aodagå ããæä¾ def __init__(self): HTMLParser.__init__(self) self.datum = [] self.instyle = False def handle_data(self, data): if data.strip() and not self.instyle: self.datum.append(data) def getString(self): return "".join(self.datum) def handle_starttag(self, tag, attrs): if tag == 'style' or tag == 'script': self.instyle = True def handle_endtag(self, tag): if tag == 'style' or tag == 'script': self.instyle = False def getHtml(url): return urlopen(url).read() def useBS(html): # id:y_yanbe ããæä¾ã¢ã # http://python.g.hatena.ne.jp/y_yanbe/20081025/1224910392 soup = BeautifulSoup(html) text = '\n'.join([e.string for e in soup.findAll() if e.string!=None and e.name not in ('script','style')]) return text def useLXML(html): # id:Alexandre ããæä¾ã¢ã # http://d.hatena.ne.jp/a2c/20081025/1224924646#c1225076104 et = fromstring(html) xpath = r'//text()[name(..)!="script"][name(..)!="style"]' text = ''.join([text for text in et.xpath(xpath) if text.strip()]) return text def useHP(html): p = TagStrip() p.feed(html) return p.getString() if __name__ == '__main__': url = 'http://www.hatena.ne.jp/' repeatCnt = 30 htmlSource = getHtml(url) tmpDelta, timeList = [],{} print 'BS start!' for i in range(repeatCnt): start = time() useBS(htmlSource) tmpDelta.append( time() - start) timeList['BS'] = sum(tmpDelta)/repeatCnt print 'LXML start!' tmpDelta = [] for i in range(repeatCnt): start = time() useLXML(htmlSource) tmpDelta.append( time() - start) timeList['LXML'] = sum(tmpDelta)/repeatCnt print 'HP start!' tmpDelta = [] for i in range(repeatCnt): start = time() useHP(htmlSource) tmpDelta.append( time() - start) timeList['HP'] = sum(tmpDelta)/repeatCnt print timeList
ã¨ãããã¥ã30åãããã®å¹³åã§ãã£ã¦ã¿ããä½åããã£ããã©ãããã»ã©ã°ãã¤ãç¡ãåããããªçµæãåºãã®ã§ãä¿¡ããäºã«ããã以ä¸çµæ(æ¹è¡ããã¦ããã¾ã)
{ 'BF': 0.58448076248168945, 'LXML': 0.01511224110921224, 'HP': 0.03491028149922689 }
PySpaã§ä½ã£ã¦ãæã«BSã¤ãã£ã¦ã¦ãä½ã¨ãªãé ããªãã£ã¦æã£ã¦ããã©ãã²ãã£ã¨ããããBSãå¼ã£ããã£ã¦ãã®ããç¥ããªãã¨æã£ãã100åãããã®HTMLããTagé¤å»ããã1å(BS)ã¨1ç§(lxml)ãããã®å·®ãåºæ¥ãã¨æãã¨ãlxmlããé¸æè¢ã¯ç¡ããªãã¨æã£ããHTMLParseãã¾ãã¾ãéããã©ãã¯ã©ã¹ããªã¼ãã©ã¤ãããªãã¡ãscriptã¨ãstyleã«å¯¾å¿ã§ããªãã®ãããã©ãããã
ããããé©æé©æãæããã¨æãã¾ããã大éã®htmlãã¡ã¤ã«ããTagãé¤å»ããã«ã¯ãlxmlãåãã¦ããã¨æãã¾ããã
id:aodagå ãããid:y_yanbeãããid:Alexandreãã æãã¦ããã¦æãé£ããããã¾ããã大å¤åå¼·ã«ãªãã¾ããã
宿é¡ã¨ãã¦ãè¯éºã«Timerã使ãããªããããã«ãªãäºã追å ããã¾ããã
追è¨
ã¿ã°ãé¤å»ããããã¹ããã¡ã¤ã«ããªã«ãã«åã©ã¤ãã©ãªã§å·®ããããªãã¨æãããããã¹ããã¾ãã¾è¼ããã®ã¯ä¸¸ãã¨å¼ç¨ã«ãªãããã ã£ãã®ã§ãè¼ããã®ãèºèºã£ãã代ããã«wcã®çµæ
% cat lxml_log.txt|wc 35 168 7586 % cat BeautifulSoup_log.txt|wc 24 121 6464 % cat HTMLParce_log.txt|wc 34 166 7582
çãããã or é·ããã è¯ãã£ã¦ã¢ã³ã§ããªããã¿ã°ãå®å ¨ã«é¤å»ãã¦ã«äºãèãããã誤å¤å®é¤å»ãå°ãªãæ¹ãé·ãã®ã§ãããã§ãlxmlãåªç§ãªãããããBSã ã¨ãnbsp ã¨ãããã®ã¾ã¾åºã¦ãã¾ã£ã¦ãããããã§ããã ãçãã¨ããäºã¯ä»ã«ãªã«ããæ¶ãã¦ãã£ã¦äºãã