Pythonã§WEBページをクãƒãƒ¼ãƒªãƒ³ã‚°ã™ã‚‹æ™‚ã®Tips
Pythonã§WEBページをクãƒãƒ¼ãƒªãƒ³ã‚°ã™ã‚‹æ™‚ã®Tipsã‚’ã¾ã¨ã‚ã¦ã¿ã¾ã—ãŸã€‚
urllib2.urlopenã®ãƒ‡ãƒ•ã‚©ãƒ«ãƒˆã®ãƒ¦ãƒ¼ã‚¶ãƒ¼ã‚¨ãƒ¼ã‚¸ã‚§ãƒ³ãƒˆã‚’変更ã™ã‚‹
Pythonã§URLã‚’é–‹ãã«ã¯ã€urllib2.urlopenã—ã¾ã™ã€‚
urllib2.urlopenã¯ã€ãƒ‡ãƒ•ã‚©ãƒ«ãƒˆã§"Python-urllib/(Pythonã®ãƒãƒ¼ã‚¸ãƒ§ãƒ³)"ã¨ã„ã†ãƒ¦ãƒ¼ã‚¶ãƒ¼ã‚¨ãƒ¼ã‚¸ã‚§ãƒ³ãƒˆã‚’使用ã—ã¾ã™ãŒã€Wikipediaãªã©ä¸€éƒ¨ã®ãƒšãƒ¼ã‚¸ã§ã¯ã“ã®ãƒ¦ãƒ¼ã‚¶ãƒ¼ã‚¨ãƒ¼ã‚¸ã‚§ãƒ³ãƒˆã«å¯¾ã—403 Forbiddenã‚’è¿”ã—ã¦ãã¾ã™ã€‚以下ã®ã‚³ãƒ¼ãƒ‰ã«ã‚ˆã£ã¦ãƒ‡ãƒ•ã‚©ãƒ«ãƒˆã®ãƒ¦ãƒ¼ã‚¶ãƒ¼ã‚¨ãƒ¼ã‚¸ã‚§ãƒ³ãƒˆã‚’変更ã™ã‚‹ã¨ã€403エラーを回é¿ã™ã‚‹ã“ã¨ãŒã§ãã¾ã™ã€‚
import urllib2 opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'your user agent string')] urllib2.install_opener(opener)
HTMLã‹ã‚‰ã‚¿ã‚°ã‚’除去ã™ã‚‹
Pythonã§HTMLをパースã™ã‚‹ãƒ©ã‚¤ãƒ–ラリã§ã¯BeautifulSoupãŒæœ‰åã§ã™ãŒã€HTMLã®ã‚¿ã‚°ã‚’除去ã™ã‚‹ç¨‹åº¦ã®ã‚¿ã‚¹ã‚¯ã®å ´åˆã€æ¨™æº–ライブラリã®sgmllibã§ã‚‚比較的簡å˜ã«é”æˆã™ã‚‹ã“ã¨ãŒã§ãã¾ã™ã€‚
import sgmllib import StringIO class TagRemover(sgmllib.SGMLParser): def remove_tag(self, some_html): self.stringio = StringIO.StringIO() self.feed(some_html) self.close() result = self.stringio.getvalue() self.stringio.close() return result def handle_data(self, data): self.stringio.write(data) def clean_html(some_html): return TagRemover().remove_tag(some_html)
ElementTreeモジュールã§XMLをパースã™ã‚‹
Pythonã§ã¯XMLをパースã™ã‚‹ãŸã‚ã®æ¨™æº–ライブラリãŒè¤‡æ•°ã‚ã‚Šã¾ã™ãŒã€Python2.5以é™ã§ä½¿ç”¨å¯èƒ½ãªElementTreeモジュールãŒä¸€ç•ªä½¿ã„ã‚„ã™ã„ã¨æ€ã„ã¾ã™ã€‚ãŸã ã€ã“ã®ãƒ¢ã‚¸ãƒ¥ãƒ¼ãƒ«ã®XMLåå‰ç©ºé–“ã®æ‰±ã„ã§å°‘ã—ãƒãƒžã£ãŸã®ã§ã€ã“ã“ã«ãƒ¡ãƒ¢ã«ã—ã¦ãŠãã¾ã™ã€‚
ElementTree.ElementTreeã®findã‚„findallãªã©ã®ãƒ¡ã‚½ãƒƒãƒ‰ã§ã¯ã‚¿ã‚°ã®åå‰ç©ºé–“ã‚’{åå‰ç©ºé–“}ã§æŒ‡å®šã—ã¾ã™ã€‚RSSã‚„Atomç‰ã‹ã‚‰ãƒ‡ãƒ¼ã‚¿ã‚’抜ã出ã™éš›ã«ã¯ã€ã‚¿ã‚°ã®åå‰ç©ºé–“ã‚’ãã¡ã‚“ã¨æŒ‡å®šã™ã‚‹å¿…è¦ãŒã‚ã‚Šã¾ã™ã€‚
ElementTreeã®ã‚µãƒ³ãƒ—ルプãƒã‚°ãƒ©ãƒ
import urllib2 from xml.etree import ElementTree url = 'http://d.hatena.ne.jp/saitodevel01/rss' etree = ElementTree.fromstring(urllib2.urlopen(url).read()) print etree.find("item") print etree.find("{http://purl.org/rss/1.0/}item")
実行çµæžœ
None <Element {http://purl.org/rss/1.0/}item at 10101d248>
æ™‚åˆ»æƒ…å ±ã‚’ãƒ‘ãƒ¼ã‚¹ã™ã‚‹
ã¯ã¦ãªã®RSSãªã©ã§ã¯æ™‚åˆ»æƒ…å ±ãŒISO8601ã¨ã„ã†ãƒ•ã‚©ãƒ¼ãƒžãƒƒãƒˆã§å‡ºã¦ãã¾ã™ã€‚ã“ã®ãƒ•ã‚©ãƒ¼ãƒžãƒƒãƒˆã¯æ¨™æº–ライブラリã®timeモジュールやdatetimeモジュールã§ã¯ãƒ‘ースã™ã‚‹ã“ã¨ãŒå‡ºæ¥ãªã„ã®ã§ã€iso8601ã¨ã„ã†ãƒ©ã‚¤ãƒ–ラリを使用ã—ã¾ã™ã€‚
iso8601ã®ã‚¤ãƒ³ã‚¹ãƒˆãƒ¼ãƒ«
sudo easy_install.py iso8601 ã‚‚ã—ã㯠sudo pip install iso8601
iso8601ã®ä½¿ã„æ–¹
>>> import iso8601 >>> iso8601.parse_date("2007-01-25T12:00:00Z") datetime.datetime(2007, 1, 25, 12, 0, tzinfo=<iso8601.iso8601.Utc ...>) >>>
æ–‡å—列をUnicodeã«å¤‰æ›ã™ã‚‹éš›ã®Tips
WEBページã§ã¯æ§˜ã€…ãªæ–‡å—列エンコードãŒä½¿ç”¨ã•ã‚Œã¦ã„ã¾ã™ãŒã€å†…部データã¯Unicodeã§çµ±ä¸€ã—ã€ã‚·ãƒªã‚¢ãƒ©ã‚¤ã‚ºã¯utf-8ã§è¡Œã†ã®ãŒå®šçŸ³ã‹ã¨æ€ã„ã¾ã™ã€‚æ–‡å—列をUnicodeã¸å¤‰æ›ã™ã‚‹éš›ã«æ–‡å—列ã®ã‚¨ãƒ³ã‚³ãƒ¼ãƒ‰ã‚’指定ã™ã‚‹å¿…è¦ãŒã‚ã‚Šã¾ã™ãŒã€æ–‡å—列エンコードを推定ã™ã‚‹ãŸã‚ã®chardetã¨ã„ã†ãƒ©ã‚¤ãƒ–ラリを使ãˆã°ã€unicodeã¸ã®å¤‰æ›ã‚’ç°¡å˜ã«è¡Œã†ã“ã¨ãŒã§ãã¾ã™ã€‚
chardetã®ã‚¤ãƒ³ã‚¹ãƒˆãƒ¼ãƒ«
sudo easy_install.py chardet ã‚‚ã—ã㯠sudo pip install chardet
chardet.detectã«æ–‡å—列を渡ã™ã¨ã€encodingã¨confidenceã®ï¼’ã¤ã®ã‚ーãŒå…¥ã£ãŸãƒ‡ã‚£ã‚¯ã‚·ãƒ§ãƒŠãƒªãŒè¿”ã•ã‚Œã€encodingã‚ーã®å€¤ã¯ãã®ã¾ã¾unicode関数ã®ç¬¬2引数ã«æ¸¡ã™ã“ã¨ãŒå‡ºæ¥ã¾ã™ã€‚ã¾ãŸã€unicode関数ã®ç¬¬3引数を'ignore'ã¨ã™ã‚‹ã¨ã€UnicodeDecodeErrorãŒç™ºç”Ÿã™ã‚‹æ–‡å—列を無視ã—ã¦ãã‚Œã¾ã™ã€‚
# -*- coding: utf-8 -*- import chardet def unicode2(raw_string): encoding = chardet.detect(raw_string)['encoding'] if encoding is not None: return unicode(raw_string, encoding, 'ignore') else: raise ValueError, 'Can not detect encoding' if __name__ == '__main__': text = 'ã‚ã„ã†ãˆãŠ' print repr(text) print repr(unicode2(text))