Python+Selenium+Phantom.js+Beautifulsoupã§ã¹ã¯ã¬ã¤ãã³ã°ãã
â»ã2018/04/17追è¨ã
Phantom.jsã¯ã¡ã³ããã³ã¹ãçµäºããããã§ããä»å¾ã¯Google Chromeã使ç¨ãã¦Javascriptã®å¦çãè¡ã£ã¦ãããã¨ã«ãªãã¾ãã以ä¸ã®è¨äºã§è§£èª¬ãã¦ãã¾ãã®ã§åããã¦ã覧ãã ããã zipsan.hatenablog.jp
ã追è¨çµããã
æè¿ã¹ã¯ã¬ã¤ãã³ã°ã®ã¹ã¯ãªããæ¸ãã¦éãã§ããã®ã§ããã«ã¤ãã¦ã®ã¡ã¢ãã¦ãã«ã
Pythonã§ã¹ã¯ã¬ã¤ãã³ã°ããæ¹æ³ã¯å¤ã ããã¿ãããªãã§ããï¼å人çã«ä¸çªä½¿ããããã£ãï¼æ £ãï¼ï¼ã®ããã®çµã¿åããã§ããã
以åPythonã®urllib.requestï¼Beautifulsoupã§ã¬ã¹ãã³ã¹htmlã®è§£æããã¦æ¬¡ã ã¨ãã©ã£ã¦ãããããªã¹ã¯ãªãããæ¸ãã¦ãããã§ãããããã ã¨Javascriptã§è¿½å ãããã¨ã¬ã¡ã³ãã¯åãåããªãã£ããããªãã¤ã¬ã¯ãå¦çããã£ã¡ã大å¤ã ã£ããè²ã ã¨é¢åã§ãããä»åSeleniumã¨Phantomjsã使ç¨ãããã¨ã§ãã®è¾ºãã®é¢åãªå¦çãä¸æ¬ã§ã§ããããã«ãªãã¾ããã
ç°¡åã«æµãã説æããã¨ãPythonã§Seleniumãæä½ããSeleniumãPhantom.jsã§JSãå®è¡ããçµæã®HTMLãBeautifulSoupã§ãã¼ã¹ãã解æãã¦ããã¾ãã
Selenium
Seleniumã¯ãã©ã¦ã¶ã®èªååãè¡ããã¼ã«ã§ããè¤æ°ã®ãã©ã¦ã¶ã§Webã®ãã¹ããå®è¡ããããããã¨ãã§ããããAndroid/iOSã§ãã¹ãåºæ¥ããããããã¨ä¾¿å©ï¼Seleniumãµã¼ãã¼å»ºã¦ã¦éä¸ç®¡çããããã§ããã¿ããã ãï¼ãä»åã¯FirefoxãChromeã®ä»£ããã«Phantom.jsã使ãã¾ãã
Selenium - Web Browser Automation
Phantom.js
Phantom.jsã¯æ¬æ¥ã¯ãã©ã¦ã¶ããªãã¨å®è¡ã§ããªãJavascriptãããã©ã¦ã¶ç»é¢ãªãã§å®è¡ã§ããããããã¤ãAPIå½¢å¼ã§å©ããã£ã½ãï¼ PhantomJS | PhantomJS
Beautiful soup
Beautiful soupã¯HTML/XMLã®ãã¼ãµã¼ã§ãHTMLã解æãã¦ä½¿ãããããã¦ããããã®ã§ããHTMLãDOMã«å£ã£ã¦åæãããæ¤ç´¢ãããé¸æãããã§ãã¾ãã
Beautiful Soup: We called him Tortoise because he taught us.
ç°å¢è¨å®ã¨ã
ç¹ã«ãããªã«ãããã¨ãªããã©ã»ã»ã»
使ç¨ããè¨èªã¯Python3.4ã§ããLinux, Mac, Windowsã§åãã®ã確èª
Python3ã¯å ¥ã£ã¦ããã¨åæã§ãã¾ãã¯pythonã®seleniumã¢ã¸ã¥ã¼ã«ã®ã¤ã³ã¹ãã¼ã«
pip3 install selenium
次ã«Phantom.jsãå
¥ãã¾ããããã¯ç¹ã«èª¬æããªãã®ã§é©å½ã«å
¥ãã¦ãã ããã
http://phantomjs.org/
ã¡ããã¨ãã¹ãéãã¦ãããã¨ã
æå¾ã«BeautifulSoupãå
¥ãã¾ããbsã¯python3ã®å ´åã¯2to3
ã³ãã³ãã§python3ç¨ã«å¤æããå¿
è¦ãããã¾ãï¼å
¬å¼ã§ããæ¸ãã¦ããï¼ã
http://www.crummy.com/software/BeautifulSoup/#DownloadããããBeautiful Soup 4ãè½ã¨ãã¦ãã¦2to3ã§å¤æã
wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.3/beautifulsoup4-4.3.2.tar.gz # ç¾æç¹ï¼2015/04/12ï¼ã§ã®ææ°ç tar zxf beautifulsoup4-4.3.2.tar.gz cd ./beautifulsoup4-4.3.2 2to3 -w bs4 python3 setup.py
ãã¾ãè¡ããªããã°2to3ããå¾ã«ç´æ¥ã©ã¤ãã©ãªãã£ã¬ã¯ããªã®ä¸ã«çªã£è¾¼ãã§ãOK
使ã£ã¦ã¿ã
from selenium import webdriver from bs4 import BeautifulSoup driver = webdriver.PhantomJS() driver.get("http://sukumizu.moe/") data = driver.page_source.encode('utf-8') print(data) driver.save_screenshot("ss.png") driver.quit()
çµæ
b'<!DOCTYPE html><html><head>\n\t<title>sukumizu.moe</title>\n\t<link rel="stï¼ç¥
ãããªãããã§æ±ãã¾ãã
ã¹ã¯ãªã¼ã³ã·ã§ãããæ®ãã¾ãã
UAãæå®ãããå ´åã¯ãããªæãã以ä¸ã¯Chromeã®ä¾
from selenium import webdriver from selenium.webdriver.common.desired_capabilities import DesiredCapabilities des_cap = dict(DesiredCapabilities.PHANTOMJS) des_cap["phantomjs.page.settings.userAgent"] = ( 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/28.0.1500.52 Safari/537.36' ) driver = webdriver.PhantomJS(desired_capabilities=des_cap) driver.get("http://sukumizu.moe/") data = driver.page_source.encode('utf-8')
åå¾ããhtmlã®è§£æ (Beautiful soup)
from bs4 import BeautifulSoup # ----- ç¥ ----- html = BeautifulSoup(data) print(html) # htmlã½ã¼ã¹ã表示ãã print(html.title) # ã¿ã¤ãã«ã¿ã° print(html.title.string) # ã¿ã¤ãã«ã¿ã°å ã®æå print(html.find('h1')) # h1ã¿ã° print(html.find_all('link')) # å ¨ã¦ã®linkã¿ã°ã®ãªã¹ã print(html.find_all('link', attrs={'href': 'style.css'})) # linkã¿ã°ãã¤hrefãstyle.cssã®ãã®ã®ãªã¹ã
çµæ
<!DOCTYPE html> <html><head> <title>sukumizu.moe</title> ---ï¼ç¥ï¼--- </body></html> <title>sukumizu.moe</title> sukumizu.moe <h1>What is your favorite "sukumizu"...? </h1> [<link ...., <link ...., <link ....] [<link href="style.css" rel="stylesheet"></link>]
ããããã使ããã°å°ããªããããä»ã«ãããããããã®ã§bsã®ããã¥ã¡ã³ããåç §ã
ã¿ã°ã®è¦ç´ ã®é¸æããã§ãã¯ã¯ãã©ã¦ã¶æ¨æºã®ãè¦ç´ ãæ¤è¨¼ããéçºãã¼ã«ãã便å©ã Chromeãªãå·¦ä¸ã®è«ç¼é¡ãFirefoxãªãå³ä¸ã®ç¢å°ã§è¦ç´ ã®é¸æãã§ãã¾ã
pythonã¯ç°¡åã§ãããªã