Python+Selenium+Chrome(+BeautifulSoup)ã§ã¹ã¯ã¬ã¤ãã³ã°ãã
以åPython+Selenium+Phantom.js+Beautifulsoupでスクレイピングする - ひよこになりたいã¨ããè¨äºãæ¸ãã¾ããã
å½æã¯Chromeãå ¥ããå¿ è¦ããªãããã¤ããªãç½®ãã ãã§ä½¿ããPhantom.jsãé常ã«ä¾¿å©ã§ããã使ã£ã¦ããã®ã§ãããå»å¹´ãPhantom.jsã®ã¡ã³ããã³ã¹ãçµäºããã¨ãããã¥ã¼ã¹ãçºä¿¡ãããè¨äºãæ¸ãæããªãã¨ãªãã¨æã£ã¦ãã¾ãããããããããæ«ãã®éã¯Javascriptããªã¢ã«ã¿ã¤ã å¦çãããå¿ è¦ã®ãªããã¼ã¸ã°ããã¹ã¯ã¬ã¤ãã³ã°ãã¦ããã®ã§ãçµå±æ¾ç½®ããã¾ã¾ã§ããï¼ãã¿ã¾ããï¼ã
ãã1å¹´ã®éã«ãChromeã®Version59以éããæ¨æºã§ï¼stableçã§ï¼headlessã¢ã¼ãï¼ãã©ã¦ã¶ç»é¢ãç«ã¡ä¸ãããªãã¢ã¼ãï¼ã使ããããã«ãªã£ãããããåã®ããã«xvfbã®ãããªä»®æ³ã¹ã¯ãªã¼ã³ã使ãå¿ è¦ããªããªã£ããã¨ã«å ããä¹ ã ã«Javascriptãå¿ è¦ãªãµã¤ããã¹ã¯ã¬ã¤ãã³ã°ããå¿ è¦ãåºã¦ããããããã®æ©ä¼ã«ä½¿ç¨æ³ã«ã¤ãã¦ã¾ã¨ãã¦ãããã¨ã«ãã¾ãã
Phantom.jsã®ä½è ãããChromeã使ãã¨ãããã¼ã¨è¨ã£ã¦ããã¿ããã§ããã
ç°å¢æ§ç¯
å¿ è¦ãªãã®
- Python + pipï¼ããã§ã¯3.6.5ã使ãã¾ããï¼
- Selenium
- Google Chromeï¼Version59以ä¸ã ã£ããOKã§ããï¼
- ChromeDriver
OSã¯Linuxã§ãMacã§ã大ä¸å¤«ã§ããå¤åWindowsã§ãããã¾ããç§ã¯Arch Linux(GUIãªã)ã使ãã¾ããã
ã¤ã³ã¹ãã¼ã«
- Python+pip
- é©å½ã«å ¥ãã¦ããã¦ãã ãã
- Selenium
pip install selenium
ã§ã¤ã³ã¹ãã¼ã«åºæ¥ã¾ãã
- Google Chrome
- Package管çãã¼ã«ã§ããã¦ãããã§ãããexeãdmgãç´æ¥rpmãdebããããã¦ãããã§ãã
- CentOSã ã£ãã CentOS7にChromeをインストール - Qiita ããããåèã«ãªããã¨æãã¾ãã
- Arch Linuxã®å ´åã¯
yaourt -S google-chrome-stable
ã§å ¥ããã¾ããï¼ç§ã®å ´åã¯ã©ã¤ãã©ãªã®ä¾åé¢ä¿ãå£ãã¦ããã®ã§ãèµ·åæã®ã¨ã©ã¼ãè¦ãªããdowngrade
ã§é å¼µã£ã¦3-4åã®ã©ã¤ãã©ãªããã¦ã³ã°ã¬ã¼ããã¦èµ·åã¾ã§æã£ã¦ããã¾ããï¼
- Chrome Driver
- Downloads - ChromeDriver - WebDriver for Chrome ãããã¦ã³ãã¼ãåºæ¥ã¾ãã
- ãã¦ã³ãã¼ããã¦è§£åãã¾ããããå¤åä¸ã«chromedriverã¨ãããã¤ããªãããã¯ãã§ããé©å½ãªãã£ã¬ã¯ããªã«ç§»åãã¦ã¦ããããã¹ãéãã¦ããã¾ãããã
echo 'export PATH="$PATH:/path/to/chromedriver_directory"' >> ~/.bashrc
ã¨ãã§è¡ããããããªãããªãï¼æªæ¤è¨¼ï¼
使ã£ã¦ã¿ã
# -*- coding: utf-8 -*- from selenium import webdriver options = webdriver.chrome.options.Options() options.add_argument("--headless") # ããæ¶ãã°ãã©ã¦ã¶ç»é¢ãåºã¾ã driver = webdriver.Chrome(chrome_options=options) driver.get("https://sukumizu.moe") # ã¿ã¤ãã« print(driver.title) # URL print(driver.current_url) # cookies print(driver.get_cookies()) # ã¹ã¯ãªã¼ã³ã·ã§ããã®æ®å½± print(driver.save_screenshot("screenshot.png")) # ãã¼ã¸ã®ã½ã¼ã¹ã®åå¾ print(driver.page_source)
Phantom.jsã®ã¨ãã®ããã«ä¸éã欲ããæ©è½ã¯æã£ã¦ãã¾ããã
ã¾ããdriver.page_source
ãBeautifulSoupã«é£ãããã°ä»¥åã®ããã«ãã®ã¾ã¾BeautifulSoupã®æ©è½ã使ãã¾ãããSeleniumã«ã¯ããããè¦ç´ ã®é¸ææ©è½ãããã®ã§ãããã使ã£ã¦ã¹ã¯ã¬ã¤ãã³ã°ãããã¨ãåºæ¥ã¾ãã
# idæ¤ç´¢ driver.find_element_by_id('id') # classæ¤ç´¢ driver.find_element_by_class_name('classname') # ãã¹ããããè¦ç´ ã®åå¾ driver.find_elements_by_xpath(".//div")
7. WebDriver API — Selenium Python Bindings 2 documentation
ã¾ããã¯ãªãã¯ãæåå ¥åãã§ããã®ã§ããããªãã§ãããããæ¾é¡ã£ã¦æãã§ãã
UIæä½ã¯ã¡ãã£ã¨è¤éãªã³ã¼ãã«ãªãã®ã§ããã§ã¯å²æãã¾ããèå³ãããæ¹ã¯èª¿ã¹ã¦ã¿ã¦ãã ããã