- ã¯ããã«
- Webã¹ã¯ã¬ã¤ãã³ã°ã®åºæ¬äºé
- Webã¹ã¯ã¬ã¤ãã³ã°ã®å®è·µæ¹æ³
- Webã¹ã¯ã¬ã¤ãã³ã°ãå®è·µããã«ã¯
- Pythonã§ã®ã¹ã¯ã¬ã¤ãã³ã°å®è·µæ¹æ³
- ãããã«
ã¯ããã«
ã¿ãªããããã«ã¡ã¯ããã¸ãµã¯ã§ãã
ã¿ãªãããä¸åº¦ãäºåº¦ã¯ãã¤ã³ã¿ã¼ãããä¸ã®æ
å ±ãåéãã¦åæããã¨ããä½æ¥ãè¡ã£ããã¨ãããã®ã§ã¯ãªãã§ããããã
調ã¹ã対象ã®Webãµã¤ããæ°ãã¼ã¸ãããã§ããã°ããã©ã¦ã¶ãéãã¦ã³ãã¼ï¼ãã¼ã¹ããç¹°ãè¿ãã ãã§ããªãã¨ããªãã¾ããã
対象ãæ°åãæ°ç¾ã¨ããªãã¨ãé端ã«äººåã§æ
å ±åéãè¡ããã¨ãé£ãããªãã¾ãã
ããããæã«ç¨ããããã®ããWebã¹ã¯ã¬ã¤ãã³ã°ãã¨ããææ³ã§ãã
å
æ¥ãã¨ããWebãµã¤ãããæ
å ±ãéãã¦åæããã¨ããä½æ¥ããã£ã¦ããã®ã§ãããããããæä½æ¥ã§ãããã¨ãããç²¾ç¥çã«éçãè¿ããã®ã§ãWebã¹ã¯ã¬ã¤ãã³ã°ãã¦ãã¾ãæ
å ±ãéããããªããã¨èãã¾ããã
ã¨ããããç°¡åãªWebã¹ã¯ã¬ã¤ãã³ã°ã§ããã°ããã»ã©æè¡çã«é£æ度ã®é«ããã®ã§ã¯ãªãã®ã§ããã調ã¹ã¦ã¿ãã¨ãæå¤ã¨Webã¹ã¯ã¬ã¤ãã³ã°ã«ã¯æ³è¦ä»¶ãªã©ãæè¡é¢ä»¥å¤ã§æ³¨æããªããã°ãªããªããã¤ã³ããå¤ãããã§ãã
ããã§ãæ¬ç¨¿ã§ã¯ãWebã¹ã¯ã¬ã¤ãã³ã°ã¨ã¯ä½ãªã®ãï¼ããWebã¹ã¯ã¬ã¤ãã³ã°ã§æ°ãä»ããªããã°ãªããªããã¨ããªã©ã®åºæ¬äºé ã¨ä½µãã¦ããPythonãç¨ãã¦Webã¹ã¯ã¬ã¤ãã³ã°ãå®è·µããã«ã¯ã©ãããã°ããã®ããã¨ãã£ããããã解説ãããã¨æãã¾ãã
ç¹ã«ã¨ã³ã¸ãã¢è«¸æ°ã«ãããã¦ã¯ãããã¾ãæèããã«Webã¹ã¯ã¬ã¤ãã³ã°ãã¦ããã人ãå°ãªããªãã®ã§ã¯ãªãã§ããããï¼ Webã¹ã¯ã¬ã¤ãã³ã°ã«ã¤ãã¦åãã¦å¦ã¶æ¹ã¯ãã¡ãããæ¢ã«å®è·µããã¦ããæ¹ããæ¬ç¨¿ãèªãã§æ¹ãã¦Webã¹ã¯ã¬ã¤ãã³ã°ã«ã¤ãã¦æ´çãã¦é ããã¨å¹¸ãã§ãï¼
Webã¹ã¯ã¬ã¤ãã³ã°ã®åºæ¬äºé
Webã¹ã¯ã¬ã¤ãã³ã°(Scraping)ã¨ã¯
Webã¹ã¯ã¬ã¤ãã³ã°ã¨ã¯ãWebãµã¤ãã«å«ã¾ããæ
å ±ããå¿
è¦ãªãã®ãæ½åºããæè¡ã»è¡çºãæããã®ã§ãã
â»ãªããæ¬ç¨¿ã§ã¯ãWebãã¹ã¯ã¬ã¤ãã³ã°ã¨æ£ç¢ºã«è¨è¼ããããã«ãã¦ãã¾ãããåã«ãã¹ã¯ã¬ã¤ãã³ã°ãã¨å¼ã¶å ´åããããããWebã¹ã¯ã¬ã¤ãã³ã°ãæãã¦ãããã¨ãå¤ãã¨æãã¾ãã
Webã¹ã¯ã¬ã¤ãã³ã°ã¨ä¸¦ãã§ãããè³ã«ããè¨èã«ãWebã¯ãã¼ãªã³ã°ï¼Crawlingï¼ã¨ãããã®ãããã¾ãããããã¯ãããURLãåºã«ããã®Webãã¼ã¸ã«å«ã¾ãã¦ãããªã³ã¯ã辿ããªãããæ
å ±ãåéããä»çµã¿ãæãã¾ãã
ã¡ãã£ã¨ãããããã§ããããã®ä¸¡è
ã®éãã¯ãã¯ãã¼ã©ã¼ãéããæ
å ±ãã¹ã¯ã¬ã¤ãã³ã°ãããã¨ããæç« ã«ããã°ãããåãããããã®ã§ã¯ãªãã§ããããã
Webã¹ã¯ã¬ã¤ãã³ã°ã®æ´»ç¨ã·ã¼ã³
Webã¹ã¯ã¬ã¤ãã³ã°ã®åºæ¬çãªçãã¯ãè¨å¤§ãªæ
å ±ã®åéãæä½æ¥ã§ã¯ãªãèªååãããã¨ã«ããæ¥åãå¹çåãããã¨ã§ãã
主ãªæ´»ç¨ã·ã¼ã³ã¨ãã¦ã¯æ¬¡ã®ãããªãã®ãæãããã¾ãã
- ãã¼ã±ãã£ã³ã°ãç 究åéã«ããããã¼ã¿åæ
- æ©æ¢°å¦ç¿ã«ãããã¢ãã«ä½æã®ããã®å¦ç¿ãã¼ã¿ã¨ãã¦
Webã¹ã¯ã¬ã¤ãã³ã°ã®åºæ¬çãªä»çµã¿
Webã¹ã¯ã¬ã¤ãã³ã°ã®åºæ¬çãªä»çµã¿ã¯æ¬¡ã®éãã§ãã
- ããWebãµã¤ãããHTMLãã¼ã¿ãåå¾ãã
- HTMLãã¼ã¿ãã¿ã°æ§é ã«å¾ã£ã¦å解ã»åæï¼ãã¼ã¹ï¼ãã
- ç®çã¨ãªãæ å ±ãé¸ã³åºãããã¡ã¤ã«ãDBã«ä¿åãã
- 1.ï½3.ã®ããã»ã¹ããè¤æ°ã®ãã¼ã¸ã«å¯¾ãã¦ç¹°ãè¿ãã¦å®è¡ãã
Webã¹ã¯ã¬ã¤ãã³ã°ã®æ³¨æäºé
Webã¹ã¯ã¬ã¤ãã³ã°ã¯ãã¼ã¿åéãè¡ãä¸ã§ã¨ã¦ãå¼·åãªæ段ã§ããã注æããªããã°ãªããªãç¹ãããã¤ãããã¾ãã
åå¾å ã¸ã®æ»æã¨ã¿ãªãããããè¦ç´éåããèä½æ¨©æ³éåã«åããããã¨ããã
- åå¾å ã¸ã®æ»æ
Webã¹ã¯ã¬ã¤ãã³ã°ã¯ããã°ã©ã ãèªåã§å®è¡ããã¨ããæ§è³ªä¸ã人éã«ã¯ä¸å¯è½ãªå¤§éã®ãªã¯ã¨ã¹ãããã¼ã¿ã®åå¾å
ã«éä¿¡ãããã¨ãã§ãã¦ãã¾ãã¾ãã
ããããªãããçæéã«å¤§éã®ãªã¯ã¨ã¹ããéä¿¡ãããã¨ã¯ãåå¾å
ã®ãµã¼ãã¼ã®å¦çãé
延ãããå ´åã«ãã£ã¦ã¯ãµã¼ãã¼ããã¦ã³ããããªã©ã®æ失ãä¸ãã¦ãã¾ãã¾ããéå»ã«ããã®ä»¶ã§é®æããã¦ãã¾ã£ãäºä¾ãªã©ãããã®ã§ãæ
éã«èããå¿
è¦ãããã§ãããã
ã§ã¯ãã©ããããã®é »åº¦ã§ããã°è¨±å®¹ãããã®ãã¨ããã¨ãæ®å¿µãªããæ確ãªåºæºã¯ããã¾ããã
Webãµã¤ãã®éå¶è
ãrobots.txt
ãè¨ç½®ãã¦ããå ´åã¯ãã¯ãã¼ã«ãã¦è¯ãURLï¼è¯ããªãURLã許容ããã¢ã¯ã»ã¹é »åº¦ã«ã¤ãã¦æè¨ããã¦ããå ´åãããã¾ãã®ã§ããã¡ãã確èªãã¾ãããã
robots.txt
â¦Webãµã¤ãã®éå¶è
ããGoogleãªã©ã®æ¤ç´¢ã¨ã³ã¸ã³ã®ã¯ãã¼ã©ã¼åãã«é
ç½®ãããã¡ã¤ã«ãé常ã¯ã対象ãã¡ã¤ã³ã®ã«ã¼ãã«é
ç½®ããã¾ããï¼ä¾ï¼https://xxxx.example.com/robots.txtï¼
- å©ç¨è¦ç´éå
Webãµã¤ãã«ãã£ã¦ã¯ãæ確ã«ã¹ã¯ã¬ã¤ãã³ã°ãããã¨ãç¦æ¢ãã¦ãããã®ãåå¨ãã¾ãã
ã¹ã¯ã¬ã¤ãã³ã°ããéã«ã¯ãååã«åå¾å
ã®Webãµã¤ãã®å©ç¨è¦ç´ã確èªããããã«ãã¾ãããã
- èä½æ¨©æ³éå
Webã¹ã¯ã¬ã¤ãã³ã°ã®å¯¾è±¡ã¨ãªãæ
å ±ã«èä½æ¨©ãçãã¦ããå ´åãæ´»ç¨æ¹æ³ã誤ãã¨èä½æ¨©æ³éåã¨ãªãã¾ãã®ã§æ³¨æãå¿
è¦ã§ãã
ãªããèä½æ¨©æ³ã§ã¯ã第ä¸åæ¡ã®åã«ã¦ãæ
å ±è§£æã®ç¨ã«ä¾ããå ´åãã¯èä½æ¨©è
ã®æ¿è«¾ãªãå©ç¨ãããã¨ãã§ããã¨ããã¦ãã¾ãã
æ´»ç¨äºä¾ã¨ãã¦æããããã¼ã±ãã¤ã³ã°ãç 究åéã«ããããã¼ã¿åæãããæ©æ¢°å¦ç¿ã«ãããã¢ãã«ä½æã®ããã®å¦ç¿ãã¼ã¿ãã¨ãã¦ã¹ã¯ã¬ã¤ãã³ã°ããæ
å ±ãæ´»ç¨ãããã¨ã¯åé¡ãªãã¨è¨ããã¦ãã¾ãã
â»çè
ã¯æ³å¾ã®å°é家ã§ã¯ãªãã®ã§ããã®ç¹ã«ã¤ãã¦ã¯æ¹ãã¦ååã«ç¢ºèªããããã¨ããè¦ããã¾ã
åå¾å ã®å¤æ´ã«å½±é¿ãåãã
ã¹ã¯ã¬ã¤ãã³ã°ã®åºæ¬çãªä»çµã¿ã¯ããHTMLè¦ç´ ãç¹å®ã®ãã¿ã¼ã³ã«æ²¿ã£ã¦æ½åºããããã¨ã§ãã®ã§ããã¼ã¿ã®åå¾å
ã®ä»æ§å¤æ´ããã¶ã¤ã³å¤æ´ã«ãã£ã¦ãHTMLæ§é ãå¤ããããã¾ãæ½åºã§ããªããªã£ã¦ãã¾ããã¨ãããã¾ãã
ã¹ã¯ã¬ã¤ãã³ã°ãä¸æãå¦çãããªãå ´åã¯ãåå¾å
ã®HTMLæ§é ã«å¤åããªãã確èªãã調æ´ãããå¿
è¦ãããã¾ãã
ãã®ã»ããã¹ã¯ã¬ã¤ãã³ã°ãé度ã«ç¸æå
ã«è² è·ãããã¦ããªãã£ãã¨ãã¦ããç¸æå
ã®éä¿¡æå¦ãªã¹ãã«è¼ã£ã¦ãã¾ããã¹ã¯ã¬ã¤ãã³ã°ãã§ããªããªããããªå ´åãããã¾ãã
åå¾å ãAPIãå ¬éãã¦ãããªããã¡ããæ´»ç¨ããæ¹ãè¯ã
ããã¾ã§è¿°ã¹ãã¨ãããWebã¹ã¯ã¬ã¤ãã³ã°ã«ã¯ã»ã³ã·ãã£ããªåé¡ãé¢åãã¨ãã¤ãã¾ã¨ãã¾ãã®ã§ããã¼ã¿åå¾å
ãAPIã§å¿
è¦ãªæ
å ±ãå
¬éãã¦ããå ´åã¯ããã¡ããæ´»ç¨ããæ¹ãå®å
¨ã ã¨æãã¾ãã
æ¬å½ã«Webã¹ã¯ã¬ã¤ãã³ã°ããå¿
è¦ãããã®ãã©ãããã¾ãèãã¾ãããã
ãã¡ãããAPIãå©ç¨ããéã«ããå©ç¨è¦ç´ãããã°ãã£ãã確èªãã¦ãã ãããã
Webã¹ã¯ã¬ã¤ãã³ã°ã®å®è·µæ¹æ³
Webã¹ã¯ã¬ã¤ãã³ã°ãå®è·µããã«ã¯
Webã¹ã¯ã¬ã¤ãã³ã°ãå®è·µããã«ã¯ã次ã®ãããããé¸æãããã¨ã«ãªãã¾ãã
1. ãã³ãã¼ã®ãµã¼ãã¹ããã¼ã«ãå©ç¨ãã
ãã³ãã¼ãæä¾ãã¦ãããµã¼ãã¹ã¯ã¨ã¦ã種é¡ãå¤ããç´¹ä»ããããªãã®ã§ãããã§ã¯ç°¡åã«å°å ¥ã§ããChromeæ¡å¼µãããã¤ãç´¹ä»ãã¾ãã
- Instant Data Scraper
- è¤æ°ãã¼ã¸ã«ã¾ãããä¸è¦§ç»é¢ããã®ã¹ã¯ã¬ã¤ãã³ã°ã«ç¹åããæ¡å¼µã§ãèµ·åããã¨èªåã§ç»é¢ä¸ã®ä¸è¦§è¦ç´ ãå¤å®ãã¦æ½åºãã¦ããã¾ãããã¨ã¯ã次ã®ãã¼ã¸ã«é²ããã¿ã³ãã©ãããæ示ããã ãã§ãèªåã§ã¹ã¯ã¬ã¤ãã³ã°ãã¦ããã¾ãã
- Scraper
- ã¨ã¦ãã·ã³ãã«ãªã¹ã¯ã¬ã¤ãã³ã°ãã¼ã«ããã©ã¦ã¶ä¸ã®å³ã¯ãªãã¯ã¡ãã¥ã¼ããå®è¡ã§ãããªã©ãç¾å¨è¡¨ç¤ºãã¦ãããµã¤ãã®æ å ±ãç°¡æçã«æ½åºãããå ´åã«ã¯æ軽ã«æ´»ç¨ã§ãã¾ãã
- Web Scraper
- è¤æ°ãã¼ã¸ã«å¯¾ããã¯ãã¼ãªã³ã°ã®è¨å®ããåçãã¼ã¸ã¸ã®å¯¾å¿ãªã©ã使ãæ¹ãè¦ããã®ã«å°ãæéãå¿ è¦ãªåé¢ãèªç±åº¦ãé«ãé«æ©è½ãªã¹ã¯ã¬ã¤ãã³ã°ãã¼ã«ã§ãã
2. èªåã§ããã°ã©ã ãä½æãã
PythonãJava,PHP,Rubyã¨ãã£ãããã°ã©ã è¨èªãç¨ãã¦ãã¹ã¯ã¬ã¤ãã¼ãä½æããæ¹æ³ã§ãã
è²»ç¨ããããããèªç±åº¦ãé«ããããç°¡åãªã¹ã¯ã¬ã¤ãã³ã°å¦çã§ããã°ãæ
£ãã¦ããã¨ã³ã¸ãã¢ã«ã¨ã£ã¦ã¯ãã³ãã¼ã®ãµã¼ãã¹ããã¼ã«ã使ç¨ãããããæ軽ããããã¾ããã
ä»®ã«ããã³ãã¼ã®ãµã¼ãã¹ããã¼ã«ãå©ç¨ããå ´åã§ãã£ã¦ããå
é¨ã§åä½ãã¦ããããã°ã©ã ã®åºæ¬çãªèãæ¹ã¯åãã§ãã®ã§ããã¼ã«ã®æ©è½ãç解ããããã«ãã¹ã¯ã¬ã¤ãã¼ã®å®è£
ãææ¡ãã¦ããã®ãè¯ãã®ã§ã¯ãªãã§ããããã
ããã§ã¯ãå®éã«Pythonã§Webã¹ã¯ã¬ã¤ãã³ã°ãããããã°ã©ã ãä½æãã¦ã¿ã¾ãããã
ãªããæ¬ç¨¿ã§ã¯ãPythonã®ãã¼ã¸ã§ã³ã¯Python3ç³»ãç¨ãã¦è§£èª¬ãã¦ãã¾ãããPython2ç³»ãã使ãã®æ¹ã¯ãäºæ¿ãã ããã
ãªãPythonãªã®ãï¼
Python以å¤ã®è¨èªã§ãåæ§ã®ãã¨ã¯å®ç¾ã§ããã®ã§ãããPythonã®å ´åãWebã¹ã¯ã¬ã¤ãã³ã°ãå®ç¾ããããã«å¿ è¦ãªã©ã¤ãã©ãªãæ¸ç±ãªã©ãè±å¯ã ããã§ãã
Pythonã§ã®ã¹ã¯ã¬ã¤ãã³ã°å®è·µæ¹æ³
Pythonã§Webã¹ã¯ã¬ã¤ãã³ã°ãå®ç¾ããæ¹æ³ã¯æ§ã
ã§ãã
æ¬ç¨¿ã§ã¯ãæ¯è¼çãã¼ã·ãã¯ãªä½ãæ¹ã§ãããPythonã©ã¤ãã©ãªã®ãBeautifulSoup4ãã使ç¨ããå®è·µä¾ã解説ãã¾ãã
ä»ã®æ¹æ³ã¨ãã¦ã¯ããScrapyãã¨ããWebã¹ã¯ã¬ã¤ãã³ã°ã«ç¹åãããã¬ã¼ã ã¯ã¼ã¯ãç¨ããå ´åãããSeleniumããªã©ã®Webãã©ã¦ã¶ãåä½ããã¦å®è¡ããã©ã¤ãã©ãªãç¨ããå ´åããããã¯å°ãWebã¹ã¯ã¬ã¤ãã³ã°ã®æ¬æµããã¯é¢ãã¾ããããPandasããç¨ãããã¼ã¿è§£æãªã©ãæãããã¾ãã
äºåæºå
BeautifulSoup4ã®ã¤ã³ã¹ãã¼ã«
HTMLã®è§£æãè¡ãã©ã¤ãã©ãªã«ããã£ã¨ãå®çªã®ã©ã¤ãã©ãªã¨è¨ããã¦ãããBeautifulSoup4ãã使ã£ã¦è§£èª¬ãã¾ãã
ãBeautifulSoup4ãã¯æ¬¡ã®ããã«ã¤ã³ã¹ãã¼ã«ãããã¨ãã§ãã¾ãã
ãªããä»åã®ãµã³ãã«ã§ã¯ã対象ã®Webãµã¤ãã«ã¢ã¯ã»ã¹ããããã®ã©ã¤ãã©ãªã¨ãã¦ããã¡ããPythonã§ã¯å®çªã¨ãªã£ã¦ããRequests
ã使ç¨ããã®ã§ãä½µãã¦ã¤ã³ã¹ãã¼ã«ãã¦ãã¾ãã
â»Pythonå®è¡ç°å¢ã®æ§ç¯ããpipï¼Pythonã®ã©ã¤ãã©ãªç®¡çãã¼ã«ï¼ã®å°å
¥ã«ã¤ãã¦ã¯æ¬ç¨¿ã§ã¯è§¦ãã¾ããã®ã§ãäºæ¿ãã ããã
$ pip install beautifulsoup4 requests
模æ¬Webãµã¤ãã®æ§ç¯
ããã§ã¯ãã¹ã¯ã¬ã¤ãã¼ãå®è£ ããã«ãããã模æ¬çãªãã¼ã¿åå¾å ã®Webãµã¤ããæ§ç¯ãã¾ãããã 次ã®ãã¡ã¤ã«ããPythonãå®è¡ãããã£ã¬ã¯ããªã¨åãé層ã«é ç½®ãã¦ãã ããã
- index.html
<html> <head><meta charset="utf-8"/><title>ãã«ã¼ãã·ã§ãã</title></head> <body> <h1>ååä¸è¦§</h1> <ul class="fruits_list"> <li><a href="http://localhost:8000/apple.html">ããã</a></li> <li><a href="http://localhost:8000/grape.html" class="new_item">ã¶ã©ã</a><span style="color:red">æ°ååï¼</span></li> <li><a href="http://localhost:8000/banana.html">ããã</a></li> </ul> </body> </html>
- apple.html
<html> <head><meta charset="utf-8"/><title>åå</title></head> <body> <p id="item_name">ããã</p> <p id="item_price">100å</p> </body> </html>
- grape.html
<html> <head><meta charset="utf-8"/><title>åå</title></head> <body> <p id="item_name">ã¶ã©ã</p> <p id="item_price">200å</p> </body> </html>
- banana.html
<html> <head><meta charset="utf-8"/><title>åå</title></head> <body> <p id="item_name">ããã</p> <p id="item_price">79800å</p> </body> </html>
Webãµã¼ãã¼ãç«ã¡ä¸ãã
ä¸è¨ã®ãã¡ã¤ã«ãé ç½®ãããã£ã¬ã¯ããªã§ãä¸è¨ã®ã³ãã³ããå®è¡ãã¦ãPythonã®ãã«ãã¤ã³ãµã¼ãã¼æ©è½ã使ã£ã¦ãWebãµã¼ãã¼ãç«ã¡ä¸ãã¾ãããã
$ python -m http.server 8000
ãã©ã¦ã¶ãèµ·åããhttp://localhost:8000
ã«ã¢ã¯ã»ã¹ããã¦ã次ã®ãããªç»é¢ã表示ãããã°OKã§ãã
ãªãã次ç¯ä»¥éã§Webã¹ã¯ã¬ã¤ãã³ã°ãè¡ãPythonããã°ã©ã ãå®è¡ãã¾ãããä¸è¨ã³ãã³ããå®è¡ããã¿ã¼ããã«ã¨ã¯å¥ã®ã¿ã¼ããã«ãç«ã¡ä¸ãã¦å®è¡ããå½¢ãæ³å®ãã¦ãã¾ãã
åç´ç·¨ï¼ç¹å®ã®è¦ç´ ããåä¸ã®è¦ç´ ãæãåºã
ããã§ã¯ãå ã»ã©ä½æãããã«ã¼ãã·ã§ãããµã¤ãããããæ°ååããã¼ã¯ãã¤ãã¦ããååã®ååãåå¾ãã¦ã¿ã¾ãããã
以ä¸ã®Pythonã³ã¼ããscraping_beginner.py
ã¨ãããã¡ã¤ã«åã§ä¿åãã¾ãã
- Pythonã³ã¼ãä¾
import requests from bs4 import BeautifulSoup target_url = "http://localhost:8000/" r = requests.get(target_url) # target_urlã«ã¢ã¯ã»ã¹ãã¦ãHTMLãåå¾ r.encoding = 'utf-8' # æååã対çã®ããæ示çã«æåã³ã¼ããæå® bs = BeautifulSoup(r.text, 'html.parser') new_fruit = bs.find(class_='new_item') # HTMLè¦ç´ ã®ä¸ããç®çã®è¦ç´ ãæ¢ã print(new_fruit.text)
- å®è¡çµæ
$ python scraping_beginner.py ã¶ã©ã
- Pythonã³ã¼ã解説
bs = BeautifulSoup(r.text, 'html.parser')
ããã§ã¯HTMLæååãå
ã«ãã¼ã¹ãå®è¡ããPythonããã°ã©ã ã§æ±ãããããããªãã¸ã§ã¯ãã¸ã®å¤æãè¡ã£ã¦ãã¾ãã
第ä¸å¼æ°r.text
ã«ã¯ãtarget_url
ã«ã¢ã¯ã»ã¹ããéã®ã¬ã¹ãã³ã¹ã®HTMLæååãæ ¼ç´ããã¦ãã¾ãã
ä»åã¯ãå®éã«Webãµã¤ãã«ã¢ã¯ã»ã¹ãã¦ãã¾ããã次ã®ãããªå½¢ã§æååãç´æ¥æ¸¡ããã¨ãããã¡ã¤ã«ãªãã¸ã§ã¯ãã渡ããã¨ãå¯è½ã§ãã
éçºä¸ããããã°ã®éã«æ´»ç¨ããã¨ä¾¿å©ã§ããã
bs = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
第äºå¼æ°html.parser
ã¯ãHTMLã®è§£æãè¡ãã©ã¤ãã©ãªã«ä½ãæå®ããããè¨è¿°ãã¾ãã
ä¸è¨ã®ãã®ãå«ããPythonã®å®çªãã¼ãµã¼ãåæãã¾ãã
ã»html.parser
ãâ¦ãPythonã®æ¨æºæ©è½ã¨ãã¦åãã£ã¦ãããã¼ãµã¼ã§ãæ軽ã«å©ç¨ã§ããåé¢ãPythonã§è¨è¿°ããã¦ããããå¦çé度ãé
ãã
ã»lxml
⦠Cè¨èªã§è¨è¿°ãããé«éãªãã¼ãµã¼ãããã大éã®HTMLãå¦çããã®ã§ããã°ãã¡ããå©ç¨ãããã¨ããè¦ãã ããå¥éOSã«lxml
ãã¤ã³ã¹ãã¼ã«ããå¿
è¦ãããã
ã»html5lib
â¦ãHTMLãæ£ããè¨è¿°ããã¦ããªããªã©ã®å¤å°ã®åé¡ããã£ã¦ãæè»ã«è§£éãã¦ãããä¸æ¹ãå¦çé度ãã¨ã¦ãé
ããPythonæ¨æºã®ã©ã¤ãã©ãªã§ã¯ãªãã®ã§ãå¥épipã§ã¤ã³ã¹ãã¼ã«ãå¿
è¦ã
new_fruit = bs.find(class_='new_item')
ããã§ã¯ããã¼ã¹ãããHTMLè¦ç´ ã®ä¸ãããç®çã®è¦ç´ ãæ¢ãå¦çãè¡ã£ã¦ãã¾ãã
ä»åã®ä¾ã§ããã°ãclass
ã«new_item
ãæå®ããã¦ããè¦ç´ ã®ãã¡ãæåã«è¦ã¤ãã£ããã®ãè¿ããã¨ããå¦çã«ãªãã¾ãã
â»ã¡ãªã¿ã«ããã¼ã¯ã¼ããclass_
ã¨ãªã£ã¦ããã®ã¯ãclass
ã¨ããæååãPythonã®äºç´èªã ããã§ãã
BeautifulSoupã¯ãHTMLè¦ç´ ã«ã¢ã¯ã»ã¹ããããã«å¿ è¦ãªæ©è½ãæãã¦ãã¾ãã ã¤ã¡ã¼ã¸ãæ´ã¿ããããããããããã¤ãã®ä¾ãç´¹ä»ãã¦ããã¾ãã
ã»IDãname
ã§ãã<a>
è¦ç´ ã®href
å±æ§ã®å¤ãåå¾ãã
bs.find('a', id=
name)
ã»class
ã«parent
ãæå®ããã¦ãã<div>
ã®åè¦ç´ ã§ãã<p>
ãåå¾ãã
bs.select(
div.parent > p)
ä¸ç´ç·¨ï¼ãããã¼ã¸ããç¹°ãè¿ããä¼´ãè¤æ°ã®è¦ç´ ãæãåºã
次ã«ããã«ã¼ãã·ã§ãããµã¤ããããååã®ååä¸è¦§ã¨ãååã®è©³ç´°ãã¼ã¸ã¸ã®ãªã³ã¯URLãå ¨ã¦åå¾ãã¦ã¿ã¾ãããã
以ä¸ã®Pythonã³ã¼ããscraping_intermediate.py
ã¨ãããã¡ã¤ã«åã§ä¿åãã¾ãã
- Pythonã³ã¼ãä¾
import requests from bs4 import BeautifulSoup target_url = "http://localhost:8000/" r = requests.get(target_url) r.encoding = 'utf-8' bs = BeautifulSoup(r.text, 'html.parser') fruits = bs.find('ul', class_='fruits_list').find_all('li') for fruit in fruits: fruit_name = fruit.a.text fruit_detail_url = fruit.a['href'] print(fruit_name) print(fruit_detail_url)
- å®è¡çµæ
$ python scraping_intermediate.py ããã http://localhost:8000/apple.html ã¶ã©ã http://localhost:8000/grape.html ããã http://localhost:8000/banana.html
- Pythonã³ã¼ã解説
fruits = bs.find('ul', class_='fruits_list').find_all('li')
ããã§ã¯ãäºæ®µéã§è¦ç´ ã®æ¤ç´¢ãè¡ã£ã¦ãã¾ãã
ä¸æ®µéç®ã®find
ã§ã¯ãclass
ã«fruits_list
ãæã¤<ul>
è¦ç´ ãæ¢ããäºæ®µéç®ã§ãã®åè¦ç´ ã®<li>
ãå
¨ã¦åå¾ãã¦ãã¾ãã
åç´ç·¨ã§è§£èª¬ããéããfind
ã¯æåã«è¦ã¤ãã£ãè¦ç´ ãï¼ã¤è¿ãã ãã§ããããfind_all
ã¯ãè¦ã¤ãã£ãè¦ç´ ãå
¨ã¦ãé
åã§è¿å´ãã¾ãã
for fruit in fruits: fruit_name = fruit.a.text fruit_detail_url = fruit.a['href']
find_all
ã§åå¾ããé
åãèµ°æ»ããããããã®ããã¹ãè¦ç´ ã¨ãhref
å±æ§è¦ç´ ãæãåºãã¦ãã¾ãã
ä¸ç´ç·¨ï¼è¤æ°ã®ãã¼ã¸ããè¤æ°ã®è¦ç´ ãæãåºããCSVã«åºåãã
ããã¾ã§ã¯ãããåä¸ã®Webãã¼ã¸ããæ
å ±ãæ½åºããã ãã§ããã
æå¾ã«ã極ãã¦åæ©çã§ã¯ããã¾ãããWebã¯ãã¼ã©ã¼ã¨ãã¦ã®æ¯ãèãããã£ãããã°ã©ã ãä½æãã¾ãã
ãã«ã¼ãã·ã§ãããµã¤ãã®ããããã¼ã¸ã«è¨è¼ããã¦ããå
¨ååã®è©³ç´°ãã¼ã¸ã«å«ã¾ãã¦ãããååã®éé¡æ
å ±ãéããCSVã«åºåãã¦ã¿ã¾ãããã
以ä¸ã®Pythonã³ã¼ããscraping_advanced.py
ã¨ãããã¡ã¤ã«åã§ä¿åãã¾ãã
- Pythonã³ã¼ãä¾
import csv import requests import time from bs4 import BeautifulSoup target_url = "http://localhost:8000/" r = requests.get(target_url) r.encoding = 'utf-8' bs = BeautifulSoup(r.text, 'html.parser') fruits = bs.find('ul', class_='fruits_list').find_all('li') fruit_prices = [] for fruit in fruits: time.sleep(5) # ä¸å®æéå¾ ã¤ fruit_name = fruit.a.text fruit_detail_url = fruit.a['href'] r_detail = requests.get(fruit_detail_url) r_detail.encoding = 'utf-8' bs_detail = BeautifulSoup(r_detail.text, 'html.parser') fruit_price = bs_detail.find('p', id='item_price').text fruit_prices.append([fruit_name, fruit_price]) with open('fruit_prices.csv', 'w') as csvfile: csvwriter = csv.writer(csvfile) for fruit_price in fruit_prices: csvwriter.writerow([fruit_price[0], fruit_price[1]])
- å®è¡çµæ
$ python scraping_advanced.py
ã§ã¯ãåºåãããCSVãè¦ã¦ã¿ã¾ãããã
$ cat fruit_prices.csv ããã,100å ã¶ã©ã,200å ããã,79800å
æå¾ éããããããã®ãã¼ã¸ãåç §ãã¦ãå¿ è¦ãªæ å ±ãéãããã¨ãã§ãã¦ãã¾ããã
- Pythonã³ã¼ã解説
r = requests.get(target_url) ⦠[A] : fruits = bs.find('ul', class_='fruits_list').find_all('li') ⦠[B] : for fruit in fruits: : fruit_detail_url = fruit.a['href'] : r_detail = requests.get(fruit_detail_url) ⦠[C]
ä¸è¨ã®å¦çããã¨ã¦ãç°¡åãªå¦çã§ããããã®ä¸è¦§ã®åããWebã¯ãã¼ã©ã¼ã¨ãã¦ã®æ¯ãèãã«ãªãã¾ãã
[A]ã§ãããç»é¢ã«ã¢ã¯ã»ã¹ãã[B]ã§è¡¨ç¤ºããã¦ããæç©ãªã¹ããåå¾ãã[C]ã§ãããããã®è©³ç´°ãã¼ã¸ã«ã¢ã¯ã»ã¹ããããã¨ããæµãã«ãªã£ã¦ãã¾ãã
time.sleep(5) # ä¸å®æéå¾ ã¤
5ç§éå¾
ã£ã¦ããã次ã®ãã¼ã¸ã¸ã®ã¢ã¯ã»ã¹ãè¡ã£ã¦ãã¾ãã
ãã®å¦çã¯ã¨ã¦ãéè¦ã§ãæ¬ç¨¿ã§æ¢ã«è¿°ã¹ãéããä¸å®æéå¾
ã¤ã¨ããå¦çãæã¾ãªãå ´åããã¼ã¿åå¾å
ã®ãµã¼ãã¼ã«å¤§ããªè² è·ãããã¦ãã¾ããã¨ã«ãªãã®ã§ãããªããã¹ãªã¼ããæãããã«ãã¾ãããã
with open('fruit_prices.csv', 'w') as csvfile: csvwriter = csv.writer(csvfile) for fruit_price in fruit_prices: csvwriter.writerow([fruit_price[0], fruit_price[1]])
fruit_prices.csv
ã¨ãããã¡ã¤ã«åã§ãã¹ã¯ã¬ã¤ãã³ã°ããæ
å ±ãCSVã«åºåãã¦ãã¾ãã
ä»åã®ä¾ã§ã¯ãCSVã«åºåãã¦ãã¾ãããããã®å¦çãDBã¸ã®ç»é²å¦çããExcelãã¡ã¤ã«ã¸ã®åºåå¦çãªã©ãç¨éã«å¿ãã¦å¤æ´ããã¨è¯ãã§ãã
ã¨ã¦ãç°¡åãªãµã³ãã«ã§ããããWebã¹ã¯ã¬ã¤ãã³ã°ãããã³Webã¯ãã¼ã©ã¼ã®åºæ¬çãªä»çµã¿ã¯ä»¥ä¸ã§ãã
ãããã«
ãã¦ããããã§ããã§ããããã
ä»åã¯ãWebã¹ã¯ã¬ã¤ãã³ã°ãæ´»ç¨ããä¸ã§ã®åæç¥èã注æäºé
ããµã¾ãã¤ã¤ãPythonããã°ã©ã ã®å®è£
ä¾ãç´¹ä»ãã¾ããã
ã¿ãªããããã²ãWebã¹ã¯ã¬ã¤ãã³ã°ãæ´»ç¨ãã¦ãæ¥åã®å¹çåãããã¦ã¯ãããã§ããããã
ãããããã
ã»ãã¼ã¿åå¾å
ã®ãµã¼ãã¼ã¸ã®é度ãªè² è·ããããªããã¨
ã»å©ç¨è¦ç´ãå®ããã¨
ã»èä½æ¨©æ³ãå®ããã¨
ã«ã¤ãã¦ãã注æãã ããã¾ãã
ã§ã¯ã§ã¯ãã¿ãªãã¾è¯ãWebã¹ã¯ã¬ã¤ãã³ã°ã©ã¤ããã
ã¨ã³ã¸ãã¢ä¸éæ¡ç¨ãµã¤ã
ã©ã¯ã¹ã§ã¯ãã¨ã³ã¸ãã¢ã»ãã¶ã¤ãã¼ã®ä¸éæ¡ç¨ãç©æ¥µçã«è¡ã£ã¦ããã¾ãï¼
ãèå³ããã¾ãããæ¯éã確èªããé¡ããã¾ãã
https://career-recruit.rakus.co.jp/career_engineer/ã«ã¸ã¥ã¢ã«é¢è«ãç³è¾¼ã¿ãã©ã¼ã
ã©ã®è·ç¨®ã«å¿åããã°è¯ããããããªãã¨ããæ¹ã¯ãã«ã¸ã¥ã¢ã«é¢è«ãéæè¡ã£ã¦ããã¾ãã
以ä¸ãã©ã¼ã ãããç³è¾¼ã¿ãã ããã
forms.gleã¤ãã³ãæ å ±
ä¼ç¤¾ã®é°å²æ°ãç¥ãããæ¹ã¯ãæ¯é±éå¬ãã¦ããã¤ãã³ãã«ãåå ãã ããï¼ rakus.connpass.com