æ°å¹´ããã¾ãã¦ããã§ã¨ããããã¾ããä»å¹´ããããããã£ã¦ããã¾ãã
æ¬ç¨¿ã§ã¯PHP製ã®Webã¹ã¯ã¬ã¤ãã³ã°ã©ã¤ãã©ãªGoutteãç´¹ä»ãã¾ãã
Goutteï¼ã°ããï¼ã¨ã¯
Goutteã¯å¿ è¦ååãªæ©è½ãæã£ãWebã¹ã¯ã¬ã¤ãã³ã°ã©ã¤ãã©ãªã§ããããããWebã¹ã¯ã¬ã¤ãã³ã°ã¨ããã®ã¯ãå¤é¨Webãã¼ã¸ããå¿ è¦ãªãã¼ã¿ãåã£ã¦ãããããã®æå³ã§ããã¤ã¾ããGoutteã¯Webã¹ã¯ã¬ã¤ãã³ã°ãç°¡åã«è¡ãéå ·ã ã¨èããã°ããã§ãããã
å ·ä½çã«ã¯ãGoutteã¯Webã¯ãã¼ã©ã¨HTMLãã¼ãµãçµã¿åããããããªãã®ã§ããCookieããã©ã¼ã ã®æ±ããªã©Webãã©ã¦ã¶ã¨ãã¦ã®æ©è½ã¯ä¸éãæã£ã¦ãã¾ãããCSS風ã®è¦ç´ æå®ãã§ãããªã©ãæ©è½é¢ã§ã¯ä»ã®ã©ã¤ãã©ãªã¨éè²ãªãããã«æãã¾ãã
ããã«åå人ãGoutteã«æå¾ ãã¦ããç¹ã¯ãå®å®æ§ã¨ãã³ã°ãµãã¼ãã§ããGoutteã¯ä¸»è¦æ©è½ãSymfony2ããã³ZendFrameworkã®ã³ã³ãã¼ãã³ãã§å®ç¾ãã¦ãããGoutteèªèº«ã¯ããããã¤ãªãåããã¦ããã ãã§ããå®éãGoutteæ¬ä½ã®ã³ã¼ãã¯300è¡ãããã§ãããé常ã«ãçããããããã¯ã§ããã¨æãã¾ãã
Goutteèªä½ã¯ã¾ã æ£å¼ãªãªã¼ã¹ã¯ããã¦ããããGitHubã®ããã¸ã§ã¯ããã¼ã¸ããããã¾ãããã¨ã¯ãããæè¿ã¾ã§Pull Requestãåãè¾¼ãã ããã¦ãã¾ãã®ã§ãä½è ã®Fabienã飽ããããã§ã¯ãªãããã§ããæ¢ã«å®ç¨ã¬ãã«ã ã¨æãã¾ãã®ã§ãé åããè¦ã¦æ£å¼ãªãªã¼ã¹ãã¦ããããããã®ã§ãã
ã¡ãªã¿ã«ãGoutteã¨ããã®ã¯ãã©ã³ã¹èªã§ããããã¾ãã¯æ°´æ»´ã¨ãã£ãæå³ã®ããã§ãã
Goutteã®ç¹å¾´
åãGoutteã§ç¹å¾´çã ã¨æãã®ã¯æ¬¡ã®3ç¹ã§ãã
CSSã»ã¬ã¯ã¿ã®åå¨
Goutteã§ã¯ãCSSã»ã¬ã¯ã¿ãXPathã«å¤æããã¯ã©ã¹ï¼CssSelectorï¼ãå©ç¨ãã¦ãã¾ããããã«ãããWebã¹ã¯ã¬ã¤ãã³ã°æã®è¦ç´ æå®ãCSSã»ã¬ã¯ã¿ã§è¡ãã¾ãã®ã§ãXPathãè¦æãªäººãç¸é ã人ã§ãå®å¿ã§ãã
<?php /*çç¥*/ $timestampStr = $crawler->filter('div.paragraph:first-child span#timestamp')->text();
ãã®ã¯ã©ã¹ã¯Symfony2ãæä¾ãã¦ãããã®ã§ãPythonã®lxmlã¨ããã©ã¤ãã©ãªãPHPã«portãããã®ã§ãï¼åºå ¸:ãParsing XML documents with CSS selectors - Fabien Potencierãï¼ãå¹³æãªCSSã»ã¬ã¯ã¿ãæ¸ãã¦ããéãç¹ã«åé¡ãªã使ããã¨æãã¾ããã詳細ãªä»æ§ãç¥ãããå ´åã¯lxmlã®ããã¥ã¢ã«ãåç §ããå¿ è¦ãããã¾ãã
ãã©ã¦ã¶æ©è½ã®å å®
Goutteã§ã¯ãã©ã¦ã¶æ©è½ã®å®ç¾ã«æ¬¡ã®ãããªã¯ã©ã¹ãå©ç¨ãã¦ãã¾ãã
- Zend_Http_Client
- DomCrawler
- BrowserKit
ãããã®æ©è½ãçµã¿åããããã¨ã§ãåãåã£ãCookieã次ã®ãªã¯ã¨ã¹ãã§èªåçã«éä¿¡ããããç¹å®ã®ãªã³ã¯ãã¯ãªãã¯ãããããã©ã¼ã ã«é©åãªå¤ãè©°ãã¦POSTãããã¨ãã£ãããã©ã¦ã¶ãè¡ã£ã¦ããæ©è½ãä¸è¶³ãªãå®ç¾ã§ãã¾ãã
ãããããä¸è¨ã¯ã©ã¹ã®ãã¡HTTPé信以å¤ã®é¨åã¯Symfony2ã®æ©è½ãã¹ãï¼functional testï¼ã§å®ç¸¾ã®ãããã®ã§ããWebã¢ããªã±ã¼ã·ã§ã³ã®æ©è½ãã¹ãã§ã¯ããã©ã¦ã¶ã®ããªããã¦ã¢ããªã±ã¼ã·ã§ã³ã«ã¢ã¯ã»ã¹ããä½ãã¼ã¸ãé·ç§»ããããåºåãããHTMLãæå¾ éãããã§ãã¯ããããã¾ããããããåºèªãèããã°ããã©ã¦ã¶ã¨ãã¦ã®æ©è½ãæã£ã¦ããã®ã¯å½ç¶ã ã¨ãè¨ãã¾ãã
æ¬ä½ã³ã¼ãã®çã
æ¢ã«ç´¹ä»ããéããGoutteèªä½ã¯ã³ã¡ã³ãè¾¼ã¿ã§300è¡ã»ã©ã®é常ã«å°ããã©ã¤ãã©ãªã§ãã大åã®æ©è½ã«ã¤ãã¦Symfony2ã¨Zend Frameworkã®ã³ã³ãã¼ãã³ããã¤ãªããããã¦å©ç¨ãã¦ãã¾ãã
Symfony2ãZend Frameworkã®ã¦ã¼ã¶æ°ã®å¤ããããããããå¤é¨ã³ã³ãã¼ãã³ãã®å質ã¯é«ãããµãã¼ãæéãé·ãã¨äºæ³ããã¾ããã¾ããGoutteç¬èªã§å®è£ ãã¦ããé¨åã¯é常ã«å°ããã®ã§ãç¬èªå®è£ é¨åã®ãã°ã®å°ãªããã¡ã³ããã³ã¹ã®å®¹æãã¨ãã£ãã¡ãªãããæå¾ ã§ãã¾ããã¤ã¾ããå®å®æ§ãé·ãä¿å®æéãéè¦ãªå ´åãGoutteã¯è¯ãé¸æè¢ã¨è¨ããã§ãããã
å¥ã®è¦æ¹ãããã¨ãã³ã³ãã¼ãã³ãæ§ã®é«ãã©ã¤ãã©ãªãçµã¿åãããã°ä¸è¦æ¨¡ç¨åº¦ã®ãã¼ã«ãç°¡åã«ä½ãããã¨ãGoutteã¯ç¤ºãã¦ãã¾ãã300è¡ã®ã³ã¼ããæ¸ããªãããã°ã©ãã¯ããªãã¨æãã¾ãã®ã§ãæã ã«å¤¢ãä¸ãã¦ãããã¨ããæå³ã§ãç´ æ´ãããããã¸ã§ã¯ãã ã¨æãã¾ãã
Goutteã®ã¤ã³ã¹ãã¼ã«
PHP5.3以éãç¨æãããããã¨ã¯pharãã¡ã¤ã«1åã好ããªãã£ã¬ã¯ããªã«è¨ç½®ããã ãã§ããä¸è¨URLããã¡ã¤ã«ã¨ãã¦ã»ã¼ããã¦ãgoutte.pharã¨ããååã«ãã¦ãã ããã
ãããPHPã¹ã¯ãªããã¨åããã£ã¬ã¯ããªã«ç½®ãã°ã¤ã³ã¹ãã¼ã«å®äºã§ããããå°ã工夫ããã人ãgitã³ãã³ãã使ããã人ã¯é©å®ãªãã¨ããã¦ãã ããã
Goutteã®å®è¡ä¾
Webã¹ã¯ã¬ã¤ãã³ã°ã®ãµã³ãã«ã¨ãã¦æåãªãã¯ã¦ãªãã¼ã¯ã¼ããç´ºéããç¾ãããæ°åã®ããã¹ããæãåºãã¦ã¿ã¾ãããã
<?php require __DIR__.'/goutte.phar'; use Goutte\Client; $client = new Client(); $crawler = $client->request('GET', 'http://d.hatena.ne.jp/keyword/%BA%B0%CC%EE%A4%A2%A4%B5%C8%FE'); list(list($title, $url)) = $crawler->filter('div.keyword-container a.title')->extract(array('_text', 'href')); $furigana = $crawler->filter('div.keyword-container span.furigana')->text(); var_dump($title, $url, $furigana);
ä¸è¨ãå®è¡ããã¨æ¬¡ã®çµæãå¾ããã¾ãã
string(15) "ç´ºéããç¾" string(39) "/keyword/%BA%B0%CC%EE%A4%A2%A4%B5%C8%FE" string(18) "ããã®ããã¿"
Goutteã¯ã¤ãã¯ãªãã¡ã¬ã³ã¹
Goutteã使ãä¸ã§å¿ è¦ãªæ å ±ãåæ£ãããã¦ããã¨æããã®ã§ã以ä¸ã«ä»£è¡¨çãªæ©è½ãã¾ã¨ãã¦ã¿ã¾ããã
HTTPãªã¯ã¨ã¹ãã«é¢ããè¨å®
Goutte\Clientã®ã³ã³ã¹ãã©ã¯ã¿ç¬¬ä¸å¼æ°ã¯ãZend_Http_Clientã«å¯¾ããè¨å®ãã©ã¡ã¼ã¿ã«ãªã£ã¦ãã¾ããã§ããããä¾ãã°ã¦ã¼ã¶ã¼ã¨ã¼ã¸ã§ã³ããå¤æ´ãããå ´åã¯æ¬¡ã®ã³ã¼ãã§å®ç¾ã§ãã¾ãã
<?php require __DIR__.'/goutte.phar'; use Goutte\Client; $config = array('useragent' => 'MyRobot/1.1') $client = new Client($config); $crawler = $client->request('GET', 'http://example.com/');
ã¾ããããã©ã«ãã®è¨å®å¤ã®ãã¡ãZend_Http_Clientã®ããã©ã«ãå¤ã¨ç°ãªã£ã¦ãããã®ã¯ä»¥ä¸ã®4ã¤ã§ãã
- ãªãã¤ã¬ã¯ãããã©ãæ大æ°(maxredirects): 0(=ãã©ããªã)
- æ¥ç¶ã¿ã¤ã ã¢ã¦ãç§æ°(timeout): 30ç§
- ã¯ããã¼å¤ãURLã¨ã³ã³ã¼ããããã©ãã(encodecookies): falseï¼ããªãï¼
- ã¦ã¼ã¶ã¼ã¨ã¼ã¸ã§ã³ã(useragent): "Symfony2 BrowserKit"
ãã®ä»ã®è¨å®ãã©ã¡ã¼ã¿ã«ã¤ãã¦ã¯Zend_Http_Clientの公式ドキュメントãã覧ãã ãããåã¯è©¦ãã¦ãã¾ããããHTTP proxyã®å©ç¨ã»ã¯ã©ã¤ã¢ã³ã証ææ¸ã®å©ç¨ã»Zend_Http_Client ã®æ¥ç¶ã¢ããã¿ã®å¤æ´ãªã©ãå¯è½ãªã¯ãã§ãã
HTTPãªã¯ã¨ã¹ãã«é¢ããã¡ã½ãã
ç¬èªã®ãªã¯ã¨ã¹ãããããè¨å®ãããå ´åã¯setHeaderã¡ã½ãããå©ç¨ã§ãã¾ãã
<?php ï¼ç¥ï¼ $client = new Client(); $client->setHeader('X-Nantoka-Id', 'abcd0123); $crawler = $client->request('GET', 'http://example.com/');
ã¾ããBasicèªè¨¼ãå©ç¨ããå ´åã¯setAuthã¡ã½ãããå©ç¨ã§ãã¾ãã
<?php ï¼ç¥ï¼ $client = new Client(); $client->setAuth('id', 'password'); $crawler = $client->request('GET', 'http://example.com/');
ãã¼ã¸é·ç§»ã«é¢ããã¡ã½ãã
Goutte\Clientã¯Symfony2ã®BrowserKit\Clientãç¶æ¿ãã¦ããã次ã®ã¡ã½ãããå©ç¨ã§ãã¾ãã
followRedirects($followRedirect) | èªåçã«ãªãã¤ã¬ã¯ããããã©ããè¨å®ããï¼ããã©ã«ãã§ã¯èªåã§ãªãã¤ã¬ã¯ãããï¼ |
click($link) | ãªã³ã¯ãã¯ãªãã¯ãã |
submit($form, $values)) | ãã©ã¼ã ãéä¿¡ãã |
request($method, $uri, $parameters, $files, $server, $content, $changeHistory) | HTTPãªã¯ã¨ã¹ããè¡ã |
back() | ãã©ã¦ã¶ã®å±¥æ´ãå©ç¨ãã¦åã®ãã¼ã¸ã«æ»ã |
forward() | ãã©ã¦ã¶ã®å±¥æ´ãå©ç¨ãã¦æ¬¡ã®ãã¼ã¸ã«é²ã |
reload() | ä»ã®ãã¼ã¸ããªãã¼ããã |
followRedirect() | ãªãã¤ã¬ã¯ãå ã«é·ç§»ãã |
requestã¡ã½ããã®ä¾ã示ãã¾ãã
<?php ï¼ç¥ï¼ $crawler = $client->request('GET', 'http://www.symfony-project.org/');
詳ããã¯Symfony2ããã¥ã¡ã³ãã®機能テストの説明ãã覧ãã ããã
DOMæä½ã«é¢ããã¡ã½ãã
Clientã®request,click,submitã®åã¡ã½ããã®è¿ãå¤ã¯Symfony2ã®DomCrawler\Crawlerã¯ã©ã¹ã®ãªãã¸ã§ã¯ãã§ãããã®ã¯ã©ã¹ã®ã¡ã½ãããå©ç¨ãã¦ãHTMLããæ å ±ãåãåºãããããªã³ã¯ããã©ã¼ã ã®ãªãã¸ã§ã¯ããåãåºããããããã¨ãã§ãã¾ãã
以ä¸ã¯ãã¼ãã®çµãè¾¼ã¿ã«å©ç¨ã§ããã¡ã½ããã§ãã
filter('h1') | CSSã»ã¬ã¯ã¿ã«ããããããã¼ã |
filterXpath('h1') | XPathå¼ã«ããããããã¼ã |
eq(1) | æå®ããã¤ã³ããã¯ã¹ã®ãã¼ã |
first() | æåã®ãã¼ã |
last() | æå¾ã®ãã¼ã |
siblings() | å å¼ã®ãã¼ã |
nextAll() | å¾ã®å å¼ãã¼ã |
previousAll() | åã®å å¼ãã¼ã |
parents() | 親ãã¼ã |
children() | åãã¼ã |
reduce($lambda) | callableãfalseãè¿ããªããã¼ã |
selectLink($value) | æå®ãããããã¹ããå«ããªã³ã¯ãã¹ã¦ãé¸æ |
selectButton($value) | æå®ãããããã¹ããå«ããã¿ã³ãã¹ã¦ãé¸æ |
以ä¸ã¯æ å ±ã®æ½åºã«å©ç¨ã§ããã¡ã½ããã§ãã
attr($attribute) | æåã®ãã¼ãã®ãæå®ããå±æ§ã®å¤ãè¿ã |
text() | æåã®ããã¹ããã¼ãã®å¤ãè¿ã |
extract($attributes) | ãã¹ã¦ã®ãã¼ããããé åã§æå®ããå±æ§ã®å¤ãæ½åºããï¼_textã¯ããã¹ããã¼ãã®å¤ã®æå³ï¼ |
以ä¸ã¯ãªã³ã¯ããã©ã¼ã ã«å¯¾å¿ãããªãã¸ã§ã¯ããåå¾ããã¡ã½ããã§ããClientã®clickã¡ã½ãããsubmitã¡ã½ããã®å¼æ°ã¨ãã¦å©ç¨ã§ãã¾ãã
form() | æåã®ãã¼ããå«ã¾ãã¦ãããã©ã¼ã ã«å¯¾å¿ããFormãªãã¸ã§ã¯ããè¿ã |
link() | æåã®ãã¼ãã«å¯¾å¿ããLinkãªãã¸ã§ã¯ããè¿ã |
以ä¸ã¯ãªã³ã¯ãã¯ãªãã¯ããä¾ã§ãã
<?php ï¼ç¥ï¼ $link = $crawler->selectLink('Plugins')->link(); $crawler = $client->click($link);
以ä¸ã¯ãã©ã¼ã ã«éä¿¡ããä¾ã§ãã
<?php ï¼ç¥ï¼ $form = $crawler->selectButton('sign in')->form(); $crawler = $client->submit($form, array('signin[username]' => 'fabien', 'signin[password]' => 'xxxxxx'));
Symfony2ããã¥ã¡ã³ãã«Crawlerについての説明ãããã¾ãã®ã§ããããã¦ã覧ãã ããã
åèãã¼ã¸
- fabpot/Goutte - GitHub
- Zend Framework: Documentation: 導入 - Zend Framework Manual
- テスト | Symfony2日本語ドキュメント
- Parsing XML documents with CSS selectors - Fabien Potencier
- lxml.cssselect
- Introducing four new PHP 5.3 components and Goutte, a simple web scraper « php|architect – The site for PHP professionals
- GoutteからみるSymfony2の使われ方 : アシアルブログ
ã¾ã¨ã
ã³ã¼ãéã®å°ãªãã¨ååãªå®ç¨æ§ã両ç«ãã¦ããWebã¹ã¯ã¬ã¤ãã³ã°ã©ã¤ãã©ãªGoutteãç´¹ä»ãã¾ãããåã以åéã£ãPull Requestã®å 容ãåæ ããããã¨ããããç¾æç¹ã§ã¯åã«ã¨ã£ã¦ä¸æºã®ãªããã¼ã«ã§ããä»ååæ£ãã¦ããæ å ±ãã¾ã¨ãããã¨ã§ãã¦ã¼ã¶ã¼ãå¢ããããããªãã¨èãã¦ãã¾ãã
ä¸æ¹ã§ãåèªèº«ã¯ãã¾ãWebã¹ã¯ã¬ã¤ãã³ã°ãããæ¹ã§ã¯ãªãã®ã§ãGoutteã«è©å ¥ããããããããã¾ãããããã®æ©è½ã足ããªãã®ãæãããã¨ãã£ãç¹ãããããã°æãã¦ãã ããããã¨ãã°ãDiggin*1ã¯robots.txtã®è§£éããã¦ãããããã§ãããGoutteã«ã¯ãããªæ©è½ã¯ããã¾ããã