ãã®è¨äºã¯ååã®ç¶ãã§ãã
ååã¾ã§ã§ãxpath + curl + cookie ã使ãã¾ããã
xpath ã¯ã¨ã¦ã便å©ãªã®ã§ãåºæ¬çãªä½¿ãæ¹ãåæ²ãã¦ããã¾ããï¼ä»¥åã«ã¾ã¨ãããã®ã®ã§ããï¼
xpath | å 容 |
---|---|
//* | å ¨ã¦ã®ãã¼ã |
//a | å ¨ã¦ã®<a> ãã¼ã |
(//a)[1] | å ¨ã¦ã®<a> ãã¼ããåå¾ãã¦ãæåã®ï¼å |
(//a)[2] | å ¨ã¦ã®<a> ãã¼ããåå¾ãã¦ãï¼çªãï¼é åã¢ã¯ã»ã¹ï¼ |
(//a[1]) | 親ãã¼ãä¸ã®æåã®ï¼åã®<a>ããã¹ã¦ |
//a/span | span ãã¼ãã§ã親ã<a>ã®ãã®ããã¹ã¦ |
//a/@href | aãã¼ãã®ãã¹ã¦ã®hrefå±æ§ |
//a/text() | aãã¼ãã®ãã¹ã¦ã®text()è¡¨ç¾ |
//a[@href="/index.html"] | aãã¼ãã®ãã¡ hrefå±æ§ã"/index.html"ã¨åè´ããã¢ããã¹ã¦ |
//a[contains(@href,"index.html")] | aãã¼ãã®ãã¡ hrefå±æ§ã« index.html ãå«ããã®ãã¹ã¦ |
//title | //meta | title 㨠meta ã¿ã°ãä¸¡æ¹ |
//img[ contains(@src,'jpg') or contains(@src,'png')] | img ãã¼ã㧠src ã« jpg/pngãå«ããã®ãã¹ã¦ |
//div[ contains(@class,'link') and contains(@class,'book')] | div.link.bookã«ç¸å½ãããã® |
//form[ ./input[name="username"] ] | åãã¼ãã« //input[name="username"]ãæã¤form ãã¼ã |
//div[@id=main]//form | form ãã¼ãã§ã親ã div[@id=main] ã®ç©ããã¹ã¦ |
//div/* | div ã®åè¦ç´ ãã¹ã¦ |
//div//* | div ã®åå«è¦ç´ ããã¹ã¦ |
//table//td[2] | table ã¿ã°ã§ï¼çªãã®ï½ï½ã®ãã®ï¼ï¼åç®ããã¹ã¦ï¼ |
//*[@id] | idå±æ§ããããã®ããã¹ã¦ |
id("tid_123") | idå±æ§ãtid_123ã®ãã®(id="tid_123") |
xpath ã®ç·´ç¿
ãã®ãã¼ã¸ã«å«ã¾ãã a è¦ç´ ãåæãã
curl -s http://takuya-1st.hatenablog.jp/ | xpath "//a/@href"
ãã®ãã¼ã¸ã«å«ã¾ãã title 㨠meta ãåãåºãã
curl -s http://takuya-1st.hatenablog.jp/ | xpath "//title | //meta"
ãã®ãã¼ã¸ã«å«ã¾ããform ã§ï¼çªç®ã®ãã®ãåãåºãã
curl -s http://takuya-1st.hatenablog.jp/ | xpath "(//form)[2]"
curl + xpath + md5sum ã§æ´æ°ãã§ãã¯
ãã¼ã¸ã®æ´æ°ãã§ãã¯ããè¦ç´ ãå é¨HTMLã®å¤åã¨ãã¦èãã¦ãè¦ç´ ã®å¤åã追ãããã¦ãæ´æ°ãã§ãã¯ããã
url="http://localhost/" xpath_exp="(//div[contains(@class, 'main')])[2]" digest=`curl -s $url | xpath $xpath_exp 2>/dev/null | md5sum ` while true ; do current=`curl -s $url | xpath $xpath_exp 2>/dev/null | md5sum ` if [[ $digest != $current ]] ; then echo changed!! sendmail ã»ãã»ã digest=$current fi sleep 1 done
ãã®ããã«ããã¼ã¸ã®æ´æ°ãcurl 㨠xpath ã§ç¢ºå®ã«è¿½ãããããã¨ãã§ãã¾ãã
curl ã§é£ç¶ãã¼ã¸åå¾
curl ã§ããé£ç¶ãããã¼ã¸ã®åå¾ãã§ãã¾ããããã --next ãªãã·ã§ã³ã§ãã
--next ã«ãããã©ã¼ã éä¿¡ããã®ãã¼ã¿åå¾
next ã使ããã¨ã§ãé常ã®ã¯ãã¼ã©ã¼ãæããããªåä½ãcurl ã«ãè¡ããããã¨ãã§ãã¾ãã
以ä¸ã®ä¾ã¯ãpitapa.com ã«ãã°ã¤ã³ãã¦ãããããã¼ã¸ã¸é·ç§»ãã¦ããä¾ã§ãã
curl -v -k -c pitapa.cookie.yml -F id=takuyaXXX -F password=XXXXX \ https://www2.pitapa.com/member/login.do\ --next -k -c pitapa.cookie.yml https://www2.pitapa.com/member/top.do
next ã¯ç¶ãã¦ããã¤ã§ãããã¾ãã便å©ï¼
ãã£ã¨ã¾ã¨ãã¦ãã¼ã¿ãåå¾ããã
URLã®ä¸è¦§ãåæãã¦ããã³ãã³ã¢ã¯ã»ã¹ãã¦åå¾ãããã¨ãã§ãã¾ããããã --config ãªãã·ã§ã³
page.conf ãã¡ã¤ã«
ã¢ã¯ã»ã¹ãããURLãåæãã¦ãã¼ã¸ãåå¾ã«è¡ãã¾ãã
url="http://www.yahoo.co.jp/" output="yahoo.html" url="https://qiita.com" output="qiita.com.html" url="http://b.hatena.ne.jp/" output="hatebu.html"
ãã¼ã¸ã®ä¸è¦§ã¯xpath ã§ä½ã£ã¦ããã°ããã¨æãã¾ãã
é£ç¶åå¾
curl ã«--config/-K ãªãã·ã§ã³ãã¤ããã¨é£ç¶ãã¦ãã¼ã¿åå¾ããããå¦çãã¦ããã¾ãã
curl -s -K page.conf
ããã§ãxpath ã§ä½æããã¢ã¯ã»ã¹URLä¸è¦§ã¸ã©ãã©ãã¢ã¯ã»ã¹ãããã¨ãå¯è½ã«ãªãã¾ãã便å©
user-agent ãªã©ãæå®ã§ãã
config ã®ååã®éãã curl ã«æ¸¡ãã³ãã³ããªãã·ã§ã³ãè¨è¿°ã§ãã¾ãã
user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.48 Safari/537.36" url="http://www.yahoo.co.jp/" output="1.html" url="http://www.yahoo.co.jp/" output="2.html" url="http://www.yahoo.co.jp/" output="3.html" url="http://www.yahoo.co.jp/" output="4.html" url="http://www.yahoo.co.jp/" output="5.html" url="http://www.yahoo.co.jp/" output="6.html" url="http://www.yahoo.co.jp/" output="7.html"
libcurl ãªãã·ã§ã³
curl ã³ãã³ããããã使ããªãã£ã¦ãã¨ããªãã§ããCè¨èªã®ã½ã¼ã¹ãåãã¦ããã¾ãã
curl --libcurl get_urls.c -s -k -K curl.conf
ããããã¨ãget_urls.c ãçæããã¦
/********* Sample code generated by the curl command line tool ********** * All curl_easy_setopt() options are documented at: * http://curl.haxx.se/libcurl/c/curl_easy_setopt.html ************************************************************************/ #include <curl/curl.h> int main(int argc, char *argv[]) { CURLcode ret; CURL *hnd; hnd = curl_easy_init(); curl_easy_setopt(hnd, CURLOPT_URL, "http://www.yahoo.co.jp/"); curl_easy_setopt(hnd, CURLOPT_NOPROGRESS, 1L); curl_easy_setopt(hnd, CURLOPT_USERAGENT, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.48 Safari/537.36"); curl_easy_setopt(hnd, CURLOPT_MAXREDIRS, 50L); curl_easy_setopt(hnd, CURLOPT_SSL_VERIFYPEER, 0L); curl_easy_setopt(hnd, CURLOPT_SSL_VERIFYHOST, 0L); curl_easy_setopt(hnd, CURLOPT_TCP_KEEPALIVE, 1L); ï¼ä»¥ä¸ç¥
ãã使ãã¢ã¯ã»ã¹ãã¿ã¼ã³ãCè¨èªã§ã³ã³ãã¤ã«ã§ã³ãã³ãåãããã¨ãå¯è½ã«ãªãã¾ãã楽ããã
ã¾ã¨ã
- curl ã¯ä¾¿å©
- curl ã¯cookie ããã£ã¡ãæ±ãã
- curl ã§ãããå¦ç㯠--next ã¾ã㯠config
- xpath ã¯æ¥½ãã
- curlã§ä½ã£ãã³ãã³ã㯠libcurl ã®Cè¨èªã½ã¼ã¹ã¨ãã¦åå©ç¨å¯è½ã
curl ã£ã¦ä¾¿å©ãªã®ã§ãã¹ã¯ã¬ã¼ãã¼ãä½ãéã«å¤§å¤éå®ãã¾ãã
xpath åèè³æ
http://yakinikunotare.boo.jp/orebase/index.php?XML%2FXPath%2FXPath%A4%CE%BD%F1%A4%AD%CA%FD