ãã®è¨äºã¯ [ã¯ãã¼ã©ã¼ï¼Webã¹ã¯ã¬ã¤ãã³ã° Advent Calendar 2015] ã®ä¸ã¤ã¨ãã¦æ¸ãã¾ããã
å ¬éé ããã©ããããã
ãã®è¨äºã®ç®æ¨
curl ã³ãã³ãã®ä½¿ãæ¹ãè¦ãã¤ã¤ãã¹ã¯ã¬ã¤ãã³ã°ããã£ã¦ããã¾ãã
ãã®è¨äºã§ç´¹ä»ãããã¨
- curlÂ
- curl + grep
- curl -sÂ
- curl + md5sum
- curl + md5sum + mailÂ
- curl + cookieÂ
- curl + cookie + xpathÂ
- curl + xpath + xpath
- ã·ã§ã«ã¹ã¯ãªããå®è¡
ç¨æãããã®
ç¥ã£ã¦ããã¨ä¾¿å©ãªç¥è
- css2
- css3
- xpath
- jq
js ã¸ã®å¯¾å¿
åºæ¬æ¹éã¯ãJSã«å¯¾å¿ããªãã
ã ã£ã¦ããªã¯ã¨ã¹ããããè¦ã¦ãããããããã
curl ã³ãã³ãã§WEBãã¼ã¸ãåå¾ãã
ã¹ã¯ã¬ã¼ããããã¾ãã« curl ã³ãã³ããç´¹ä»ãã¾ãã
curl ã³ãã³ã㯠libcurl ãã³ãã³ãã©ã¤ã³ä½¿ããã®ã§ãã curl ã³ãã³ãã¯ã³ã³ãã¤ã«ãªãã·ã§ã³æ¬¡ç¬¬ã§ http2 / scp / ssh / ftp ãªã©ã»ã¨ãã©ã®ãã¡ã¤ã«è»¢éã«å¯¾å¿å¯è½ã§ãã ããã«HTTPdãµã¼ãã¼ ã¸ã® ã¢ã¯ã»ã¹ããã®ã«ä½¿ãã¾ãã
curl ã³ãã³ããè¦ããã
curl ã³ãã³ãã®åºæ¬
curl http://qiita.com
curl ã³ãã³ã㧠HTTP HEADãè¦ã
curl -I http://qiita.com
curl ã³ãã³ãã§HTTP 302/301 ã«è¿½å¾ãã
curl -L http://qiita.com
 curl ã§ç¹å®ã®ãã¼ã¿ãä¿åãã
curl -L http://qiita.com > out.html
 ãã¡ã¤ã«ãä¿å
curl http://cdn.qiita.com/assets/siteid-reverse-9b38e297bbd020380feed99b444c6202.png > out.png
 URLã®ãã¡ã¤ã«åã§ä¿å
curl -O https://i.gyazo.com/f609d81c30b580c9015a890643ecc604.png
 ãµã¼ãã¼ã¨ã®ããã¨ãã詳細ã«è¡¨ç¤º
curl -v -L http://qiita.com > out.html
 é²æçã®ä»£ããã«ããã°ã¬ã¹ãã¼ã表示
curl  -#  -O URL
 ããã°ã¬ã¹ãã¼ãä¸åé表示
curl -s  URL
ããããããè¦ãã¦ããã°ãã»ã¨ãã©ã®å ´åã«å¯¾å¿ã§ãã¾ãã
ãªãcurl ãªã®ãï¼
ã¹ã¯ã¬ã¼ãã¼ãªã®ã«ãªãcurl ã®ã話ããã¦ããã®ãã¨ããã¨ãã¹ã¯ã¬ã¤ãã³ã°ãä½ãä¸ã§ curl ã¯ä¸å¯æ¬ ãªãã¼ã«ãªã®ã§ãã
欲ããã³ã³ãã³ããã¡ã¤ã³ãã£ã·ã¥ã¨ãããããã©ã¦ã¶ã¯ã¬ã¹ãã©ã³ãããã°ã©ã è¨èªã¯ã³ã³ãããcurl ã¯ãã箸ã»ãã©ã¼ã¯ãã§ããç¾å³ããããã ãããã«ä¸å¯æ¬ ãªãã¼ã«ã§ããã¡ãªã¿ã« xpathã¯åãç¿ï¼é£å¨ã§ããã
curl ãç¨ããã¹ã¯ã¬ã¤ãã³ã°
curl+ grep ã§ãµã¤ãã®æ å ±ãåãåºãã
åºæ¬ä¸ã®åºæ¬ã§ãã
curl https://qiita.com | grep  title
ãããçµæãããæ±ãã»ã»ã»ç¾ãããªãã
takuya@~/Desktop$ curl -s  https://qiita.com | grep '<title>(.+)</title>' 1:<!DOCTYPE html><html xmlns:og="http://ogp.me/ns#"><head><meta charset="UTF-8" /><title>Qiita - ããã°ã©ãã®æè¡æ å ±å ±æãµã¼ãã¹</title><meta content="width=device-width,initial-scale=1" name="viewport" /><meta content="Qiitaã¯ãããã°ã©ãã®ããã®æè¡æ å ±å ±æãµã¼ãã¹ã§ãã ããã°ã©ãã³ã°ã«é¢ããTipsããã¦ãã¦ãã¡ã¢ãç°¡åã«è¨é² &amp; å ¬éãããã¨ãã§ãã¾ãã" name="description" /><meta content="summary" name="twitter:card" /><meta content="@Qiita" name="tï¼ããããã
curl + grep ã§ãµã¤ãã®æ å ±ã綺éºã«åãåºã
grep -o ãªãã·ã§ã³ã使ã
curl https://qiita.com | grep -o '<title>(.+)</title>'
çµæã¯ã»ã綺éºã
$ curl -s https://qiita.com | grep -o '<title>(.+)</title>' <title>Qiita - ããã°ã©ãã®æè¡æ å ±å ±æãµã¼ãã¹</title>
curl + grep -o ã§ãããã便å©ã«ãªãã¾ããã
curl + m5sumÂ
åå¾ããå 容ãmd5sum ã«ããã
curl -s  http://takuya-1st.hatenablog.jp/entries/2015/12/11 | md5sum
ããã«ããåå¾ããå 容ãåããã©ããæ¤åºãå¯è½ã«ãªãã
è注 last-modified ã e-tag ã使ãã¹ããªãã ããããæããç³ã¿ãããªããã°å®è£ ãå¤ãã¦304 Not Modified ãè¿ããªããµã¤ããå¤ããããã§ããããããããå¦ç«¹æã®ä»£è¡¨ä¾ã¯CA
curl + md5sum + mailÂ
ãµã¤ãã«æ´æ°ããã£ããéç¥ãããã·ã§ã«ã¹ã¯ãªãã
curl + md5sum ã§æ´æ°ç£è¦ãªããï¼åã§æ¸ããããã«ããéººå¾ ã£ã¦ãéã«ã§ãã¡ãããã
url="http://localhost/" digest=`curl -s $url | md5sum ` while true ; do  current=`curl -s $url | md5sum `  if [[ $digest != $current ]] ; then   echo changed!!   sendmail ã»ãã»ã   digest=$current  fi  sleep 1 done
ãµã¤ããå¤åããããMD5ã®çµæãå¤ããã®ã§ããã®çµæãè¦ã¤ãã¦éç¥ãã¾ãã
ããã ã¨æ¯ç§è¦ã«è¡ã£ã¦ã¾ãããããããã°ã§JRã®éè¡æ å ±ãåå¾ãã話ãæ¸ããæã«ãï¼ï¼ç§ã«ï¼åº¦ã¯ç çãªã¢ã¯ã»ã¹é »åº¦ãã¨è¨ããããã¨ããããã¯ãã¼ã©ã¼ãé¿ããã管çè ã¯ãã£ãã·ã¥ãæ£ããæ±ã£ã¦ãã ãããã¯ãã¼ã©ã¼ã¯éæ³è¡çºã§ãæ»æã§ãããã¾ããããLast-modified-since/ If-none-matchãã«æ£ããå¿çãã¦ãã ãããHTTPã®ãã£ãã·ã¥ããæ±ããªãSI'erã¯WEBæ¡ä»¶ã«ãããªãæ¯åã¯ãã¼ã©ã¼ãèµ°ããããããã§æ»æã ã¨ã¡ã¼ã«éã£ã¦ããé¢è¥¿é»åãããã®ãã¨ã ããã£ã¨ã
ã¦ã¼ã¶ã¼ã¨ã¼ã¸ã§ã³ããå¤æ´ãã
ããã¤ãã®ãµã¤ãã¯ãã¦ã¼ã¶ã¼ã¨ã¼ã¸ã§ã³ãã«ããèå¥ããã¦ããã®ã§ãç§ã®ãã©ã¦ã¶ã®ä»£çãcurl ã«ããããã®ã§ãã¦ã¼ã¶ã¼ã¨ã¼ã¸ã§ã³ãããã©ã¦ã¶ã«åããã¦ãã
curl --user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.48 Safari/537.36" http://www.yahoo.co.jp/
ãã©ã¦ã¶ã®ãªã¯ã¨ã¹ããCurlã§ãã
curl ã«ã¯ããããã¨ä¾¿å©ãªæ©è½ãããã¾ãã åºæ¬çã«http ãªã¯ã¨ã¹ã㯠curl ã§ä½æã§ãã¾ãã
ä¾ãã°ããã©ã¦ã¶ã®ãªã¯ã¨ã¹ãã¨å ¨ãåããªã¯ã¨ã¹ããåç¾ããã«ã¯Chromeã®éçºãã¼ã«ã§cURLã¨ãã¦ã³ãã¼ããã°ã¨ã¦ãç°¡åã§ãã
chrome ããã³ãã¼ãã¦ãã¦ã·ã§ã«ã«è²¼ãä»ã
ChromeããcURL ã³ãã³ãã¨ãã¦ã³ãã¼ãã¦ã·ã§ã«ã«è²¼ãä»ããããç°¡åã«ããªã¯ã¨ã¹ããåç¾ã§ããã
Cookieããããããã®ã¾ã¾åã¾ã£ã¦ããã®ã§ãã¹ã¯ã¬ã¼ãã¼ä½ãæã¯ãChromeã®ã³ãã¼ããå§ããã¨ä¾¿å©ã
curl ã§Cookieã®åãæ±ã
curl ã§cookieãæ°¸ç¶ãããã§ãã
WEBãã¼ã¸ã«ã¢ã¯ã»ã¹ããã«ã¯ã»ã¨ãã©ã®å ´åãCookieã«ãã£ã¦èå¥ããã¾ãã
cookie ä¿åã«ã¯ -c ãªãã·ã§ã³
curl -c ${ä¿åããããã¹}Â http://www.yahoo.co.jp
Cookieã®åå©ç¨ ã«ã¯ -b ãªãã·ã§ã³
curl -b ${ä¿åæ¸ã¿ãã¹} http://auctions.yahoo.co.jp/
Cookieãååã®ã»ãã·ã§ã³ããåå©ç¨ãã¤ã¤ã次åã®ããã«ä¿å
curl -b path -c path  http://auctions.yahoo.co.jp/
ãªãã·ã§ã³ã® -b -c ãåæã«ä½¿ãã¾ãããã¡ã¤ã«åã¯åãã§æ§ãã¾ãã
注æï¼path ã«åããã®ãè¨è¿°ãããã¨ãã£ã¦çç¥ãããã¨ã¯ã§ãã¾ããã
curl -bc path  http://auctions.yahoo.co.jp/ ## ããã¯ã§ããªããCookie使ããªã
ããã§ãCookieã®åé¡ãæ°ã«ããã«æ±ããããã«ãªãã¾ãã
æéãªãã®ã»ãã·ã§ã³ã¯ããã¼ãä¿åããã¾ãã
ãªãã·ã§ã³ -c ã使ãã°session-cookie ãåé¡ãªãä¿åããã¾ãã
ããä¿åããããªãæ㯠-j ãã¤ãã¦ãã ããã
注æï¼æéãªãã¨ã¯æéã®è¨å®ãããã¦ãªãCookieã®ãã¨ã§ãããæéãªãããé·æéCookie(é称ï¼æ°¸ç¶Cookieï¼ã¨åéããããã§ãããæéè¨å®ãªãï¼ã»ãã·ã§ã³Cookieï¼ãã¼ã¸ãéããã¾ã§æå¹ï¼Windowãéããã¾ã§ã§ããï¼ï¼ã¿ããéããã ãã§ã¯æ¶ããªãï¼
Cookieãã©ãããåãåºãã®ãã
ã§ããForméä¿¡ãã¦Cookieä½ãã®ããã©ãããã
Cookieã¯åè¿°ã®Chromeéçºãã¼ã«ã®å³ã¯ãªãã¯ããåãåºãã»ããChromeã®ã¦ã¼ã¶ã¼ãããã¡ã¤ã«ããåãåºããã¨ã¨ãã§ãã¾ãã
ãããã¯Chromeããã³ãã³ãã§åãåºããã¨ãå¤ãã§ãã
https://github.com/takuya/chrome-storage
chrome-cookie .yahoo.co.jp | jq .
æ¬é¡ã®ã¹ã¯ã¬ã¤ãã³ã°ã§ãã
ãã¦ãã¦ãããã§ã¯æºåãæ´ã£ããã¹ã¯ã¬ã¤ãã³ã°ãå§ãã¦ããããã¨æãã¾ãã
ããã¾ã§ã§
ã¨ããæ¦å¨ãä¸éãæãã¾ããã
ã¹ã¯ã¬ã¤ãã³ã°ããã¨ãã«ããä¸ã¤ä¸å¯æ¬ ãªæ¦å¨ãããã¾ãããããlibxmlã§ãã
æå¾ã®æ¦å¨ libxml
ã¹ã¯ã¬ã¤ãã³ã°ãããæã«æ¬ ãããªãæçµå µå¨ãlibxml ã§ãã
libxml ã«æ·»ä»ã®xpath ç¨ã³ãã³ãã§ãã¼ã¸è¦ç´ ãåå¾
grep ã§ã¯çµ¶å¯¾ã«è¶³ããªããªãã®ã§ã libxml ã§XMLã解æãå¿ è¦ã§ãã
HTML ã XML ã¨ãã¦è§£éãã¦ããããã®ã§libxml ã¯ã¹ã¯ã¬ã¤ãã³ã°ã«ã¯æ¬ ãããªãã
ruby nokogiri ã python lxml ãªããããã§ããã
libxml ã¯ã³ãã³ãããã使ããã®ã§ãã
 libxmlã®xmllintã³ãã³ãã§xpath ãå®è¡ãã
xmllint ã³ãã³ãã§ã¯xpath ãå®è¡ã§ãã¾ãã便å©ï¼
xmllint --xpath "//nodename" sample.xml
ã¤ã³ã¹ãã¼ã«
sudo apt install libxml2-utils
xmllint ã³ãã³ãã§html ãæ±ã
xmllint 㧠html ãæ±ãã«ã¯html ãªãã·ã§ã³ãã¤ãã¾ã
xmllint --html --xpath "//head/title" sample.html
ã¿ã¤ãéãé¢åãªã®ã§ xpath ã§alias ãã¦ããã¾ãã
alias xpath="xmllint --html --xpath 2>/dev/null"
ã¨ã©ã¼ã¡ãã»ã¼ã¸ã®ã´ãç®±è¡ãã¯ãã¾ã説æã®ããã§ãã xpath ã®åºæ¬æ§æã¯ãã¨ã§è©³ããæ¸ãã¨ãã¦ãxpath ã§ã©ãã©ããã¼ã¸ãåã£ã¦ããã¾ãã
xpath ã¨curl ã³ãã³ãã¨çµã¿åããã¦æ¦ãã¾ãã
curl -s 'http://www.yahoo.co.jp/' | xpath "//head/title"  -Â
ï¼ã¤ã®æ¦å¨ãæã£ã
ã¹ã¯ã¬ã¤ãã³ã°ã«æ¬ ãããªããï¼ç¨®ã®ç¥å¨ãcurl + libxml ã§æ´ãã¾ããã
å¦ç | ã³ãã³ã |  |
---|---|---|
ãã¡ã¤ã«åå¾ | curl  | |
cookie åãæ±ã | curl -b path -c path | |
HTML 解æ | xmllint --html --xpath |
ããã§ã¯ã¹ã¯ã¬ã¤ãã³ã°å¦çããã¾ãã
åç½®ãé·ãããç²ããã
é£ç¶ãã¼ã¸åå¾ããã¦ããããã¨æãã¾ãã
ä¾ãã°ãyahoo ãªã¼ã¯ã·ã§ã³ã®æ¤ç´¢çµæãã¼ã¸ãããªã³ã¯ãå ¨ã¦æãåºãã«ã¯
curl -s -L  http://j.mp/1YC5mSM | xpath  "//h3/a/@href" -
ãã®çµæããããã«å¯¾ãã¦ããã¼ã¸ã®è©³ç´°ãåå¾ãã¦ä¿åããã
curl -s -L  http://j.mp/1YC5mSM | xpath  "//h3/a/@href" -
ããã«xargs ã§å±éãã¦
åå¾ããhref ã®ä¸è¦§ããããã«xargs ã§å±éãã¦è©³ç´°ãã¼ã¸ã«ã¢ã¯ã»ã¹ãã¾ãã
curl -s -L http://j.mp/1YC5mSM |xpath  "//h3/a/@href" -  ¥ | sed 's/href=//g'¥ | sed 's/"//g' |¥ xargs -P0 -d ' ' -I@ curl -v -O -L @
éã«æã¾ãsed ãéªéãªã®ã§
èªä½ã®xpath é¢æ°ã«æ¸ãå¤ãã¾ãã
git clone [email protected]:894c5aeabc620344bcea.git cd 894c5aeabc620344bcea cp xpath /usr/local/bin/ chmod +x /usr/local/bin/xpath
ããã«çç¥å
curl -s -LÂ http://j.mp/1YC5mSM | xpath "//h3/a/@href"Â | xpath "//h1/text()"
xpathã¨curl ã®çµã¿åããã§ããããæ¦ããã
curl ã¨xpath ã±ã±ã£ã¨ãã¼ã¿åå¾
anemone ã¤ããã£ã¦è©±ãªãã ãããã©ã
selenium ãã©ã¤ã使ãã°ãããã ãããã©ã
ãã¼ã¸ã®è§£æãã·ã§ã«ã§ã¦ããæ¹ãã¿ã¤ãéå°ãªãã¦ä¾¿å©ã ããï¼ï¼ï¼
ããå°ãç¶ããããã
ç¶ãâcurl+xpath から始めるお手軽スクレイピング(2) - それマグで!
é¢é£è³æ
grep ã§ãããããé¨åã ããåãåºã http://takuya-1st.hatenablog.jp/entry/20121112/1352750670
xpath ã³ãã³ã http://takuya-1st.hatenablog.jp/entry/2014/08/24/031832
2021-11-28
apt install libxml2-utils ã追è¨