pythonã®ãã¬ã¼ã ã¯ã¼ã¯ã§ãµã¯ãã¨ã¯ãã¼ã©ãã¤ããã"Python Framework Scrapy"
ããã ã¯ãã¼ã©ã¤ããã
ãç 究ãã¼ããæ¢ãã¤ã¤ããªãã¨ãªããããã¯ã¼ã¯ãé¢ç½ããã¨æã£ã¦ããä»æ¥ãã®é ãç 究ã§æ±ãã®ã¯ãéä¿¡æè¡ã®ãããã¯ã¼ã¯ã§ã¯ãªãã¦ããã¼ãã¨ãªã³ã¯ã«ä¸è¬åãããã¤ã
ããスモールワールド(small world phenomenon)ãããå 次ã®éãã(six degrees of separation)ããªã©ãä¸è¬çã«ç¥ããã¦ãé¨åããã£ã¦ãçµæ¸ç¾è±¡ã¨ãææçã®æ¡æ£ã®çè«ã«ãé©ç¨ã§ãã(ã¨è¨ããã¦ãã)ã®ã§ãé¢é£ããåéããã¡ããã¡ãåºãã
ããã ãå®éã®ãã¼ã¿ãç¡æã§æã£åãæ©ãéããã¨ãªãã¨ãwebä¸ã®ããã¥ã¡ã³ãã解æããã®ã楽ãããã¨ãããã¨ã§ãã¯ãã¼ã©ãã¤ãã£ã¦ã¿ãã
ãéãããã¼ã¿ãã©ãæ±ããã¯ã¾ã 決ãã¦ãªã(!!)ãã©ãã¨ããããéãªã³ã¯æ°ã¨ãã®è§£æããå§ããããã¨èãã¦ããã¾ãã
Scrapy(すくれぴー)
ãpythonã®ãã¬ã¼ã ã¯ã¼ã¯ã«ãããããããã¿ããã ãã©ãstackoverflowãªã©ãåèã«ãScrapyã使ã£ã¦ã¿ããã¨ã«ãã
ãチュートリアルãèªã¿ãªãããã¾ãã¯ã¤ã³ã¹ãã¼ã«ããã
$ sudo pip install Scrapy
ã以ä¸ãããã§ãScrapyã®ãµã¤ãã§ã¯ãdmoz/spidersã¨ãã®ãã£ã¬ã¯ããªä¸ã«ãã¡ã¤ã«ãä½æãã¨ãæ¸ãã¦ããã©ããã®è¾ºããå¾®å¦ã«ééã£ã¦ãã¦ãjsonåãåºãã¨ãããこっちのサイトã®æ¹ãè¯ãããã
ãã¨ããããã§ã次ã®ã³ãã³ãã§first_projããã¸ã§ã¯ããä½æãfirst_projãã£ã¬ã¯ããªã¸ç§»åãã¦ãfirst_proj/items.pyãç·¨éããã
$ scrapy startproject first_proj $ cd first_proj $ vim first_proj/item.py
first_proj/item.py
from scrapy.item import Item, Field class FirstProjItem(Item): title = Field() link = Field() content = Field()
ããããªæãã§ããµã¤ãããå¼ã£å¼µã£ã¦ãããã¼ã¿ç¨ã«å¤æ°ãå®ç¾©ãã¦ããã
ã次ã«ãfirst_proj/spider/FooSpider.pyãä½æã
first_proj/spider/FooSpider.py
from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.http.request import Request from first_proj.items import FirstProjItem class FooSpider(BaseSpider): name = "foo" allowed_domains = ["foo.org"] start_urls = ["http://blog.scrapy.org/"] def parse(self, response): hxs = HtmlXPathSelector(response) next_page = hxs.select("//div[@class='pagination']/a[@class='next_page']/@href").extract() if not not next_page: yield Request(next_page[0], self.parse) posts = hxs.select("//div[@class='post']") items = [] for post in posts: item = FirstProjItem() item["title"] = post.select("div[@class='bodytext']/h2/a/text()").extract() item["link"] = post.select("div[@class='bodytext']/h2/a/@href").extract() item["content"] = post.select("div[@class='bodytext']/p/text()").extract() items.append(item) for item in items: yield item
ãstart_urlsã§urlãæå®ãã¦ãhtmlããã¼ã¹ãããã¼ã¿ãããã£ãä½ã£ãitemã«è¿½å ãã¦ããã
ããã®ãã¼ã¿ã¯jsonã«ããã®ã§ãå®è¡ã³ãã³ãã¯ä»¥ä¸ã
$ scrapy crawl foo -o bar.json -t json
ããããããé å¼µã£ã¦åãã¦ããã¾ãã
ãä»å使ã£ãããã¸ã§ã¯ãã¯こちらã
追è¨(9/19/2013)ï¼URLã®ãªã³ã¯å ã«ãã¼ã¸ãè¦ã¤ãããªããªã£ã¦ããã®ã§ãこちらãåèã«ã³ã¼ããå¤æ´ã
first_proj/spiders/FooSpider.py
from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.http.request import Request from first_proj.items import FirstProjItem class FooSpider(BaseSpider): name = "foo" allowed_domains = ["foo.org"] start_urls = ["http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"] def parse(self, response): hxs = HtmlXPathSelector(response) sites = hxs.select("//ul/li") items = [] #for post in posts: for site in sites: item = FirstProjItem() item["title"] = site.select('a/text()').extract() item["link"] = site.select('a/@href').extract() item["content"] = site.select('text()').extract() items.append(item) for item in items: yield item