ã¯ããã«
ããã«ã¡ã¯ããã¼ã¿åæé¨ã®ä¹ ä¿ (@beatinaniwa) ã§ãã
仿¥ã¯ç¾©åæè²ã§æãã¦ãè¯ãããããªããã¨ããæãWebã¯ãã¼ã«ã¨ã¹ã¯ã¬ã¤ãã³ã°ã®è©±ã§ãã
ç§èªèº«ãæ¥é ã¯ç¤¾å ã«èç©ããããã¥ã¼ã¹è¨äºãã¼ã¿ãè¡åãã°ãSQLãPythonã使ã£ã¦åå¾ã»åæãããã¨ãå¤ãã§ãããWebä¸ã«ããå¤é¨ãã¼ã¿ã使ã£ã¦åæã«å½¹ç«ã¦ããã¨ããã·ã¼ã³ã¯ã¾ã¾ããã¾ãã
åç¬ã®ãã¼ã¸ãã¬ãªã¬ãªã¹ã¯ã¬ã¤ãã³ã°ãããã¨ããªã©ã¯ãä¸ã®1å¹´åãããåã®ä¼ç¤¾ã¢ããã³ãã«ã¬ã³ãã¼ã«æ¸ãããããªæ¹æ³ã§ãã£ã¦ããã°è¯ããã§ãããããã¤ãã®é層ã«ãããããã¥ã¼ã¹ãã¼ã¿ã«ãµã¤ããã°ã«ã¡ãã¼ã¿ã«ãµã¤ããå¹çããã¯ãã¼ã«+ã¹ã¯ã¬ã¤ãã³ã°ããããã«ã¯ãããã«é©ãããã¼ã«ã使ãã®ãããã便å©ã§ãã
ããã§Pythonç¨ã¹ã¯ã¬ã¤ãã³ã°ãã¬ã¼ã ã¯ã¼ã¯Scrapyã®ç»å ´ã§ãã

Scrapy | A Fast and Powerful Scraping and Web Crawling Framework
Scrapyã¨ã¯
Scrapyã使ãã°web spiderã¨å¼ã°ããããµã¤ããã¯ãã¼ã«+ã¹ã¯ã¬ã¤ãã³ã°ããå¦çãç°¡æ½ã«è¨è¿°ãããã¨ãã§ããããã«å¾è¿°ããããã«ãããScrapy Cloudã«ãããã¤ãããã¨ã«ãã£ã¦ãèªåãã»ãããµã¤ãã®æ å ±ãèªåçã«ï¼ãããæä½éã§ããã°ç¡æã§ï¼ï¼ãåå¾ãããã¨ãã§ãã¾ãã
ä»å¹´ã®5æã«ãªãªã¼ã¹ããã1.1.0ï¼ææ°ãã¼ã¸ã§ã³ã¯ãã®è¨äºãæ¸ãã¦ããæç¹ã§1.1.1ï¼ããã¯ã¤ãã«Python3ç³»ããµãã¼ãããï¼ã¾ã ãã¼ã¿ãµãã¼ãã¨ãããã¨ã§Windowsã§ã¯Python2ç³»ã®ã¿ã®å¯¾å¿ã§ãã...ï¼ãå¿ãè½ã¡çããã¦ä½¿ããã¨ãã§ãã¾ãã
Scrapyã®ä½¿ãæ¹
ã¤ã³ã¹ãã¼ã«
ã¾ãã¤ã³ã¹ãã¼ã«ã¯ãã¤ãã©ããpipã§
$ pip install scrapy
åæºå
ä»åã¯å®éã®ã¯ãã¼ã«+ã¹ã¯ã¬ã¤ãã³ã°ãè¡ããã¥ã¼ã¹ãã¼ã¿ã«ãµã¤ãã¨ãã¦Webçã®ã°ãã·ã¼ã顿ã«ãããã¨æãã¾ãã
ã°ãã·ã¼ã®ããããã¼ã¸ã¯ãã®ããã«ãªã£ã¦ãã¦ã大ããããããã¨ã³ã¿ã¡ãã¹ãã¼ãã... ã°ã«ã¡ã®ããã«åããã¦ãã¾ãã

ä»åã¯ã¨ã³ã¿ã¡ããã°ã«ã¡ã¾ã§ã®åã«ãã´ãªã®è¨äºãScrapyã使ã£ã¦éãã¦ã¿ããã¨ã«ãã¾ããåè¨äºã®ã¿ã¤ãã«ãURLããµãã«ãã´ãªåãåå¾ããã¿ã¼ã²ããã¨ãã¾ãã

ããã¸ã§ã¯ã使
ã¾ãã¯Scrapyã®ããã¸ã§ã¯ããä½ãã¾ãã
$ scrapy startproject gunosynews $ cd gunosynews
ããããã¨ãããªãã¡ã¤ã«ããã£ã¬ã¯ããªä»¥ä¸ã«ä½ããã¾ãã
. âââ gunosynews â  âââ __init__.py â  âââ __pycache__ â  âââ items.py â  âââ pipelines.py â  âââ settings.py â  âââ spiders â  âââ __init__.py â  âââ __pycache__ âââ scrapy.cfg
gunosynews/items.py ãç·¨éãã¦
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class GunosynewsItem(scrapy.Item): title = scrapy.Field() url = scrapy.Field() subcategory = scrapy.Field()
次ã«è¦ã®web spideré¨åãä½ãã¾ããscrapyã§ç¨æããã¦ããã³ãã³ããå®è¡ãããã¨ã§éå½¢ã使ãã¦ãã¾ãã
$ scrapy genspider gunosy gunosy.com
ããã¯gunosy.comãã¡ã¤ã³ãã¯ãã¼ã«ããgunosyã¨ããååã®spider(ã¯ãã¼ã©)ã使ãããã¨ããæå³ã«ãªãã¾ãã æ©éã§ããéå½¢ãç·¨éãã¾ãã
gunosynews/spiders/gunosy.py
# -*- coding: utf-8 -*- import scrapy from gunosynews.items import GunosynewsItem class GunosynewsSpider(scrapy.Spider): name = "gunosy" allowed_domains = ["gunosy.com"] start_urls = ( 'https://gunosy.com/categories/1', # ã¨ã³ã¿ã¡ 'https://gunosy.com/categories/2', # ã¹ãã¼ã 'https://gunosy.com/categories/3', # ãããã 'https://gunosy.com/categories/4', # å½å 'https://gunosy.com/categories/5', # æµ·å¤ 'https://gunosy.com/categories/6', # ã³ã©ã 'https://gunosy.com/categories/7', # ITã»ç§å¦ 'https://gunosy.com/categories/8', # ã°ã«ã¡ ) def parse(self, response): for sel in response.css("div.list_content"): article = GunosynewsItem() article['title'] = sel.css("div.list_title > a::text").extract_first() article['url'] = sel.css("div.list_title > a::attr('href')").extract_first() article['subcategory'] = sel.css("div.list_text > a::text").extract_first() yield article next_page = response.css("div.page-link-option > a::attr('href')") if next_page: url = response.urljoin(next_page[0].extract()) yield scrapy.Request(url, callback=self.parse)
é çªã«èª¬æãã¾ãã
start_urlsã«ã¯ã¯ãã¼ã«ã®èµ·ç¹ã¨ãªãURLãè¨è¿°ãã¦ããã¾ããä»åã¯ã¨ã³ã¿ã¡~ã°ã«ã¡åã«ãã´ãªã®è¨äºãéãããã£ãã®ã§ãåã«ãã´ãªã®ããããã¯ãã¼ã«ã®èµ·ç¹ã¨ãã¾ãããparse()ã¡ã½ãããä»åã®ã¯ãã¼ã©ã®ä¸å¿ã¨ãªãã¡ã½ããã§ããå ã»ã©æå®ããstart_urlããã¦ã³ãã¼ããããã¨ãã®ã¬ã¹ãã³ã¹ã弿°ã¨ãªã£ã¦ãã®ã¡ã½ããã«æ¸¡ããããã¨ã«ãªãã¾ããparse()ã¡ã½ããã®ä¸ã§ã¯ä¸ããããã¬ã¹ãã³ã¹ããåãåºãããè¦ç´ ã®ç®æãCSSã»ã¬ã¯ã¿ã§æå®ãã¦ãã¾ããè¨äºæ°ã®åã ãforæã§å¦çãç¹°ãè¿ãã¦ãã¾ããnext_page~以éã®ç®æã«æ³¨ç®ãã¦ãã ãããããã§ã¯è¨äºä¸è¦§ãã¼ã¸ã®ä¸é¨ã«ãããæ¬¡ã®ãã¼ã¸ãã¿ãããªã³ã¯ã®è¦ç´ ãCSSã»ã¬ã¯ã¿ã«ãã£ã¦é¸æãããããæ¬¡ã®ãã¼ã¸ãã¿ãããªã³ã¯ãåå¨ããã°ãã®URLãscrapy.Requestãªãã¸ã§ã¯ãã«æ¸¡ãã¦ãcallback弿°ã«å ã»ã©ã®parse()ã¡ã½ãããæå®ãããã¨ã«ãã£ã¦å帰çã«æ¬¡ã®ãã¼ã¸ããã©ããã¨ãå¯è½ã«ãªã£ã¦ãã¾ãããããããã¨ã§Scrapyã¯èªåçã«æ¬¡ã®ãã¼ã¸ããªããªãã¾ã§ãåã«ãã´ãªã®è¨äºãªã¹ãããã©ã£ã¦ããã¾ãã
ããã¾ã§ã§ä¸éãã¯ãã¼ã«+ã¹ã¯ã¬ã¤ãã³ã°ããå¦çãã§ãã¾ãããããããå®è¡ãã¦ã¿ãã¨ããã§ããããã®åã«è¨å®ãã¡ã¤ã«ãè¦ã¦ã¿ã¾ããè¨å®ã®ä¸èº«ã¯ gunosynews/settings.py ã«è¨è¿°ããã¦ãã¾ããDOWNLOAD_DELAY = 3 ã¨æ¸ããã¦ããã¨ããã«æ³¨ç®ãã¦ãã ãããããã§ã¯ã¯ãã¼ã©ãåãã¼ã¸ããã¦ã³ãã¼ãããéé(ç§)ãæå®ã§ãã¾ããããã©ã«ãã§ã¯ã³ã¡ã³ãã¢ã¦ãããã¦ãã¾ããã¯ãã¼ã«å
ã«è¿·æããããªãããã«ãã³ã¡ã³ãã¢ã¦ããå¤ãã¦ããã¾ãããã
ãã®ä»ç´°ããè¨å®ãªã©ã«ã¤ãã¦ã¯æ¬å®¶ããã¥ã¡ã³ããåç
§ãã¦ãã ããã
Settings — Scrapy 1.1.1 documentation
å®è¡&çµæ
ãã¤ãããã¾ã§ãããå®è¡ãã¦ã¿ã¾ãã
$ scrapy crawl gunosy
ããã¨ã¯ãã¼ã«+ã¹ã¯ã¬ã¤ãã³ã°ãå§ã¾ããã¨ãã°ä¸ã®ãããªå®è¡çµæãã¤ãã¤ãã¨åºã¦ããã¨æãã¾ãã
2016-08-17 11:46:55 [scrapy] DEBUG: Scraped from <200 https://gunosy.com/categories/8>
{'subcategory': 'ãåº',
'title': 'ãåå麺ã£ã¦ä½ï¼ããããªã¢ãã¿ã«ãå·ããéèåå麺ãã¯ãããã§ãããï¼ @ãåå®¶ã',
'url': 'https://gunosy.com/articles/R0MAi'}
2016-08-17 11:46:55 [scrapy] DEBUG: Scraped from <200 https://gunosy.com/categories/8>
{'subcategory': 'ã°ã«ã¡ç·å',
'title': 'ãã«ã®ã¼ã§1å¹´ã«2åã ãã®å³ï¼ãç¥ãã¨å
±ã«ãã£ã¦ããçµ¶åã¹ã¤ã¼ããã¹ãã¦ããã¥ã',
'url': 'https://gunosy.com/articles/RWeO2'}
2016-08-17 11:46:55 [scrapy] DEBUG: Scraped from <200 https://gunosy.com/categories/8>
{'subcategory': 'ãåº',
'title': 'ã空éã¨æéãæ¥æ¬ã¨éãã¢ããã¼ãï¼è¨è
ã«ã',
'url': 'https://gunosy.com/articles/Ri3TP'}
ç®çã®è¦ç´ ãæ£ããåå¾ã§ãã¦ãããã¨ããããã¾ãã
Scrapy Cloudã§ç°¡åã¯ãã¼ã©ç®¡ç
å°å ¥
ãã¦ç®çã®æ å ±ãåå¾ããã¨ããã¾ã§ã§ãã¾ããããéè¦ãªã®ã¯ãããã®ãã¼ã¿ãã©ããã£ã¦ä¿åãã¦ããã®ããã¾ãã©ããã£ã¦ã¯ãã¼ã©ã®å®è¡ã管çãããã¨ãããã¨ã§ããããã§Scrapy Cloudã®ç»å ´ã§ãã
Scrapy Cloudã§ã¯å
ã»ã©ä½æããã¯ãã¼ã©ããããã¤ãããã©ã¦ã¶ã®ç»é¢ããç°¡åã«ç®¡çãããã¨ãã§ãã¾ãã
ã¯ãã¼ã©ãä¸ã¤ä½ã£ã¦åããã®ãªãç¡æã§ä½¿ããã¨ãã§ããã¯ã¬ã¸ããã«ã¼ãã®ç»é²ãªã©ãå¿
è¦ããã¾ããï¼ãã¼ã¿ã®ä¿ææéãªã©ç¡æãªãã§ã¯ã®å¶éã¯å½ç¶ãªããããã¾ãï¼ã
ãµã¤ã³ã¢ãããã¦Organizationãé©å½ã«è¨å®ããã¨ä¸ã®ãããªç»é¢ã«ãªãã¨æãã¾ãã

ä¸ã®ã¡ãã¥ã¼ãã¼ãã Scrapy Cloud > Create Project ã鏿ããããã¸ã§ã¯ãã使ãã¾ãã

ç¡äºä½æããã¾ããã

ããã¨å·¦ã®ãµã¤ããã¼ã« Code & Deploys ã¨ããé
ç®ãããã®ã§ãããã¯ãªãã¯ãã¦ãAPIãã¼ã¨ããã¸ã§ã¯ãIDãã¡ã¢ã£ã¦ããã¾ãã
ãããã¤
ãããããããã¤ã§ãããããã¤ãã¼ã«ç¨ã®ã©ã¤ãã©ãªãç¨æããã¦ããã®ã§ãããå©ç¨ãã¾ãã
$ pip install shub
Scrapy Cloud — Scrapinghub documentation
ãã°ã¤ã³ãã¾ãã
$ shub login
ãã°ã¤ã³æã«APIãã¼ãèãããã®ã§ããã»ã©ã¡ã¢ã£ãAPIãã¼ãå
¥åãã㨠~/.scrapiinghub.yml ã«APIãã¼ãä¿åãããæ¬¡å以éãã°ã¤ã³ããã¨ãã«ã¯èªåçã«ããã使ãããããã«ãªãã¾ãã
å®éã®ãããã¤ã§ããããã§ã¯å¯¾è±¡ããã¸ã§ã¯ãIDãèãããã®ã§ãåæ§ã«ããã»ã©ã¡ã¢ã£ãããã¸ã§ã¯ãIDãå ¥åãã¾ãã
$ shub deploy
ãããã¤ãå®äºããã¨ãããããªãã£ã¬ã¯ããªããã¡ã¤ã«ã追å ããã¦ããã®ããããã¾ãã
. âââ build â  âââ bdist.macosx-10.10-x86_64 â  âââ lib â  âââ gunosynews â  âââ __init__.py â  âââ items.py â  âââ pipelines.py â  âââ settings.py â  âââ spiders â  âââ __init__.py â  âââ gunosy.py âââ gunosynews â  âââ __init__.py â  âââ __pycache__ â  â  âââ __init__.cpython-35.pyc â  â  âââ items.cpython-35.pyc â  â  âââ settings.cpython-35.pyc â  âââ items.py â  âââ pipelines.py â  âââ settings.py â  âââ spiders â  âââ __init__.py â  âââ __pycache__ â  â  âââ __init__.cpython-35.pyc â  â  âââ gunosy.cpython-35.pyc â  âââ gunosy.py âââ project.egg-info â  âââ PKG-INFO â  âââ SOURCES.txt â  âââ dependency_links.txt â  âââ entry_points.txt â  âââ top_level.txt âââ scrapinghub.yml âââ scrapy.cfg âââ setup.py
å®è¡&çµæ
å®éã«Scrapy Cloudã®å·¦ãµã¤ããã¼ãã Spiders > Dashboard ãããããã¤ããã¯ãã¼ã©ä¸è¦§ãè¦ããã¨ãã§ããã¯ãã¼ã©åãã¯ãªãã¯ãããã¨ã§ã¯ãã¼ã©è©³ç´°ç»é¢ã«è¡ããã¨ãã§ãã¾ãã

å³ä¸ã®Run Spiderãã¿ã³ãæ¼ãã¦ãããä¸åº¦éããã¦ã£ã³ãã¦ã§Run Spiderãã¿ã³ãæ¼ããã¨ã§å®éã®ã¯ãã¼ã©ã®å®è¡ãå§ã¾ãã¾ããæ¼ãã¦ã¿ã¾ãããï¼

ããã¨Running Jobsã®ã¨ããã«ã¿ãã¿ããã¡ã«ã¢ã¤ãã ãæºã¾ã£ã¦ããã¾ãã

ã¢ã¤ãã æ°ã®ã¨ãããã¯ãªãã¯ãããã¨ã§ãå®éã«ã¹ã¯ã¬ã¤ãã³ã°ããé
ç®ããã§ãã¯ãããã¨ãã§ããããã«ã¯CSVãªã©ã§ãã¦ã³ãã¼ããããã¨ãã§ãã¾ãã
è¶
便å©ï¼ï¼
æå¾ã«
ãããã§ããã§ãããããã¯ãã¼ã©+ã¹ã¯ã¬ã¤ãã³ã°ã¯èªåã§ä¸ããæ¸ãã¨ãªãã¨ãã³ã¼ãã®ã¡ã³ããã³ã¹ãå®éã®ã¯ãã¼ã©ã®ç£è¦ããã¼ã¿ã®ä¿åãªã©ãããããªç®æãããã¯ã¨ãªã£ã¦ãã¾ãããScrapy + Scrapy Cloudã使ãã¨ããªãè² æ ãå°ãªããªããã¨ããããã¨æãã¾ãã Scrapyã«ã¯ä»åç´¹ä»ããããªãã£ãä¾¿å©æ©è½ããããããããããã¥ã¡ã³ãããã£ãããã¦ããã®ã§è©³ããç¥ãããæ¹ã¯ãã²èªãã§ã¿ã¦ãã ããã
Scrapy 1.1 documentation — Scrapy 1.1.1 documentation
ããã§ã¯è¯ãã¹ã¯ã¬ã¤ãã³ã°ã©ã¤ããã