Webã¹ã¯ã¬ã¤ãã³ã°ã¨ã¯ãããã°ã©ã ã使ç¨ãã¦ãWebãµã¤ãã®æ
å ±ï¼ãã¼ã¿ï¼ããã¦ã³ãã¼ããããã¼ã¿ã解æãåæãããã¨ãè¨ãã¾ãã
ååãªã®ã§ãã¹ã¯ã¬ã¤ãã³ã°ãããã«ããã£ã¦ã®æ³¨æäºé
ãåºæ¬çãªã¨ãããæ¸ãã¾ãã
ãã¾ãç¡è¶ã¯åºæ¥ã¾ããããã¹ã¯ã¬ã¤ãã³ã°ã使ãã¨ãã¨ã¦ã便å©ãªãã¨ãããã¨æãã¾ãã
ä¾ãã°ãèªåã®ããã°ã®ãªã³ã¯åãèªåãã§ãã¯ããããã°è¨äºä¸è¦§ã®èªåä½æãURLãå
¥åããã¨è¨äºã¿ã¤ãã«ã表示ããããã¼ã«ï¼URLãæ稿æ¥æã«ãã¦ãã¾ã£ãå ´åã«ä¾¿å©ï¼ãªã©ãããã°éå¶ã«å½¹ã«ç«ã¡ãããªãã¨ãåºæ¥ããã§ãã
ä»åã¯ãèªåã®ããã°ã®è¨äºã®URLã¨ã¿ã¤ãã«ã®ä¸è¦§ãåå¾ã§ãããã¼ã«ãä½ã£ã¦ã¿ããã¨æãã¾ãã
ããã§ã¯ããã£ã¦ããã¾ãï¼
ã¯ãã¼ãªã³ã°ã¨ã¯
ãããä¸ã«å
¬éããã¦ãããã¼ã¿ãéãããã¨ãã¯ãã¼ãã³ã°ã¨è¨ãã¾ãã
ã¾ãããã®éãããã¼ã¿ã解æãã¦ãå¿
è¦ãªæ
å ±ãåå¾ãããã¨ãã¹ã¯ã¬ã¤ãã³ã°ã¨è¨ãã¾ãã
ã¯ãã¼ãªã³ã°ç¦æ¢ã«ã¤ãã¦ç¥ã
èä½æ¨©ãå®ããã¨ããé度ãªã¢ã¯ã»ã¹ãããªããã¨ãããã¨ã¯åæã¨ãã¦ç解ãã¦ããã¨ãã¾ãã
ãµã¤ããã¯ãã¼ãªã³ã°ãç¦æ¢ãã¦ãããã©ããã¯ããrobots.txtãã¨ãåãã¼ã¸ã®ãrobots metaã¿ã°ããè¦ãã¨åããã¾ãã
robots.txtãè¦ã¦ã¿ã
èªåã®ããã°ã®ãrobots.txtããè¦ã¦ã¿ãã
ãrobots.txtãã¯ãèªåã®ããã°ã®ã«ã¼ããã£ã¬ã¯ããªã«ç½®ãã¦ããã¾ãã
ä¾ãã°ããã®ããã°ã®å ´åãããããã¼ã¸ã®URLï¼ä¾ãã°ãhttps://daisuke20240310.hatenablog.com/ï¼ã«ãrobots.txtããä»ãã¦ãhttps://daisuke20240310.hatenablog.com/robots.txt
ã«ã¢ã¯ã»ã¹ããã¨ã以ä¸ã®ãããªå
容ã表示ããã¾ãã
User-agent: *
Sitemap: https://daisuke20240310.hatenablog.com/sitemap_index.xml
Disallow: /api/
Disallow: /draft/
Disallow: /preview
User-agent: Mediapartners-Google
Disallow: /draft/
Disallow: /preview
Disallow
ã¨æ¸ãããã¨ããã¯ã¯ãã¼ãªã³ã°ï¼ãã¼ã¿ãéãããã¨ï¼ç¦æ¢ã¨ããæå³ã«ãªãã¾ãã
ã¾ãã¯ãUser-agent: *
ãè¦ã¦ããã°ããã§ãï¼User-agent: Mediapartners-Google
以éã¯ããåããã¾ããï¼ã
ã¤ã¾ããapiãã£ã¬ã¯ããªãdraftãã£ã¬ã¯ããªãpreviewãã£ã¬ã¯ããªã¯ã¯ãã¼ãªã³ã°ç¦æ¢ã¨ãããã¨ã«ãªãã¾ãã
èªåã®ããã°ã®ããããã¼ã¸ã®ãrobots metaã¿ã°ããè¦ã¦ã¿ã¾ãã
èªåã®ããã°ã®ããããã¼ã¸ãéãã¾ãã
å³ã¯ãªãã¯ã§ããã¼ã¸ã®ã½ã¼ã¹ã表示ãï¼Chromeã®å ´åï¼ãã¯ãªãã¯ãã¾ãã
Ctrl+Fãæ¼ãã¦ããrobotsããæ¤ç´¢ãã¾ãã
ç§ã®ããã°ã®å ´åã以ä¸ãè¦ã¤ããã¾ããã
<meta name="robots" content="max-image-preview:large" />
robots metaã¿ã°ã«ã¯ã以ä¸ã®ãããªãã®ãããããã§ãã
robots metaã¿ã°
ã»indexï¼ã¤ã³ããã¯ã¹å¯
ã»noindexï¼ã¤ã³ããã¯ã¹ä¸å¯
ã»followï¼ãªã³ã¯ã辿ã£ã¦OK
ã»nofollowï¼ãªã³ã¯ã辿ã£ã¦ã¯ãã¡
ã»noarchiveï¼Webãã¼ã¸ããã£ãã·ã¥ãããªã
ã»nosnippetï¼æ¤ç´¢çµæã«ããã¹ãã¹ããããï¼meta descriptionï¼ã表示ãããªã
ã»max-snippetï¼ããã¹ãã¹ãããããæå®ããå¤ã®æåæ°ã«å¶éãã
ã»max-image-previewï¼æ¤ç´¢çµæã«è¡¨ç¤ºããããã¼ã¸ã®ç»åãã¬ãã¥ã¼ã®å¤§ããï¼noneï¼ç»åãªããstandardï¼ããã©ã«ãç»åãlargeï¼ç»é¢ã®æ¨ªå¹
ã¾ã§ã®ç»åï¼
ã»max-video-previewï¼åç»ã®å ´åã®ãã¬ãã¥ã¼ç§æ°ã®æå®
ã»unavailable_afterï¼æ¤ç´¢çµæã«è¡¨ç¤ºããæå¾ã®æ¥æãæå®ï¼ä»¥éã¯è¡¨ç¤ºãããªãï¼
ã»noimageindexï¼ç»åã®ã¤ã³ããã¯ã¹ç¦æ¢
ç¹ã«ç¦æ¢ã¯ãã¦ãªãããã§ãã
ã¯ãã¼ãã³ã°ããã£ã¦ã¿ã
äºåæºå
Pythonã¯æ¢ã«ã¤ã³ã¹ãã¼ã«ããã¦ãããã®ã¨ãã¾ãã
Pythonã«2ã¤ã®ã©ã¤ãã©ãªãã¤ã³ã¹ãã¼ã«ãã¾ãã
1ã¤ã¯ããrequestsãã§ãWebãã¼ã¸ã便å©ã«åå¾ã§ããã©ã¤ãã©ãªã§ãã
ãã1ã¤ã¯ããbeautifulsoup4ãã§ãHTMLãããã¼ã¿ãåå¾ã解æããã©ã¤ãã©ãªã§ãã
ã³ãã³ãããã³ãããªã©ã§ã以ä¸ãå®è¡ãã¾ãã
pip install requests beautifulsoup4
ã§ã¯ãæºåã§ãã¾ããã®ã§ãæ©éãã£ã¦ããã¾ãï¼
ã¯ãã¼ãã³ã°ãã¦ã¿ã
ä»åã¯ãWindowsã®IDLEã使ã£ã¦ã¿ã¾ãã
ã¾ãã¯ãèªåã®ããããã¼ã¸ãåå¾ãã¦ã¿ã¾ãã
IDLEã®FileâNew Fileã§ãuntitledã¨ããã¨ãã£ã¿ãéãã®ã§ãé©å½ãªå ´æã«ååãä»ãã¦ä¿åãã¾ãï¼ä¾ï¼cloning.pyï¼ã
以ä¸ãå
¥åãã¦ä¿åãã¾ãï¼FileâSaveãã¾ãã¯ãCtrl+sï¼ã
import requests
url = "https://daisuke20240310.hatenablog.com"
response = requests.get( url )
response.encoding = response.apparent_encoding
print( response.text )
RunâRun Moduleã§å®è¡ã§ãã¾ãã
é»è²ã®ç®±ã®ãSqueezed text (3243 lines).ããããã«ã¯ãªãã¯ããã¨ãé·ããã©è¡¨ç¤ºããï¼ã¨èãããã®ã§ãOKãæ¼ãã¨åå¾ãããã¼ã¿ãå
¨ã¦è¦ãã¾ãã
èªåã®ããã°ã®ããããã¼ã¸ã®ãã¼ã¿ãåå¾ã§ãã¾ããã
ã¹ã¯ã¬ã¤ãã³ã°ããã£ã¦ã¿ã
ã§ã¯ãåå¾ã§ãããã¼ã¿ã解æãã¦ããã¾ãããï¼ï¼ã¹ã¯ã¬ã¤ãã³ã°ï¼
aã¿ã°ã¨titleã¿ã°ã1ã¤ãã¤åå¾ãã
ã¾ãã¯ããªã³ã¯ãåå¾ãã¦ã¿ã¾ãã
ã½ã¼ã¹ã³ã¼ãã¨çµæã示ãã¾ãã
import requests
from bs4 import BeautifulSoup
url = "https://daisuke20240310.hatenablog.com"
response = requests.get( url )
response.encoding = response.apparent_encoding
parse_html = BeautifulSoup( response.text, "html.parser" )
print( parse_html.find("a") )
print( parse_html.find("title") )
BeautifulSoupã§ãåå¾ãããã¼ã¿ã解æãã¦ããã¾ãã第1å¼æ°ã«ã¯åå¾ãããã¼ã¿ãæå®ãã第2å¼æ°ã¯HTMLãã¼ãµã¼ã®æå®ã§ãä»ã®ãã¼ãµã¼ãæå®ã§ãã¾ãããã¾ãã¯ãããæå®ãã¦ãã¾ãã
parse_html
ã¯ãBeautifulSoupã®ãªãã¸ã§ã¯ãã§ãããã使ã£ã¦ã解æããçµæãåå¾ã§ãã¾ããfind("a")
ã¡ã½ããã¯ãå
é ã®aã¿ã°ã1ã¤åå¾ã§ãã¾ãã
aã¿ã°ã®å
容ãè¦ãã¨ãURLã¯ããããã¼ã¸ã§ããªã³ã¯ããã¹ãã«ã¯ç»åã®URLãå
¥ã£ã¦ããããã§ããããã°ã®ã¿ã¤ãã«ãå
é ããªï¼ã¨æã£ãã®ã§ãããéã£ãããã§ãã
åãããªãã®ã§ãChromeã®ãããããã¼ãã¼ã«ã§ç¢ºèªãã¦ã¿ã¾ãã
bodyã¿ã°ã®å
é ããããã <a
ã§æ¤ç´¢ããã¨ããªãã»ã©ããããã£ã¼ã«ã®ç»åã ã£ããã§ãããç´å¾ã§ãã
ä»åã¯ãfind("a")
ã使ãã¾ããããfind_all("a")
ã¨ããã¨ãå
¨ã¦ã®aã¿ã°ããªã¹ãã§åå¾ã§ãã¾ãã
ããä¸ã¤ã®titleã¿ã°ã¯ãæ®éã«ã¿ã¤ãã«ãåå¾ã§ãã¦ãã¾ãã
ãµã¤ãå
ã®å
¨URLãåå¾ãã
ãªã³ã¯ãåå¾ã§ããã®ã§ããã¨ã¯å帰çã«ãªã³ã¯ããã©ãã°ããµã¤ãå
ã®å
¨URLãåå¾ã§ãããã§ãã
ãã ããå¤é¨ã®URLã¯å¯¾è±¡å¤ã«ããªãã¨æ°¸é ã¨åå¾ãç¶ãã¦ãã¾ãã®ã§ã注æãå¿
è¦ã§ãã
å®è£
ããããã¨æã£ãã®ã§ãããä»å㯠ChatGPT ã«å®è£
ãã¦ããããã¨æãã¾ãï¼
daisuke20240310.hatenablog.com
ChatGPTãã¤ãããã§ãï¼
å°ãæå³ãããã®ã§ã¯ãªãã®ã§ãããåå使ããã³ã¼ããã¿ãã§åºã¦ãã¾ããï¼ç¬ï¼
ChatGPTã®åºåããã³ã¼ãã¨å®è¡çµæãè²¼ã£ã¦ããã¾ãã
ãã®ã¾ã¾ã ã¨é£ç¶ã§ãµã¤ãã«ã¢ã¯ã»ã¹ãã¦ãã¾ãããµã¼ãã«è² æ
ããããå¯è½æ§ãããã®ã§ãé©å®ãæ°ç§ã®sleepå
¥ãããªã©ã®å¿
è¦ãããã¨æãã¾ãã
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
def crawl_page(url, visited_pages=set()):
if url in visited_pages:
return
try:
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.title.text.strip()
print(f"URL: {url} - ã¿ã¤ãã«: {title}")
visited_pages.add(url)
for link in soup.find_all('a', href=True):
absolute_link = urljoin(url, link['href'])
if absolute_link.startswith("http://example.com"):
crawl_page(absolute_link, visited_pages)
except Exception as e:
print(f"ã¨ã©ã¼ãçºçãã¾ãã: {e}")
start_url = "http://example.com"
crawl_page(start_url)
ã¢ã³ã«ã¼ãªã³ã¯ï¼ãã¼ã¸å
ãªã³ã¯ï¼ãå«ãã§ãã¾ã£ã¦ãã®ã§ãããã¯é¤å¤ããã³ã¼ãã追å ããå¿
è¦ãããã¾ããã
ChatGPTã«ãé¡ãããå
容ã®ç²¾åº¦ãä¸ããå¿
è¦ãããããã§ãï¼ç¬ï¼
ä»åã¯ä»¥ä¸ã§ãï¼
çµããã«
ããã«ãã¦ããChatGPTã®ã³ã¼ãçæã®ç²¾åº¦ã¯è¡æã§ããã
Pythonã®ç´°ããææ³ã¯æ¢ã«è¦ããªãã¦ããã§ãããä»å¾ãããããæ´»ç¨ãã¦ãããã¨æãã¾ãã
æå¾ã¾ã§ãèªã¿ããã ãããããã¨ããããã¾ããã