5ch(æ§2ch)ãã¹ã¯ã¬ã¤ãã³ã°ãã¦ãéå»æµè¡ã£ããããã¹ã©ã³ã°ã®ä»ãç¥ã
5ch(æ§2ch)ã§ã¯ããæ°å¹´ã¯Twitterã使ç¨ããããã«ãªã£ã¦ãã¾ãã¾ãããããããã®ãã¼ã ã®çºä¿¡å°ç¹ã§ãã£ã¦ãæ§ã ãªã¹ã©ã³ã°ãçã¿ãæ§ã ãªæåãä½ã£ã¦ããã¨æãã¾ãã
å¦çæ代ã2chã¾ã¨ãã¨ãè¦ã¦ããã®ã§ãå½±é¿ãåãã¦ãããããæåã§ãæè¦å¤ããããã¨ãã©ãã«ãæµè¡ãå»ããããããã§ãã
5chã®éå»ãã°ãéå»18å¹´ã»ã©ããã®ã¼ã£ã¦åå¾ããæ¹æ³ã¨ãæããããããã¹ã©ã³ã°ã®ããã¥ã¡ã³ãã«å ããåºç¾å ·åãæç³»åã§ã«ã¦ã³ããããã¨ã§ãæ代ã®å¤é·ã§ã©ã®ããã«ä½¿ç¨ã®æ¹æ³ãå¤åããã®ã観測ãããã¨ãã§ãã¾ãã
ææ«ã«ãorzãã£ã¦ä»ãããè¥ã人ãããorzã£ã¦ãªãã§ããï¼ãã¨èããã¦å¿èº«å ±ã«orzã¿ããã«ãªã£ã¦ã
— ã°ããã (@vaaaaanquish) October 19, 2018
ä¾ãã°ãä»åéè¨ãã5chã®æ¸ãè¾¼ã¿500GByteç¨åº¦ã®ãã°ããã§ã¯ãorzã¨ãããããã¹ã©ã³ã°ã¯ã2005å¹´ããã¼ã¯ã«æ¸ãè¾¼ã¿ã®ä¸ã§åºã¦ããé »åº¦ãã©ãã©ãä¸ãã£ã¦ãã¾ãã
orzã¨ãã表ç¾ã¯ãè¥ã人ãç¥ããªãã®ããã¾ããããããªããã¨ãã£ãæãã®ãããã¹ã©ã³ã°ã®ããã§ãã
5chã®éå»ãã°ãã¹ã¯ã¬ã¤ãã³ã°ããã«ã¯
åã ãã5chã®ã³ã¼ãã¹ã¯æ¬²ããã£ãã®ã§ãããã©ããããã©ããã£ã¦åå¾ããã°ããã®ãããããªãã£ãã®ã§ããããªãã¨ãPython3 + requests + BeautifulSoupã®çµã¿åããã§ç¢ºç«ããæ¹æ³ãããã®ã§ãç´¹ä»ãã¾ãã
å¹ åªæ¢ç´¢ã«ããéå»ãã°ã®ã¹ã¯ã¬ã¤ãã³ã°
URLå士ã®ãªã³ã¯ã°ãããã¯ã¼ã¯æ§é ã«ãªãã¾ãã
ã¹ã¯ã¬ã¤ãã³ã°ããéã®æ¦ç¥ã¨ãã¦ããããã¯ã¼ã¯ãã©ããã©ãããã¨ããåé¡ã§ãå¹
åªå
æ¢ç´¢ãè¡ãã¾ããã
2chã®éå»ãã°ãã辿ãããã°ã¯å¹³é¢çã«å¤§éã®ãªã³ã¯ã2~3åãã©ãã°ç®çã®ãã¼ã¿ã«ã¢ã¯ã»ã¹ã§ããæ§é ã§å¹
åªå
æ¢ç´¢ã«é©ãã¦ããããã¨ããçç±ã§ãã
èµ·ç¹ã¨ãªãä¸ç¹ã決ãã
ããã¦ãã2chã®å
¨ãã°åå¾ã¯å¤¢ã§ããããæ§ã
ãªæ¹æ³ãæ¤è¨ãã¾ãããããã°ãä¿åããã¦ããURLã®ä¸è¦§ãåå¨ããªãã¨ãããã¨ã§è«¦ãã¦ããã®ã§ãããã¤ãã«çºè¦ããã«è³ãã¾ããã
以ä¸ã®URLããã¢ã¯ã»ã¹ãããã¨ãã§ããå¤ãã®ã¹ã¬ã®éå»ãã°ãµã¼ããåç
§ãã¦ãã¾ãã
ãã®ããããããã¢ã¯ã»ã¹ãããã¨ã§2chã®éå»ãã°ãã¹ã¯ã¬ã¤ãã³ã°ãããã¨ãã§ãã¾ãã
http://lavender.5ch.net/kakolog_servers.html
ã¬ã¬ã·ã¼ãªhtmlãã©ã¼ãããã«å¯¾å¿ãã
Pythonã®htmlãã¼ãµã¼ãåæã«è©±ãã¾ãããæ§2chã®HTMLã¯æ£ããHTMLã¨ããããã§ãªãããã§ãã
tableã¿ã°ãå¤ç¨ãããã¶ã¤ã³ã2017年度åã°ã¾ã§ä¸»æµã ã£ãããã§ããã®ã¨ãã®ã¿ã°ã«éããã®å¯¾å¿ãªããlxml
, html.parser
ãªã©ã使ãã¨å¤±æãã¾ãã
ãã®ãããä¸é¨ã®å£ããhtmlã§ããã¼ã¹ã§ããããã«html5lib
ãã¼ãµã¼ãå©ç¨ãã¦ãã¼ã¹ãããã¨ãã§ãã¾ã[1]
ããã®åé¡ã¯ãBeautifulSoupã®ãã¼ãµã以ä¸ã®ããã«html5libã«è¨å®ããã°è§£æ±ºãããã¨ãã§ãã¾ãã
soup = bs4.BeautifulSoup(html, 'html5lib')
並åã¢ã¯ã»ã¹ãè¡ã
2chã®éå»ãã°ã¯ãä¸ã¤ä¸ã¤ã®ãµã¼ãã«ååãã¤ãã¦ãã¦ãåãµã¼ããç°ãªã£ããµããã¡ã¤ã³ãæã£ã¦ãã¾ãã
ãã®ãããç°ãªã£ãå®ãµã¼ãããã£ã¦ããã¨èããããã®ã§ããµã¼ããã¨ã«ã¢ã¯ã»ã¹ã並ååãããã¨ã§é«éåãããã¨ãã§ãã¾ããå ãã¦ããã¨ãã¨Pythonã®requestsã¨BeautifulSoupã使ã£ãhtml解æãéãä½æ¥ãªã®ã§ããã«ãã³ã¢ãªã½ã¼ã¹ãæ大éå©ç¨ãã¦ã並åã¢ã¯ã»ã¹ããæ義ãããã¾ãã
éè¨ãããããã¹ã©ã³ã°ã®é¸å®
ä¸è¬çãªãããã¹ã©ã³ã°ã¯æ代ã®å¤é·ã®å½±é¿ãåããã¨ããæè¦å¤ãããã¾ããã
å ·ä½çã«ã¯ããã®æ¥ã«ãããåèªã®é »åº¦ã人æ°ãããã¨é«ããªããä½ããªãã¨ä¸ãã£ã¦ããã¨ããæè¦å¤ããããæç³»åã«ããã¨ãã人æ°ã®çºçããã使ãããªããªãã¾ã§ã観測ã§ããã®ã§ã¯ãªããã¨æããéè¨ãã¾ããã
éå»ãï¼ï¼å¹´éã«åå¨ãã¦ããæ§ã ãªãããã¹ã©ã³ã°ã«ã¤ãã¦ãæ§ã ãªã¾ã¨ã[2]ããããã¿ã¦ããã¨ã¨ã¦ãæããããªãã¾ãã
è¨æ¶ã«å¼·ãæ®ã£ã¦ããããéåæããã£ãããä»ã§ã使ãã¦ããã®ã ãããï¼æè¿è¦ã¦ããªããã©ã®ç¨åº¦æ¸ã£ãã®ã?ãã¨ããè¦ç¹ã§é¸ãã åèªããããã«ãªãã¾ãã
ï¼ä¸è¨ã«è¨ããGitHubã®éè¨ã³ã¼ããå¤æ´ãããã¨ã«ãã£ã¦ãä»»æã®ãã¼ã¯ã¼ãã§åéè¨ãããã¨ãã§ãã¾ããããããã£ãããã£ã¦ã¿ã¦ãã ããï¼
- orz
- å°å¸«
- é¦å ·å¸«
- ç¬(ææ«)
- åé³ãã¯
- çµæããã
- ã³ã¼ãã®ã¢ã¹
- hshs
- iphone
- ãï½
- èªå® è¦åå¡
- ã¯ã³ãã£ã³
- ã¹ãã
- æ å¼±
- ãã©è£
- ä»åç£æ¥
- 禿å
- w(ææ«)
- ã¡ã·ã¦ã
- ã¾ã©ãã®
- ã½ã·ã£ã²
- ã¸ã¯ã
- ããã
- (ry
- ggrks
- ãªã¯ã³ã³
ãã®è¨ç®ã¯ãä¸è¨ã®GitHubã®examples/time_term_freq.py
ã§è¡ããã¨ãã§ãã¦ãããã°ã©ã ãå¤ãããã¨ã§éããåèªãå¤æ´ãããã¨ãã§ããåéè¨ãããã¨ãã§ãã¾ãã
htmlãjsonlåãã
ã¹ã¯ã¬ã¤ãã³ã°ããhtmlãã¹ã¬ã®å
容ãåãåºããjsonl(ä¸è¡ã«ä¸ãªãã¸ã§ã¯ãã®json)ã«ãã¦ããã¨ããããã¨éè¨ãé½åãããã§ãã
scan_items.py
ã¨ããããã°ã©ã ã§ãã¼ã¹ã§ããã®ã§ãåèã«ãã¦ãã ããã
$ python3 scan_items.py
çµæ
examples/time_term_freq.py
ãå®è¡ãããã¨ã§å¾ããã¾ãã
仮説ã©ããããããã¹ã©ã³ã°ã¯æµè¡ãå»ãããããä»ã¯æ®ã©ä½¿ãããªããªã£ããã®ãã©ã®ææããæ¶ãã¦ãã£ãã®ãè¦è¦åããã¾ããã
ã¾ããææ«ã«wãã¤ãããªã©ã®èãçãã表ç¾ã¯ä»ãå¼·ããªãã¤ã¥ãã¦ããããã°ãã使ã£ã¦ãè害æ±ããããªãã§ããã(å®å¿)ã
æè¦å¤ã¨å®éã«ãã¼ã¿ã§è¡¨ããããããã¹ã©ã³ã°è¡¨ç¾ã®éãããæããã«ãããçºç¥¥ãæ代æãä¸æ確ã§ãã£ãããããã®ãæ´çãããçºè¦çãªéè¨ã¨ãªãã¾ããã