Heapsã®æ³åã«ããã³ã¼ãã¹ä¸ã®èªå½æ°äºæ¸¬ã¨è©ä¾¡å®é¨
ãã¦å æ¥ãããã¨ããã£ããã§ãã¼ãã¹ã®æ³å (Heaps' law) ã®ãã¨ãæãåºããï¼æåã¯ãã¼ãã®æ³åã¨è¨æ¶ãã¦ããã®ã ããã©ï¼'ãHeapsã®å¾ãã«ããã®ã§ï¼ãã¼ãã¹ã®æ³åã¨ããã¼ãã¹åã¨å¼ã¶ã®ãæ£ããã®ã ããï¼ããã§ã¯Heapsã®æ³åã¨å¼ã¶ãã¨ã«ããï¼
Heapsã®æ³åã¨ã¯Nèªæ°ããæãã³ã¼ãã¹ã«ããã¦ï¼ç·èªå½æ°Dã¯ä»¥ä¸ã®çå¼ã§è¡¨ç¾ã§ããã¨ãããã®*1
ããã§ï¼kã¨Î²ã¯ã³ã¼ãã¹ã«ãã£ã¦å®ããããå®æ°ã¨ããï¼è±æã³ã¼ãã¹ã§ã¯Î²ã¯å¤§ä½0.4-0.6ã«ãªãããã*2
ãã®æ³åã示åãããã¨ã¯ï¼ã³ã¼ãã¹ãµã¤ãºã®å¢å ã«å¯¾ãã¦èªå½ã¯å¢ãç¶ããã¨ãããã®ï¼ã¾ãlogã¹ã±ã¼ã«ã«ããã¦ã¯ç´ç·ãªã®ã§ï¼å¾ã ã«ãµãã£ã¦ããã®ã¯ç¢ºãã§ãããï¼
Wikipediaã§ãã¼ãã¹ã®æ³åã®åºå±ã調ã¹ã¦ã¿ãã¨ï¼ã©ããã1978å¹´åºçã®Heapsè"Information Retrieval"[1]å
ã§ææ¡ããããã®ãããï¼ãã®å¾ï¼Zipfã®æ³åã¨ã®é¢é£ã示ããè«æ[2]ãï¼Zipfã®æ³åãä¸è¬åããMandelbrotåå¸ããHeapsã®æ³åã®å°åºã示ããè«æ[3]ãããã®ã ããã©ï¼å
¨ãç解ãã¦ããªã横éã«é¸ãã¦ãã¾ãã®ã§æ¬è¨äºããã¯å²æï¼
æ¤ç´¢ã¨ã³ã¸ã³ã«ããã¦ã¯æ¤ç´¢å¯¾è±¡ã§ããã³ã¼ãã¹ã«ãããç·èªå½æ°ã®äºæ¸¬ã¯å¤§åãªåé¡ã§ããããï¼ãã®å¾ã®æ å ±æ¤ç´¢ã®æç§æ¸ã§ãå¿ ãã¨è¨ã£ã¦ãããã»ã©ç´¹ä»ãããããã«ãªã£ãã®ã ãã*3
æ
å ±æ¤ç´¢ãå°éã§ãã¨ãã»ããã¦ãã人éãHeapsæ¬ãèªã¾ãªãããã«ã¯ãããªãã¨ããã¦ã¦å³æ¸é¤¨ã«ãããã¿Heapsæ¬ãGETï¼
åºçå¹´ã¯1978å¹´ï¼èªåãçã¾ããåã«æ¸ãããæ¬ã«ï¼ç¾ä»£ã®æ¤ç´¢ã¨ã³ã¸ã³ã®åºç¤ãæ¸ããã¦ããã¨æãã¨ã¾ã¯ã¾ã¯ããï¼æ¹ãã¦æ¤ç´¢ã¨ã³ã¸ã³ã¯ã¦ã§ãã®åºç¾ã«ãã£ã¦æ¹ãã¦æ³¨ç®ãããããã«ãªã£ãã¨ããèæ¯ãæãããã¨ãã§ããï¼
ãããï¼ãããªã¯ã¯ãã«æ°åã§Heapsã®æ³åã«ã¤ãã¦æ¸ãããé¨åãèªãã§ã¿ãã¨ããï¼Heapsæ¬ã§ã¯ãã£ãã®3ãã¼ã¸ã§ç´¹ä»ããã¦ï¼ãã¤ããã¾ãï¼ã¨ãããã®ã§ãã£ãï¼ã¾ããèè ããã®å¾æ å ±æ¤ç´¢ã§ã¯å¿ ãã¨ãã£ã¦ããã»ã©å¼ç¨ãããããã«ãªãã¨ã¯æããªãã£ãã ããï¼
ç°¡åãªã¹ã¯ãªããã§ããã«è¨ç®ã§ãããã¨ãªã®ã§å®ãã¼ã¿ã«å¯¾ãã¦Heapsã®æ³åãæãç«ã£ã¦ãããã©ãããæ¤è¨¼ãã¦ã¿ãï¼
ã¾ãæç§æ¸çã«ã¯ç¾è±¡ã説æã§ããããï¼ãããããï¼ãã¤ããã¾ãï¼ã§ããã®ãããããªãããã©ï¼ãã¯ãå·¥å¦çææ³ãããã®ãå½¹ã«ç«ã£ã¦ãã³ããã¨ããã®ãæ¬æ¥ããã¹ã姿ã ã¨æãï¼ã¨ããããã§ï¼ãã¼ã¿ã»ããã®ä¸é¨ãä¸ããããéã«Heapsã®æ³åã«å¾ã£ã¦äºæ¸¬ã¢ãã«ãæ§ç¯ãï¼ããç·èªæ°ã«ãããç·èªå½æ°ãäºæ¸¬ãã精度ãè©ä¾¡ãããã¨èããï¼
ã¨ããããã§ä»¥ä¸ã®2ã¤ãæ¤è¨¼ãã¦ã¿ãï¼
- æ¤è¨¼1: å®ãã¼ã¿ã«ãããåèªæ°-èªå½æ°æ²ç·ãæãã¦çºãã¦ã¿ã
- æ¤è¨¼2: ã³ã¼ãã¹ã®ä¸é¨ãç¨ãã¦Heapsã®æ³åã«åºã¥ãäºæ¸¬ã¢ãã«ãä½æãã¦äºæ¸¬æ²ç·ãæãã¦çºãã¦ã¿ã
æ¤è¨¼ã®ããã«ï¼ä»¥ä¸ã®2ã¤ã®å ¬éãã¼ã¿ã»ãããå©ç¨ããï¼
- Reuters-21578 (Reuters)
- http://www.daviddlewis.com/resources/testcollections/reuters21578/
- SGMLã¿ã°ã¯å ¨ã¦é¤å»ãï¼ã¹ãã¼ã¹åºåãã§åèªãæ½åº
- ä¸è¦èªã®é¤å»ï¼ã¹ããã³ã°ã¯è¡ããªã
- AOL query log (AOL)
- http://www.gregsadetsky.com/aol-data/
- ã¯ã¨ãªã«å«ã¾ãããã¼ã¯ã¼ããã¹ãã¼ã¹ã§åºåã£ã¦åèªãæ½åº
- ä¸è¦èªã®é¤å»ï¼ã¹ããã³ã°ã¯è¡ããªã
- ç·åèªæ°ãå¤ãããï¼100000èªã¾ã§ã¯1åèªãã¤ã«ã¦ã³ãï¼ãã以éã¯100åèªæ¯ã«èªå½æ°ãã«ã¦ã³ã
Reutersã¯ããã¹ãåé¡ã§ç¨ãããã¦ãããã¥ã¼ã¹è¨äºãã¼ã¿ã»ããã§ï¼AOLã¯ï¼æ¤ç´¢ã¨ã³ã¸ã³ã«å¯¾ããã¯ã¨ãªãã°ã®ãã¼ã¿ã»ããï¼Reutersã§ãããã¤ãã®ãããã¯ã«åãããã¦ãããï¼ã¯ã¨ãªãã°ã®å ´åã¯ãã¥ã¼ã¹è¨äºã«æ¯ã¹ã¦ãããã¯ã®åããå¤æ§æ§ã大ããï¼Heapsã®æ³åããå¤ããçµæã«ãªãã®ã§ã¯ãªããã¨æå¾
ãã¦æ¤è¨¼å¯¾è±¡ã¨ãã¦ã¿ãï¼ãªãï¼ãããã®ã³ã¼ãã¹ããã¼ã¿ã»ããã®ã·ã£ããã«ã¯è¡ã£ã¦ããªãããï¼IDãæ¥ä»ã®å°ããæ¹ããã«ã¦ã³ãããã¦ããï¼
æ¤è¨¼1: å®ãã¼ã¿ã«ãããåèªæ°-èªå½æ°æ²ç·ãæãã¦çºãã¦ã¿ã
ã¾ãã¯2ã¤ã®ã³ã¼ãã¹ã«ãããåèªæ°-èªå½æ°æ²ç·ãæãã¦ã¿ãï¼å®æ°ã°ã©ãã¨ä¸¡å¯¾æ°ã°ã©ããæãã¦ã¿ãï¼
Reuters (å®æ°ã¹ã±ã¼ã«)
ããããªæ²ç·ãæãã¦ãããï¼x=2.5e+6ãããã§ä¸åº¦èªå½æ°å¢å ã®ãã¼ã¹ãä¸ãã£ã¦ããããã«è¦ããï¼å ·ä½çãªåå ã¯ããããªããï¼Reutersã³ã¼ãã¹ãçºãã¦ã¿ãã¨ï¼21åã«åå²ããã¦ãããã¡ï¼æåã®16åã®ãã¡ã¤ã«ã1987å¹´3æ-4æã®è¨äºã§æ§æããã¦ããã®ã«å¯¾ãï¼æ®ãã®5ãã¡ã¤ã«ã§ã¯1987å¹´4æ-10æã®è¨äºã§æ§æããã¦ããããï¼ã³ã¼ãã¹ã«ãããè¨äºãµã³ããªã³ã°æ°ã«åãããããã¨ãããã£ãï¼ãããåå ã ã¨æã£ã¦ï¼2.5e+6èªæ°ããããã¡ããã©5æãããã«ãªã£ã¦ããã¨ãããªãã¨æã£ã¦èª¿ã¹ã¦ã¿ããï¼ããããããã§ããªãããï¼ããããï¼æç¶çã«æ°ãã話é¡ãçã¾ããã®ã ãããï¼
Reuters (両対æ°ã¹ã±ã¼ã«)
両対æ°ã°ã©ããæãã¦ã¿ãã¨ä½åãã®ã¶ã®ã¶ãã¦ããé¨åãè¦ã¦åããï¼æåã®æ²ç·ã¨ã¯å°ãéãå°è±¡ãåããï¼æçµçã«ã¯ã»ã¼ç´ç·ã«ãªã£ã¦ããï¼
AOL (å®æ°ã¹ã±ã¼ã«)
Reutersã«æ¯ã¹ã¦æ²ç·ã®å¾ãããããã (Heapsã®æ³åã§è¨ãã¨ããã®Î²ã大ãã) ããã«è¦ããï¼AOLãã¼ã¿ã¯ç´2ã¶æéã®ã¯ã¨ãªãã°ã ãï¼æ¥å¸¸çã«æ°ãã話é¡ãçºçãï¼ã¦ã¼ã¶ã¯æ¥ã ä¼¼ã話é¡ã追ããããå¾åãããããèªå½ã®å¢å ãæ³¢æã£ã¦ããã®ã ãããï¼
AOL (両対æ°ã¹ã±ã¼ã«)
両対æ°ã«ããã¨å¾®å¦ã«ãããã§ããé¨åããããï¼åºæ¬çã«ã¯ç´ç·è¿ä¼¼ãã§ããããªå½¢ããã¦ããï¼ã¨ããããã§ã¯ã¨ãªãã°ã ã¨Heapsã®æ³åããå¤ããããªï¼ã¨ããäºæ³ãå¤ãã¦ãã¾ã£ãï¼Heapsã®æ³åãããï¼
æ¤è¨¼2: ã³ã¼ãã¹ã®ä¸é¨ãç¨ãã¦Heapsã®æ³åã«åºã¥ãäºæ¸¬ã¢ãã«ãä½æãã¦äºæ¸¬æ²ç·ãæãã¦çºãã¦ã¿ã
ã©ããã2ã¤ã®ãã¼ã¿ã»ããã§Heapsã®æ³åãæãç«ã¡ãããªãã¨ãããã£ãã®ã§æ¬¡ã¯äºæ¸¬ã¢ãã«ãä½ã£ã¦ã¿ãï¼
ãã¦Heapsã®æ³åã¯ä¸è¦ããã¨Nã®è©ã®ä¸ã«ãä¹ã£ã¦ããã®ã§æ¨å®ãé£ãããã«è¦ããããã©ï¼å·¦è¾ºå³è¾ºã§logãåãã¨
ã¨ããããããy = a x + bã®ç·å½¢å帰ã¢ãã«ã¨è§£éãããã¨ãã§ããï¼å¤§å¦ã§ç¶ãã¹ãè¬ç¾©ãã¾ããã«åãã¦ããå¦çã«ã¯å¯¾æ°->対æ°ã®åå帰ã¢ãã«ã ã¨ãããã¨ãããã (åã¯ããããªãã£ã)ï¼
ããã§ã¯ç¹ã«åã£ããã¨ãããã¤ããã¯ãªãã®ã§ï¼æå°äºä¹æ³ã§ãã£ããã£ã³ã°ãããã¨ã«ããï¼æå°äºä¹æ³ã®å ´åã«ã¯ãã¼ã¿ãä¸ããããéã«éãã解ã§æ±ãããã¨ãã§ããã®ã§ï¼è¡åã®ããç®ã¨éè¡åè¨ç®ãã§ããã°ããï¼
ãããªç°¡åãªããã°ã©ã ãæ¸ããã¨ãã§ï½ä»åã¯ãã¼ã¿æ°ãå¤ãã®ã§ãã¤ã¼ããªå®è£
ã§ã¯è¦å´ããã¨æãï¼Rã®lmé¢æ°ã§æå°äºä¹æ³ã«åºã¥ããã£ããã£ã³ã°ãè¡ã£ãï¼
以ä¸ã®ããã«lmãç¨ãã¦ä¸¡å¯¾æ°ã®ç·å½¢å帰ã¢ãã«ãæ¨å®ããï¼
# ãã¼ã¿ã®èªã¿è¾¼ã¿ > x <- read.table("reuters-21578_per_token.dat") # æå°äºä¹æ³ã«ããç·å½¢å帰ã¢ãã«ã®æ¨å® (logãä»ãããã¨ãå¿ããã«) > lm(log(x$V2) ~ log(x$V1)) Call: lm(formula = log(x$V2) ~ log(x$V1)) Coefficients: (Intercept) log(x$V1) 1.8359 0.6784 # ããã§åçã¯log kãªã®ã§ï¼expãåã£ã¦kãè¨ç®ãã > exp(1.8359) [1] 6.270775
ä¸è¨ã®ä¾ã§ã¯æå°äºä¹è¿ä¼¼æ²ç·ã¯ã¨ãªãï¼
2ã¤ã®ãã¼ã¿ã»ããã«ã¤ãã¦ä»¥ä¸ã®2ã¤ãå ããã°ã©ããæãã¦ã¿ã
- å ¨ãã¼ã¿ãç¨ããæå°äºä¹è¿ä¼¼æ²ç·
- ä¸é¨ã®ãã¼ã¿ãç¨ããæå°äºä¹è¿ä¼¼æ²ç·
Reuters
å ¨åèªãç¨ããå ´åã¨ï¼10,000èª (0.2%)ï¼370,000èª (10%) ã«ãããæå°äºä¹è¿ä¼¼æ²ç·ãå ããã°ã©ãã以ä¸ã«ç¤ºãï¼
10,000èªãç¨ããäºæ¸¬ã¯æçµçã«ã¯å¤§ããå¤ãã¦ããï¼370,000èªãç¨ãã¦ãã¾ã å¤ãã¦ããï¼å ¨ãã¼ã¿ãç¨ããã°ç¶ºéºã«ã¨ããææ°ã¢ãã«ã§ç¶ºéºã«ãã£ããã£ã³ã°ã§ãã¦ãããã¨ããããï¼
ãªãï¼å¯¾æ°ã¹ã±ã¼ã«ã®å ´åã¯ä»¥ä¸ã®éãã«ãªãï¼
AOL
ãã¡ããReutersåæ§ã«å ¨åèªãç¨ããå ´åã¨ï¼10,000åèª (0.5%)ï¼250,000åèª (10%) ã«ãããæå°äºä¹è¿ä¼¼æ²ç·ãå ããã°ã©ãã以ä¸ã«ç¤ºãï¼
10,000åèªã®è¿ä¼¼æ²ç·ã§ã¯äºæ³ã大ããå¤ãã¦ãããï¼å ¨ä½ã®10%ã§ãã250,000åèªã§æ¨å®ãããã©ã¡ã¼ã¿ã§ããªãè¿ãæ²ç·ãæãã¦ãããã¨ããããï¼AOLã®æ¹ãReutersã«æ¯ã¹ã¦å°ãªããã¼ã¿æ¯çã§è¿ãæ²ç·ãæãã¦ãããã¨ãèå³æ·±ãï¼
ãªãï¼å¯¾æ°ã¹ã±ã¼ã«ã®å ´åã¯ä»¥ä¸ã®éãã«ãªãï¼
ã¾ã¨ã
æ¬è¨äºã§ã¯Heapsã®æ³åãç´¹ä»ãï¼å ¬éãã¼ã¿ã»ãããç¨ãã¦Heapsã®æ³åãæç«ãããï¼ã¾ãæå°äºä¹è¿ä¼¼ã«ããã¢ãã«äºæ¸¬ãã©ã®ããã«ãªããã¨ãããã¨ãæ¤è¨¼ãã¦ã¿ãï¼
ã³ã¼ãã¹ããããã¼ã¿ã½ã¼ã¹ãããµã³ããªã³ã°ããããã®ã ã¨ããã¨ï¼ãã®ã¾ã¾ã³ã¼ãã¹ãµã¤ãºã大ããããå ´åã®ç·èªå½æ°ã®å¤åãäºæ¸¬ãããã¨ãå¯è½ã«ãªãã®ã ããã©ï¼ä»åã®æ¤è¨¼çµæããã¯ã³ã¼ãã¹ã«ãã£ã¦æ£ç¢ºãªäºæ¸¬ã«å¿ è¦ãªæ¯çã¯ã¾ã¡ã¾ã¡ã¨ãããã¨ãããã£ãï¼
- Future work
- æ¥æ¬èªã³ã¼ãã¹ã§ã®æ¤è¨¼ï¼ç¹ã«Twitterããã
- äºæ¸¬ã¢ãã«ã®é«ç²¾åº¦å
- æå°äºä¹æ³ä»¥å¤ã®ãã£ããã£ã³ã°æ¹æ³ãªã©ãæ¤è¨
- æ°ããªãã¥ã¼ãªã¹ãã£ã¯ã¹ã®å°å ¥
ãªãã ãæ¸ãã¦ãããã¡ã«å ãæç« ã«ãªã£ã¦ãã¾ã£ããªãï¼å®é¨ã¯30åç¨åº¦ã§çµãã£ãã®ã«ããã°è¨äºæ¸ãã®ã«å¤ã«æ°åãå ¥ãããã¦2-3æéãããã¦ãã¾ã£ãï¼æ¬¡åã¯ãã£ã¨ãããµããªå®é¨è¨äºã«ãããï¼
References
- [1] H. S. Heaps, "Information Retrieval: Computational and Theoretical Aspects", Academic Press, 1978.
- [2] L. Lu, Z.-K. Zhang and T. Zhou, "Zipf's Law Leads to Heaps' Law: Analyzing Their Relation in Finite-Size Systems", arXiv:1002.3861v2, 2010.
- [3] D.C. van Leijenhorst, Th.P. van der Weide, "A formal derivation of Heaps' Law", Information Sciences, Vol.170, pp.263--272, 2005.
ä»é²A
åèªã«ã¦ã³ãã«ç¨ããPerlã½ã¼ã¹ã³ã¼ãã¯ä»¥ä¸ã®ã¨ãã
- reuters
# reuter-count.pl use strict; use warnings; my $doc_count = 0; my $token_count = 0; my %count_of; while (my $line = <>) { chomp $line; foreach my $contents (split /<[^>]+>/, $line) { next unless ($contents =~ /\S+/); foreach my $word (split /\s/, $contents) { $count_of{ $word }++; $token_count++; print $token_count, "\t", scalar keys %count_of, "\n"; } } }
- AOL
use strict; use warnings; my $token_count = 0; my %count_of; while (my $line = <>) { chomp $line; foreach my $word (split /\s/, $line) { $count_of{ $word }++; $token_count++; if ($token_count < 100000 || $token_count % 100 == 0) { print $token_count, "\t", scalar keys %count_of, "\n"; } } }
*1:ããã§èªæ°ã¨ã¯ãããããã¼ã¯ã³æ°ã®ãã¨ã表ãï¼èªå½ã¨å¼ã¶éã«ã¯ã¦ãã¼ã¯ãªãã®ã表ã
*2:ãã ãããã¯Wikipediaæ å ±ã§ï¼Heapsã«ããè¨è¿°[1]ã§ã¯åè¿°ã®å¤ã®ç¯å²ã¯è¨ããã¦ãããï¼å³æ¸é¤¨ã®ç®é²ã¿ã¤ãã«ï¼åå¦ã®èµæ¸ã¿ã¤ãã«ã«ããããã¼ã¿ãå ã«è¿ä¼¼å¤ãæ±ãï¼Î²= 0.5, 0.6ã¨ããå¤ã示ãã¦ããï¼
*3:IIR, Croftæ¬ï¼ã¢ãã¤ã³ã§ã¯ç¢ºèªæ¸ï¼ããããã®æç§æ¸ã®è§£èª¬ã¯情報検索ことはじめ〜教科書編その2 (2011年決定版) 〜ãåç §