ä»æ¥ã®èªç¶è¨èªå¦çåå¼·ä¼ï¼ æ±äº¬ã«ã¦ãZipfå(ããåå¸)ã®çè«çè£ä»ãã¨ãã¦ãã©ã³ãã ã«ã¢ã«ãã¡ãããï¼ç©ºç½ãçæãããã空ç½åºåãã®ãåèªãã¨èããã¨ããããã®åèªã®é »åº¦åå¸ã Zipf ã«å¾ããã ããçãªã話ã FSNLP ã«æ¸ãã¦ãã£ãã
ã¸ã¼ã§çµãããããããããªãã¦ããããªã®ç°¡åãªã¹ã¯ãªããã§ç¢ºèªã§ãããããã£ã¦ã¿ãã
ã¾ãæåãä¸æ§çæã«ããã¨ããã1æååèªã¨2æååèªã¨3æååèªã®çæ確çãããããçããããã®å¢çã§é »åº¦ã®ã¸ã£ã³ããçºçãããããªãæããã«ä¸èªç¶ãªéåçåå¸ã«ãªã£ã¦ãã¾ã£ã¦ãf*r ãä¹±é«ä¸ããã
ããã§ã実際の英文でのアルファベットの使用頻度ãã¢ã«ãã¡ãããã®çºççã«ç¨ããããã«ãã¦ã¿ãã
#!/usr/bin/ruby list = "abcdefghijklmnopqrstuvwxyz ".split(//) prob = [ 0.0651738,0.0124248,0.0217339,0.0349835,0.1041442,0.0197881,0.0158610,0.0492888,0.0558094, 0.0009033,0.0050529,0.0331490,0.0202124,0.0564513,0.0596302,0.0137645,0.0008606,0.0497563, 0.0515760,0.0729357,0.0225134,0.0082903,0.0171272,0.0013692,0.0145984,0.0007836,0.1918182 ] prob_sum = 0 cum_prob = [0] prob.each do |x| prob_sum += x cum_prob << prob_sum end map = Hash.new(0) word = "" while true r = rand l = 0 h = prob.size while h>l+1 m = (h+l)/2 if r < cum_prob[m] h = m else l = m end end x = list[l] if x == " " if word.length > 0 map[word] += 1 if map.size % 100000 == 0 open("zipf#{map.size/1000}k.txt", "w") do |f| f.puts "rank,word,freq,rank*freq,freq_freq" freq = rank = 0 map.to_a.sort_by{|x| -x[1]}.each_with_index do |x, r| if freq != x[1] f.puts "#{rank},#{x[0]},#{x[1]},#{rank*x[1]},#{r-rank+1}" if rank>0 freq = x[1] rank = r + 1 end end end end end word = "" else word += x end end
500ä¸èªçæããçµæã®æç²ã以ä¸ã®éãã
ã追è¨ãåæ²æã¯ã³ã¼ãã«ãã¹ããã£ãã®ã§ä¿®æ£ãã¦çµæãåãç´ããããï¼è¿½è¨ã
å·¦ããã©ã³ã¯ããåèªã(ã®ä»£è¡¨)ãé »åº¦ãé »åº¦Ãã©ã³ã¯(Zipfåã¯ãããä¸å®ã«ãªããã¨ã主張ãã¦ãã)ããã®é »åº¦ã®åèªãããã¤ãããã
rank,word,freq,rank*freq,freq_freq 1,t,207933,207933,1 2,a,185903,371806,1 3,o,170137,510411,1 4,n,160637,642548,1 5,i,158915,794575,1 6,s,147372,884232,1 7,r,141705,991935,1 8,h,140770,1126160,1 9,d,99879,898911,1 10,l,94509,945090,1 11,u,64120,705320,1 12,c,61551,738612,1 13,m,57745,750685,1 14,f,56634,792876,1 15,w,48466,726990,1 16,g,45284,724544,1 17,y,41680,708560,1 18,p,39152,704736,1 19,b,35149,667831,1 20,ee,30894,617880,1 21,v,23547,494487,1 22,te,21780,479160,1 23,et,21739,499997,1 24,ea,19281,462744,1 : 100,dt,7413,741300,1 101,hs,7400,747400,1 102,td,7256,740112,1 103,rs,7255,747265,1 104,sh,7219,750776,1 105,rr,7046,739830,1 : 501,ka,1004,503004,1 502,aer,993,498486,1 503,eas,992,498976,1 504,ioe,988,497952,2 506,neo,983,497398,1 : 701,ks,729,511029,1 702,iat,727,510354,1 703,hhe,726,510378,4 707,tht,725,512575,1 708,gg,724,512592,1 : 1001,red,535,535535,5 1006,uv,534,537204,1 1007,ran,533,536731,3 1010,iar,531,536310,3 1013,ari,530,536890,5 : 1500,rhi,361,541500,5 1505,lts,360,541800,2 1507,nec,359,541013,5 1512,dia,358,541296,2 1514,nce,357,540498,4 : 2010,idr,265,532650,7 2017,tep,264,532488,3 2020,lsh,263,531260,12 2032,rdn,262,532384,9 2041,aca,261,532701,4 : 3003,raf,175,525525,18 3021,ifn,174,525654,10 3031,amr,173,524363,28 3059,coh,172,526148,17 3076,abt,171,525996,17 : 4012,icd,128,513536,30 4042,nnp,127,513334,30 4072,mad,126,513072,20 4092,opn,125,511500,25 4117,hgo,124,510508,27 : 6044,aaeo,75,453300,58 6102,neti,74,451548,66 6168,iese,73,450264,69 6237,ova,72,449064,69 6306,seot,71,447726,82 : 38665,nefm,10,386650,3591 42256,lery,9,380304,4255 46511,feou,8,372088,5290 51801,nsbl,7,362607,6647 58448,erhv,6,350688,8491 66939,maif,5,334695,11641 78580,cyho,4,314320,17417 95997,leats,3,287991,29028 125025,slfc,2,250050,56358 181383,neiaitnxsat,1,181383,159155
æåã¨æå¾ã¯ãããããã¦ããã(ããã¯æ¬ç©ã®åèªã®åå¸ã§ããããªã)ãéä¸ã¯ 50ä¸ã¡ãã£ã¨ã§å®å®ãã¦ããã
ãªãã»ã©ãããã¯ã¨ã¦ãç´å¾ã
ã追è¨ãã¤ãã§ã«ãR ã§åºåçµæãèªãã§ä¸¡å¯¾æ°ã°ã©ããæãã¦ã¿ãã
x <- read.table("zipf5000k.txt", sep=",", header=T) plot(x[c(1,3)], log="xy", type="l")
ã¯ããã®æ¹ã®ãã¬ã¯ããè¦ãããåå¸ã®ã°ã©ãããã¡ãã£ã¨å¤§ããã ãããã¨ã¯ããªããããã
ãï¼è¿½è¨ã