åã«ä»¥ä¸ã®ãããªè¨äºãæ¸ãã¾ãããã大éã®ããã¹ãã§ã¯ãã¾ããããªãã£ãã®ã§æ°ãã«æ¸ãã¾ãã
ä¸ã®è¨äºã§ã¯ããã¹ããã©ã³ãã ã«\(k\)è¡åãåºãããæ"shuf -n k"ã³ãã³ãã§ã©ã³ãã ã«ã·ã£ããã«ãã\(k\)è¡ãåãåºãã¦ãã¾ãã
ã¨ãããé常ã«å¤§ããªããã¹ããã¡ã¤ã«ã«å¯¾ãã¦ä¸ã®ã³ãã³ããå®è¡ããã¨ãä¸åº¦ã«ãã¼ã¿ãå ¨é¨ã¡ã¢ãªã«èªã¿è¾¼ã¿å§ãã¦ããã®ãããããå¢ãã§ã¡ã¢ãªãæ¶è²»ãã¦ããã¾ãã(sort -Rã§ã)
ããã§ã¡ã¢ãªããã¾ã使ããã«ã©ã³ãã ã«\(k\)è¡åãåºãæ¹æ³ã«ã¤ãã¦èª¿ã¹ã¾ãã
ã¾ãåºæ¬çãªé復å æ½åºã®ã¢ã«ã´ãªãºã ã¯ä»¥ä¸ã®è¨äºã®çºå±ææ³ã¨ã追è¨ã®ãããã®è©±ãããããããã¨æãã¾ã
ãã®è¨äºã®è©±ãä¸åº¦å
¨é¨ã®è¦ç´ ãèªã¿è¾¼ãã§ããã®ã§ãä»åã®åé¡ã§ã¯ä½¿ãã¾ãã
(ããã¹ããä¸åº¦å
¨é¨èªã¿è¾¼ãã§è¦ç´ æ°ã調ã¹ã¦ãã¤ã³ããã¯ã¹ãæ½åºãã¦ããã¾ãããã¹ããèªã¿è¾¼ãã§åãåºãã°ã§ããã¨æãã¾ããâ¦â¦)
ããã§ä¸åº¦ãã¡ã¤ã«ãèªã¿è¾¼ãã ãã§(ã¡ã¢ãªã«å
¨é¨ä¿åãã¦ãããªãã§ã)ã©ã³ãã ã«æ½åºã§ããæ¹æ³ã調ã¹ã¾ãã
(å®ç¨çã«ã¯ãåè¡ã«ã¤ãã¦ä¹±æ°ãä¸å®ç¢ºç以ä¸ã«ãªã£ããæ½åºã¨ãã§ãããããªæ°ããã¾ããâ¦â¦)
Reservoir Sampling
Reservoir Samplingã¨ããæ¹æ³ãããã¿ããã§ã
ã¡ãã£ã¨å
è«æã¾ã§ã¯èªãã¦ããªãã®ã§ããã以ä¸ã®è¤æ°ã®è¨äºãåèã«ãã¦ã¾ã¨ãã¾ãã
- Reservoir sampling - Wikipedia, the free encyclopedia
- algorithm - Reservoir sampling - Stack Overflow
- Gregable: Reservoir Sampling - Sampling from a stream of elements
- Reservoir Sampling - Taming Uncertainty - Site Home - MSDN Blogs
- Sampling very large sequences - greg.beech
Reservoir samplingã¯åè¦ç´ ããããã\(1\)åãã¤ã¿ãã ãã§ããã¢ã«ã´ãªãºã ã§ã(å
¨ä½ã®è¦ç´ æ°ãããããªãã¦ãã)
è¨æ¶å®¹éã¯æ½åºããè¦ç´ æ°ãä»åã®å ´åã¯\(k\)åããå¿
è¦ããã¾ãã
åºåãããé çªã¯å®å
¨ãªã©ã³ãã ã§ã¯ãªãã®ã§æ³¨æ
ã¢ã«ã´ãªãºã ã¨ã³ã¼ã
ã¢ã«ã´ãªãºã ãç°¡åã«èª¬æãã¾ã
- ã¾ãè¦ç´ ã\(k\)åã«ãªãã¾ã§è¦ç´ ãé åã«èªã¿è¾¼ã¿ã¾ã
- \(n=k+1\)åç®ä»¥éã®è¦ç´ ã®å ´åã\(0\)ãã\(n-1\)ã¾ã§ã®æ´æ°ã®ä¹±æ°ãçæãã¦ããã\(k\)ããå°ãããã°ãã®ã¤ã³ããã¯ã¹ã®è¦ç´ ãä»ã®è¦ç´ (\(n\)çªç®ã®è¦ç´ )ã§ç½®ãæãã¾ã
以ä¸ã®Pythonã®ã³ã¼ããè¦ãã»ããããããããããããã¾ãã
iterableãkããçãå ´åStopIterationã®ä¾å¤ãåºã¾ã
import random def reservoir_sampling(iterable, k): it = iter(iterable) reservoir = [next(it) for i in xrange(k)] n = k for item in it: n += 1 r = random.randint(0, n - 1) if r < k: reservoir[r] = item return reservoir if __name__ == '__main__': print reservoir_sampling(xrange(1000000), 10)
ä½æ ããã§ãã¾ãããã®ãã
\(n=k+1\)以éã®è¦ç´ ã«ã¤ãã¦èãã
\(n\)çªç®ã®è¦ç´ ãé
åã®è¦ç´ ãç½®ãæãã確çã\(\frac{k}{n}\)
é
åä¸ã®ããç¹å®ã®è¦ç´ ã\(n+1\)çªç®ã«ç½®ãæãããã確çã¯ãç½®ãæããèµ·ãã確ç\(\frac{k}{n+1}\)ã¨\(k\)åã®å
ãã®è¦ç´ ãé¸ã°ãã確ç\(\frac{1}{k}\)ã®ç©ãªã®ã§\(\frac{k}{n+1}\times\frac{1}{k}=\frac{1}{n+1}\)
ãã£ã¦ãããç¹å®ã®è¦ç´ ã\(n+1\)çªç®ã«ç½®ãæããããªã確çã¯\(1-\frac{1}{n+1}=\frac{n}{n+1}\)
ãªã®ã§\(n\)çªç®ã«é
åã«ãã£ãè¦ç´ ã\(N\)çªç®ã¾ã§æ®ã£ã¦ãã確çã¯\(\frac{n}{n+1} \times \frac{n+1}{n+2}\times\dots\times\frac{N-2}{N-1}\times\frac{N-1}{N}=\frac{n}{N}\)ã¨ãªã
以ä¸ãã\(n\)çªç®ã®è¦ç´ ãé
åã®è¦ç´ ãç½®ãæããå¾\(N\)çªç®ã¾ã§æ®ã£ã¦ãã確çã¯ãé
åã®è¦ç´ ãç½®ãæãã確çã¨\(N\)çªç®ã¾ã§æ®ã確çã®ç©ãªã®ã§\(\frac{k}{n}\times\frac{n}{N}=\frac{k}{N}\)
æåã«é
åã«å«ã¾ãã¦ããè¦ç´ ã«ã¤ãã¦ã¯ã\(k\)çªç®ã«é
åã«ãã£ãè¦ç´ ã\(N\)çªç®ã¾ã§æ®ã£ã¦ãã確çã¨èããã°ããã®ã§ãä¸ã§è¨ç®ããããã«\(\frac{k}{N}\)
ãã®çµæãããã¹ã¦ã®è¦ç´ ã¯çãã確çã§é
åã«å«ã¾ãããã¨ãããã
é¢é£URL
éã¿ä»ãã®å ´åã®æ½åºã¢ã«ã´ãªãºã ã®è©±(è¦ç´ ãå ¨é¨ããã£ã¦ãå ´åã®ã§ã)
- ランダム抽出アルゴリズムについて考える - Shogo's Blog
- algorithm - Select random k elements from a list whose elements have weights - Stack Overflow
æ´æ°ã®ç¯å²ã§ä¸æ§ä¹±æ°ãå¾ãã«ã¯ããã¤ãæ°ãã¤ããªãã¨ãããªããã¨ããã話(ä¸ã®ã³ã¼ãã§ä½¿ã£ã¦ã¿ãã¿ããã«ãç´æ¥ããç¯å²ã®æ´æ°ã®ä¹±æ°ãå¾ãããã©ã¤ãã©ãªãªãé¢ä¿ãªã話ã ãã©
FisherâYates shuffleã¯é
åãã©ã³ãã ã«ã·ã£ããã«ããããã®ã¢ã«ã´ãªãºã ã§ã
Reservoir Samplingã¨åæ§ã«å
¨ä½ã®è¦ç´ æ°ãããããªãã¦ã使ããã¢ã«ã´ãªãºã ãªã®ã§ãReservoir SamplingのWikipediaの記事ã«æ¸ãã¦ããããã«ãFisherâYates shuffleãä¸ããkçªç®ã®è¦ç´ ã¾ã§ä¿æããããã«ãã¦è¡ãã°ãã·ã£ããã«ã¨è¦ç´ ã®ãµã³ããªã³ã°ãåæã«è¡ãã¾ã