大è¦æ¨¡ãã¼ã¿ã§åèªã®æ°ãæ°ãã
大è¦æ¨¡ãã¼ã¿ãã one-pass 㧠itemï¼n-gram ãªã©ï¼ã®é »åº¦ãæ°ããææ³ã«é¢ããã¡ã¢ï¼ããæ°å¹´ï¼æ¯å¹´ã®ããã«è¶ 大è¦æ¨¡ãª n-gram ã®çµ±è¨æ å ±ã空éï¼æéå¹çè¯ãå©ç¨ããããã®ææ³ãææ¡ããã¦ããï¼æè¿ã ã¨ï¼
- Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval (EMNLP 2010)
ã¨ãï¼ãã®è«æã§ã¯ï¼æå°å®å ¨ããã·ã¥é¢æ°ã power-law ãèæ ®ããé »åº¦è¡¨ç¾ã®å§ç¸®ãªã©ï¼ç´°ããæè¡ãä¸å¯§ã«çµã¿ä¸ãã¦ããï¼ããããã工夫ãç´°ãããªã£ã¦ããã¨log-frequency Bloom filter (ACL 2007) ãããããããå§ã¾ã£ã n-gram é »åº¦æ å ±ã®å§ç¸®ã®ç 究ãããããåæãããã¨ããå°è±¡ï¼ã¡ããã©è«æãèªãç´åã«ï¼ãã®è«æã®7ç¯ã®å¯å¤é· fingerprint 辺ãã®å 容ãèãã¦ããï¼æãã¤ããªãã§è¯ãã£ãï¼ï¼
ãããã®ç 究ã¯ï¼æ¢ã«æ°ãçµãã£ãéç㪠n-gram ã®é »åº¦æ å ±ãï¼ã³ã³ãã¯ãã«ä¿æããããã®ãã¼ã¿æ§é ãææ¡ãã¦ãããï¼ããããæ°ããæ¹ãæéã»ç©ºéå¹çè¯ããããªãã¨çæè½ã¡ã ããï¼ããã§ï¼æè¿åºããµã¼ãã¤è«æãèªãã§ã¿ãï¼
ãã®è«æèªä½ã¯ï¼VLDB 2008ã«åºããã®ã®æ¹è¨ã®ããã«æ¹è¨ãã¼ã¸ã§ã³ãªã®ã§ï¼å
容çã«ã¯ããã»ã©ç®æ°ããã¯ãªãã®ã ãã©ï¼åºç¤çãªã¢ã«ã´ãªãºã ã®ç°¡æ½ãªèª¬æã¨ç¶²ç¾
çãªå®é¨ã«ããæ¯è¼ï¼ä»å¾ã®å±éãªã©é常ã«ãã©ã³ã¹ããã¾ã¨ã¾ã£ã¦ããï¼æè§ãªã®ã§ï¼åºç¤çãªã¢ã«ã´ãªãºã ã«ã¤ãã¦ç°¡åã«ã¡ã¢ãã¦ããï¼ä»¥ä¸ï¼æ°ãæ°ãã対象ãï¼item ã¨å¼ã¶ãã¨ã«ããï¼ï¼
ã¡ã¢ãªã«è¼ããªãã»ã©ã®ç¨®é¡ã® item ã®æ°ãå¹ççã«æ°ããã«ã¯ï¼åæ£è¨ç®ããã¨ãåä»»ãã§ãããªãéãï¼ã©ããã§æãæãå¿
è¦ããã£ã¦ï¼å¤§ã¾ãã«ã¯ã¡ã¢ãªã§æ±ãããµã¤ãºã®ç¨®é¡ã® item ã«å¯¾ãã¦ç´æ¥çã«ã«ã¦ã³ã¿ãä¿æãã counter-based (deterministic) ãªææ³ã¨ï¼å
¨ item ã«å¯¾ãã¦è¿ä¼¼çãªã«ã¦ã³ã¿ãä¿æãã sketch-based (probabilistic) ãªææ³ãããï¼ãµã¼ãã¤ã«ã¯ããå°ãä¸è¬çãªåé¡ãæ±ãããã® Quantile ã¢ã«ã´ãªãºã ãï¼sketch-based ãªææ³ã®æ¡å¼µãªã©ãç´¹ä»ããã¦ãããããã§ã¯çç¥ï¼ï¼ã©ã®ã¢ã«ã´ãªãºã ãï¼æ¬ä¼¼ã³ã¼ãã§10è¡åå¾ã¨ä¸ç®ã§ç解ã§ããã»ã©ã«ã·ã³ãã«ã§ç¾ããï¼ä»¥ä¸ï¼æ¬ä¼¼ã³ã¼ããè¼ã£ã¦ãã5ã¤ã®ã¢ã«ã´ãªãºã ã®ç´¹ä»ï¼
Counter-based Algorithm ã®ã¹ N åã® item ã®åºç¾ã«å¯¾ãã¦ï¼counter-based ãªææ³ã¯ï¼é«é »åº¦ã®ï¼item ã®é »åº¦ãæ大 Nε (ε<1) ã®èª¤å·®ã§ä¿æããï¼counter-based ãªææ³ã®åºç¤çãªã¢ã«ã´ãªãºã ã¨ãã¦ã¯ï¼ä»¥ä¸ã®ä¸ã¤ãããï¼å®éã«å®ç¨çãªã®ã¯ä¸ã®äºã¤ï¼ï¼
- Frequent (Frequency estimation of internet packet streams with limited space; ESA 2002)
- Lossy counting (Approximate Frequency Counts over Data Streams, VLDB 2002)
- Space saving (Efficient computation of frequent and top-k elements in data streams, ICDT 2005)
ã©ã®ç 究ãï¼è¿ä¼¼èª¤å·® Nε ã®ä¿è¨¼ã¯å¤å
¸ç㪠Majority ã¢ã«ã´ãªãºã ã®ã¢ã¤ãã¢ã«åºã¥ãã¦ããï¼å³ã¡ï¼å°ãªãã¨ã Nε åããå¤ãåºç¾ãã¦ããé«é »åº¦ã® itemï¼ã¨ãã®è¿ä¼¼é »åº¦ï¼ãå¾ãããã«ã¯ï¼é«ã
1/ε - 1 ã® item ã«ã¤ãã¦ã«ã¦ã³ã¿ãä¿æããã°ååï¼ã¨ãããã®ï¼frequent ã§ã¯ãããç´æ¥çã«å©ç¨ãã¦ï¼k = 1/ε - 1 åã®ã«ã¦ã³ã¿ãä¿æãã¦ï¼é »åº¦ã Nε 以ä¸ã®å
¨è¦ç´ ã®é »åº¦ãï¼è¿ä¼¼èª¤å·® Nε ã§ç²å¾ããï¼å¾ãããé »åº¦ã¯å¸¸ã« underestimate ãã; ε=0.5 (k=1) ã®å ´åã Majority ã¢ã«ã´ãªãºã ï¼ï¼
Frequent ã§å¾ãããé »åº¦ã¯ï¼çè«çãªèª¤å·®ã¯ä¿è¨¼ããããã®ã®å®éã«ã¯ããªãããå æ¸ãªå¤ã«ãªã£ã¦ãã¾ãï¼å®ç¨ä¸ã¯ãã¾ãå½¹ã«ç«ããªãï¼åºåããå®éã«é »åº¦ Nε 以ä¸ã® top-k' (k' ≤ k) item ãæ±ããã®ãå°é£ï¼ï¼ä¸æ¹ï¼lossy counting ã space saving ã§ã¯ï¼é«é »åº¦ã® item ã®è¿ä¼¼èª¤å·®ãï¼ï¼power-law ã«åããããªå®ãã¼ã¿ã§ã¯ï¼çµé¨çã« Nε ããï¼å¤§ããï¼æãããã¨ãã§ãã. ä¾ãã° lossy counting ã§ã¯ï¼ãã¼ã¿ã 1/ε åä½ã§ bucket ã«åå²ãã¦é ã«å¦çãï¼i (≥0) çªç®ã® bucket ã®å¦çå¾ã«ï¼ãã®æç¹ã¾ã§å¦çããã®ã¹ n = i/ε ã® item ã§ï¼é »åº¦ï¼ã®èª¤å·®ãèæ
®ããä¸éå¤ï¼ã Î = i (= nε = i/ε * ε) æªæºã® item ã®ã«ã¦ã³ã¿ãåé¤ããï¼ã㮠Π㯠i+1 çªç®ã® bucket ã§æ°ãã追å ããã item ã®èª¤å·®ã®è¦ç©ãï¼i çªç®ã¾ã§ã® bucket ã§ã®é »åº¦ã®æ大è¦ç©ãï¼ã«ã使ãããï¼ï¼ä¸æ¹ï¼space saving ã§ã¯ï¼å¸¸ã« k (= 1/ε) åã®ã«ã¦ã³ã¿ã®ã¿ãä¿æãï¼æ°ãã観測ããã item ã¨æä½é »åº¦ã® item ãå
¥ãæ¿ããã ãã®ã·ã³ãã«ãªã¢ã«ã´ãªãºã ï¼ãããã®ã¢ã«ã´ãªãºã ã§ã¯ï¼æåã« item ã追å ããæç¹ã®èª¤å·®ãï¼ãã® item ãã«ã¦ã³ã¿ããåé¤ãããªãéãä¿æãããã®ã§ï¼é«é »åº¦ã§ã¾ãã¹ããªãåºç¾ãã item ã§ããã»ã©ï¼é »åº¦ã®æ£ç¢ºãªè¦ç©ããå¯è½ã¨ãªã£ã¦ããï¼
ãµã¼ãã¤è«æã§ã¯ï¼counter-based ãªææ³ã§ã¯ space saving ãç·åçã«ä¸çªè¯ãã¨ããçµè«ã«ãªã£ã¦ããï¼ç©ºéè¨ç®éã¨ããç¹ã§ã¯ï¼ã«ã¦ã³ã¿ãåºå®é·ã® space saving ã使ããããã®ã¯ééããªãã*1ï¼æ´æ°é度ã§ã¯èªåã§å®è£
ãã¦è©¦ããéã大差ã¯ãªãã¨ããå°è±¡*2ï¼lossy counting 㨠space saving ã®æ¬è³ªçãªå·®ã¯ï¼top-k' ãåãåºãéã®è³ª (false positive ã®æ°ã®å·®) ã¨æãã°è¯ãããï¼
ãªãï¼lossy counting ã¯ãã¯ãæç§æ¸ã¬ãã«ã®åºç¤ã¢ã«ã´ãªãºã ã§ï¼ä¾ãã°
ãªã©ã§ä½¿ããã¦ããï¼
é »åº¦é ã®åºåãå¾ããå ´åï¼space saving ã§ã¯ã«ã¦ã³ã¿ã®å®è£
ã«ç¨ãããã¼ã¿æ§é ï¼stream summaryï¼ãé »åº¦é ã®åºåããµãã¼ãããã®ã§æ©ãã¨ããã¯ãªããï¼lossy counting ã®æ¹ã¯ã«ã¦ã³ã¿ã®å®è£
ã«ãã£ã¦ã¯å¥éåºåæã«ã½ã¼ãããå¿
è¦ããã£ããã¨æ³¨æãå¿
è¦ï¼ãã®ä»ï¼å bucket ã®å¦çå¾ã«åé¤ãã item ã®å²åãå
¨ä½ã«å¯¾ãã¦æ¥µç«¯ã«å°ãªãå ´åã«ã¯ï¼ç©ºéã»æéå¹çãè¯ãããã·ã¥ãã¼ã¹ã®ã«ã¦ã³ã¿ããï¼stream summary ã®ãããªãã¼ã¿æ§é ã®æ¹ã使ãã»ããè¯ãããç¥ããªãï¼ï¼
Sketch-based Algorithm ä¸æ¹ï¼sketch-based ãªã¢ã«ã´ãªãºã ã¯ï¼ãã¼ãç´æ¥çã«ä¿æããªãã§ï¼å
¨ item ã«å¯¾ãã¦éè¤ã許ãããµã¤ãº w ã®ã«ã¦ã³ã¿ã d åä¿æããï¼å item ãåã«ã¦ã³ã¿ã®ã©ã®ã¹ãããã使ããã¯ï¼ã«ã¦ã³ã¿ãã¨ã« pairwise independent ãªããã·ã¥é¢æ°ãç¨æãã¦å©ç¨*3; top-k' ãåºãããå ´åã¯ï¼è¿ä¼¼ã«ã¦ã³ã¿ãå©ç¨ãã¤ã¤ï¼ãã¼ãå¥éä¿åï¼ï¼counter-based ã¨ã®éãã¯ï¼ç¢ºçδã§é »åº¦ã®èª¤å·®ç¯å² Nε ãå¤ããã¨ããç¹ï¼ã«ã¦ã³ã¿ã®ãµã¤ãºãæ°ã¯ï¼ä¸ãããã (ε, δ) ã«å¯¾ãã¦æé©åããã; CountMin Sketch ã§ã¯ãµã¤ãº w = e/ε ã®ã«ã¦ã³ã¿ã d = ln 1/δ åå¿
è¦ï¼ï¼åºæ¬çãªã¢ã«ã´ãªãºã ã¯ä»¥ä¸ã®äºã¤ï¼CGT: Combinatorial Group Testing ã¯å®è£
ãã¦ããªãã®ã§ï¼ããã§ã¯ç¥ï¼ï¼
- CountSketch (Finding Frequent Items in Data Streams; ICALP 2002)
- CountMin sketch (An improved data stream summary: the count-min sketch and its applications; Algorithmis 55, 2005)
CountSketch ã¯ã«ã¦ã³ã¿æ´æ°æã«æ´æ°ã®ç¬¦å·ãå¥ã®ããã·ã¥é¢æ°ã使ã£ã¦æ±ºããåï¼æ´æ°é度ã CountMin sketch ã«æ¯ã¹ã¦åç¨åº¦é
ããªãï¼ã¾ãï¼è¿ä¼¼é »åº¦ã¨ãã¦ï¼åã«ã¦ã³ã¿ã«ä¿åãããå¤ã® median ãç¨ããã®ã§ï¼ï¼å®ç¨çã«ã¯åé¡ãªããï¼æéè¨ç®éã«æãã¨å®è£
ãå°ãé¢åï¼ï¼ã¾ãï¼CountMin Sketch 㯠CountSketch ãã空éè¨ç®éã®ç¹ã§ãçè«çã«åªãã¦ããï¼èª¤å·®ç¯å²ããå¤ããã¨ã overestimate ããããªããªã©ï¼ä½¿ãåæã¯è¯ãæãï¼ã©ã¡ãã®ææ³ã誤差ç¯å²ãå¤ãã確ç δ ã¯ããªãå°ããã§ããã®ã§*4ï¼å®ç¨æ§ã¯é«ãã¨æãï¼
é«é »åº¦ã® item ã«å¯¾ã㦠Nε ããå°ãã誤差ãå¿
è¦ãªå ´åãï¼ãã¼ãé½ã«ä¿æããå¿
è¦ãããå ´åã«ã¯ counter-based ã®ææ³ãï¼ããã§ãªãå ´å㯠sketch-based ãªææ³ãç¨ããã¨è¯ãã®ã§ã¯ãªãããªï¼
ãã®ä»ï¼ä¸¡æ¹ã®å©ç¹ãçµã¿åãããææ³ãå¹¾ã¤ãå ±åããã¦ããï¼ä¾ãã°ããï¼
- Finding top-k elements in data streams (Information Sciences 180 (24), 2010)
- Probabilistic Lossy Counting: An efficient algorithm for finding heavy hitters (SIGCOMM 2008)
åè 㯠space-saving ã®å段㫠sketch-based ã®ãã£ã«ã¿ãå ¥ãããã®ï¼space saving ã§ã¯ï¼æçµçã«ã«ã¦ã³ã¿ã«å«ã¾ããªããããªä½é »åº¦ã® item ãï¼ã«ã¦ã³ã¿ããé »ç¹ã«åºããå ¥ã£ãããããã¨ã«ãªãï¼ããã§å item ã«ã¤ãã¦é »åº¦ãæ´æ°ããéã«ï¼item ãã«ã¦ã³ã¿ã«å«ã¾ãã¦ããªãã£ãç´¯ç©ã®åæ°ã sketch-based ãªã«ã¦ã³ã¿ã§è¿ä¼¼çã«è¨é²ãã¦ããï¼æ´æ°æã«ãã¤ãã«ã¦ã³ã¿ã¸ã®æ°è¦ç»é²ãå¿ è¦ãªä½é »åº¦ã® item ã«ã¤ãã¦ã¯æ´æ°ããµããããã«ããï¼ä»¥ä¸ï¼æãèªã¿ä»¥ä¸ãªã®ã§è©³ç´°ã¯ç¥ï¼ï¼å¾è 㯠lossy counting ã§ï¼ãã¼ã¿ã®åãï¼power-lawï¼ãèæ ®ãã¦èª¤å·® Î ã®è¦ç©ããããå°ããæããææ³ï¼ãã ãï¼æä¸ã®ç¢ºç δ (<< 1) ã§è¦ç©ã以ä¸ã®èª¤å·®ãçãããã¨ã許ãï¼ï¼çµæã¨ãã¦ï¼å bucket å¦çå¾ã«ããå¤ãã® item ãåé¤ããããã«ãªãï¼ç©ºé使ç¨çãæ¹åãããï¼ãã¼ã¿ã® skewnessï¼è¦ããã« power-law ã®ãã©ã¡ã¿ï¼ãæä¸ã§ãããã¨ãåæã¨ãã¦ãããã¨ï¼ä½¿ãåæã¯ããè½ã¡ã¦ãããï¼çè«çã«ã¯ç¶ºéºã«ã¾ã¨ã¾ã£ã¦ããï¼
ããã¾ã§è¿°ã¹ãã¢ã«ã´ãªãºã ã¯ï¼è¿ä¼¼èª¤å·®ãå item ãã¨ã®é »åº¦ã«å¯¾ãã¦ã¯ bound ãããªãã®ã§ï¼ä½é »åº¦ã® item ã«ã¤ãã¦ç¸å¯¾çãªèª¤å·®ã大ãããªãï¼item ãã¨ã«ç¸å¯¾èª¤å·®ãä¿è¨¼ãã¦è¨æ¸¬ããææ³ã¨ãã¦ã¯ï¼ä¸è¨ã® log-frequency Bloom filter ï¼ãããã¯ãã®å¤ç¨®ï¼ã«ï¼Morris ã®å¤å ¸ç㪠Approximate Counting ãçµã¿åãããææ³ãææ¡ããã¦ããï¼åè ã®æ¹ã¯æãèªã¿ç¨åº¦ã ãã©ï¼å 容çã«ã¯å¾è ã¨ã»ã¨ãã©åãï¼ï¼
- Probabilistic Counting with Randomized Storage (IJCAI 2009)
- Succinct Approximate Counting of Skewed Data (IJCAI 2009)
ã©ã¡ãã®ç 究ã§ãè¨åããã¦ãããï¼Bloom filterï¼ã«ã¦ã³ã¿ï¼ã®é©åãªãµã¤ãºï¼item ã®ç¨®é¡æ°ï¼å¹³å対æ°é »åº¦ãªã©ã«ä¾åï¼ãäºåã«æ±ºãããã¨ãã§ããªãï¼ã¾ãï¼ãã¼ãä¿æãã¦ããªãé¢ä¿ã§åçã«å¤æ´ãããã¨ãã§ããªãï¼ã®ã§ï¼å®å
¨ã« one-pass ã§æ°ããã¨ããããã«ã¯ãããªãããã ï¼ã¾ãï¼log-frequency bloom filter ã§ã¯ï¼é »åº¦ f ãéåå log_(1+ε)(f) ãã¦ä¿åããï¼Îµãç¸å¯¾èª¤å·®ï¼ã®ã ãã©ï¼ã«ã¦ã³ã¿ãæ´æ°ã»åå¾ãã item ã®é »åº¦ã大ãããªãã»ã©ï¼ã¾ãç¸å¯¾èª¤å·®ãå°ãããããã¨ããã°ããã»ã©ï¼ã«ã¦ã³ã¿ã確èªããåæ°ãå¢ãã¦å¦çé度ãè½ã¡ã¦ãã¾ãï¼
ãã®ä»ï¼CountMin Sketch ã§ç¸å¯¾èª¤å·®ãæ¸ãããã¥ã¼ãªã¹ãã£ã¯ã¹
ãªã©ã試ããã¦ããï¼è¦³æ¸¬ããã item ã«å¯¾ãã¦åã«ã¦ã³ã¿ã®å¯¾å¿ããã¹ããããæ´æ°ããã¨ãã«ï¼ãã®æç¹ã§ã®è¿ä¼¼é »åº¦ãã大ããå¤ããã¹ãããã¯æ´æ°ããªã; æå ã®å®è£ ã§è©¦ããã¨ããï¼æ´æ°ã¯20%ã»ã©é ããªããã®ã®ï¼è¿ä¼¼é »åº¦ã®èª¤å·®ãæ¸ããã®ã«ããªãå¹æãããæãï¼+10 è¡ã»ã©ã§å®è£ ã§ãããï¼ãªã¹ã¹ã¡ï¼ï¼
ããã¾ã§è²ã ãªææ³ãã¿ã¦ãããã©ï¼ãããã®ã¢ã«ã´ãªãºã ã¯ï¼åºæ¬çã«é常ã«ä¼¼éã£ã¦ãã¦ï¼è«æã®ã¿ã¤ãã«ãä¼¼ããããªã®ã°ãããªã®ã§ç´ããããï¼CountMin Sketch 㨠spectral Bloom filter (SIGMOD 2003) ã¨ãï¼åè ã¯ããã·ã¥é¢æ°ãã¨ã«ç¬ç«ããã«ã¦ã³ã¿ãç¨ãããï¼å¾è ã¯å ±éã®ã«ã¦ã³ã¿ãä¸ã¤ã ãç¨ããï¼ï¼ååãå ¨ç¶éã£ã¦ãã¦ãï¼ä¼¼ããã®ãããã®ã§æ³¨æãå¿ è¦ã ï¼
è«æã«ãã£ã¦ï¼ãã£ããèªã¿è¾¼ãã ãã®ã¨ï¼æãèªã¿ç¨åº¦ã®ãæ··ãã£ã¦ããã®ã§ï¼èª¬æã®ç²åº¦ãã°ãã°ãã§ç¡é§ã«é·ããªã£ã¦ãã¾ã£ãï¼ä»å¾ã¯åæ£è¨ç®ã¨è¦ªåæ§ãé«ãã¢ã«ã´ãªãºã ãå¢ãã¦ããã¨æããããã©ï¼ãã®è¾ºãã¯å人çã«ãã¾ãèå³ããªãã¨ããããã£ã¦å²æï¼
[追è¨] é©å½ã«æ¸ãã¦ããã¨ãããå°ããã¤ç´ãã¦ãããï¼
[追è¨; 11/19] counter-based ãªã«ã¦ã³ã¿ï¼sketch-based ãªã«ã¦ã³ã¿ã§ææãã㪠space saving 㨠CountMin sketch (+ conservative update) ãï¼ãããã c++ ã§å®è£
ãã¦ã¿ãï¼
ã¹ã«ã¼ãããã¯ï¼Intel core2 3.2Ghz ã§ï¼ãã¼ãé½ã«ä¿åããªã CountMin sketch (δ=0.001, ε=0.01-0.0001)ã 58-65MiB/sec (10-11M words/sec)ï¼é½ã«ä¿åãã space saving (ε=0.01-0.00001) ã 37-52MiB/sec (6-9M words/sec) ã¨ããã¨ããï¼動的ダブル配列を使って Wikipedia のテキスト処理を高速化 - ny23の日記 ã§ä½¿ã£ãæ¥æ¬èª Wikipedia å ¨æ (ããã¹ãã®ã¿ã§1.5GiB) ãªãï¼30ç§ãããã§å¦çã§ããï¼æå ã®ãã¼ã¿ã§ã©ããªåºåãå¾ãããã試ãããããªã使ããããï¼
*1:lossy counting ã§ã¯ï¼ã«ã¦ã³ã¿ã®ãµã¤ãºãææª O(1/ε log Nε) ãªã®ã«å¯¾ãï¼space saving ã§ã¯ O(1/ε)ï¼ãã ãï¼lossy couting ã®ç©ºé使ç¨éãé »åº¦åå¸ã®åãã«å¼·ãå½±é¿ãåãããã¨ãï¼lossy counting ã§ã¯ space saving ãã 1 item 辺ãã®ç©ºé使ç¨çãè¯ããã¼ã¿æ§é ã使ãããã¨ãèæ ®ããã¨ï¼å®ãã¼ã¿ã»ããã§ã¯å¤§ããªå·®ã¯åºãªãæãï¼
*2:ãã®ãµã¼ãã¤è«æã®æ´æ°é度ã«é¢ããå®é¨çµæã¯ï¼åèç¨åº¦ã«ã¨ã©ãã¦ãããæ¹ãè¯ãããï¼ä¾ãã°ï¼space saving 㯠lossy counting ããããªãã¹ã«ã¼ããããåºã¦ãããã©ï¼å¥ã®ãµã¼ãã¤è«æï¼Frequent items in streaming data: An experimental evaluation of the state-of-the-art (Data & Knowledge engineering 68, 2009)ï¼ã§ã¯ï¼ï¼space saving ãé¤ãã¦ï¼VLDB Journal ã®ãµã¼ãã¤è«æã®å®é¨ã¨ã»ã¼åãå®è£ ãç¨ãã¦ããããã ãã©ï¼å ¨ãéã®çµæãåºã¦ããï¼ããã«è¦ããï¼ï¼ã¾ãï¼èè ãã«ãã lossy counting ã®å®è£ ãçºãã¦ã¿ãã¨ï¼æé©åã®ä½å°ãã¾ã ããããã«è¦ããï¼item ã®åã uint32_t ã§ï¼ãã¤ãã¼ã¿å ¨ä½ãäºãã¡ã¢ãªä¸ã«è¼ãã¦å®é¨ãã¦ããã®ã«ã注æãå¿ è¦ã§ï¼èªåã§å®è£ ãã¦å®é¨ããéãï¼item ã®åãå¯å¤é·æåå (char *) ã®å ´åãªã©ã§åçãªã¡ã¢ãªç¢ºä¿ãå¿ è¦ã¨ãªãå ´åã¯ï¼1 è¡ 1 item ã®ãã¼ã¿ãèªã¿è¾¼ãã ãã§ãï¼å ¥åãç¨æããã¨ããã®æ¹ãããã«ããã¯ã§ï¼ä¸è¨ã®ã©ã®ææ³ã§ã動的ダブル配列を使って Wikipedia のテキスト処理を高速化 - ny23の日記ã®ã¨ãå®è£ ããå¯è±ªçãªã«ã¦ã³ã¿ã®2å以å ã®é度ã«åã¾ã£ãï¼
*3:ä¾ãã°ï¼æ¢ç´å¤é å¼ã«åºã¥ãããã·ã¥é¢æ°æãªã© (cf. Recursive n-gram hashing is pairwise independent, at best (Computer Speech & Language 24 (4), 2010))ï¼ãã®è«æã§ã¯ï¼æååããããé¨åæåå (character n-gram) ããã¼ã¨ãã¦åãåºãéã«ï¼å·¡åå¤é å¼ã«åºã¥ãããã·ã¥é¢æ°æãç¨ãããã¨ã§ï¼ãã®ããã·ã¥å¤ãï¼æ大nåï¼é«éã«è¨ç®ããææ³ãææ¡ããã¦ãï¼
*4:ã«ã¦ã³ã¿ã®æ°ã O(ln 1/δ) ã§ããããï¼ä¾ãã° CountMin Sketch 㧠δ ã 0.01→0.0001 ã«ãã¦ãï¼ç©ºé使ç¨éã§2åï¼é »åº¦ã®æ´æ°ã»åå¾é度ã§1/2åç¨åº¦ã®æ§è½ä½ä¸ã§æ¸ãï¼