ã¨ã ã¹ãªã¼ã¨ã³ã¸ãã¢ãªã³ã°ã°ã«ã¼ã AIã»æ©æ¢°å¦ç¿ãã¼ã ã§ã½ããã¦ã§ã¢ã¨ã³ã¸ãã¢ããã¦ããä¸æ(po3rin) ã§ããæ¤ç´¢ã¨Goã好ãã§ãã
ä»åã¯æååçéãè³ããã¦ããé«éãªRust製ãã¿ã¼ã³ãããã³ã°ãã·ã³DaachorseãPythonã§å¼ã³åºãã¦æ¢åã®æååãã¿ã¼ã³ããããã¸ãã¯ãé«éåããã話ããã¾ãã
- Daachorseã¨ã¯
- ãªãPythonããå¼ã³åºãããã®ã
- ãã¿ã¼ã³ãããã³ã°ã®ã¿ã®ãã³ããã¼ã¯
- python-daachorseã ããªã¼ãããã³æ§ç¯è¾¼ã¿ã®ãã³ããã¼ã¯
- ã¾ã¨ã
Daachorseã¨ã¯
Daachorseã¯LegalForceããã§éçºéç¨ããã¦ããæååãã¿ã¼ã³ããããè¡ãRust製ã©ã¤ãã©ãªã§ãã
æè¡çãªãããã¯ã«é¢ãã¦ã¯LegalForceããã®è¨äºãå ¨ã¦è§£èª¬ãã¦ããã®ã®ã§ãã¡ããåç §ãã¦ãã ããã
ãªãPythonããå¼ã³åºãããã®ã
ã¨ããç¨éã§æååãã¿ã¼ã³ãããã®ãã¸ãã¯ãPythonã§çµãã§ããã®ã§ããããã¾ãã¡ããã©ã¼ãã³ã¹ãæªããé«éåã§ããã«è¦ããã§ããã¨ããDaachorseã®ãªãªã¼ã¹ããããæ¯éã¨ã使ã£ã¦ã¿ããã¨æãã¾ããã
ããããAIã»æ©æ¢°å¦ç¿ãã¼ã ã§ã¯ãã¼ã¿å¦çãã¢ãã«å¦ç¿ã®ãã¤ãã©ã¤ã³ã«Pythonã®gokartã¨ããã¢ã¸ã¥ã¼ã«ãå ¨é¢çã«å©ç¨ãã¦ãããåºæ¬çã«ä½ããå®è£ ããã¨ãã¯Pythonã§éçºããããã¨ãå¤ãã§ããgokartã«ç¾è¡ã®ãã¼ã¿å¦çãã¸ãã¯ãä¹ã£ã¦ãã以ä¸ãå ¨ã¦ãRustã§æ¸ãæããã®ã¯ããªãã®å¤§å·¥äºã§ããããã§ãã¸ãã¯é¨åã ãRustã§æ¸ãç´ãã¦é«éåã§ããªããã¨èãã¾ããã
ããã¦èª¿ã¹ã¦ã¿ãã¨Daachorseã®Pythonãã¤ã³ãã£ã³ã°å ¬éããã¦ããã®ã§ããã¡ããå©ç¨ãããã¨ã«ãã¾ããã
python-daachorseã§ã¯PyO3ãå©ç¨ãã¦ãããããªã®ã§ãããPythonãã¸ãã¯ãRustã§æ¸ãç´ãããã¨ãã欲æ±ãããå ´åã¯ä½¿ã£ã¦ã¿ã¦ãã ããã
åèªèº«ãPythonã§Rustã®å¦çãå¼ã¶æ¹æ³ãåå¼·ãããã£ãã®ã§ãå®éã«DaachorseãPyO3çµç±ã§å¼ã³åºãã·ã³ãã«ãªãµã³ãã«å®è£ ã試ãã¦ã¿ã¾ãããPyO3ã®å ¥éã®åèã«ãªãã°ã
ä»åã¯python-daachorse
ã使ããå®éã«ãã³ããã¼ã¯ãåããå®æ¦æå
¥ã§ãããã調æ»ãã¾ããã
ãã¿ã¼ã³ãããã³ã°ã®ã¿ã®ãã³ããã¼ã¯
å¼ç¤¾ã®å®éã«ACãã·ã³ã使ããã¦ããå¦çã対象ã«python-daachorseãæã ã®æ¡ä»¶ä¸ã§ãããã©ã¼ãã³ã¹ãåºãããã確èªãã¾ããDaachorseã®ãã³ããã¼ã¯ã¯Word100K/UniDicãã¼ã¿ã»ãã両æ¹ã§ãã§ã«å ¬éããã¦ãã¾ããããã³ããã¼ã¯ãæ¹ãã¦ã¨ã£ãçç±ã¨ãã¦ã¯ä¸è¨ãæãããã¾ãã
- Pythonã©ããã¼çµç±ã®å¼ã³åºããå«ããããã©ã¼ãã³ã¹ãç¥ãããã£ã
- å®éã®å¼ç¤¾ã®ãã¿ã¼ã³éåã§ãããã©ã¼ãã³ã¹ãåºãããç¥ãããã£ã
- ç¾å¨ä½¿ã£ã¦ããpyahocorasickã¨ã®æ¯è¼ãç¥ãããã£ã
ãã¼ã¹ã©ã¤ã³ã¯å¼ç¤¾ã®ç¾è¡ãã¸ãã¯ã§å©ç¨ããã¦ããpyahocorasickã¨ããPythonã¢ã¸ã¥ã¼ã«ã¨ãPure Pythonå®è£ ã®ahocorapyãããã¦Daachorseã¨åãRust Crateã®Pythonãã¤ã³ãã£ã³ã°ã§ããahocorasick_rsã§ãã
pyahocorasick
ahocorapy
ahocorasick_rs
ä»åã®ãã³ããã¼ã¯ã§ã¯ãªã¼ãããã³æ§ç¯ã¯å«ãããç´ç²ãªãã¿ã¼ã³ãããã®ã¿ã®ãã³ããã¼ã¯ãã¨ãã¾ããã
ãã³ããã¼ã¯ç°å¢
platform darwin -- Python 3.9.0, pytest-7.1.1, pluggy-1.0.0 benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
ã¢ã¸ã¥ã¼ã«ã®ãã¼ã¸ã§ã³ã¯ãã¡ã
pyahocorasick = "^1.4.4" daachorse = "^0.1.3" ahocorapy = "^1.6.1" ahocorasick-rs = "^0.12.2"
ãã³ããã¼ã¯ã®ãã¼ã¿ã»ããã¯ä¸è¨ã«ãªãã¾ãã
èªç¤¾ã§æã¤æ¥æ¬èªã®å»çç³»ãã¼ã¿ã»ãã ãã¿ã¼ã³æ°: 22948(å¼ç¤¾ã®ã¨ããå»çè¾æ¸) ããã¹ãæ°: 5094(å¼ç¤¾ã®ã¨ããææ¸ãã¼ã¿)
ã¾ãããã¿ã¼ã³ã¨ããã¹ãæ°ã¯ããããä¸è¨ã®ãããªæååé·ã®åå¸ã«ãªã£ã¦ãã¾ãã
ããã¦ãã³ããã¼ã¯ã³ã¼ãã§ãã
import daachorse import ahocorasick import ahocorasick_rs from ahocorapy.keywordtree import KeywordTree def get_data() -> (list[str], list[str]): // ä½ãããç´ æµãªãã¼ã¿ã»ãããè¿ã def substr_match_ahocorasick(automaton: any, haystacks: list[str]): for haystack in haystacks: x = list(automaton.iter(haystack)) return x def substr_match_ahocorapy(automaton: any, haystacks: list[str]): result = [automaton.search(t) for t in haystacks] return result def substr_match_with_ahocorasick_rs(automaton: any, haystacks: list[str]): result = [automaton.find_matches_as_indexes(t) for t in haystacks] return result def substr_match_with_daachorse(automaton: any, haystacks: list[str]): result = [automaton.find_overlapping(t) for t in haystacks] return result def test_match_ahocorasick_benchmark(benchmark): patterns, haystacks = get_data() automaton = ahocorasick.Automaton() for idx, key in enumerate(patterns): automaton.add_word(key, (idx, key)) automaton.make_automaton() ret = benchmark(substr_match_ahocorasick, automaton=automaton, haystacks=haystacks) assert len(ret)!=0 def test_match_ahocorapy_benchmark(benchmark): patterns, haystacks = get_data() automaton = KeywordTree(case_insensitive=True) for idx, key in enumerate(patterns): automaton.add(key) automaton.finalize() ret = benchmark(substr_match_ahocorapy, automaton=automaton, haystacks=haystacks) assert len(ret)!=0 def test_match_ahocorasick_rs_benchmark(benchmark): patterns, haystacks = get_data() automaton = ahocorasick_rs.AhoCorasick(patterns) ret = benchmark(substr_match_with_ahocorasick_rs, automaton=automaton, haystacks=haystacks) assert len(ret)!=0 def test_match_daachorse_benchmark(benchmark): patterns, haystacks = get_data() automaton = daachorse.Automaton(patterns) ret = benchmark(substr_match_with_daachorse, automaton=automaton, haystacks=haystacks) assert len(ret)!=0
çµæã¯ä¸è¨ã«ãªãã¾ãã
-------------------------------------------------------------------------------------------------- benchmark: 4 tests ------------------------------------------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- test_match_daachorse_benchmark 80.8431 (1.0) 147.8423 (1.0) 93.7325 (1.0) 19.8576 (1.0) 86.4954 (1.0) 11.9653 (1.12) 1;1 10.6687 (1.0) 10 1 test_match_ahocorasick_rs_benchmark 123.6172 (1.53) 412.8799 (2.79) 179.8610 (1.92) 114.3097 (5.76) 136.1129 (1.57) 10.6961 (1.0) 1;1 5.5598 (0.52) 6 1 test_match_ahocorasick_benchmark 736.3777 (9.11) 901.6807 (6.10) 776.5008 (8.28) 70.4611 (3.55) 745.9376 (8.62) 54.7538 (5.12) 1;1 1.2878 (0.12) 5 1 test_match_ahocorapy_benchmark 1,339.0980 (16.56) 3,124.5482 (21.13) 1,908.8495 (20.36) 744.7254 (37.50) 1,565.3466 (18.10) 979.0138 (91.53) 1;0 0.5239 (0.05) 5 1 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Daachorseç´ æµã§ããMeanã§ã®æ¯è¼ã§ã¯ç¾è¡ã®pyahocorasickããã8åéãã§ããã¾ããåãRustå®è£ ã®Pythonãã¤ã³ãã£ã³ã°ãããéãã¨ããçµæã«ãªãã¾ããããã®çµæããå®è·µæå ¥ã§ååæ´»èºã§ããã¨å¤æãã¾ããã
python-daachorseã ããªã¼ãããã³æ§ç¯è¾¼ã¿ã®ãã³ããã¼ã¯
å¼ç¤¾ã§ã¯gokartã§ãã¤ãã©ã¤ã³ãæ§ç¯ãã¦ããã®ã§ã1åæ§ç¯ãããªã¼ãããã³ã¯gokartãã£ãã·ã¥(pickle)ã¨ãã¦ä¿åãã¦ããããã¨ããã§ããããããbuiltins.Automaton
ã¯pickleã§ä¿åãã§ãã¾ãããéçºè
ããã³ã¡ã³ããããã ããéããserialize/deserializeãpython-daachorseãWrapãã¦ããªãããã§ãã
serialize/deserializeã¯unsafeãªã®ã§ã©ããã¼ãæ¸ããããªããã§ããããdaachorseã®ä¸ã§ã¯æã get_uncheckedã使ã£ã¦ãã¦ãä¿¡é ¼ã§ããªããã¼ã¿ãdeserializeããéã«ä½ãèµ·ãããåãããªãã®ã§ã
— æ°´å æ¡å äººï¼ æ±æ¸å· (@vbkaisetsu) 2022å¹´9æ25æ¥
unsafe Rustã«é¢ãã¦ã¯ä¸è¨ã®The Rust Programming Language æ¥æ¬èªçã®ã¬ã¤ããé常ã«åå¼·ã«ãªãã¾ãã
unsafeãªã³ã¼ããã§ããã ãåé¢ããããã«ãunsafeãªã³ã¼ããå®å ¨ãªæ½è±¡ã®ä¸ã«éãè¾¼ããå®å ¨ãªAPIãæä¾ããã®ãæåã§ãã... æ¨æºã©ã¤ãã©ãªã®ä¸é¨ã¯ã æ¤æ»ãããunsafeã³ã¼ãã®å®å ¨ãªæ½è±¡ã¨ãã¦å®è£ ããã¦ãã¾ããå®å ¨ãªæ½è±¡ã«unsafeãªã³ã¼ããå ããã¨ã§ã unsafeããããªããããªãã®ã¦ã¼ã¶ãunsafeã³ã¼ãã§å®è£ ãããæ©è½ã使ããããå¯è½æ§ã®ããç®æå ¨é¨ã«æ¼ãåºããã¨ãé²ãã¾ãã
unsafe Rustã®ã¬ã¤ãã®éããpython-daachorseã§ã¯å®å ¨ãªAPIãæä¾ããããã«unsafeãå®å ¨ãªæ½è±¡ã§å ãã§ããã¨ããæå³ãããããã§ãã
ããã«python-daachorseã§ã¯ããã»ã¹ãå®è¡ãããã³ã«æ¯åãªã¼ãããã³ãæ§ç¯ãç´ãå¿ è¦ãããã¾ãããã®ããå®è·µæå ¥ã®ããã«ã¯ãªã¼ãããã³æ§ç¯ãå«ãããã³ããã¼ã¯ãåãå¿ è¦ãããã¾ããã
å ã»ã©ã®ã³ã¼ãã«ãã1ã¤ãã³ããã¼ã¯ã追å ãã¾ãããã®ãã³ããã¼ã¯ã§ã¯ãªã¼ãããã³æ§ç¯ãå«ãã§ãã¾ãã
def substr_match_with_daachorse_build_automaton(patterns: list[str], haystacks: list[str]): automaton = daachorse.Automaton(patterns) result = [automaton.find_overlapping(t) for t in haystacks] return result def test_match_daachorse_with_build_automaton_benchmark(benchmark): patterns, haystacks = get_data() ret = benchmark(substr_match_with_daachorse_build_automaton, patterns=patterns, haystacks=haystacks) assert len(ret)!=0
çµæã§ãã
----------------------------------------------------------------------------------------------------- benchmark: 5 tests ---------------------------------------------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- test_match_daachorse_benchmark 49.9756 (1.0) 66.1039 (1.0) 54.0174 (1.0) 5.1499 (7.58) 51.5605 (1.0) 5.7571 (15.38) 3;1 18.5126 (1.0) 16 1 test_match_ahocorasick_rs_benchmark 66.9054 (1.34) 68.9367 (1.04) 67.4963 (1.25) 0.6794 (1.0) 67.3037 (1.31) 0.3743 (1.0) 2;2 14.8156 (0.80) 12 1 test_match_daachorse_with_build_automaton_benchmark 112.0056 (2.24) 144.7642 (2.19) 127.7381 (2.36) 13.8674 (20.41) 133.4831 (2.59) 25.1129 (67.09) 3;0 7.8285 (0.42) 7 1 test_match_ahocorasick_benchmark 497.1756 (9.95) 501.2648 (7.58) 499.6143 (9.25) 1.8371 (2.70) 500.4676 (9.71) 3.1697 (8.47) 1;0 2.0015 (0.11) 5 1 test_match_ahocorapy_benchmark 650.2134 (13.01) 652.2055 (9.87) 651.1910 (12.06) 0.8694 (1.28) 651.2392 (12.63) 1.5757 (4.21) 2;0 1.5356 (0.08) 5 1 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ãªã¼ãããã³æ§ç¯è¾¼ã¿ã§ãç¾å¨å©ç¨ãã¦ããpyahocorasickã®ãªã¼ãããã³æ§ç¯ã¹ããããã3å以ä¸éãï¼åªç§ã ï¼ãã®çµæãããããã»ã¹ãèµ·åãããã³ã«æ¯åãªã¼ãããã³æ§ç¯ãã¦ããé£ããæ¥ãã®ã§ãDaachorseæ¡ç¨ã«èµãåãã¾ããã
ã¾ã¨ã
ä»åã¯Daachorseã§æ¢åã®pyahocorasickã使ã£ããã¿ã¼ã³ããããé«éåãã話ãç´¹ä»ãã¾ããããã¼ã¿ã«ãã£ã¦ã¯ä»ã®ã¢ã¸ã¥ã¼ã«ã®æ¹ãé«éã®å ´åãããã®ã§ãå°å ¥åã«ã¯ä»åã®ããã«ãã³ããã¼ã¯ãã¨ã£ã¦èª¿ã¹ã¦ã¿ãã¨è¯ãã§ããããå人çã«ã¯ãã®èª¿æ»ã®ä¸ã§ãRustãPyO3ã使ã£ã¦Pythonã§å¼ã³åºãæ¹æ³ããunsafe Rustã«ã¤ãã¦é常ã«åå¼·ã«ãªãã¾ããã
We are Hiring!
ã¨ã ã¹ãªã¼ã§ã¯æååå¦çãèªç¶è¨èªå¦çã§å»çã«è²¢ç®ãã¦ããããã¡ã³ãã¼ãåéãã¦ãã¾ãã ãã¡ãã£ã¨è©±ãèãã¦ã¿ãããããã¨ãã人ã¯ãã¡ãããï¼
ãã®ä»
ã«ãã¼ç»åã¯Unsplashã®British Libraryã®ç»åã§ãããããã¨ããããã¾ãã