Whooshã§æ¤ç´¢æ©è½ã®åä¸ãå³ã
ã¯ããã«
åã«é¡ä¼¼æ¬æ¤ç´¢ã·ã¹ãã ãä½æããã®ã§ããããã®ä¸ã§æ°ä¸ããæ¬ã®åè£ã®ä¸ããæ¢ãããæ¬ã®æ¤ç´¢ããé¨åãããã¾ãã ãã®ã¨ãã¯å ¥åãããåèªã«å¯¾ãæ¤ç´¢ãå ¨æ¸ç±ã«å¯¾ãã¦è¡ããã¨ããæãåç´ãªææ³ãå®è£ ããã®ã§ããã ããå°ãããããæ¹ããªãããªã¼ã¨ããã¤ãè³æãèªãã§æ¹åãå®æ½ããã®ã§ãã®éç¨ãè¨è¿°ãã¾ãã
åèè³æ
- å³æ¸é¤¨æ å ±å¦ãªã¿ã¯ã¨å¦ã¶ æ¤ç´¢ã¨ã³ã¸ãã¢å ¥é
- æ¤ç´¢æè¡åå¼·ä¼ã®è³æ
- Whooshå ¬å¼
- Sudachiå ¬å¼
ç¾ç¶ã®åé¡ç¹
- è¤æ°ã®ã¯ã¼ããå ¥åãããã¨ãã§ãããORæ¤ç´¢ãNOTæ¤ç´¢ãã§ããªã
- ç»é²ããã¦ããæ¸ç±ãå ¨æ¤ç´¢ãã¦ããã®ã§ã件æ°ãå¢ããå ´åã«æ¤ç´¢æéãç·å½¢ã«å¢ããã
- æ¤ç´¢ãä¸è´ããå¾ã®ãªã¹ãã®è¿ãæ¹ã«ä½ãåªå é ä½ãã¤ãã¦ããªã
解決æ¹æ³
pythonã§å©ç¨ã§ããå ¨ææ¤ç´¢ããã±ã¼ã¸ã®whooshã使ãã¾ããåç¨ã§ã¯ãµã¼ãæ©è½ãä½µãæã¤Elasticsearchçã使ããããã¨ãå¤ããã ã£ãã®ã§ãããä»åã¯herokuä¸ã«å®è£ ãããã¨ããæå³ããããã¡ãã使ç¨ãã¾ãããã¡ãªã¿ã«ããã¥ã¡ã³ãã®é¡ã¯Elasticsearchã®æ¹ãè±å¯ããã§ããã
æ¤ç´¢ãå®æ½ããæãç°¡åãªæ¹æ³ã¯ãæ¤ç´¢ãããæååãæ¤ç´¢å¯¾è±¡ãã¹ã¦ã«å¯¾ãé ã«æ¤ç´¢ãããããã¨ã§ãããããã ã¨æ¤ç´¢å¯¾è±¡ã®æ°ã ãæéãããã£ã¦ãã¾ãã¾ãããããé²ãããã«åèªåä½ã§ã¤ã³ããã¯ã¹ãæãããã¨ããææ³ãããã¾ãã ä¾ãã°ãç¬ãæ£ã«å½ãããã¨ããæ¤ç´¢æååã«å¯¾ãããç¬ãããæ£ãããå½ãããã¨ããããããã®åèªã«å¯¾ãã¦äºåã«å¯¾å¿ããæ¤ç´¢å¯¾è±¡ãä½ããå¥ã ã«è¨é²ãã¦ãããã¨ã§ãèµ°æ»ãå¿ è¦ãªæ°ãå ¨ä½ã§ã®åèªã®ã¦ãã¼ã¯æ°ã«åã¾ãããã«ãªãã¾ãããè¤æ°ã®åèªã®æ¤ç´¢çµæãæ¯è¼ãããã¨ã§ANDæ¤ç´¢ãORæ¤ç´¢ãè¡ããããã«ãªãã¾ãããããè¡ãããã«ã¯å½¢æ ç´ è§£æãäºåã«è¡ãå¿ è¦ãããã¾ãã
å®ã¯ä»åå®é¨ã®å¯¾è±¡ã¨ãããã¼ã¿ã ã¨ãæ¤ç´¢å ã®ææ¸ä»¶æ°ãï¼ä¸å¼±ã¨ãã¼ã¿éãå¤ããªãå ¨æ¤ç´¢ã§ãé度ã¯ãã¾ãæ°ã«ãªããªãã®ã§ãWhooseã使ãã¡ãªããã¯ANDãORã使ããã¨ãã§ããããã«ãªããã¨ã¨ãã©ã³ãã³ã°ãèæ ®ã§ãããã¨ã«ãªãã¾ãã
å®è£ ã¡ã¢
æ¤ç´¢é¨åã®å®è£ ã¯ãã®ãããªå½¢ã«ãªãã¾ããã åæ©è½ã®è©³ãã説æã«ã¤ãã¦ã¯å ¬å¼ãåç §ããã®ã確å®ã§ãã
##ãindexã®åã®ä½æ schema = Schema(title=TEXT(stored=True), content=TEXT(stored=True, analyzer=StandardAnalyzer(stoplist=None)),count=NUMERIC(stored=True, sortable=True)) if not os.path.exists("heroku_index"): os.mkdir("heroku_index") ix = create_in("heroku_index", schema) # indexä½æ writer = ix.writer() for num in range(len(bookdict_all_sort_upper10)): #表層形 titlewords = set([m.surface() for m in tokenizer_obj.tokenize( bookdict_all_sort_upper10[num][0], mode)]) #æ£è¦å titlewords = titlewords.union(set([m.normalized_form( ) for m in tokenizer_obj.tokenize(bookdict_all_sort_upper10[num][0], mode)])) #è¾æ¸è¡¨è¨ titlewords = titlewords.union(set([m.dictionary_form( ) for m in tokenizer_obj.tokenize(bookdict_all_sort_upper10[num][0], mode)])) writer.add_document(title=bookdict_all_sort_upper10[num][0], content=" ".join(list(titlewords-remove_words)), count=bookdict_all_sort_upper10[num][1]) writer.commit() # æ¤ç´¢ outarr=[] q = qp.parse(searchword_parsed) with ix.searcher() as s: # withå ã§å¦çããå¿ è¦ãã results = s.search(q, limit=999, sortedby="count", reverse=True) # æ¬ã®ç»å ´æ°ã§ã½ã¼ã for i in results: outarr.append(i.values()[2]) #æ¤ç´¢çµæãæ ¼ç´
å®è£ ã¡ã¢
å½¢æ ç´ è§£æã«ã¤ãã¦
- sudachiã使ç¨
- sudachiã§ã¯å½¢æ ç´ è§£æã®æ¹æ³ã«3ã¤ã®é¸æè¢ãããããåºæåè©ãæ½åºã§ããããªSplitMode.Cã使ç¨ãã¦ããã
- 表è¨æºããé¡ä¼¼èªã«å¯¾å¿ã§ãããã¨ãæå¾ ãã¦è¡¨å±¤åã¨è¾æ¸åã¨æ£è¦ååã®ï¼ã¤ã®å½¢æ ç´ è§£æã®çµæãç¨ãã¦ããã
å ¨ææ¤ç´¢ã¨ã³ã¸ã³ã«ã¤ãã¦
- whooshã使ç¨
- å¾ã§æ ¼ç´ããå¤ãåãåºãããå ´åã¯ã¹ãã¼ãä½ææã«stored_Trueã¨ãã¦ããå¿ è¦ãããã
- å ãè±èªãã¼ã¹ã§ããã©ã«ãã ã¨å½¢æ ç´ è§£æå¾ã«ï¼æåã®ãã®ã¯stopwordã¨ãã¦é¤å¤ããã¦ãã¾ããããï¼æåã®æ¤ç´¢ãæå¹åãããã¨ãã¯ã¹ãã¼ãä½ææã«context=analyzer=StandardAnalyzer(stoplist=None)ã¨ãã¦ããå¿ è¦ãããã
- æ¤ç´¢å¾ã«å¤ãåãåºãéã¯withå ã§å®æ½ããå¿ è¦ãããã
- whooshã¯ããã©ã«ãã ã¨BM25ã¨ããTFIDFã«ä¼¼ãã¢ã«ã´ãªãºã ã§ã©ã³ãã³ã°ãè¿ãã¦ããããä»åã¯æ¤ç´¢ã¯ã¼ãã®é¡ä¼¼æ§ãããæ¬ã®ç»å ´åæ°ãç¨ãããã£ããããæ¤ç´¢ã®sortedbyã®é¨åãæ¬ã®ç»å ´åæ°ã«å¤æ´ãã¦ããã
çµæ
- ãã¢ãµã¤ãï¼https://bookrecommendst.herokuapp.com/
ORæ¤ç´¢ã®ä¾
NOTæ¤ç´¢ã®ä¾
ä»åã®æ´æ°ã«ããã¹ãã¼ã¹åºåãã®è¤æ°ã¯ã¼ãå ¥åããORæ¤ç´¢ãNOTæ¤ç´¢ãã§ããããã«ãªãã¾ããã ã¾ããæ¬ã®ç»é²åæ°ã§ã½ã¼ããã¦ããã®ã§ãç®çã®æ¬ãä¸ä½ã«æ¥ããããªã£ããã¨æãã¾ãã
ã¾ã¨ã
whooshã使ç¨ãã¦ãµã¤ãã®æ¤ç´¢æ©è½ã®åä¸ãå³ãã¾ããã
ãããããªãµã¤ãã®æ¤ç´¢ãµã¸ã§ã¹ããè¦ã¦ããã¨ã å ¥åãããåèªã¨ã¯ç´æ¥é¢ä¿ã¯ãªããæ¤ç´¢ããã対象ã«è¿ãçµæãåºã¦ãããããªä»çµã¿ããã£ãã®ã§ã ã¾ãæéãããã°ãã®ãããã調ã¹ã¦ã¿ããã§ãã
使ç¨ããã³ã¼ãã¯ãã¡ãã«ãªãã¾ãã(ã³ã¼ãé¨ã®ã¿)