ãã¡ãã¯ã¨ã ã¹ãªã¼ Advent Calendar 2024 1æ¥ç®ã®è¨äºã§ãã
ã¨ã ã¹ãªã¼ã¨ã³ã¸ãã¢ãªã³ã°ã°ã«ã¼ã AIã»æ©æ¢°å¦ç¿ãã¼ã ã§ã½ããã¦ã§ã¢ã¨ã³ã¸ãã¢ããã¦ããä¸æ(po3rin) ã§ãã
ä»åã¯Qdrantãéçºããæ°ããã¹ã³ã¢ãªã³ã°ã¢ã«ã´ãªãºã ã§ããBM42ãç°¡åã«ç´¹ä»ãããããElasticsearchä¸ã§æ§ç¯ããæ¹æ³ã¨ãã®ææãã話ããã¾ããããã«å½¢æ ç´ è§£æå¨ã®Sudachiã使ã£ã¦é¡ä¼¼èªå±éããã¼ã¯ã³ä¿®æ£ãè¡ãªããBM42ã®ç²¾åº¦ãç¯æ£ããæ¹æ³ã試ããã®ã§ãã®ç´¹ä»ããã¾ãã
BM42ã®ç´¹ä»ã«é¢ãã¦ã¯Qdrantã®è¨äºãæã詳ããã§ããããã®ããã°ã§ãå°å ¥ã¨ãã¦ç°¡åã«ç´¹ä»ãã¾ãã
BM25ã®å¼±ç¹
BM25ã¯æ¤ç´¢ã«ããã¦ã¯ã¨ãªã«é¢é£ããçµæãã¹ã³ã¢ãªã³ã°ããããã«ä½¿ç¨ããã¾ããã»ãã³ãã£ãã¯æ¤ç´¢ãæ´»èºããç¾å¨ã§ãBM25ã¨çµã¿åããããã¤ããªããæ¤ç´¢ã¯ä¸»åã®æ¹æ³ã§ãã
BM25ã®å¼ããããããã¦ããã¾ããããã¿ã¼ã ãå«ãã¯ã¨ãªãä¸ããããã¨ãBM25ã®ã¹ã³ã¢ã¯æ¬¡ã®ããã«ãªãã¾ãã
2024/12/01ç¾å¨ãç°å¢ã«ãã£ã¦ã¯Hantena Blogã®Texæ°å¼ã®è¡¨ç¤ºãä¹±ãã¦ãã¾ã£ã¦ããããã§ããä¸è¨ã®è¨äºã§ä¿®æ£ã®ä»æ¹ã解説ãã¦ä¸ãã£ã¦ãã¾ãã
ããã§ã¯ã®éããã¥ã¡ã³ãé »åº¦ãã¯ããã¥ã¡ã³ãã«ãããã®ã¿ã¼ã é »åº¦ã§ãã詳細ãªå°å ¥ãªã©ã¯æ¸ç±ãæ å ±æ¤ç´¢ :æ¤ç´¢ã¨ã³ã¸ã³ã®å®è£ ã¨è©ä¾¡ãã詳ããã®ã§ãã¡ãããå§ããã¾ãã
BM25ã¯è¨ç®å¼ã«ã¿ã¼ã é »åº¦ãæã¤ãããããã¥ã¡ã³ããååã«é·ããéè¦èªãè¤æ°ååºç¾ããå ´åã«ãã¾ãæ©è½ãã¾ããããããRAGãªã©ãã£ã³ã¯åå²ã«å¯¾ããæ¤ç´¢ããçãè¨äºã¿ã¤ãã«ã«å¯¾ãã¦æ¤ç´¢ãããããªå ´åã¯ãéè¦èªã§ãã£ã¦ãã¿ã¼ã é »åº¦ã1åã§ããå ´åãå¤ããBM25ã¯ä¸æãæ©è½ãã¾ãããç¹ã«ãã£ã³ã¯åå²ã®æ¤ç´¢ã®å ´åã¯ããã£ã³ã¯ã®é·ãã«å·®ãçºçããªãã®ã§ãBM25ã®å¼ã§æ©è½ããã®ãã ãã«ãªã£ã¦ãã¾ãå ´åãå¤ãã®ãç¾ç¶ã§ãã
ãã£ã¦çãããã¥ã¡ã³ãããã£ã³ã¯ã®æ¤ç´¢ã§ã¯ããã¥ã¡ã³ãå ã®ç¨èªã®éè¦åº¦ãåæ¤è¨ããå¿ è¦ãåºã¦ãã¾ããããã§éçºãããã®ãBM42ã§ãã
BM42ã¨ã¯
BM42ã¯IDFã¨Transformerã®Attentionãå©ç¨ãã¾ãã
Transformerã¢ãã«ã§ã¯ãããã¥ã¡ã³ãå ã®maskããããã¼ã¯ã³ãäºæ¸¬ããããã«ãã¬ã¼ãã³ã°ããã¦ãããããAttentionè¡åã¯ãã®maskãããã¼ã¯ã³ãäºæ¸¬ããã®ã«åãã¼ã¯ã³ãã©ãã ãå¯ä¸ããã(Attention weight)ã表ãã¾ãã
ãã¼ã¯ã³ã®ä¸ã§ãTransformerã¢ãã«ã® [CLS]ãã¼ã¯ã³ ã¯å ¥åãããããã¹ãå ¨ä½ã®è¦ç´ãåé¡ã¨ãã¦å©ç¨ã§ããããã«å¦ç¿ããããã[CLS]ãã¼ã¯ã³ã¯ããã¥ã¡ã³ãæèå ¨ä½ã代表ãããã¼ã¯ã³ã¨ãªã£ã¦ãã¾ãããã®ãããAttentionè¡åã®[CLS]ãã¼ã¯ã³åã確èªããã¨ããã¥ã¡ã³ãã®æèå ¨ä½ã«å¯¾ããåãã¼ã¯ã³ã®éè¦æ§ãåå¾ã§ãã¾ãããã®ãã¼ã¯ã³ã®éè¦åº¦ãBM25ã®ã¿ã¼ã é »åº¦ã®ä»£ããã«å©ç¨ããã®ãBM42ã§ãã
ä¾ã¨ãã¦åå¤å±å¤§å¦ã®å¡è¶ãããå ¬éãã¦ããè¨èªã¢ãã«ã§ããRuriã使ã£ã¦æèå ¨ä½ã«å¯¾ãããã¼ã¯ã³ã®éè¦åº¦ã確èªãã¦ã¿ã¾ã(retokenizationã®é¨åã¯ãã¡ãã®è¨äºã大ãã«åèã«ãã¾ããããããã¨ããããã¾ã)ã
import torch from transformers import AutoTokenizer, AutoModel model_name = "cl-nagoya/ruri-large" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) def get_token_attentions(text) -> dict[str, float]: inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) with torch.no_grad(): outputs = model(**inputs, output_attentions=True) attentions = outputs.attentions[-1][0, :, 0].mean(dim=0) # â² â² â² # â â ââââ [CLS] token is the first one # â ââââââââ First item of the batch # âââââââââââ Last transformer layer tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]) token_attentions = {} current_word = "" current_weight = 0 # retokenization for token, weight in zip(tokens[1:-1], attentions[1:-1]): if token.startswith("##"): current_word += token[2:] current_weight += weight continue if current_word: token_attentions[current_word] = current_weight current_word = token current_weight = weight.item() if current_word: token_attentions[current_word] = float(current_weight) return token_attentions result = get_token_attentions("qdrantãéçºããæ°ããã©ã³ãã³ã°ã¢ã«ã´ãªãºã ã§ããBM42ã試ãã¾ãã") for k, v in result.items(): print(f"{k}: {v:.4f}")
retokenizationã®ãã§ã¼ãºã§ã¯Qdrantã®è¨äºã§ç´¹ä»ããã¦ããã®ã¨åãããã«ãåå²ããã¦ãã¾ã£ããµãã¯ã¼ããåèªã«åã³ãã¼ã¸ãã¦ãã¾ãããã®éã®Attention weightã¯ãµãã¯ã¼ãã®Attention weightãåè¨ã¨ãªãã¾ããAttention weightã¯ããã¥ã¡ã³ãã®å ¨åèªã®åè¨ã1ã«ãªãããã«æ£è¦åããã¦ãã®ã§ããã®ããã¥ã¡ã³ãã«ããããã®åèªã®éè¦åº¦ã¨ããæå³ã§è¶³ãåããã¯å¯è½ã§ãã
çµæã¯æ¬¡ã®ããã«åå¾ã§ãã¾ãã
ã©ã³ãã³ã°: 0.2108 qdrant: 0.1261 試ã: 0.1148 ã¢ã«ã´ãªãºã : 0.1063 42: 0.0750 ã¾ã: 0.0523 ã: 0.0492 ã: 0.0436 BM: 0.0301 ãã: 0.0272 æ°ãã: 0.0171 éçº: 0.0149 ã§: 0.0058 ã: 0.0056 ã: 0.0054 ã: 0.0051
ãã®å¤ãã¨ãã¦BM42ãè¨ç®ãã¾ãã
Qdrantã®è¨äºã§ã¯SPLADEã¨ã®æ¯è¼ãã¡ãªãã¡ãªã©ã説æããã¦ããã®ã§ãããèå³ãããã°ãã¡ããåç §ãã ããã
BM42ãElassticsearchã§åãã
å½ç¶ãªããElasticsearchã«ã¯BM42ã¯å®è£ ããã¦ããªãã®ã§ããããElasticsearchã«çµã¿è¾¼ãã«ã¯ã©ãããã°è¯ãããæ¤è¨ãã¾ããElasticsearchã§ã¯ã¹ãã¼ã¹ãã¯ãã«æ¤ç´¢ããµãã¼ãããã¦ããã®ã§ãã¡ãã使ãã¾ãããããå¾ã§èª¬æããéããElasticsearchã¯Qdrantã¨éãã¹ãã¼ã¹ãã¯ãã«æ¤ç´¢ã®ã¹ã³ã¢ã¨IDFã®ç©ãã¨ãç´æ¥çãªæ¹æ³ããªãã®ã§ãå°ã工夫ããå¿ è¦ã¯ããã¾ãã
ã¾ãã¯ã¹ãã¼ã¹ãã¯ãã«æ¤ç´¢ç¨ã®mappingãç¨æãã¾ãã
{ "mappings": { "properties": { "title": { "type": "text", "analyzer": "whitespace" }, "joined_tokens": { "type": "text", "analyzer": "whitespace" }, "tokens": { "type": "sparse_vector" } } } }
ãã£ã¼ã«ãã«ã¯ä»åã¯ã·ã³ãã«ã«ã¿ã¤ãã«ãã®ã¾ã¾ã®æååãæ ¼ç´ããtitle
ãã¿ã¤ãã«ãtokenizeããçµæãã¹ãã¼ã¹ã§joinããjoined_tokens
ãã¹ãã¼ã¹ãã¯ãã«tokens
ã®3ã¤ç¨æãã¾ããJSONããElasticsearchã®ã¤ã³ããã¯ã¹ãä½æããå¾ã«ãå®éã«ãã¼ã¿ãå
¥ãã¦ããã¾ãã
client = Elasticsearch("http://localhost:9200/") texts = [ "qdrantãéçºããæ°ããã©ã³ãã³ã°ã¢ã«ã´ãªãºã ã§ããBM42ã試ãã¾ãã", "æ¤ç´¢ã©ã³ãã³ã°ã§ä½¿ãããBM25ã¨ã¯ï¼" ] for i, t in enumerate(texts): tokens = get_token_attentions(t) joined_text = ' '.join(tokens.keys()) doc = { "title": t, "joined_tokens": joined_text, "tokens": tokens } resp = client.index(index="test-index", id=i+1, document=doc)
ä»åã¯å¾ã«ã¹ã³ã¢è¨ç®ã®ç¢ºèªãããããã«ããã¥ã¡ã³ãã2ã¤ã ãç¨æãã¦ãã¾ããããã§æ¤ç´¢ããæºåãæ´ãã¾ãããã¿ã¼ã ã1ã¤ã®å ´åã¯æ¬¡ã®ã¯ã¨ãªã§BM42ã®æ¤ç´¢ãã§ãã¾ãã
GET /test-index/_search { "query": { "script_score": { "query": { "bool": { "filter": { "match": { "joined_tokens": "BM" } }, "should": [ { "term": { "tokens": { "value": "BM" } } } ] } }, "script": { "source": "return _score / _termStats.docFreq().getSum() " } } } }
ãã¾ãç¾ãããªãã§ãããã¯ã¨ãªãã¨ã«ã¹ãã¼ã¹ãã¯ãã«æ¤ç´¢ã®ã¹ã³ã¢ã¨IDFã®ç©ãã¨ãã®ãElasticsearchã ã¨å°ãé£ããããã«ãã®ãããªã¯ã¨ãªã«ãªã£ã¦ãã¾ãã
script_score
ã§ã¯queryå
ã§ä½¿ã£ãè¤æ°ã¿ã¼ã ã®çµ±è¨å¤(IDFã®å¹³åãTFã®åè¨ãªã©)ããåããªãã®ã§ãfilterãæãã§ããã§ã¿ã¼ã ã1ã¤ã«çµã£ã¦ãã¾ããå®éã«filterã¯ã¨ãªãå¤ãã¨_termStats.docFreq().getSum()
ã®é¨åã0ã«ãªãã¾ãã
script_score
ã§ã¯return _score / _termStats.docFreq().getSum()
ã¨ããã¹ã¯ãªããã§ã¿ã¼ã ãã¨ã®ã¹ã³ã¢ãè¨ç®ãã¦ãã¾ãã_score
ã¯ã¹ãã¼ã¹ãã¯ãã«æ¤ç´¢ã®çµæã®ã¹ã³ã¢(BM42ã§ã¯Attention weight)ã_termStats.docFreq().getSum()
ã¯filterã¯ã¨ãªã§ä½¿ã£ãã¿ã¼ã ã®ããã¥ã¡ã³ãé »åº¦ã®åè¨å¤ã§ããfilterå
ã§ã¯ã¿ã¼ã ã1ã¤ã«çµã£ã¦ããã®ã§ã_termStats.docFreq().getSum()
ã¯ãã®ã¿ã¼ã ã®ããã¥ã¡ã³ãé »åº¦ãã®ãã®ã表ç¾ãã¦ãã¾ãã
ããã§ãã®è¨ç®ã«å¿
è¦ãªå
¨ããã¥ã¡ã³ãæ°ã使ç¨ãã¦ããªãã®ã¯ãscript_score
å
ã§åç
§ã§ããªãããã§ããã®å®ç¾©ã¨ã¯éãã¾ãããæ¤ç´¢ç¨éã§ã¹ã³ã¢ã並ã³æ¿ããã ãã§ããã°ãå
¨ã¢ã¤ãã ã«å
±éã®å¤ã§ããå
¨ããã¥ã¡ã³ãæ°ã¯ç¡è¦ã§ãã¾ãã
ã¹ã³ã¢ã®è©³ç´°ã確èªããããã«ãåãbodyã_explain
ã¨ã³ããã¤ã³ãã«æãã¦ã¿ã¾ãã
GET test-index/_explain/1
ããããã¨ã¹ãã¼ã¹ãã¯ãã«ã®ã¹ã³ã¢ã¨ããã¥ã¡ã³ãé »åº¦ã®éæ°ã®ç©ãã¨ã£ã¦ãããã¨ãåããã¾ãã
{ "_index": "test-index", "_id": "1", "matched": true, "explanation": { "value": 0.015045166, "description": "script score function, computed with script:\"Script{type=inline, lang='painless', idOrCode='return _score / _termStats.docFreq().getSum()', options={}, params={}}\"", "details": [ { "value": 0.030090332, "description": "_score: ", "details": [ { "value": 0.030090332, "description": "sum of:", "details": [ { "value": 0.030090332, "description": "Linear function on the tokens field for the BM feature, computed as w * S from:", "details": [ { "value": 1, "description": "w, weight of this function", "details": [] }, { "value": 0.030090332, "description": "S, feature value", "details": [] } ] }, { "value": 0, "description": "match on required clause, product of:", "details": [ { "value": 0, "description": "# clause", "details": [] }, { "value": 1, "description": "joined_tokens:BM", "details": [] } ] } ] } ] } ] } }
è¤æ°ã¿ã¼ã ã®å ´åã¯ã次ã®ããã«shouldã§ç¹ãã¦sumãåãã¾ãã
GET /test-index/_search { "query": { "bool": { "should": [ { "script_score": { "query": { "bool": { "filter": { "match": { "joined_tokens": "BM" } }, "should": [ { "term": { "tokens": { "value": "BM" } } } ] } }, "script": { "source": "return _score / _termStats.docFreq().getSum() " } } }, { "script_score": { "query": { "bool": { "filter": { "match": { "joined_tokens": "æ¤ç´¢" } }, "should": [ { "term": { "tokens": { "value": "æ¤ç´¢" } } } ] } }, "script": { "source": "return _score / _termStats.docFreq().getSum() " } } } ] } } }
should
ã¯ã¨ãªã¯åã¯ã¨ãªã®ã¹ã³ã¢ã®åè¨ãåãã®ã§ãããã§BM42ã®ã¹ã³ã¢è¨ç®ãã§ãã¾ãã
ãã®ããã«å°ãå¨ããã©ãã¯ã¨ãªãããªãã¨ç¾ç¶BM42ãElasticsearchä¸ã§åç¾ã§ããªãã®ãåé¡ã§ããã¾ããTopKã¯ã¨ãªå¦çæé©åãã§ããªãã®ã§ãæ®éã®ã¹ã³ã¢è¨ç®ãããããã©ã¼ãã³ã¹ãæªãã§ããTopKã¯ã¨ãªå¦çæé©åã«é¢ãã¦ã¯æ¬¡ã®è¨äºã詳ããã§ãã
æé©è§£ã¯ã«ã¹ã¿ã Pluginãå®è£ ãããã¨ã ã¨æãã¾ãããä»åã¯ä¸çªã·ã³ãã«ãªå®è£ æ¹æ³ãç´¹ä»ãã¾ããã(æéãããã°å®è£ ã«ãã£ã¬ã³ã¸ãã¦ã¿ããã¨æãã¾ã)
Sudachiã«ããç¯æ£
ããã¾ã§ã§BM42ãElasticsearchä¸ã§åããæ¹æ³ãç´¹ä»ãã¾ããããã¾ã 次ã®æ°ã«ãªãç¹ãããã¾ãã
- ã¢ãã«ã«ãã£ã¦ã¯æå³ããªããã¼ã¯ã³ãçæãããåé¡
- 表è¨æºããã·ããã ãå¸åã§ããªãåé¡
ã¢ãã«ã«ãã£ã¦ã¯æå³ããªããã¼ã¯ã³ãçæãããåé¡
æ¥æ¬èªåãè¾¼ã¿ã¢ãã«ã ã¨ãç¹ã«ã¨ã ã¹ãªã¼ãã¡ã¤ã³ã§æ±ãå»çç¨èªã¯ç´°ããåå²ãããã®ã§ãæå³ããªãããããçæããå¯è½æ§ãããã¾ããç¹ã«å»çæ¤ç´¢ã§é¡èãªã®ã¯æ¼¢æ¹ã®ååã§ããä¾ãã°ãåå¤åæ´æ¹¯ã¨æ´è¡å ç«éª¨ç¡è湯ã®ä½µç¨ãã¨ããããã¥ã¡ã³ãã¯æ¼¢æ¹åããã¼ã¯ã³ã«å解ããã¾ãã
example_text = "åå¤åæ´æ¹¯ã¨æ´è¡å ç«éª¨ç¡è湯ã®ä½µç¨" result = get_token_attentions(example_text) for k, v in sorted(result.items(), key = lambda fruit : fruit[1], reverse=True): print(f"{k}: {v:.4f}")
çµæã¯æ¬¡ã®ããã«ãªãã¾ãã
ä½µç¨: 0.1649 åå¤: 0.1020 湯: 0.0594 åæ´: 0.0523 æ´è¡: 0.0436 ç¡è: 0.0323 ç«éª¨: 0.0310 å : 0.0213 ã®: 0.0156 ã¨: 0.0149
ãã®ããããã®ãããªåè©ã¯è¾æ¸ã§ããã«ãã¼ã¯ã³ãçµåãã¦ãããããªãã¾ããqdrantã§ã¯##ãã¬ãã£ãã¯ã¹ãæã¤ãã¼ã¯ã³ãçµåãã¦ãã¾ããã##ãã¬ãã£ãã¯ã¹ãæããªããã¼ã¯ã³ã®çµåãèããå ´åã¯ãåå²ããããªãåèªãè¾æ¸ã§æã£ã¦ããã¨è¯ãããã§ããã¨ã ã¹ãªã¼ã§ã¯Sudachiã使ã£ãå½¢æ ç´ è§£æã«ããæ¤ç´¢ãæ¢ã«åãã¦ããã®ã§ããã®è¾æ¸ã«ããå½¢æ ç´ è§£æã®çµæ使ã£ã¦ãã¼ã¯ã³çµåãããã¨ãèãã¾ããã¨ã ã¹ãªã¼ã§ã®Sudachiå©ç¨ã«é¢ãã¦ã¯éå»ã®è¨äºã§ç´¹ä»ãã¦ãã¾ãã
ä¸è¨ã¯Sudachiã§ãã¼ã¯ã³ãSudachiã®å½¢æ ç´ è§£æã®çµæãå ã«çµåããä¾ã§ãã
def retokenize_with_sudachi(tokens, text): """ Sudachiã®å½¢æ ç´ è§£æçµæãå ã«ããã¼ã¯ã³åãçµåãã """ tokenizer_obj = dictionary.Dictionary(config_path="./sudachi.json", dict_type="core").create() sudachi_tokens = [m.surface() for m in tokenizer_obj.tokenize(text, mode)] result = {} for token in sudachi_tokens: result[token] = 0 for t in tokens: if t in token: result[token] += float(tokens[t]) return result
ããã§Sudachiã®å½¢æ ç´ è§£æã®çµæã使ã£ããã¼ã¯ã³çµåãã§ãã¾ãã
example_text = "åå¤åæ´æ¹¯ã¨æ´è¡å ç«éª¨ç¡è湯ã®ä½µç¨" result = get_token_attentions(example_text) for k, v in sorted(result.items(), key = lambda item : item[1], reverse=True): print(f"{k}: {v:.4f}") print("----------") result = retokenize_with_sudachi(result, example_text) for k, v in sorted(result.items(), key = lambda item : item[1], reverse=True): print(f"{k}: {v:.4f}")
çµæã¯æ¬¡ã®ããã«ãªãã¾ãã
ä½µç¨: 0.1649 åå¤: 0.1020 湯: 0.0594 åæ´: 0.0523 æ´è¡: 0.0436 ç¡è: 0.0323 ç«éª¨: 0.0310 å : 0.0213 ã®: 0.0156 ã¨: 0.0149 ---------- åå¤åæ´æ¹¯: 0.2138 æ´è¡å ç«éª¨ç¡è湯: 0.1876 ä½µç¨: 0.1649 ã®: 0.0156 ã¨: 0.0149
ä¸è¨ã®å®è£ ã§ã¯åããã¼ã¯ã³ãäºååºã¦ãããã¨ãèæ ®ãã¦ããªããªã©ã®æ¹è¯ã®ä½å°ã¯ããã¾ãããããã§çµ¶å¯¾ã«å解ããããªããã¼ã¯ã³ãçµåãããã¨ãã§ãã¾ããã
表è¨æºããã·ããã ãå¸åã§ããªãåé¡
2ç¹ç®ã«ã¤ãã¦ã¯ãé¡ç¾©èªãã·ããã è¾æ¸ã«å®ç¾©ãã¦ç®¡çãã¦ãããã¼ã ãããã¨æãã¾ããããããBM42ã§ã¯ã¢ãã«ã®Tokenizerãå©ç¨ããã®ã§ã·ããã ã¯ç¡è¦ããã¾ããSPLADEã§ããã°æèã«åºã¥ãé¢é£æ§ã®é«ãåèªããã¯ãã«ã«å«ãããã¨ãã§ãã¾ãããæå³ããªããã¼ã¯ã³ã追å ãããæ¤ç´¢æã«ããããªãããããã¦ãã¾ãå¯è½æ§ãããã¾ãã
ä¾ãã°Yuichi Tatenoãããå ¬éãã¦ããæ¥æ¬èªã®SPLADEã¢ãã«ã使ã£ã¦ãQdrantãéçºããæ°ããã©ã³ãã³ã°ã¢ã«ã´ãªãºã ã§ããBM42ã試ãã¾ããã®åºåãè¦ãã¨æ¬¡ã®ããã«ãªãã¾ã(ãã¢ç»é¢ã®ã¹ã¯ãªã¼ã³ã·ã§ãããæ²è¼ãã¾ã)ã
ã試ãããæ¡å¼µãã¦ãææ¦ãã¨ãããã¼ã¯ã³ãå«ã¾ããããªã©ãä¸æãæ©è½ãã¦ããããã«è¦ãã¾ãããã製åãããç§å±±ããªã©ç¡é§ãªãã¼ã¯ã³ãå«ã¾ãã¦ãã¾ã£ã¦ãã¾ãã
ããã§æ¢ã«ããã·ããã è¾æ¸ã§ãã¼ã¯ã³ãåãã¹ã³ã¢ã§æ¡å¼µãããã¨ãèãã¾ãã
ããã«é åã§ã·ããã ãæ ¼ç´ãããããã¼ã¯ã³ãæ¤ç´¢ã§ãããããã°ã·ããã ã¨ãã¦ãã¼ã¯ã³ãåãã¹ã³ã¢ã§è¿½å ãã¾ããä»åã¯key-valueã®ããã«ã·ã³ãã«ã«ä½¿ããããã«é åã®Pythonå®è£ ã®Pydatrieãå©ç¨ãã¾ããã
from pydatrie import DoubleArrayTrie synonyms = DoubleArrayTrie( { "ã°ãæ": "å¼¾çºæ", "å¼¾çºæ": "ã°ãæ" } ) def token_expantion(tokens) -> dict[str, float]: """ ãã¼ã¯ã³ã®é¡ç¾©èªã追å ãã """ result = {} for k, v in tokens.items(): result[k] = v syn = synonyms.get(k) if syn is not None: result[syn] = v return result example_text = "ã°ãæã®çç¶ã«ã¤ãã¦" result = get_token_attentions(example_text) for k, v in sorted(result.items(), key = lambda item : item[1], reverse=True): print(f"{k}: {v:.4f}") print("----------") result = retokenize_with_sudachi(result, example_text) for k, v in sorted(result.items(), key = lambda item : item[1], reverse=True): print(f"{k}: {v:.4f}") print("----------") result = token_expantion(result) for k, v in sorted(result.items(), key = lambda item : item[1], reverse=True): print(f"{k}: {v:.4f}")
çµæã¯æ¬¡ã®ããã«ãªãã¾ãã
ã°ãæ: 0.5203 çç¶: 0.1462 ã®: 0.0684 ã«: 0.0675 ã¤ã: 0.0506 ---------- ã°ãæ: 0.5203 çç¶: 0.1462 ã®: 0.0684 ã«: 0.0675 ã¤ã: 0.0506 ã¦: 0.0000 ---------- ã°ãæ: 0.5203 å¼¾çºæ: 0.5203 çç¶: 0.1462 ã®: 0.0684 ã«: 0.0675 ã¤ã: 0.0506 ã¦: 0.0000
ãã¡ãã§ç¨æããã·ããã è¾æ¸ã«ãããã¼ã¯ã³æ¡å¼µãã§ãã¾ãããä»ã«ãstopwordã®åé¤ãããã¼ã¯ã³ã®normalizeãªã©ãã®æ¹è¯ç¹ã¯ããããã§ãã
ã¾ã¨ã
ä»åã¯Elasticsearchä¸ã§BM42ãæ§ç¯ããæ¹æ³ã«å ããSudachiã使ã£ã¦ãã¼ã¯ã³çµåãããæ¹æ³ããã·ããã ã«ãããã¼ã¯ã³å±éãããæ¹æ³ãç´¹ä»ãã¾ãããããã£ã½ãããæãã«BM42ãå©ç¨ã§ããã¨ããã¾ã§ç¢ºèªã§ãã¾ããããã¦ã¼ã¶ã¼ã®ã¯ã¨ãªãåããéã«ãTransformerã®æ¨è«ã«å ããå½¢æ ç´ è§£æãæãã ããã·ããã ã®æ¤ç´¢ãããå¿ è¦ãããã®ã§ãããã©ã¼ãã³ã¹ã®é¢ã§ã¾ã 課é¡ãããããã§ããã¾ãElasticsearchã®TopKã¯ã¨ãªå¦çæé©åãã§ããªãç¶æ³ãªã®ã§ãElasticsearchã®ããã©ã¼ãã³ã¹ã«ãå½±é¿ãä¸ãããããå®éã«å®åã§å©ç¨ãèããå ´åã¯ã¾ã ã¾ã 課é¡ãããããã§ãã
å½ç¶ãBM42ã¯ãã¤ããªããæ¤ç´¢ã¨åããã¦ä½¿ããã¨ãçæ³ãªã®ã§ãå®åå©ç¨ã®å ´åã¯ã¹ãã¼ã¹ãã¯ãã«æ¤ç´¢ã®çµæã¨åããã¦ãå¯ãã¯ãã«æ¤ç´¢ãBM25ã§ã®çµæãRRFã§æ··ãåãããã¨è¯ãã§ãããã
We are hiring !!
ã¨ã ã¹ãªã¼ã§ã¯æ¤ç´¢ãæ¨è¦ã大好ããªã¨ã³ã¸ãã¢ãåéãã¦ãã¾ãï¼å°ãã§ãèå³ãããæ¹ã¯ã次ã®URLããã«ã¸ã¥ã¢ã«é¢è«ããå¿åãã ããï¼