ã¨ã ã¹ãªã¼ã¨ã³ã¸ãã¢ãªã³ã°ã°ã«ã¼ã AIã»æ©æ¢°å¦ç¿ãã¼ã ã§ã½ããã¦ã§ã¢ã¨ã³ã¸ãã¢ããã¦ããä¸æ(po3rin) ã§ããæ¤ç´¢ã¨Goã好ãã§ãã
ä»åã¯ç¤¾å ã§PyTerrierãæ¡ç¨ãã¦ææ¸æ¤ç´¢BatchãPythonã§å®è£ ããã®ã§ãPyTerrierã®ç´¹ä»ã¨PyTerrierã§æ¥æ¬èªæ¤ç´¢ãå®è£ ããæ¹æ³ãç´¹ä»ãã¾ã(æ¥æ¬èªã§PyTerrierãæ±ãè¨äºã¯å¤åå?)ã
- PyTerrierã¨ã¯
- å¼ç¤¾ã§ã®PyTerrierå©ç¨
- PyTerrierã§æ¥æ¬èªæ¤ç´¢
- Phrase Queryã®æ³¨æç¹
- ã¾ã¨ã
PyTerrierã¨ã¯
PyTerrierã¯ãPythonã§ã®æ å ±æ¤ç´¢å®é¨ã®ããã®ãã©ãããã©ã¼ã ã§ãã Javaãã¼ã¹ã®Terrierãå é¨çã«ä½¿ç¨ãã¦ãã¤ã³ããã¯ã¹ä½æã¨æ¤ç´¢æä½ãè¡ããã¨ãã§ãã¾ããåºæ¬çãªQuery RewritingãBM25ãªã©ã®å種ã¹ã³ã¢ãªã³ã°ãããã«ä½¿ããã¾ãå¦ç¿æ¸ã¿ã¢ãã«ã®çµã¿è¾¼ã¿ãè©ä¾¡ãªã©ãç°¡åã«ã§ãããããéçºã¨è©ä¾¡ãä¸æ°é貫ã§è¡ããã¨ãå¯è½ã§ãã
ECIR2021ã§ã¯Learning to rankã®å®é¨ãªã©PyTerrierã§è¡ããã¥ã¼ããªã¢ã«ãå ¬éããã¦ãã¾ãã
ãã¤ãã©ã¤ã³ãæ¼ç®åã§æ§ç¯ã§ããã®ãç¹å¾´ã§ãä¾ãã°ãTF-IDFã§100件åã£ã¦ãã¦ãBM25ã§ãªã©ã³ãã³ã°ãããã¤ãã©ã¤ã³ã¯ä¸è¨ã®ããã«å®£è¨çã«å®è£ ã§ãã¾ãã
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF") bm25 = pt.BatchRetrieve(index, wmodel="BM25") pipeline = (tfidf % 100) >> bm25
ãã¤ãã©ã¤ã³ã®è©ä¾¡ãããã«è¡ããã¨ãã§ãã¾ããä¾ãã°ä¸è¨ã¯TF-IDFã¨BM25ã®æ¯è¼ãmap(Mean Average Precision)ã¡ããªã¯ã¹ã§è¡ãä¾ã§ãã
pt.Experiment([tf_idf, bm25], topic, qrels, eval_metrics=["map"])
ãã®ããã«PyTerrierã§ã¯æ å ±æ¤ç´¢ã®å®é¨ç°å¢ã¨ãã¦ãé常ã«åªããã¤ã³ã¿ã¼ãã§ã¼ã¹ãæä¾ãã¦ãã¾ãã
å¼ç¤¾ã§ã®PyTerrierå©ç¨
社å ã§ããæ°åä¸ä»¶ã®termãªã¹ããè¨äºã«åºç¾ãããããªãã©ã¤ã³ã§ç¢ºèªãããã¨ããã¿ã¹ã¯ãå®è£ ãããã¨ã«ãªãããã®ä¸ã§PyTerrierã使ã£ã¦ã¿ããã¨ã¨ãã¾ããã
ã¡ãªã¿ã«å¼ç¤¾ã§æ¥ã 使ã£ã¦ããElasticsearchã使ã£ã¦ãã¾ãã¨ããã®ãåè£ã¨ãã¦ããã¾ããããElasticsearchãå©ç¨ããã¨ã³ã¢å¦çã®ãã¹ããããã«ã¦ã§ã¢ã«ä¾åãããã¨ã«ãªããåä½ç¢ºèªã®ãã³ã«ESãç«ã¦ããè½ã¨ããªã©ã®é¢åãªå¦çãå¿ è¦ãªãããä»åã¯è¦éãã¾ããã
ä»ã«ããtermãé·ãæã«å¥ã®ææ³ã使ã£ã¦é«éåãã¦ããããããã®ã§ãããããã«é¢ãã¦ã¯å¥ã®è¨äºã§è©³ç´°ã説æãããã¨æãã¾ãã
PyTerrierã§æ¥æ¬èªæ¤ç´¢
PyTerrierã§æ¥æ¬èªæ¤ç´¢ãããéã«ã¯å°ãã³ããå¿ è¦ã§ããPyTerrierã§ç¨æãã¦ããTokenizerã«æ¥æ¬èªã®å½¢æ ç´ è§£æã¯ãªãã®ã§ãèªåã§ç¨æãã¦ãããå¿ è¦ãããã¾ãã
PyTerrierã§è±èªä»¥å¤ã®æ¤ç´¢ä¾ãå ¬éããã¦ããã®ã§ããããåèã«ãã¦ãã ããã
ä»åã¯Sudachiã§å½¢æ ç´ è§£æãã¦ãPyTerrierã§æ¤ç´¢ããæ¹æ³ãç´¹ä»ãã¾ããSudachiã®ç´¹ä»ãSudachiãElasticsearchã«å°å ¥ããè¨äºãå¼ç¤¾ããå ¬éãã¦ããã®ã§ãSudachiã«èå³ã®ããæ¹ã¯æ¯éãã¡ããã覧ãã ããã
æ©éPyTerrierã§æ¥æ¬èªæ¤ç´¢ãããæ¹æ³ãç´¹ä»ãã¦ããã¾ããã¢ã¸ã¥ã¼ã«ã¯ä¸è¨ãç¨æãã¾ããã¾ããPyTerrierã®coreã¯Javaã§å®è£ ããã¦ããã®ã§ãJavaã®ç°å¢ãç¨æãã¦ããã¾ãããã
import os import pyterrier as pt import pandas as pd from sudachipy import dictionary, tokenizer
PyTerrierãåæåãã¾ãã
if not pt.started(): pt.init()
ä»åæ¤ç´¢ãã対象ã®ããã¥ã¡ã³ããç¨æãã¦ããã¾ãã
df = pd.DataFrame([ ["d1", "æ¤ç´¢æ¹æ³ã®æ¤è¨"] ], columns=["docno", "text"])
PyTerrierã¯Pandasã®DataFrameããã®ã¾ã¾ã¤ã³ããã¯ã¹ããã¤ã³ã¿ã¼ãã§ã¼ã¹ãç¨æããã¦ããã®ã§ä¾¿å©ã§ãã ããã¥ã¡ã³ãã¨ã¯ã¨ãªã®ä¸¡æ¹ãå½¢æ ç´ è§£æããã®ã§ãããããTokenizerãç¨æãã¦ããã¾ããããã¥ã¡ã³ãã¯åè©ã§ã¤ã³ããã¯ã¹ããã¿ã¼ã ãçµãã¾ãã
class DocTokenizer(): tokenizer_obj = dictionary.Dictionary().create() mode = tokenizer.Tokenizer.SplitMode.C def tokenize(self, txt: str) -> list[str]: return [ m.dictionary_form() for m in self.tokenizer_obj.tokenize(txt, self.mode) if len(set(['åè©', 'åè©', '形容è©', 'å¯è©', 'å½¢ç¶è©']) & set(m.part_of_speech())) != 0 ] class TokenizeDoc(): tokenizer = DocTokenizer() def tokenize(self, df: pd.DataFrame): df['tokens'] = df['text'].apply(lambda x: ' '.join(self.tokenizer.tokenize(x))) return df
ããã§äºåã«ããã¥ã¡ã³ããã¿ã¼ã ã«åå²ããç¨æãã§ãã¾ãããããã¥ã¡ã³ãã®DataFrameãTokenizeãã¾ãã
doc_tokenizer = TokenizeDoc() phrase_query_converter = PhraseQueryConverter() df = doc_tokenizer.tokenize(df=df) df # docno text tokens # d1 æ¤ç´¢æ¹æ³ã®æ¤è¨ æ¤ç´¢ æ¹æ³ æ¤è¨
ããã§ããã¥ã¡ã³ãã®æºåãã§ããã®ã§ãå®éã«Indexå¦çãè¡ãã¾ããæ¥æ¬èªã®å ´åã¯ã¹ãã¼ã¹ã§åºåãããUTFTokeniser
ãå©ç¨ãã¾ããäºåã«ããã¥ã¡ã³ããã¿ã¼ã ã®ã¹ãã¼ã¹åºåãã«ãã¦ããã®ã§ããã®ã¾ã¾æ¸¡ãã¦ãããã°ã¤ã³ããã¯ã¹å®äºã§ãã
indexer = pt.DFIndexer('./askd-terrier', overwrite=True, blocks=True) indexer.setProperty('tokeniser', 'UTFTokeniser') indexer.setProperty('termpipelines', '') index_ref = indexer.index(df['tokens'], docno=df['docno']) index = pt.IndexFactory.of(index_ref)
å¾ã¯ã¯ã¨ãªã®å¦çã§ããPyTerrierã§ã¯ã¯ã¨ãªè¨èªããµãã¼ããã¦ãããAndæ¤ç´¢ãPhraseæ¤ç´¢ãå¯è½ã§ããä¾ãã°Andæ¤ç´¢ã¯+term1 +term2
ã®ããã«è¨è¿°ã§ããPhraseæ¤ç´¢ã¯"term1 term2"
ã®ããã«è¨è¿°ã§ãã¾ãããã®ä»ã®è¨è¿°æ¹æ³ã¯ããã¥ã¡ã³ããã覧ãã ããã
http://terrier.org/docs/v5.1/querylanguage.html
ä»åã¯Phraseæ¤ç´¢ã使ã£ã¦ã¿ã¾ããå½¢æ ç´ è§£æããã¯ã¨ãªããã¬ã¼ãºã¯ã¨ãªè¨èªã«å±éããå®è£ ã§ãã
class QueryTokenizer(): tokenizer_obj = dictionary.Dictionary().create() mode = tokenizer.Tokenizer.SplitMode.C def tokenize(self, txt: str) -> list[str]: return [m.surface() for m in self.tokenizer_obj.tokenize(txt, self.mode)] class PhraseQueryConverter(): query_tokenizer = QueryTokenizer() def convert(self, text: str) -> str: tokens = [t for t in self.query_tokenizer.tokenize(text)] if len(tokens) <= 1: return text joined = ' '.join(tokens) return f'"{joined}"'
ã¯ã¨ãªãå¦çããæºåãã§ããã®ã§ãå®éã«æ¤ç´¢ãã¤ãã©ã¤ã³ãå®è£ ãã¾ããä»åã¯ã¯ã¨ãªããã¬ã¼ãºã¯ã¨ãªã«å¤æãã¦ãBM25ã§ã¹ã³ã¢ãªã³ã°ãã¦ä¸ä½100件ãåå¾ãããã¤ãã©ã¤ã³ãç¨æãã¾ããã
pipe = (pt.apply.query(lambda row: phrase_query_converter.convert(row.query)) >> \ (pt.BatchRetrieve(index, wmodel='BM25') % 100).compile())
compile()
ã¯æ¤ç´¢ãã¤ãã©ã¤ã³ã®DAGãæ¸ãæãã¦æé©åãã¦ããã¾ããä¾ãã°compile
ç¡ãã ã¨ã¯ã¨ãªã«ãããããããã¥ã¡ã³ããå
¨ä»¶ã¨ã£ã¦ãã¦ãBM25ã§ã¹ã³ã¢ãªã³ã°ãã¦ä¸ä½100件ãåå¾ãã¾ããä¸æ¹ã§compile()
ãè¡ãã¨Luceneã§ãæ¡ç¨ããã¦ããBlock Max WANDãªã©ã®åçãã«ã¼ãã³ã°ææ³ã«æ¸ãæããããæ¤ç´¢ãããé«éã«ãªãã¾ããcompileã«ããæé©åã«ã¤ãã¦ã¯ãã¡ãã®è«æã詳ããã§ãã
ããã§æ¤ç´¢ãã¤ãã©ã¤ã³ãå®è£ ããæºåãã§ãã¾ããã
res = pipe.search('æ¤ç´¢æ¹æ³') res # qid docid docno rank score query_0 query # 0 1 0 d1 0 -1.584963 æ¤ç´¢æ¹æ³ "æ¤ç´¢ æ¹æ³"
ãããããããã¥ã¡ã³ãã®IDã¨ã¨ãã«rankãscoreãè¿ã£ã¦ãã¾ããã¾ããquery_0
ã«ã¯å
ã®ã¯ã¨ãªãquery
ã«ã¯å®éã«æ¤ç´¢ãèµ°ã£ãã¯ã¨ãªãçµæã«è¨è¼ããã¾ãããã¡ãããã¬ã¼ãºã¯ã¨ãªã«æ¸ãæãã¦ããã®ã§ãæ¤ç´¢æ¤è¨
ãªã©ã®ã¯ã¨ãªã«ã¯ããããã¾ããã
res = pipe.search('æ¤ç´¢æ¤è¨') res # empty...
Phrase Queryã®æ³¨æç¹
ç¾å¨Issueã«ããã¦ããã®ã§ããããã¬ã¼ãºæ¤ç´¢ã®ã¿ã¼ã ãã¤ã³ããã¯ã¹ããã¦ããªããã®ã ã¨ããã®ã¿ã¼ã ãç¡è¦ãã¦æ¤ç´¢ãããæåãçºè¦ãã¾ããã
å ·ä½çã«ã¯ãä»åã®ä¾ã§è¨ãã¨ãä¸è¨ã®ãããªã¯ã¨ãªã§ããã¬ã¼ãºã¯ã¨ãªã§ããããã¦ãã¾ãã¾ãã
res = pipe.search('æ¤ç´¢å°é') res # qid docid docno rank score query_0 query # 0 1 0 d1 0 -1.584963 æ¤ç´¢å°é "æ¤ç´¢ å°é"
ç´è¿ã®ã§ãã対å¿ã¨ãã¦ã¯ãã¤ã³ããã¯ã¹ããã¦ããã¿ã¼ã ããã§ãã¯ãã¦ãããåå¨ããªããªãããã®ã¾ã¾ã®ã¯ã¨ãªãæãããã¨ã§ããããé²ããªã©ã®å¯¾å¿ãèãããã¾ãã
def convert(self, text: str, lexicon) -> str: tokens = [t for t in self.query_tokenizer.tokenize(text)] if len(tokens) <= 1: return text # indexed tokens inculde query term (bug?: phrase query ignore non indexed term) for t in tokens: if lexicon.getLexiconEntry(t) is None: return text joined = ' '.join(tokens) return f'"{joined}"' lex = index.getLexicon() pipe = (pt.apply.query(lambda row: phrase_query_converter.convert(row.query, lex)) >> pt.BatchRetrieve(index, wmodel='BM25').compile())
å¼ç¤¾ã§ã¯ãã¬ã¼ãºã¯ã¨ãªãå¿ è¦ã ã£ãã®ã§ãä¸æ¦ãã®æ¹æ³ã§å¯¾å¿ãã¦ãã¾ããæ ¹æ¬ã®åå ã¯ç¾å¨èª¿æ»ä¸ã§ãã
ã¾ã¨ã
PyTerrierã®ç´¹ä»ã¨ãPyTerrierã§æ¥æ¬èªæ¤ç´¢ããæ¹æ³ãç°¡åã«ç´¹ä»ãã¾ãããPythonã§ãµã¯ãã¨æ¤ç´¢ãããæã«ã¯ä¾¿å©ã§ããä¸æ¹ã§ãPyTerrierã¯ä»åã®ãããªLexical Searchã«ã¨ã©ã¾ãããæ å ±æ¤ç´¢ã¢ãã«ã®é©ç¨ããå®é¨ã®è©ä¾¡ãªã©ã§ãæ´»èºããã®ã§ãèå³ã®ããæ¹ã¯æ¯é触ã£ã¦ã¿ã¦ãã ãããå人çã«ã¯ECIR2021ã®ãã¥ã¼ããªã¢ã«ãé常ã«è¯ãå ¥éã«ãªãã¾ããã
https://github.com/terrier-org/ecir2021tutorial
We're hiring !!!
ã¨ã ã¹ãªã¼ã§ã¯æ¤ç´¢&æ¨è¦åºç¤ã®éçº&æ¹åãéãã¦å»çãåé²ãããã¨ã³ã¸ãã¢ãåéãã¦ãã¾ãï¼ç¤¾å ã§ã¯æ¥ã æ¤ç´¢ãæ¨è¦ã«ã¤ãã¦ã®è°è«ãæ´»çºã«è¡ããã¦ãã¾ãã
ãã¡ãã£ã¨è©±ãèãã¦ã¿ãããããã¨ãã人ã¯ãã¡ãããï¼ jobs.m3.com