ã¨ã ã¹ãªã¼ã¨ã³ã¸ãã¢ãªã³ã°ã°ã«ã¼ã AIã»æ©æ¢°å¦ç¿ãã¼ã ã§ã½ããã¦ã§ã¢ã¨ã³ã¸ãã¢ããã¦ããä¸æ(po3rin) ã§ããæ¤ç´¢ã¨Goã好ãã§ãã
ä»åã¯Luceneã®More like this(MLT)æ©è½ã®ã³ã¼ããªã¼ãã£ã³ã°ã§MLTã®å®è£ ãç解ãã¦ãã¨ã ã¹ãªã¼ã§åé¡ã«ãªã£ã¦ããMLTããã©ã¼ãã³ã¹åé¡ã解決ããã話ããã¾ãã
- What's MLT
- MLTã®å©ç¨ã±ã¼ã¹ã¨ãã¼ãã©ã¼ãã³ã¹åé¡
- é«éåã®ãã¤ã³ã1: ããã¥ã¡ã³ãæå®ãIDæå®ã
- é«éåã®ãã¤ã³ã2: Fieldã®æ°ã¨ããã¹ãé·
- é«éåã®ãã¤ã³ã3: max_query_termsã®è¨å®
- çµæ
- ã¾ã¨ã
What's MLT
MLTãç°¡åã«èª¬æããã¨ãå ¥åããã¥ã¡ã³ããå½¢æ ç´ è§£æããã¦å ¥åããã¥ã¡ã³ããå½¢æ ç´ è§£æãã¦ãTF-IDFã¹ã³ã¢ãé«ãã¿ã¼ã ã使ã£ã¦ãææ¸æ¤ç´¢ããããElasticsearchãLuceneã®æ©è½ã§ãã
Luceneã«ã¯More like this(MLT)ã¨ããæ©è½ããããä¼¼ãææ¸ãæ¢ããªã©ã«å©ç¨ã§ãã¾ãã
MoreLikeThis (Lucene 9.1.0 queries API)
ElasticsearchãããLuceneã®MLTæ©è½ã«ã¢ã¯ã»ã¹ã§ããMore like this APIã¨ãã¦å©ç¨ã§ãã¾ãã
More like this query | Elasticsearch Guide [8.1] | Elastic
ã¨ã ã¹ãªã¼ã§ã¯æ¤ç´¢åºç¤ã«Elasticsearchãå©ç¨ãã¦ãããé¡ä¼¼ææ¸æ¤ç´¢ã®ãã¡ã¼ã¹ãã¹ãããã¨ãã¦éçºã³ã¹ããå°ããã¨ãã観ç¹ããããMLTãæ¡ç¨ãã¾ãã
ç°¡åã«Luceneã®MLTæ©è½ã®ä½¿ãæ¹ãç´¹ä»ãã¾ããLuceneã®ããã¥ã¡ã³ãã«ç°¡åãªä½¿ãæ¹ãæ¸ãã¦ããã®ã§æç²ãã¾ãã
IndexReader ir = ... IndexSearcher is = ... MoreLikeThis mlt = new MoreLikeThis(ir); Reader target = ... // orig source of doc you want to find similarities to Query query = mlt.like(target); Hits hits = is.search(query); // now the usual iteration thru 'hits' - the only thing to watch for is to make sure //you ignore the doc if it matches your 'target' document, as it should be similar to itself
MoreLikeThis
ã¯ã©ã¹ã¯ä¸»ã«æ¤ç´¢ã¯ã¨ãªãçæããããã ãã®è²¬åãæã¡ã¾ããå®éã®æ¤ç´¢ã¯MLTããåé¢ãããæ¤ç´¢ã¤ã³ã¿ã¼ãã§ã¼ã¹ã§ããIndexSearcher
ãå©ç¨ãã¾ããIndexReader
ãMoreLikeThis
ã¯ã©ã¹ã«æ¸¡ãã¦ãã¾ãããããã¯ã¿ã¼ã ã®TFãIDFãindexããåå¾ããããã«å©ç¨ãã¦ãã¾ãã
IndexSearcher
ãIndexReader
ã¯ç°¡åã«èª¬æããã¨ããããæ¤ç´¢ç¨ãIndexingç¨ã®ã¯ã©ã¹ã§ãã詳ããã¯åãéå»ã«æ¸ããè¨äºãã覧ãã ããã
MLTã®å©ç¨ã±ã¼ã¹ã¨ãã¼ãã©ã¼ãã³ã¹åé¡
å¼ç¤¾ã®ã¨ãããããã¯ãã§ã¯ã³ã³ãã³ããã¼ã¹æ¨è¦ã®ã¡ã«ãã¬ãMLTã§çæãã¦ãã¾ããå¹æã¨ãã¦ã¯ç°¡åãªå調ãã£ã«ã¿ãªã³ã°ç³»ã®ã¢ã«ã´ãªãºã ãããå¹æãé«ãã£ãã§ãã
ããããMLTã®å®è¡é度ã®åé¡ã§ãã¡ã«ãã¬ã®é ä¿¡æéã¾ã§ã«ãã¡ã«ãã¬å¯¾è±¡ã¦ã¼ã¶æ°åã®MLTå¦çãçµãããªãã¨ããåé¡ãçºçãã¦ãã¾ãããããã§MLTã®ã³ã¼ããªã¼ãã£ã³ã°ãéãã¦ãé度æ¹åã®ãã¤ã³ããææ¡ãã¦ãå®éã«é度æ¹åãè¡ãã¾ããã
ä»åã¯ã³ã¼ããªã¼ãã£ã³ã°ã§å¦ãã MLTã®ããã©ã¼ãã³ã¹æ¹åã®ãã¤ã³ãã¨ãã¦ä»¥ä¸ã®3ç¹ãç´¹ä»ãã¾ãã
- ããã¥ã¡ã³ãæå®ãIDæå®ã
- Fieldã®æ°ã¨ããã¹ãé·
- max_query_terms
é«éåã®ãã¤ã³ã1: ããã¥ã¡ã³ãæå®ãIDæå®ã
Elasticsearchã®MLTã¯2ãã¿ã¼ã³ã®ä½¿ç¨æ³ãããã¾ãããã§ã«Indexingããã¦ããããã¥ã¡ã³ãã®IDãæå®ããããããã¥ã¡ã³ããæååã¨ãã¦ç´æ¥æ¸¡ããã§ããå®éã«Luceneã§ã2ã¤ã®like
ã¡ã½ãããåå¨ãã¾ãã
public Query like(String fieldName, Reader... readers) throws IOException { // ... } public Query like(int docNum) throws IOException { // ... }
IDæå®ã®æ¹ã¯ä¸è¨ã®ãããªã³ã¼ãã«ãªãã¾ããaddTermFrequencies
å
é¨ã§ã¯æå®ãããAnalyzerã§ããã¥ã¡ã³ããå½¢æ
ç´ è§£æãè¡ããã¿ã¼ã ã¨TF-IDFã¹ã³ã¢ãåå¾ãã¾ããããã¦createQueue
ã§TF-IDFã¹ã³ã¢ã®å¤§ãããã®ããé ã«æ¤ç´¢ã«ä½¿ãã¿ã¼ã æ大æ°ã¾ã§æ ¼ç´ãã¦ããã¾ããæå¾ã«ãã¥ã¼ã«å
¥ã£ãã¿ã¼ã ãcreateQuery
ã§Shouldã¯ã¨ãªã§ç¹ãã§ãæ¤ç´¢ç¨ã¯ã¨ãªãçæãã¾ãã
public Query like(String fieldName, Reader... readers) throws IOException { Map<String, Map<String, Int>> perFieldTermFrequencies = new HashMap<>(); for (Reader r : readers) { addTermFrequencies(r, perFieldTermFrequencies, fieldName); } return createQuery(createQueue(perFieldTermFrequencies)); }
ä¸æ¹ã§ä¸è¨ã®ããã¥ã¡ã³ãæå®ã®æ¹ã®like
ã¡ã½ããã§æ³¨ç®ãã¹ãã¯ir
ã¨ããã¡ã³ãå¤æ°ã§ãããã¯IndexReader
ã¯ã©ã¹ã§ããã¤ã¾ãããã§ã«Indexingãã¦ããã¿ã¼ã æ
å ±ãIndexReader
ããåå¾ãã¦MLTãè¡ãã¾ãããªã®ã§IDæå®ã®æ¹æ³ã¯å½¢æ
ç´ è§£æãã¹ãããã§ãã¾ããæ´ã«TF-IDFã®ã¹ã³ã¢ããã§ã«Indexingããã®ã§ãã¹ã³ã¢è¨ç®ãä¸è¦ã§ãã
public Query like(int docNum) throws IOException { if (fieldNames == null) { Collection<String> fields = FieldInfos.getIndexedFields(ir); fieldNames = fields.toArray(new String[fields.size()]); } return createQuery(retrieveTerms(docNum)); }
ã¾ã¨ããã¨ãããã¥ã¡ã³ãæå®ã ã¨å½¢æ ç´ è§£æãèµ°ãã®ã§ãIDæå®ã®æ¹ãæç¶æ©ãã¨ãããã¨ã«ãªãã¾ãã
å¼ç¤¾ã§ã¯ä»¥åã¯ããã¥ã¡ã³ãæå®æ¹å¼ãã¨ã£ã¦ãã¾ããããªããªãElasticsearchã¨ãã¹ã¿ã¼ã®DBã®åæå¦çã¯Batchã§è¡ã£ã¦ããã®ã§ãIDæå®ã ã¨ã¾ã Indexingããã¦ããªãææ°ã®ããã¥ã¡ã³ãã®é²è¦§ãã°ã使ããªãããã§ããããããã¦ã¼ã¶ã¼åã®ã¡ã«ãã¬ãçæã§ããªãã¨ãããã¡ãªããã大ãããããææ°ã®ããã¥ã¡ã³ãã®é²è¦§ãã°ã使ããªãã¨ããç¹ã«ç®ãçããIDæå®ã®MLTã«åãæ¿ãã¾ããã
ãã£ã¨ã³ã¼ãã追ã£ã¦ã¿ããæ¹ã¯ãªãã¸ããªã®lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThis.javaãèªãã§ã¿ã¦ãã ããã
é«éåã®ãã¤ã³ã2: Fieldã®æ°ã¨ããã¹ãé·
ã¿ã¤ãã«ãããã£ãªã©ã®Fieldãã¨ã«æ¤ç´¢ããããã®ã§ãå½ç¶ãMLTã«ä½¿ããã£ã¼ã«ããå°ãªãæ¹ãé度ãéããªãã¾ããå®éã«IDæå®ã®like
ã§å¼ã°ãã¦ããretrieveTerms
ã®ä¸èº«ãè¦ãã¨ããã£ã¼ã«ããã¨ã®ã¿ã¼ã æ
å ±ãæ ¼ç´ãã¦ããã®ããããã¾ãã
private PriorityQueue<ScoreTerm> retrieveTerms(int docNum) throws IOException { Map<String, Map<String, Int>> field2termFreqMap = new HashMap<>(); for (String fieldName : fieldNames) { // field2termFreqapã«ãã£ã¼ã«ããã¨ã®ã¿ã¼ã æ å ±ãæ ¼ç´ãã¦ãã } return createQueue(field2termFreqMap); }
æçµçã«ãã®ä¸ããTF-IDFãé«ãé ã«é¸ãã§Shouldã§ç¹ããã®ã§ããããã£ã¼ã«ãã®ããã¹ãé·ãé·ããã°é·ãã»ã©ãã«ã¼ãã«æéãããããã¨ããããã¨æãã¾ãã
å¼ç¤¾ã§ã¯ä»¥åãé·ãããããã¹ããã£ã¼ã«ããMLTã®ãã£ã¼ã«ãã«æå®ãã¦ããã®ã§ããã®ãã£ã¼ã«ããMLTã®å¯¾è±¡ããå¤ããã¨ã§é«éåããã¾ããã
é«éåã®ãã¤ã³ã3: max_query_termsã®è¨å®
Luceneã«ã¯max_query_terms
ã¨ããMLTã®è¨å®ãããã¾ããããã¯Shouldã§ç¹ããã¿ã¼ã ã®æ大å¤ã§ããæ¤ç´¢ããã©ã¼ãã³ã¹ã®ããã«å°ãæ°ãæããå¿
è¦ãããã¾ãã
å®éã«ãã®è¨å®ãå¹ãã®ã¯ãã¥ã¼çæã¹ãã¼ã¸ã§ãããã¥ã¼ã®é·ããmax_query_terms
ã®å¤ã§çæãã¦ããã®ãåããã¾ãã
private PriorityQueue<ScoreTerm> createQueue( Map<String, Map<String, Int>> perFieldTermFrequencies) throws IOException { // have collected all words in doc and their freqs final int limit = Math.min(maxQueryTerms, this.getTermsCount(perFieldTermFrequencies)); FreqQ queue = new FreqQ(limit); // will order words by score for (Map.Entry<String, Map<String, Int>> entry : perFieldTermFrequencies.entrySet()) { } return queue; }
ãã¥ã¼ã«å
¥ã£ãã¿ã¼ã ãShouldã§ç¹ãã§ã¯ã¨ãªãçæããã®ã§ããã®æ°ãå°ãªãã»ã©ã¯ã¨ãªã軽ããªãã¾ãã
å¼ç¤¾ã§ã¯max_query_terms
ã¯ããã©ã«ãã®25ã使ã£ã¦ãã¾ããããmax_query_terms=15
ã§ãåé¡ãªã精度ãåºããã¨ã確èªã§ããçºãããã®æ°å¤ãå¤æ´ããã¨ããé度ã大å¹
ã«æ¹åãã¾ããã
çµæ
ã³ã¼ããªã¼ãã£ã³ã°ãéãã¦åãã£ãé度æ¹åã®ããã®ãã¤ã³ããã¾ã¨ããã¨ä¸è¨ã«ãªãã¾ãã
- ããã¥ã¡ã³ãæå®ã ã¨å½¢æ ç´ è§£æãèµ°ãã®ã§ãIDæå®ã®æ¹ãæç¶æ©ãã
- Fieldãã¨ã«å½¢æ ç´ è§£æãæ¤ç´¢ãè¡ãã®ã§ããã£ã¼ã«ãã®æ°ã¯å°ãªãæ¹ãå½ç¶è¯ãã
max_query_terms
ã®æ°ã ãã¿ã¼ã ããshouldã§ã¤ãªããã®ã§ãæ¤ç´¢ããã©ã¼ãã³ã¹ã®ããã«å°ãæ°ãæããå¿ è¦ãããã
ãã®çµæãMLTã«ããã¡ã«ãã¬çæã®æè¦æéã1/2ã«ãªããæ´ã«CPU使ç¨çã®åæ¸ã«ãæåãã¾ããã
ã¾ã¨ã
ä»åã¯More like thisã®å é¨å®è£ ãè¦ãã¾ããããã¯ãã³ã¼ããèªãã¨æ©è½ã®ç解ãé²ã¿ãããã§ãã åæã®å®è£ ã§ã¯MLTã¯ç°¡åã«è©¦ããã®ã§ããããã§ãããããããå ãMLTã§ã¯ç²¾åº¦ãé度ã¨ãã«éçãè¦ãã¦ããã®ã§ãå¥ã®æ¨è¦ã¢ã«ã´ãªãºã ã«ä¹ãæãããã¨ãæ¤è¨ãã¦ãã¾ããã¨ãããä¹ãæãããã®ã§ãæ¨è¦ãããã人æ¥ã¦æ¬²ããã§ããç¬
We're hiring !!!
ã¨ã ã¹ãªã¼ã§ã¯æ¤ç´¢&æ¨è¦åºç¤ã®éçº&æ¹åãéãã¦å»çãåé²ãããã¨ã³ã¸ãã¢ãåéãã¦ãã¾ãï¼ç¤¾å ã§ã¯æ¥ã æ¤ç´¢ãæ¨è¦ã«ã¤ãã¦ã®è°è«ãæ´»çºã«è¡ããã¦ãã¾ããåé±ã§æ å ±/æ¨è¦è«æèªã¿ä¼ãéå¬ããã¦ãã¾ãã
ãã¡ãã£ã¨è©±ãèãã¦ã¿ãããããã¨ãã人ã¯ãã¡ãããï¼ jobs.m3.com