ã¨ã ã¹ãªã¼ Advent Calendar 2020 ã¾ã§æ®ã 7 æ¥ã§ãï¼ Advent Calendaræ¬ç·¨ã«å ããã¦æ°å1ã2å¹´ç®ã¡ã³ãã¼ãå·çãã¾ãï¼
ã¨ã ã¹ãªã¼ã®ã¨ã³ã¸ãã¢ãªã³ã°ã°ã«ã¼ã AIã»æ©æ¢°å¦ç¿ãã¼ã ã®æ°å1å¹´ç®ã®ä¸¸å°¾ã§ããã¨ã ã¹ãªã¼ã«ã¯ä¸æ¨å¹´ã®å¤ã«ãã¤ã³ã¿ã¼ã³ããã¦ãã¦ããã®éã®ä½é¨ãè¨äºã«ãã¦ããã®ã§ããã¡ããã覧ãã ããï¼
ã¯ããã«
ç§ã¯ä¸»ã« Elasticsearch ãç¨ããæ¤ç´¢ããã¸ã§ã¯ããæ å½ãã¦ãã¾ããElasticsearchã¸ã®ç解ãã¾ã ã¾ã ä¸ååã ã¨æããä»æããã¯ããã¸ã§ã¯ãã®ã¡ã³ãã¼ãå·»ãè¾¼ãã§ãElasticsearch/Lucene ã®ã½ã¼ã¹ã³ã¼ããªã¼ãã£ã³ã°ã®åå¼·ä¼ãéãã¦ãã¾ããç®æ¨ã¯åèªãElasticsearchãLuceneã«Pull Requestãæãã¤ãããã¨ã§ãã
ä»åã®è¨äºã§ã¯ãããã¾ã§ã®åå¼·ä¼ã§çºè¡¨ããå 容ãã¾ã¨ãã¦ã次ã®2ç¹ã«ã¤ãã¦æ´çãã¾ããã
- Luceneã¤ã³ããã¯ã¹ã®APIããã¢ã³ã¼ããéãã¦è§¦ã£ã¦ã¿ã
- ã¯ã¨ãªã®ããã®
Query
ã¯ã©ã¹ã®å é¨ãè¦ã¦ãLuceneã¤ã³ããã¯ã¹ã®APIã®ä½¿ããæ¹ãè¦ã¦ã¿ã
ã¤ã³ããã¯ã¹ã®ä½æã¨èªã¿åãããã
ã¾ãã¯ãLuceneã©ã¤ãã©ãªãç¨ãã¦ãã¤ã³ããã¯ã¹ã®ä½æãè¡ãã¾ãããã®ãã¨ã«ãä½æãããã¤ã³ããã¯ã¹ã¸æ¤ç´¢ãè¡ã£ã¦ã¿ã¾ãã ããã«è¸ã¿è¾¼ãã§ãç¹å®ã®åèªãåå¨ãããã®å¤å®ãåèªã®åºç¾ç®æããªã¹ãã¢ãããè¡ããä¸æ®µä½ã¬ã¤ã¤ã¼ã®APIãæ±ã£ã¦ã¿ã¾ãã
ã¤ã³ããã¯ã¹ã®ä½æããã
å
ã«ãè§ã«ãã¾ãã¯Luceneã®ã¤ã³ããã¯ã¹ããªãã¨ããããªãã¨ã試ãã¾ããã IndexWriter
ã¯ã©ã¹ãç¨ãã¦é©å½ãªæç« ãèªã¿è¾¼ãã§ã次ã®ã³ã¼ãã®ããã«ã¤ã³ããã¯ã¹ãä½æãã¦ã¿ã¾ãããã
Directory directory = FSDirectory.open(Paths.get("./data/index")); StandardAnalyzer analyzer = new StandardAnalyzer(); IndexWriterConfig config = new IndexWriterConfig(analyzer); var writer = new IndexWriter(directory, config); File[] files = new File("./data/doc").listFiles(); for (File file : files) { if(!file.isDirectory() && file.exists() && file.canRead()){ Document document = new Document(); TextField contentField = new TextField("body", new FileReader(file)); document.add(contentField); writer.addDocument(document); } } writer.numRamDocs(); writer.close();
æå®ãããã£ã¬ã¯ããªï¼ä¸ã®ä¾ã¯ ./data/index
ï¼ã«æ°ãããã¡ã¤ã«ãä½ããã¾ãããã®ãã¡ã¤ã«ãLuceneã«ãããã¤ã³ããã¯ã¹ã®å®æ
ã¨ãªã£ã¦ãã¾ãã
ã¤ã³ããã¯ã¹ã«å¯¾ãã¦æ¤ç´¢ãã
次ã«ãç°¡åãªã¯ã¨ãªãç¨ãã¦ãä½æããã¤ã³ããã¯ã¹ã«å¯¾ãã¦æ¤ç´¢ãè¡ã£ã¦ã¿ã¾ãããã£ã¬ã¯ããªã«é
ç½®ããã¦ããã¤ã³ããã¯ã¹ãèªã¿è¾¼ãããã®ã¯ã©ã¹ DirectoryReader
ãç¨ãã¦ãèªã¿è¾¼ã¿ãè¡ãã IndexSearcher
ã¯ã©ã¹ã«ã¯ã¨ãªã渡ããã¨ã§æ¤ç´¢ãå®è¡ã§ãã¾ãã
Term term = new Term("body", "You"); TermQuery q = new TermQuery(term); Directory indexDirectory = FSDirectory.open(Paths.get("./data/index")); DirectoryReader reader = DirectoryReader.open(indexDirectory); IndexSearcher searcher = new IndexSearcher(reader); TopDocs docs = searcher.search(q, 10); for (ScoreDoc scoreDoc: docs.scoreDocs) { System.out.println(scoreDoc); }
ä¸ã®ä¾ã§ã¯ã body
ã¨ãããã£ã¼ã«ãã« You
ã¨ãããã¼ã¯ã³ãã¤ã³ããã¯ã¹ã«å«ã¾ããããæ¤ç´¢ãã¦ãã¾ããLuceneãã©ã¤ãã©ãªã¨ãã¦ä½¿ãã«ã¯ããã§ã»ã¼ååãªã®ã§ãããããã«ãã®å
é¨ã¸æ·±æãããã¦ã¿ã¾ãã
åèªã®åå¨ãå¤å®ãã
ã¾ãã¯ãã¤ã³ããã¯ã¹ã®ç¹å®ã®ãã£ã¼ã«ãã«ç¹å®ã®ãã¼ã¯ã³ã®åå¨ãå¤å®ãã¦ã¿ã¾ãããã 以ä¸ããã¢ã³ã¼ãã«ãªãã¾ãã
public class Demo { public static void main(String[] args) throws IOException { var d = new Demo(); d.scanTerm("people"); d.scanTerm("unknown"); } public void scanTerm(String text) throws IOException { System.out.println("-------"); System.out.println(text); Term term = new Term("body", text); Directory indexDirectory = FSDirectory.open(Paths.get("./data/index")); DirectoryReader reader = DirectoryReader.open(indexDirectory); for (LeafReaderContext leaf: reader.leaves()) { LeafReader leafReader = leaf.reader(); // SegmentReader TermsEnum termsEnum = leafReader.terms(term.field()).iterator(); if(termsEnum.seekExact(term.bytes())) { System.out.println("found"); System.out.println(termsEnum.termState()); // BlockTermState } else { System.out.println("not found"); } } } }
æ¨æºåºåçµæ
------- people found docFreq=1 totalTermFreq=5 termBlockOrd=8 blockFP=0 docStartFP=62 posStartFP=1474 payStartFP=0 lastPosBlockOffset=-1 singletonDocID=0 ------- unknown not found
ã¤ã³ããã¯ã¹ã¯ã»ã°ã¡ã³ãã¨å¼ã°ããåä½ã«åå²ããã¦ãã¦ã reader.leaves()
ã§ã»ã°ã¡ã³ãã®ä¸è¦§ãåãåºããã¨ãã§ãã¾ããã»ã°ã¡ã³ãä¸åã«å¯¾ãã¦ãterms(<ãã£ã¼ã«ãå>).iterator()
ã§ç¹å®ã®ãã£ã¼ã«ãã®ãã¼ã¯ã³åã®ã¤ãã¬ã¼ã¿ã¼ãåå¾ã§ãã¾ãã
seekExact
ã¡ã½ããã使ããã¨ã§ãç¹å®ã®ãã¼ã¯ã¼ãããã®ã¤ã³ããã¯ã¹ã®ãã£ã¼ã«ãã«åå¨ããããå¤å®ãããã¨ãã§ãã¾ãã
åèªã®åºç¾ç®æããªã¹ãã¢ãããã
次ã«ãããã¥ã¡ã³ãå ã®åèªã®ä½ç½®ããªã¹ãã¢ãããã¦ã¿ã¾ãã以ä¸ããã¢ã³ã¼ãã«ãªãã¾ãã
public class Demo { public static void main(String[] args) throws IOException { var demo = new Demo(); demo.postings("world"); } IndexReader reader; IndexSearcher searcher; Demo() throws IOException { reader = DirectoryReader.open(FSDirectory.open(Paths.get("./data/index"))); searcher = new IndexSearcher(reader); } void postings(String text) throws IOException { var t = new Term("body", text); for (LeafReaderContext leaf: reader.leaves()) { var it = leaf.reader().terms("body").iterator(); System.out.println(t.text()); System.out.println(it.seekExact(t.bytes())); var p = it.postings(null, PostingsEnum.ALL); p.nextDoc(); for (int i = 0; i < p.freq(); i++) { System.out.println("pos: " + p.nextPosition()); } } } }
æ¨æºåºåçµæ
world true pos: 16 pos: 214 pos: 240 pos: 326 pos: 846 pos: 904 pos: 964 pos: 1386
å
ã»ã©ã¨åæ§ã«ãterms(<ãã£ã¼ã«ãå>).iterator()
ã§ç¹å®ã®ãã£ã¼ã«ãã®ãã¼ã¯ã³åã®ã¤ãã¬ã¼ã¿ã¼ãåå¾ããpostings
ã¡ã½ããã§ä½ç½®æ
å ±ã®ã¤ãã¬ã¼ã¿ã¼ãåå¾ã§ããããã«ãªãã¾ãã
Queryã¯ã©ã¹ãèªã¿è§£ã
次ã«Queryã¯ã©ã¹ã®å®è£ ãèªã¿è¾¼ãã§ã¿ã¾ãããä¸è¿°ã®ã¤ã³ããã¯ã¹ã«å¯¾ããAPIãå é¨ã§ä½¿ããã¦ãã¾ãã
Query
ã¯ã¾ã IndexSearcher
ã渡ããã¨ã§ Weight
ãçæãã¾ãããã®æç¹ã§ã¤ã³ããã¯ã¹å
¨ä½ã®æ
å ±ã渡ãã¾ããããã«ã»ã°ã¡ã³ãã®æ
å ±ã渡ããã¨ã§ãåã»ã°ã¡ã³ãã«ããã Scorer
ãçæãããã¨ãã§ãã¾ããæ¤ç´¢ã«ããããããã®å¤å®ã¨ã®ã¹ã³ã¢ãªã³ã°ã¯ Scorer
ãç¨ãããã¾ãã
Query
èªä½ã¯æ½è±¡ã¯ã©ã¹ã§ãããå
·ä½çãªã¯ã©ã¹ã«ã¤ãã¦è¦ã¦ã¿ã¾ãã
TermQuery
TermQuery
ã¯ç¹å®ã®ãã£ã¼ã«ãã«ç¹å®ã®ãã¼ã¯ã¼ããå«ã¾ããããå¤å®ãããæãåºæ¬çãªã¯ã¨ãªã«ãªãã¾ãã
1 . createWeight
å
㧠TermStates.build
ãå¼ã¶ã
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.6.3/lucene/core/src/java/org/apache/lucene/search/TermQuery.java#L194-L206
2 . TermStates.build
å
㧠loadTermsEnum
ãå¼ã¶ã
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.6.3/lucene/core/src/java/org/apache/lucene/index/TermStates.java#L102-L118
3 . loadTermsEnum
å
㧠termsEnum.seekExact
ã§ãã¼ã¯ã¼ãã®åå¨ãå¤å®
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.6.3/lucene/core/src/java/org/apache/lucene/index/TermStates.java#L120-L129
private static TermsEnum loadTermsEnum(LeafReaderContext ctx, Term term) throws IOException { final Terms terms = ctx.reader().terms(term.field()); if (terms != null) { final TermsEnum termsEnum = terms.iterator(); if (termsEnum.seekExact(term.bytes())) { return termsEnum; } } return null; }
ä¸è¿°ããããã«ãseekExact
ã¡ã½ãããç¨ãã¦ãã¼ã¯ã¼ãã®åå¨ã®å¤å®ããã¦ãã¾ãã
PhraseQuery
PhraseQuery
ã¯ãthe world warãã®ããã«è¤æ°ã®ãã¼ã¯ã¼ããæå®ããæã«ããtheããworldããwarãã®ããããã®ãã¼ã¯ã¼ãããã®é çªã§ç¾ããå ´åã«ãããããã¯ã¨ãªã§ãã
ãããå¤å®ã¯ãPhraseMatcher
ã¯ã©ã¹ã® nextMatch
ã¡ã½ããã§è¡ããã¦ãããã¨ããã½ã¼ã¹ã³ã¼ãã追ãã¨ãããã¾ãã
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.6.3/lucene/core/src/java/org/apache/lucene/search/ExactPhraseMatcher.java#L117
ä¸ã®å³ã«ã postings
ã¡ã½ããã«ãã£ã¦åãã¼ã¯ã¼ãã®åºç¾ä½ç½®ããªã¹ãã¢ããããããããã®ãã¼ã¯ã¼ããé çªã«ç¾ãããã調ã¹ããã¨ã§ããã¬ã¼ãºæ¤ç´¢ãå®æ½ãã¦ãã¾ãã
ã¾ã¨ã
ä»åã®è¨äºã§ã¯ããã¼ã ã§æ¯é±è¡ã£ã¦ããElasticsearchåå¼·ä¼ã®è³æãæ¹ãã¦ã¾ã¨ãã¾ãããå®ã¯ãã¼ã¯ã¼ãã®åå¨å¤å®ããåºç¾ç®æã®ãªã¹ããè¿ãé¢æ°ã¯ Codec ã¨å¼ã°ããã¯ã©ã¹ã«å®è£ ããã¦ãã¦ããã®ã¯ã©ã¹ã§å¹ççãªæ¤ç´¢ãå®ç¾ããã¦ãã¾ãã次ã¯ãã®å é¨å®è£ ãæ·±æãããã®ãé¢ç½ãããããã¾ããã
We are hiring!
ã¨ã ã¹ãªã¼ã§ã¯ElasticsearchãAWSãGCPãGoãªã©å¹ åºãæè¡ã¹ã¿ãã¯ãæ±ã£ã¦ãããååéã«ç²¾éãã¦ããã¨ã³ã¸ãã¢ããããããã¾ãã ç§ãå ¥ç¤¾å½å㯠Elasticsearch ã«ã¤ãã¦ã¯ã»ã¨ãã©ä½ãç¥ãã¾ããã§ãããããã®åå¹´ã§ä¸æ°ã«ãã£ããã¢ãããããã¨ãã§ãã¾ãããæ°ããæè¡ã«ã©ãã©ãå¸åãããæ¹ãç¹ã«æ°åã®ã¨ã³ã¸ãã¢å¿æã®æ¹ããã²ãå¿åãã ããï¼