ã¨ã ã¹ãªã¼ã¨ã³ã¸ãã¢ãªã³ã°ã°ã«ã¼ã AIã»æ©æ¢°å¦ç¿ãã¼ã ã§ã½ããã¦ã§ã¢ã¨ã³ã¸ãã¢ããã¦ããä¸æ(@po3rin) ã§ãã
å¼ç¤¾ã§ã¯æ¯é±æ°´ææ¥ã«Elasticsearchã¨Luceneã®ã³ã¼ããªã¼ãã£ã³ã°ä¼ãéå¬ããã¦ãã¾ããæè¿ã§ã¯Luceneã®FSTãKD-Treeãããããå ¬éãããNSWã®å®è£ å¨ããèªãã§ãã¾ããã
å æ¥ãç§ã®çºè¡¨åã§Luceneã®ã¡ã¢ãªä¸ã§ã®è»¢ç½®ã¤ã³ããã¯ã¹ã®ãã¼ã¿æ§é ã«ã¤ãã¦çºè¡¨ããã®ã§ããã®å 容ãç´¹ä»ãã¾ããLuceneã®ãã¨ãå°ãã§ã身è¿ã«æãã¦ããã ããã°å¹¸ãã§ãã
- Luceneã¨ã¯
- 転置ã¤ã³ããã¯ã¹ã«é¢ããäºåç¥è
- Luceneã®äºåç¥è
- Luceneã®ã¡ã¢ãªä¸ã§ã®è»¢ç½®ã¤ã³ããã¯ã¹å®è£ å é¨
- ã¾ã¨ã
Luceneã¨ã¯
Elasticsearchã®å é¨ã§å©ç¨ããã¦ãããªã¼ãã³ã½ã¼ã¹ã®æ¤ç´¢ã¨ã³ã¸ã³ã©ã¤ãã©ãªã§ãããããããèç©ãã大éã®ãã¼ã¿ããæå®ãããã¼ã¯ã¼ããæ¢ãåºãæ©è½ãªã©ãJavaã®ã¯ã©ã¹ã©ã¤ãã©ãªã¨ãã¦æä¾ããã¦ãã¾ãã
Luceneã«è§¦ããã®ãåãã¦ã®äººã¯ç§ã®éå»ããã°ãããããã§ãã
ä»åã¯Apache Lucene 8.8.1ã®ã³ã¼ããè¦ã¦ããã¾ãã
転置ã¤ã³ããã¯ã¹ã«é¢ããäºåç¥è
転置ã¤ã³ããã¯ã¹ã¯ãã¯ã¨ãªã®ã¿ã¼ã ã«å¯¾å¿ããããã¥ã¡ã³ãIDãåºç¾ä½ç½®ãåºç¾é »åº¦ãªã©ã®ãã¼ã¿ãé«éã«åå¾ããããã®ç´¢å¼ãã¼ã¿æ§é ã§ã ã¿ã¼ã ã¯ããã¥ã¡ã³ããã¯ã¨ãªã«å«ã¾ããæ§æåä½ã§ãããæ¥æ¬èªã«ããã¦ã¯å½¢æ ç´ è§£æãããçµæã®ãã¼ã¯ã³ã§ãããã¨ãå¤ãã§ãã
転置ã¤ã³ããã¯ã¹ãæ§æããã³ã³ãã¼ãã³ãã¯ä¸»ã«ãã¹ãã£ã³ã°ãªã¹ãã¨è¾æ¸ã®2ã¤ãããã¾ãã ãã¹ãã£ã³ã°ãªã¹ãã¯å®éã«ãã®ã¿ã¼ã ãã©ã®ããã¥ã¡ã³ãã§ã©ã®ä½ç½®ã«åºç¾ããããªã©ãä¿åãã¾ãã è¾æ¸ã¯ãã¿ã¼ã ããããã®ã¿ã¼ã ã®åºç¾ä½ç½®ãªã©ãè¨é²ãããã¹ãã£ã³ã°ãªã¹ããå¼ãããã®ãã¼ã¿æ§é ã§ãã
ä¾ãã°è¾æ¸ãããã·ã¥ãã¼ãã«ããã¹ãã£ã³ã°ãªã¹ããå±éãªã³ã¯ãªã¹ã(unrolled linked list) ã§å®è£ ããå ´åã¯ä¸è¨ã®ãããªå½¢ã«ãªãã¾ãã
ãããã®è©³ãã解説ãGoã«ããç°¡æçãªå®è£ ã試ããããã°ãå ¬éãã¦ããã®ã§ãããããããã°ã覧ãã ããã
Luceneã®äºåç¥è
Luceneã®Indexå¦çã®å ¨ä½åã¯ä¸å³ã«ãªãã¾ãã
ã¾ãã¯ã¡ã¢ãªä¸ã§è»¢ç½®ã¤ã³ããã¯ã¹ãæ§ç¯ããå®æçããããã¯RAM使ç¨éãé¾å¤ãè¶ ããã¨ã»ã°ã¡ã³ãåã¨å¼ã°ããæ°¸ç¶åå¦çãèµ°ãã¾ããä»åã¯Memory Bufferã§ã®è»¢ç½®ã¤ã³ããã¯ã¹æ§é ã追ã£ã¦ããã¾ãã
Luceneã§ã¯IndexWriter
ã¯ã©ã¹ãã¤ã³ããã¯ã¹æ¸ãè¾¼ã¿ã®è²¬åãæã£ã¦ãããaddDocument
ãå¼ã¶ãã¨ã§ããã¥ã¡ã³ããã¤ã³ããã¯ã¹ã§ãã¾ãã
IndexWriterã®æçµå°éå°ç¹ãã¡ã¢ãªä¸ã§ã®è»¢ç½®ã¤ã³ããã¯ã¹ã§ãããæ°¸ç¶åããã¿ã¤ãã³ã°ã¯å¥éIndexWriterConfig
ã§è¨å®ã§ãã¾ãããã£ã¦ã¡ã¢ãªä¸ã§ã®è»¢ç½®ã¤ã³ããã¯ã¹ã®æ§é ãè¦ããããã°IndexWriter
ã®ã³ã¼ããèªãã§ãããã¨ã«ãªãã¾ãã
IndexWriter
ãèªãã§ããã¨PerField.invert
ã¨ãããã£ã¼ã«ããã¨ã«ããã¥ã¡ã³ããæ ¼ç´ãã¦ããã¯ã©ã¹ãè¦ã¤ããã¾ãã
private final class PerField implements Comparable<PerField> { // ... FieldInvertState invertState; TermsHashPerField termsHashPerField; // ... public void invert(int docID, IndexableField field, boolean first) throws IOException { // ... try (TokenStream stream = tokenStream = field.tokenStream(analyzer, tokenStream)) { stream.reset(); invertState.setAttributeSource(stream); termsHashPerField.start(field, first); while (stream.incrementToken()) { try { // termsHashPerFieldã«ãã¼ã¯ã³ãæ å ±ãæ ¼ç´ãã termsHashPerField.add(invertState.termAttribute.getBytesRef(), docID); } catch (MaxBytesLengthExceededException e) { // ... } // ...
Analyzerã§ããã¥ã¡ã³ããã¿ã¼ã ã«åå²ãã¦ä½æããTokenStream
ãï¼ã¤ãã¤TermsHashPerField
ã«æ ¼ç´ãã¦ãã¾ãããã£ã¦ä»åã¯TermsHashPerField
ã§è»¢ç½®ã¤ã³ããã¯ã¹ãã¡ã¢ãªä¸ã§ã©ã®ããã«æ§ç¯ããã¦ããããè«ç¹ã¨ãªãã¾ãã
Luceneã®ã¡ã¢ãªä¸ã§ã®è»¢ç½®ã¤ã³ããã¯ã¹å®è£ å é¨
å
ã«TermsHashPerField
ã®æ¦è¦ãå³ã«ãããã®ããè¦ããã¾ãã
BytesRefHash
ã転置ã¤ã³ããã¯ã¹ã®ã³ã³ãã¼ãã³ãã§è¨ãæã®è¾æ¸ã§ãããå®è£
ã¯ããã·ã¥ãã¼ãã«ã¨ãªã£ã¦ãã¾ããTokenStream
ã§æ¸¡ã£ã¦ããã¿ã¼ã ãããã·ã¥ãã¼ãã«ã«æ ¼ç´ãã¦ãã¾ããbytesStart
ããã¹ãã£ã³ã°ãªã¹ãã®ä½ç½®ãä¿æãã¾ãããã¹ãã£ã³ã°ãªã¹ãã¯ByteBlockPool
ã§è¡¨ç¾ããã¦ãã¦ããã®ã¿ã¼ã ã®ããã¥ã¡ã³ãIDãåºç¾ä½ç½®ãæ ¼ç´ãã¾ãã
ã¾ãã¯è¾æ¸ã®é¨åã«ã¤ãã¦ã¿ã¦ããã¾ããããè¾æ¸ã«é¢ãã¦ã®å®è£
ãè¦ãã¨bytesHash.add
ã§è¾æ¸ã«ã¿ã¼ã ãæ ¼ç´ãã¦ããã®ããããã¾ãã
abstract class TermsHashPerField implements Comparable<TermsHashPerField> { // ... void add(BytesRef termBytes, final int docID) throws IOException { assert assertDocId(docID); int termID = bytesHash.add(termBytes); if (termID >= 0) { // New posting initStreamSlices(termID, docID); } else { termID = positionStreamSlice(termID, docID); } if (doNextCall) { nextPerField.add(postingsArray.textStarts[termID], docID); } } // ...
BytesRefHash
ã®add
ãã¿ã¦ããã¾ãã
public int add(BytesRef bytes) { assert bytesStart != null : "Bytesstart is null - not initialized"; final int length = bytes.length; // final position final int hashPos = findHash(bytes); int e = ids[hashPos]; if (e == -1) { // new entry // ... e = count++; // ... return e; } return -(e + 1); }
findHash
ã¨ããé¢æ°åãããåããéããtermãããã·ã¥åãã¦ããã·ã¥ãã¼ãã«ã«æ ¼ç´ãã¦ãã¾ãã
é¢æ°ã®è¿ãå¤ã¨ãã¦0ããã¤ã³ã¯ãªã¡ã³ããããã¿ã¼ã ã®IDãè¿ãã¦ãã¾ãããããã§ã«åå¨ãã¦ããå ´åã¯ãã®IDã«1足ãã¦ãã¤ãã¹ãã¤ãã-(e + 1)
ãè¿ãã¦ãã¾ããããã§ãã§ã«ã¿ã¼ã ãåå¨ãã¦ããå ´åãå¤å®ãã¦ãã¾ãã
ããã§ãããã°ããªã³ããæãã§ãfield field filed
ã¨ããstreamã空ã®Filedã«indexããå ´åã«ã©ããªããè¦ã¦ã¿ã¾ãã
void add(BytesRef termBytes, final int docID) throws IOException { // ... int termID = bytesHash.add(termBytes); // ... System.out.println("add term=" + termBytes.utf8ToString() + " doc=" + docID + "termID=" + termID); } // ...
ããã§ãã¹ãç¨ã®ããã¥ã¡ã³ããæ¸ãæãã¦ãã¹ãå®è¡ããã¨ä¸è¨ã®ããã«ãªãã¾ãã
add term=field doc=0 termID=0 add term=field doc=0 termID=-1 add term=field doc=0 termID=-1
æåã®"field"ã¯ã¾ã è¾æ¸ã«åå¨ããªãã®ã§termID=0
ãè¿ãã¾ãããç¶ã"field"ã¯ãã§ã«åå¨ãã¦ããã®ã§termID=-1
ãè¿ãã¦ãã¾ãã
ããã¾ã§ã®ã³ã¼ãã§ã¿ã¼ã ã®bytesã¨ããã¥ã¡ã³ãIDã¨éè¤æ
å ±ãå«ãã¿ã¼ã IDãç¨æã§ãã¦ãããã¨ã確èªã§ãã¾ãã
ãããããã¹ãã£ã³ã°ãªã¹ãã«æ ¼ç´ãã¦ãã¾ããæ°ããã¿ã¼ã ã ã£ãå ´åã«å¼ã°ããTermsHashPerField.add
å
ã§termID>=0
ã®æã«å¼ã°ããinitStreamSlices
ãã¿ã¦ããããã®ã§ãããå°ãå¦çãã³ã¼ãããããã¥ããã®ã§ãããã°ããªã³ããä»è¾¼ãã§ã©ããªè»¢ç½®ã¤ã³ããã¯ã¹ãã§ããããç´æ¥ã¿ã¦ããã¾ããå®è£
ã®è©³ç´°ã«èå³ã®ããæ¹ã¯TermsHashPerField.javaã®initStreamSlicesã¡ã½ãããã覧ãã ããã
initStreamSlices
ã«ä¸è¨ã®ãããªããªã³ããæã¿ã¾ãã
private void initStreamSlices(int termID, int docID) throws IOException { // ï¼çç¥ï¼ System.out.println("INFO: " + Arrays.toString(Arrays.copyOfRange(bytePool.buffers[0], 0, 50))); }
DocHelper.java
ã§ç¨æããã¦ãã"one filed text"ã¨ããããã¥ã¡ã³ããæ ¼ç´ããå ´åã¯ä¸è¨ã®ãããªåºåãå¾ããã¾ãã
INFO: add term=one doc=0termID=0 INFO: [3, 111, 110, 101, 0, 0, 0, 0, 16, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] INFO: add term=field doc=0termID=1 INFO: [3, 111, 110, 101, 0, 0, 0, 0, 16, 0, 0, 0, 0, 16, 5, 102, 105, 101, 108, 100, 0, 0, 0, 0, 16, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] INFO: add term=text doc=0termID=2 INFO: [3, 111, 110, 101, 0, 0, 0, 0, 16, 0, 0, 0, 0, 16, 5, 102, 105, 101, 108, 100, 0, 0, 0, 0, 16, 2, 0, 0, 0, 16, 4, 116, 101, 120, 116, 0, 0, 0, 0, 16, 4, 0, 0, 0, 16, 0, 0, 0, 0, 0]
1ã¿ã¼ã ãã¤byteåã«æ å ±ãæ ¼ç´ãã¦ãã¾ããããããããããã«è»¢ç½®ã¤ã³ããã¯ã¹ã®æ¦è¦å³ãåæ²ãã¾ãã
é»ãã次ã®é»ã®ç¯å²ã¾ã§ãï¼ã¤ã®ã¿ã¼ã ã表ãã¾ããé»ã¯ã¿ã¼ã æååã®byteé·ã表ãã赤ã¯ã¿ã¼ã æååã®byte表ç¾ãæ ¼ç´ãã¾ããæ°´è²ã¯doc_idã¨ã¿ã¼ã ã®åºç¾é »åº¦ãæ ¼ç´ãããã³ã¯ã¯ã¿ã¼ã ã®åºç¾ä½ç½®ãå·®åãªã¹ãã¨ãã¦æ ¼ç´ãã¾ã(ã¡ã¢ãªã®ç¯ç´ã®ãã)ã
åèªã§ãã¹ãã£ã³ã°ãªã¹ãã«ã¢ã¯ã»ã¹ããã¨ãã¯ã¿ã¼ã ãããã·ã¥åããã®ã¡ã«BytesRefHash
ã®ids
ãbyteStart
ã¨è¾¿ã£ã¦ããã°è©²å½ãããã¹ãã£ã³ã°ãªã¹ãã®ä½ç½®ãåå¾ã§ãã¾ãã
確ä¿ããã¡ã¢ãªãããµããå ´å
å®è£ ã®èª¬æã¯çç¥ãã¾ãããã¤ãã§ã«"one two one one one one one one"ã®å ´åã«ã©ããªããã説æãã¾ãã
ä¸å³ã®ããã«ã"one"ã®positionç¨ã«ç¢ºä¿ããã¡ã¢ãªãåã¾ã£ãå ´åã¯ã次ã®ã¡ã¢ãªãããã¡ã®ç©ºãã«positionãåãè¾¼ãã§è¡ãã¾ããå ã®positionãå ¥ã£ã¦ããä½ç½®ã«ã¯æ¬¡ã®positionãæ ¼ç´ãã¦ããä½ç½®ãæ ¼ç´ãã¾ãããã®å½¢ããLuceneã®ã¡ã¢ãªè»¢ç½®ã¤ã³ããã¯ã¹ããã¼ã¿æ§é ãå±éãªã³ã¯ãªã¹ãã«ãªã£ã¦ãããã¨ããããã¾ãã
使ç¨ããbyteBlockã®ãµã¤ãºãã©ã®ããã«å¤§ãããªã£ã¦ãããã¯ByteBlockPool.java
ã«å®ç¾©ãããã¾ãã
// in ByteBlockPool.java public static final int[] LEVEL_SIZE_ARRAY = {5, 14, 20, 30, 40, 40, 80, 80, 120, 200};
詳ããå®è£ ã¯TermsHashPerField.javaã®positionStreamSliceã¡ã½ããé ä¸ãã覧ãã ããã
ããã¾ã§ã§Luceneã®ã¡ã¢ãªä¸ã§ã®è»¢ç½®ã¤ã³ããã¯ã¹æ§é ã大éæã«è¦ããã¨ãã§ãã¾ãããããByteBlockPool
ã®bufferããã£ã±ãã«ãªã£ãå ´åã¯æ¬¡ã®bufferãç¨æãã¦ããã«ãã¼ã¿ãè©°ãã¦ããã¾ãã
ã¾ã¨ã
ä»åã¯å¼ç¤¾ã§ãã£ã¦ããElasticsearch/Luceneã³ã¼ããªã¼ãã£ã³ã°ä¼ã®å 容ãå°ãã ããè¦ããã¾ããã転置ã¤ã³ããã¯ã¹ã®æ§é ãã復ç¿ããLuceneã®ã¡ã¢ãªä¸ã§ã®è»¢ç½®ã¤ã³ããã¯ã¹ã®å®è£ ãè¦ããã¨ã§å°ãæ¤ç´¢ã¨ã³ã¸ã³ã身è¿ã«æããããããã«ãªã£ãã¨æãã¾ãã転置ã¤ã³ããã¯ã¹ã®ä»çµã¿ã«ã¤ãã¦ãã£ã¨æ¢æ±ãããæ¹ã¯ä¸è¨ã®æ¸ç±ããªã¹ã¹ã¡ã§ãã
次åã¯ã¡ã¢ãªãããã£ã¹ã¯ã¸ã®æ°¸ç¶åã®é¨åã追ã£ã¦ãããã¨æãã¾ãã
Elasticsearch/Luceneã³ã¼ããªã¼ãã£ã³ã°ä¼ã¯æ¯é±æ°´æã«ãã£ã¦ããã®ã§ããããããã©ãã©ãElasticsearchãLuceneã®å é¨ã«ã¤ãã¦çºä¿¡ãã¦ãããã°ã¨æãã¾ãã
We're hiring !!!
ã¨ã ã¹ãªã¼ã§ã¯æ¤ç´¢&æ¨è¦åºç¤ã®éçº&æ¹åãéãã¦å»çãåé²ãããã¨ã³ã¸ãã¢ãåéãã¦ãã¾ãï¼ ç¤¾å ã§ã¯æ¥ã æ¤ç´¢ãæ¨è¦ã«ã¤ãã¦ã®è°è«ãæ´»çºã«è¡ããã¦ãã¾ãã
ãã¡ãã£ã¨è©±ãèãã¦ã¿ãããããã¨ãã人ã¯ãã¡ãããï¼ jobs.m3.com