Pattern ReplaceCharFilterFactoryã®ä½¿ãæ¹
ä»åã¯PatternReplaceCharFilterFactoryã®ä½¿ãæ¹ãç´¹ä»ãã¾ãããã®ã¯ã©ã¹ã¯前回ã®MappingCharFilterFactoryã¨åæ§ã«ãSolrã®CharFilterã®ä¸ã¤ã§ãtokenizerãè§£æããåã®æååã«å¯¾ãã¦å¦çãè¡ãã¾ããååã示ãéããæ£è¦è¡¨ç¾ãç¨ãããã¿ã¼ã³ããããç¨ãã¦ç½®æãè¡ãã³ã³ãã¼ãã³ãã§ããæ£è¦è¡¨ç¾ãå©ç¨ã§ããã®ã§ãæè»ã«ç½®æã«ã¼ã«ãè¨å®ã§ãã¾ãããããã詳細ã¯å¾è¿°ãã¾ãããã¡ã¢ãªå¹çãããã©ã¼ãã³ã¹ããããªãäºãããããããã¿ã¼ã³ãããã§ããå®ç¾ã§ããªãäºä»¥å¤ã§ã¯ç©æ¥µçã«å©ç¨ããã®ã¯æ§ããæ¹ãããã§ãããã
å©ç¨ã·ã¼ã³ã¨ãã¦ã¯æ¬¡ã®ãããªãã®ãèãããã¾ãã
ç¹å®ã®ãã¿ã¼ã³ã«ä¸è´ããæååãé¤å¤ããã
ã»HTML or XMLã®ã³ã¡ã³ãè¦ç´ ãé¤å»ï¼ <!--.+-->â空æå
ãä¾ï¼abcd<!-- ã³ã¡ã³ã -->efgâabcdefg
ã»è±åãé¤å»ï¼ [a-zA-Z]+â空æå
ãä¾ï¼hoge12hogeãããâ12ããã
ã»è±å以å¤ãé¤å»ï¼ [^a-zA-Z]+â空æå
ãä¾ï¼hoge12hogeãããâhogehoge
ç¹å®ã®ãã¿ã¼ã³ã«ä¸è´ããæååãç½®æããã
ã»è¤æ°ã®ç©ºç½æåã®é£ç¶ãä¸ã¤ã®ç©ºç½ã«ç½®æï¼\s+â" "ï¼åè§ã¹ãã¼ã¹ï¼
ãä¾ï¼test1 test2âtest1 test2
ã»endã®é£ç¶ã®å§ç¸®æ§æãç½®æï¼en+dârubykaigi2011
ãä¾ï¼the ennnnnnnnnd!!âthe rubykaigi2011!!
ãâ»http://redmine.ruby-lang.org/issues/5054 ãã
ç¹å®ã®ãã¿ã¼ã³ã«ä¸è´ããæååãåæ¹åç §ã使ã£ã¦ç½®æããã
ã»åãæåã®é£ç¶ã1æåã«ç½®æï¼(.)\1+â$1
ãä¾ï¼1223334444ãããâ1234ã
ã»ã¡ã¼ã«ã¢ãã¬ã¹ï¼ã£ã½ããã®ï¼ã®ä¸é¨ãé ãï¼(.+)@.+\..+â$1@xxxx
ãä¾ï¼hoge@hoge.comâhoge@xxxx
ä½¿ãæ¹
屿§å | 説æ | å¿ é |
pattern | æ£è¦è¡¨ç¾ï¼Javaã®æ£è¦è¡¨ç¾ãæå®å¯è½â»1ï¼ | â |
replacement | ç½®ææååï¼â»2ï¼ æªå®ç¾©ã®å ´åã¯ããããããã¿ã¼ã³ã®é¤å»ã¨ãªã | - |
maxBlockChars | ãã¿ã¼ã³ããããè¡ãæååã®æå¤§æåæ°ï¼ãããè¶ ããæååã®å ´åã¯ããã®æåæ°ãã¨ã«ãããã¯åããã¦ãã¿ã¼ã³ããããè¡ããæªæå®ã®å ´åã¯10000æåãï¼ | - |
blockDelimiters | maxBlockChars以å ã§ãã®æååãåºã¦ããå ´åããããã¯ã®åºåãããã®æååã®ä½ç½®ã«ãªã | - |
â»1 java.util.regex.Patternで使える正規表現ãæå®å¯è½
ãåæ¹åç
§ã使ãå ´åãæ£ããã°ã«ã¼ãæå®ããªãã¨è§£ææã®å®è¡æã¨ã©ã¼ã«ãªãã®ã§æ³¨æ
â»2 $nã§åæ¹åç
§ãå¯è½ï¼Matcherã®javadocãåç
§ï¼
ä¸ã¤ã®å®ç¾©ã§ä¸ã¤ã®ãã¿ã¼ã³ããè¨å®ã§ããªãã®ã§ãè¤æ°ã®ãã¿ã¼ã³ãè¨å®ãããå ´åã¯ãä¸è¨ã®ããã«PatternReplaceCharFilterãæå®ãããæ°ã ãè¨å®ãã¾ãã
<fieldType name="prcf_test" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(aa)\d+(bb)" replacement="$1 $2"/> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(.)\1+" replacement="$1"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType>
maxBlockCharsã¨blockDelimitersã¯ããã¿ã¼ã³ããããè¡ãåä½ï¼ãããã¯ï¼ãæåæ°ãããã¯åºåãæåã§è¨å®ãã¾ããæååã¯ããã§æå®ãããããã¯åä½ã§åå²ããããã®å¾ãããã¯åä½ã§ãã¿ã¼ã³ããããè¡ããã¾ãããããã¯ãã¾ããããã¿ã¼ã³ã«ã¯ãããããªãã®ã§æ³¨æãã¦ä¸ãããäºãæååã®æå¤§ãåãã£ã¦ããã®ã§ããã°ãmaxBlockCharsã«åå大ããªå¤ãè¨å®ããäºã§åé¿ã§ãã¾ãããcrawlerãªã©ãµã¤ãºãäºæ¸¬ã§ããªãå ´åã¯æ¼ããã±ã¼ã¹ãããã®ã§ãããããå ´åã¯è£å©çãªä½¿ãæ¹ã«çãã¦ããã®ãããã§ãããã
注æç¹
ç½®æåã¨å¾ã§æåæ°ãå¤ããå ´åã¯highlightãããã¦ãã¾ãããã§ããï¼javadocããï¼ã©ãããå ´åã«çºçããã®ãã¾ã å®éã«ç¢ºèªåºæ¥ã¦ãã¾ããã®ã§ãæ©ä¼ãããã°è©¦ãã¦ã¿ã¾ãã
ããã§ã¯ä¸èº«ãè¦ã¦ããã¾ããããååã®MappingCharFilterFactoryã¨åããããªæ§æã§ãä½ããããã·ã³ãã«ã§ãã
ã¯ã©ã¹ã®è©³ç´°
â»Solr 3.3ã®ãã®ã§ãï¼ç¾è¡trunkã§ã¯PatternReplaceCharFilterã¯Luceneã«ç§»åãã¾ããï¼
å¦çã®ã»ã¨ãã©ãPatternReplaceCharFilterã§è¡ããã¦ãããFactoryã¯ã¤ã³ã¹ã¿ã³ã¹ãçæããå½¹å²ã®ã¿ã§ãã
ãã®Filterã®ãã¤ã³ã
ã»readã¡ã½ãã
@Override public int read() throws IOException { while( prepareReplaceBlock() ){ return replaceBlockBuffer.charAt( replaceBlockBufferOffset++ ); } return -1; }
ãªãifã§ã¯ãªãwhileãªã®ãã¯ãã¦ãããããç½®æå¾ã®æååãå¿ è¦ã«å¿ãã¦æºåâåºæ¥ãã°ä¸æåè¿ããåºæ¥ãªããã°-1ãè¿ãã¾ããæ¬¡ã«prepareReplaceBlockã§ä½ãè¡ããã¦ãããè¦ã¦ã¿ã¾ãããã
ã»prepareReplaceBlockã¡ã½ãã
private boolean prepareReplaceBlock() throws IOException { while( true ){ if( replaceBlockBuffer != null && replaceBlockBuffer.length() > replaceBlockBufferOffset ) return true; // prepare block buffer blockBufferLength = 0; while( true ){ int c = nextChar(); if( c == -1 ) break; blockBuffer[blockBufferLength++] = (char)c; // end of block? boolean foundDelimiter = ( blockDelimiters != null ) && ( blockDelimiters.length() > 0 ) && blockDelimiters.indexOf( c ) >= 0; if( foundDelimiter || blockBufferLength >= maxBlockChars ) break; } // block buffer available? if( blockBufferLength == 0 ) return false; replaceBlockBuffer = getReplaceBlock( blockBuffer, 0, blockBufferLength ); replaceBlockBufferOffset = 0; } }
ããã§maxBlockCharsã®æåæ°ãblockDelimitersã¾ã§ãããã¯readã-1ãè¿ãã¾ã§ã®ãç½®æåã®æååã®ãããã¯ãèªã¿è¾¼ã¾ãããã¿ã¼ã³ãããã¨ç½®æãè¡ãã¾ããã¾ã èªã¿è¾¼ã¾ãã¦ããªãç½®æå¾ã®æååãããæã¯ãèªè¾¼ã¨ç½®æå¦çã¯è¡ããã«trueãè¿ãã¾ããæ¬¡ã«ãã¿ã¼ã³ãããã¨ç½®æãè¦ã¾ãããã
ã»getReplaceBlockã¡ã½ãã
String getReplaceBlock( char block[], int offset, int length ){ StringBuffer replaceBlock = new StringBuffer(); String sourceBlock = new String( block, offset, length ); Matcher m = pattern.matcher( sourceBlock ); int lastMatchOffset = 0, lastDiff = 0; while( m.find() ){ m.appendReplacement( replaceBlock, replacement ); // record cumulative diff for the offset correction int diff = replaceBlock.length() - lastMatchOffset - lastDiff - ( m.end( 0 ) - lastMatchOffset ); if (diff != 0) { int prevCumulativeDiff = getLastCumulativeDiff(); if (diff > 0) { for(int i = 0; i < diff; i++){ addOffCorrectMap(nextCharCounter - length + m.end( 0 ) + i - prevCumulativeDiff, prevCumulativeDiff - 1 - i); } } else { addOffCorrectMap(nextCharCounter - length + m.end( 0 ) + diff - prevCumulativeDiff, prevCumulativeDiff - diff); } } // save last offsets lastMatchOffset = m.end( 0 ); lastDiff = diff; } // copy remaining of the part of source block m.appendTail( replaceBlock ); return replaceBlock.toString(); }
ãã¿ã¼ã³ãããã¨ç½®æãè¡ã£ã¦ãã¾ããç½®æåã¨å¾ã§æåæ°ãå¤ããå ´åãå ã®æååã¨ã®ä½ç½®ãè£æ£ããçºã®æ å ±ãä¸ç·ã«ä¿åãã¦ãã¾ããï¼MappingCharFilterã§ããã£ã¦ãã¾ããï¼
ãã¨ã¯å ¥åãç¡ããªãã¾ã§ä¸è¨ã®ç¹°ãè¿ãã§ããã·ã³ãã«ã§ããã
ææ³
Javaæ¨æºã®æ£è¦è¡¨ç¾ã©ã¤ãã©ãªã§ä½¿ã£ããã¨ã®ãªãã¡ã½ããï¼Matcher#appendReeplacementï¼ã使ããã¦ã¦ããã使ãã®ãã¨åå¼·ã«ãªã£ã
ä½ã§StringBuilderãããªãã¦StringBuffer使ã£ã¦ãã®ããªã¨æã£ãããä¸è¨ã¡ã½ããã®å¼æ°ãStringBufferã ã£ãã¨ãããªãï¼StringBuilderã¯1.5ããã§ãMatcherã¯1.4ããï¼
èªåã®æ£è¦è¡¨ç¾åã®ç¡ããæ¹ãã¦çæã»ã»ã»
åèæ å ±
æµ·å¤ã®PatternReplaceCharFilterFactoryã®è¨äºï¼java.dzone.com
CharFilterã®æå¾ã¯HTMLStripCharFilterFactoryã§ãã