トライ木による大量ワードに対するマッチング処理

Java

Last updated at 2015-02-12Posted at 2013-08-15

以前どっかの仕事で必要になったので休日に書いたコードです。

どうしても期待するパフォーマンスが出なくて試行錯誤の結果こうなりました。トライ木というデータ構造の名前がついているのは後から知ったもの。（http://ja.wikipedia.org/wiki/トライ木）

マッチするかどうかのインターフェイスを以下のように定義するとして、

Macher.java

package search;

/**
 * 文字列に対するマッチ処理を行うためのインターフェイス。
 */
public interface Matcher {
	boolean match(String target);
}

こんな感じのコードを用意。

TreeMacher.java


package search;

import java.util.HashMap;
import java.util.Map;

/**
 * マッチ対象の文字列集合を前処理し、ツリー構造で保持する。
 * NGワードチェックなど、マッチ対象の文字列集合がほぼ静的に決定されるケースで高速に動作する。
 */
public class TreeMatcher implements Matcher{

	private CharTreeNode rootNode = null;
	
	public TreeMatcher(String[] words){
		this.rootNode = new CharTreeNode(null);
		this.rootNode.addWords(words);
	}

	/**
	 * マッチ処理を行う。
	 */
	public boolean match(String target){
		CharTreeNode currentNode = null;
		for(int i=0;i<target.length();i++){
			currentNode = this.rootNode;
			for(int k=0;i+k<target.length();k++){
				Character c = target.charAt(i+k);
				currentNode = currentNode.child.get(c);
				
				if(currentNode == null){
					break;
				}
				
				if(currentNode.end){
					return true;
				}
			}
		}
		return false;
		
	}
	
	/**
	 * ツリー構造を保持する内部クラス。
	 *
	 */
	static class CharTreeNode {
		boolean end = false;
		Character value;
		Map<Character,CharTreeNode> child = new HashMap<Character, CharTreeNode>();
		CharTreeNode(Character value){
			this.value = value;
		}
		
		void addWords(String[] words){
			for(String word : words){
				this.addWord(word);
			}
		}
		
		void addWord(String word){
			addWord(word,this);
		}
		
		static void addWord(String value, CharTreeNode parent){
			if(value == null || value.length() < 1){
				parent.end = true;
				return;
			}
			
			if(parent == null || parent.end){
				return;
			}
			
			Character firstLetter = value.charAt(0);
			CharTreeNode current = null;
			if(parent.child.containsKey(firstLetter)){
				current = parent.child.get(firstLetter);
			}else{
				current = new CharTreeNode(firstLetter);
				parent.child.put(firstLetter,current);
			}
			
			current.addWord(value.substring(1,value.length()));
		}
	}
}

マッチさせたい文字列のパターンが数万とかあったりする時、ベタな String#contains によるチェック処理の100倍くらいのパフォーマンスが出ます。

ただ前処理が重いのとメモリ食います。

ご利用は計画的に。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up