Search::Fulltext 㧠N-gram æ¤ç´¢ã§ããããã« Search::Fulltext::Tokenizer::Ngram ãæ¸ãã
2014-01-01: CPAN ã«ã¢ãããã¼ãããã®ã§è¿½è¨。
è¦æ¨
Search::Fulltext ã¨ãã大å¤ã·ã³ãã«ãªå ¨ææ¤ç´¢ã¢ã¸ã¥ã¼ã«ããªãªã¼ã¹ããã¦ããã®ã§、N-gram ãã¼ã¯ãã¤ã¶ãæä¾ãã Search::Fulltext::Tokenizer::Ngram ãæ¸ãã¾ãã。
ããã使ãã¨æ¥æ¬èªã¨ãã¦æªãã表ç¾ã§ãã¨ããããããããããããªå ¨ææ¤ç´¢ãã§ãã¾ã。
åæ©
大å¤ã·ã³ãã«ãªã¢ã¸ã¥ã¼ã«ã ã£ãã®ã§ã·ã³ãã«ã«ä½¿ã£ã¦ã¿ããã¨æã£ããç¾å¨ã®ã¨ããæ¥æ¬èªã®ãã¼ã¯ãã¤ã¶ã¯ Search::Fulltext::Tokenizer::MeCab ã®ã¿ã§ãã。Text::MeCab ã®ã¤ã³ã¹ãã¼ã«ãåå«ãªã®ã¨、ã¯ã³ãã¼æ¥æ¬èªãè·æãã¦ãã Web ææ¸ãªããã®æ¤ç´¢ã 㨠N-gram ã®æ¹ãé½åãè¯ããã¨ãããã®ã§ä½ã£ã¦ã¿ã¾ãã。
念ã®çº: N-gram ã£ã¦ä½
ããã¹ãã N æåæ¯ã«åºåã£ããã®。e.g., "è²å½©ãæããªãå¤å´ã¤ããã¨、å½¼ã®å·¡ç¤¼ã®å¹´" ãã 2-gram ãä½ã㨠"è²å½©", "彩ã", "ãæ", ... "ã®å¹´" ã¨ãã£ãå ·åã«ãªã。 è¦ã¯ææ¸ä¸ã«åºç¾ãã N æåã®çµåããç¶²ç¾ ããã®ã§、N-gram ã使ã£ã¦ã¤ã³ããã¯ã¹ãä½ã㨠N æå以ä¸ã®ã¯ã¨ãªã«ãããããææ¸ã¯ä¸ååãé¶ããªããªã。 çæã¯å½¢æ ç´ ã®åºåããç¥ããªãã®ã§ "京é½" ã¨ããã¯ã¨ãªã§ "æ±äº¬é½" ã¨ããèªãå«ãã ææ¸ã¾ã§ããããã、N æå以ä¸ã®ã¯ã¨ãªã¯ä¸åãããããªã、N ãå°ããã¨ã¤ã³ããã¯ã¹ã大ãããªããªã©。
使ãæ¹
ã¤ã³ã¹ãã¼ã«
cpanm ãªã©ã使ã£ã¦ã¤ã³ã¹ãã¼ã«ã§ãã¾ã:cpanm Search::Fulltext::Tokenizer::Ngram
git clone [email protected]:sekia/Search-Fulltext-Tokenizer-Ngram.git
cd Search-Fulltext-Tokenizer-Ngram
dzil test
dzil install
使ç¨ä¾
1-gram, 2-gram, 3-gram ã使ãã¾ã。
use strict;
use warnings;
use utf8;
use Search::Fulltext;
use Search::Fulltext::Tokenizer::Unigram; # 1-gram tokenizer
use Search::Fulltext::Tokenizer::Bigram; # 2-gram tokenizer
use Search::Fulltext::Tokenizer::Trigram; # 3-gram tokenizer
my $search_engine = Search::Fulltext->new(
docs => [
'ãã³ããã£・ãã³ãã㣠å¡ã®ä¸',
'ãã³ããã£・ãã³ãã㣠è½ã£ãã¡ã',
'çæ§ã®é¦¬ã¿ããªã¨ çæ§ã®å®¶æ¥ã¿ããªã§ã',
'ãã³ããã£ãå
ã« æ»ããªãã£ã',
],
# 3-gram ã使ã
tokenizer => q/perl 'Search::Fulltext::Tokenizer::Trigram::create_token_iterator_generator'/,
);
# search ã¯ãããããææ¸ã®ã¤ã³ããã¯ã¹ãè¿ã。ããã§ã¯ [0, 1, 3]。
my $hit_documents1 = $search_engine->search('ãã³ããã£');
# ãããããªã。
# ã¤ã³ããã¯ã¹ã 3-gram ã§æ§ç¯ããã¦ãããã2æåã® "çæ§" ã¯è¼ã£ã¦ããªã。
my $hit_documents2 = $search_engine->search('çæ§')
4æå以ä¸ã® N-gram ãå¿ è¦ãªã Search::Fulltext::Tokenizer::Ngram ãç¶æ¿ãã¦ä½ããã¨ãã§ãã¾ã:
package MyTokenizer::42gram {
use parent qw/Search::Fulltext::Tokenizer::Ngram/;
sub create_token_iterator_generator {
sub { __PACKAGE__->new(42)->create_token_iterator(@_) };
}
}
my $search_engine = Search::Fulltext->new(
docs => [ ... ],
tokenizer => q/perl 'MyTokenizer::42gram::create_token_iterator_generator'/,
);
TODO
Documentation.Upload to CPAN.
ã¾ã¨ã
大å¤ã·ã³ãã«ãªã¢ã¸ã¥ã¼ã«ã大å¤ã·ã³ãã«ã«ä½¿ããã¨ãã§ããããã«ãªãã¾ãã。Enjoy!
Search::Fulltextã®ä½è ã§ã.
è¿ä¿¡åé¤Search::Fulltext::Tokenizer::* ãã©ãªãããä½ã£ã¦ãã ããã®ãæå¾ ãã¦ããã®ã§,大å¤å¬ããæã£ã¦ããã¾ã.
ãããããã°,Search::Fulltext::Tokenizer::NgramãCPANã«ã¢ãããã¼ããªããã¾ããã?
大ã¾ããªæé ã¨ãã¦ã¯ãã¡ãã®ãµã¤ããåèã«ãªãã¾ã.
http://blog.livedoor.jp/sasata299/archives/51284970.html
Search::Fulltext::Tokenizer::* ã«æå¾ ãããREADME(Pod)ã®æ¸ãæ¹ãªã©ã¯ãã¡ããã覧ããã ããã°ã¨æãã¾ã.
https://github.com/laysakura/Search-Fulltext-Tokenizer-MeCab
楽ãªä½æ¥ã¨ã¯è¨ãã¾ããã,æ¯éãæ¤è¨ãã ããã¾ã.