ããã«ã¡ã¯ãDSOC R&Dã°ã«ã¼ãã®é«æ©å¯æ²»ã§ãã
å½¢æ ç´ è§£æãç³»åã©ããªã³ã°ã®éã®ç´ æ§æ½åºãªã©ã§ã¯ããã¤ãä¼¼ããããªã³ã¼ããæ¸ããã¡ã§ãã
ä»åã¯ãã®ä½æ¥ãæ¸ããããã®åå¿é²ã¨ãã¦ããããã®ã¡ãã£ã¨ããåå¦çã«ã¤ãã¦ç´¹ä»ãã¾ãã
å½¢æ ç´ è§£æ
æ¥æ¬èªã対象ã«ããèªç¶è¨èªå¦çã«ãããå½¢æ ç´ è§£æã¨ã¯ãåèªåå²ã¨åè©ä»ä¸ãæãã¦ãã¾ãã
æ¥æ¬èªã¯åèªã«åãã¡æ¸ãããã¦ããªããããã»ã¨ãã©ã®ã¿ã¹ã¯ã®å段ã¨ãªãé常ã«éè¦ãªå¦çã§ãã
Pythonã§æ¥æ¬èªå½¢æ ç´ è§£æãè¡ãéã«ã¯ãMeCabãPure Pythonã®Janomeããã使ããããã¨æãã¾ãã
ç§ã¯ãMeCabã®Python 3ãã¤ã³ãã£ã³ã°ã§ããmecab-python3ããã使ãã¾ãã
ã·ã³ãã«ãªã¤ã³ã¿ãã§ã¼ã¹ã§MeCabã®å½¢æ ç´ è§£ææ©è½ãPythonã§å©ç¨å¯è½ã§ãã
ã¾ãã¯ãmecab-python3ãç¨ãã¦å½¢æ ç´ è§£æãè¡ããå¾ãããå½¢æ ç´ åãPythonã§åãæ±ããããããã«å å·¥ãã¾ãã
ãããªæãã§ä½¿ããã
以ä¸ã®ããã«ãå½¢æ ç´ è§£æãåè©ã«ããåèªã®ãã£ã«ã¿ãªã³ã°ãã«ã¿ã«ãèªæ½åºãªã©ãæ軽ã«ä½¿ããããã¨å¦æ³ãè¨ãã¾ãã¦å®è£ ãé²ãã¾ãã
# å½¢æ
ç´ è§£æå¨ãã¤ã³ãã¼ã(tokenizer.pyã®Tokenizerã¯ã©ã¹)
from tokenizer import Tokenizer
# 解æã¤ã³ã¹ã¿ã³ã¹ä½æ
tok = Tokenizer()
# 解æ
morphemes = tok.tokenize("è¨èªå¦çã§å½¢æ
ç´ è§£æã¯éè¦ãªããã»ã¹ã ã")
# åè©ã ãåãåºã
nouns = [m for m in morphemes if m.pos == "åè©"]
print(nouns)
>>> [è¨èª, å¦ç, å½¢æ
ç´ , 解æ, éè¦, ããã»ã¹]
# ã«ã¿ã«ãã ãåãåºã
katakanas = [m for m in morphemes if m.is_katakana]
print(katakanas)
>>> [ããã»ã¹]
解æã¯ã©ã¹ã¨å½¢æ ç´ ã¯ã©ã¹ãå¿ è¦ã«ãªããããªãã¨ãè¦ãã¦ãã¾ããã
å½¢æ ç´ ã¯ã©ã¹ãä½ã
å½¢æ
ç´ ã¯ã©ã¹ãä½æããmorpheme.surface
ã®ããã«è¡¨å±¤å½¢ãåè©ã«ç°¡åã«ã¢ã¯ã»ã¹ã§ããããã«ãã¾ãã
IPADicã使ã£ãå ´åã®ä¾ã§ãï¼è¾æ¸ãå¤æ´ããå ´åã¯ãè¾æ¸ã®å±æ§æ
å ±(self.features
)ã®å
容ã«åããã¦ã¤ã³ã¹ã¿ã³ã¹å¤æ°ãå®ç¾©ãã¾ãï¼ã
class Morpheme:
def __init__(self, node):
self.surface = node.surface
self.features = node.feature.split(",")
self.pos = self.features[0]
self.pos_s1 = self.features[1]
self.pos_s2 = self.features[2]
self.pos_s3 = self.features[3]
self.conj = self.features[4]
self.form = self.features[5]
self.orig = self.features[6]
if len(self.features) < 8:
self.reading = None
self.reading2 =None
else:
self.reading = self.features[7]
self.reading2 = self.features[8]
def __str__(self):
return self.surface
def __repr__(self):
return self.__str__()
str
ãrepr
ã¯å®ç¾©ãã¦ãããã¨ã§ãprint(morpheme)
ããéã«ãããã®æ»ãå¤ã表示ãããããã«ãªãã¾ãã
Morpheme
ã¯ã©ã¹ã«ã¡ã½ããã追å ããªãã®ã§ããã°ã以ä¸ã®ãããªnamedtuple
ãå©ç¨ããã®ãããã¨æãã¾ãã
IPAMorpheme = namedtuple("IPAMorpheme", "surface pos pos_s1 pos_s2 pos_s3 conj form base reading1 reading2")
Morpheme
ã¯ã©ã¹ã¨ãããã¨ã§ãä¾ãã°ä»¥ä¸ã®ãããªå±æ§ã追å ãã¦ãå¾æ®µã®ç´ æ§æ½åºã®éã«å®¹æã«å©ç¨å¯è½ã§ãã
ãã ããä¸ã¤ã²ã¨ã¤ã®ã¤ã³ã¹ã¿ã³ã¹ãããããã®ã¡ã½ãããæã¤ãã¨ã«ãªããããç´ æ§æ½åºæã«å¤å®ããå¦çãç¨æããã»ãã軽éã§ããããããã¾ããã
@property
def is_katakana(self):
return regex.fullmatch(r"^\p{Katakana}+$", self.surface) is not None
ä¸è¨ã¡ã½ããã§å©ç¨ãã¦ããregexã¢ã¸ã¥ã¼ã«ã¯Unicodeæåããããã£ã¼ï¼ä¾ãã°\p{Katakana}
ï¼ãæ¨æºã¢ã¸ã¥ã¼ã«re
ã¨åçã®ã¤ã³ã¿ã¼ãã§ã¼ã¹ã§ä½¿ããã¨ãã§ãã¦é常ã«ä¾¿å©ã§ãã
å½¢æ ç´ è§£æããã¯ã©ã¹
ç¨æããå½¢æ ç´ ã¯ã©ã¹ã«å½¢æ ç´ è§£æçµæãä»£å ¥ããå¦çãä½æãã¾ãã
Tokenize
rã¯ã©ã¹ãä½æããtokenize
ã¡ã½ãããä½æãã¾ãã
import MeCab
class Tokenizer:
def __init__(self, mecab_args):
self.__tagger = MeCab.Tagger(mecab_args)
self.__tagger.parse("Initialize")
def tokenize(self, sentence):
return [morpheme for morpheme in self.__parse_to_tag(sentence)]
def __parse_to_tag(self, sentence):
node = self.__tagger.parseToNode(sentence)
node = node.next
while node.next:
yield Morpheme(node)
node = node.next
MeCabã¤ã³ã¹ã¿ã³ã¹ä½ææã®å¼æ°ã渡ããã¨ã§ãã¦ã¼ã¶ã¼è¾æ¸ã¨ãã£ãæ å ±ãå©ç¨ã§ãã¾ãã
å½¢æ ç´ è§£æã¯ã©ã¹ãç¨ãã¦scikit-learnã§TF-IDF
ä¸è¨ã§æºåããå½¢æ ç´ è§£æã¯ã©ã¹ãç¨ãã¦ãåè©ãTF-IDFã§éã¿ä»ããã¾ãã
scikit-learnã®ã¤ã³ã¿ã¼ãã§ã¼ã¹ã¨TF-IDF
scikit-learnã§ã¯TfidfVectorizerã§TF-IDFãæä¾ããã¦ãã¾ãã
scikit-learnã¯ã¤ã³ã¿ã¼ãã§ã¼ã¹ãçµ±ä¸ããã¦ãããfit
ã¯ä¸ãããããã¼ã¿ã«å¯¾ãã¦ã¢ãã«ã®å¦ç¿ãè¡ããtransform
ã¯fit
ã§å¾ãã¢ãã«ãç¨ãã¦å
¥åãå¤æãã¾ãã
predict
ã¯fit
ã§å¾ãã¢ãã«ãç¨ãã¦æ¨å®ãè¡ãã¾ãã
TF-IDFã®å ´åã¯fit
ã§TF-IDFè¨ç®ã®ããã®èªå½è¾æ¸ä½æãIDFã®è¨ç®ãè¡ããtransform
ã§ã¯fit
ã§ä½æããéã¿ã使ããä¸ãããããã¼ã¿ã«å¯¾ãã¦TF-IDFå¤ãè¨ç®ãã¦è¿ãã¾ãã
TfidfVectorizerã§TF-IDFå¤ãåå¾
TfidfVectorizerã¯åãã¡æ¸ããããããã¹ãã®å
¥åãåæã¨ãªã£ã¦ãããããtokenizer
å¼æ°ã§åãã¡æ¸ãå¦çãå ãã¾ãã
tokenizer
ã¯callableãåãä»ããããã以ä¸ã®ãããªtokenizer
ãã©ããããé¢æ°ãä½æãã¾ãã
以ä¸ã®ä¾ã§ã¯ãåãã¡æ¸ãããã¤ã¤ãåè©ã®ã¿ãè¿ãã¦ãã¾ãã
def tokenize_with_filter(text):
morphemes = tok.tokenize(text)
return [m.surface for m in morphemes if m.pos == "åè©"]
ä½æããé¢æ°ãå©ç¨ãã¦TfidfVectorizerã§TF-IDFãã¯ãã«åãã¾ãã
from sklearn.feature_extraction.text import TfidfVectorizer
# ä¾æã¯é空æ庫ããå¼ç¨
docs = [
'å¾è¼©ã¯ç«ã§ãããååã¯ã¾ã ç¡ãã',
'親è²ã®ç¡éç ²ã§å°ä¾ã®æããæã°ãããã¦ããã',
'å
¬ç¶ã¨ååãäºããªãããããªç·ã ãããå¼±è«ã«æ¥µã¾ã£ã¦ãã'
]
# TF-IDFå¤ãè¨ç®
vectorizer = TfidfVectorizer(tokenizer=tokenize_with_filter)
vectors = vectorizer.fit_transform(docs)
# docsã®TF-IDFè¨ç®ã®å¯¾è±¡ã¨ãªãèªå½ã表示
print(vectorizer.vocabulary_)
>>> {'å¾è¼©': 3, 'ç«': 8, 'åå': 2, 'è²': 10, 'ç¡éç ²': 7, 'ä¾': 1, 'æ': 6, 'æ': 5, 'äº': 0, 'ç·': 9, 'å¼±è«': 4}
# docsã®TF-IDFå¤ã表示
print(vectors.toarray())
>>>
ãã®ããã«TfidfVectorizer
ã«å¯¾ãã¦tokenizer
ãè¨å®ãããã¨ã§ãããªãèªç±åº¦é«ãå
¥ååèªåã調æ´ãããã¨ãå¯è½ã¨ãªãã¾ãã
ç³»åã©ããªã³ã°ã®ç´ æ§æ½åº
å½¢æ ç´ åãç³»åã¨è¦ãªãã¦ãå½¢æ ç´ åã«ã©ãã«ä»ãããã¨ããæ¹æ³ãããç¨ãããã¾ãã
ä¾ãã°ãåºæ表ç¾æ½åºãåè©ä»ä¸ã§ç¨ãã¾ãã
ç´ æ§ã¨ãã¦æ¨å®å¯¾è±¡ã®å½¢æ ç´ ã®åå¾æ°åèªã®è¡¨å±¤å½¢ãåè©ãåãæ±ããã¨ãããããã¾ãããifæã§æ¸ãã¨ç´ æ§ã®è¿½å ãåé¤ã¯ã³ã¼ãã®ç·¨éãå¤ããªããããé¢åã§ãã
ãã³ãã¬ã¼ããç¨ãã¦ç°¡åã«èª¿æ´å¯è½ã«ãã¾ãã
ãã³ãã¬ã¼ãã使ã£ãç´ æ§æ½åº
ãç´ æ§ã©ãã«åãç´ æ§æ½åºé¢æ°ã対象åèªããã®ç¸å¯¾çãªä½ç½®ãããã³ãã¬ã¼ãã¨ãã¦ç´ æ§æ½åºãããã¨ãèãã¾ãã
ä¾ãã°ã対象åèªãã2åèªåã®åèªã®è¡¨å±¤å½¢ãåå¾ãããã³ãã¬ã¼ãã¯ã(âword-2â, lambda x: x.surface, -2)
ã¨å®ç¾©ãã¾ãã
åºæ表ç¾æ½åºãä¾ã«æãã¦èª¬æãã¾ãã
以ä¸ã®è¡¨ã¯ããåå8æã«æ±äº¬é§ ã§éåããããã¨ããæãå½¢æ ç´ è§£æããIOB2ï¼Inside-outside-begginingï¼ã¿ã°å½¢å¼ã§åºæ表ç¾ã®ã©ãã«ãä»ä¸ãããã®ã§ãã
ããã§ããæ±äº¬ãã¨ããåèªãä¾ã«ç´ æ§æ½åºãã¾ãã
ç´ æ§ã«ã¯ã対象åèªã¨åå¾2åèªã®è¡¨å±¤å½¢ã対象åèªã¨åå¾2åèªã®åè©ãæ¨å®æ¸ã¿ã®å2åèªã®IOB2ã¿ã°ãå©ç¨ããã¨ãã¾ãã
ããã©ãã«ãå¦ç¿ããéã®ç´ æ§ã¯ãPythonã®è¾æ¸å½¢å¼ã§è¡¨ãã¨ä»¥ä¸ã®ããã«ãªãã¾ãã
{
"word-2": "æ",
"word-1": "ã«",
"word": "æ±äº¬",
"word+1": "é§
",
"word+2": "ã§",
"pos-2": "åè©",
"pos-1": "å©è©",
"pos": "åè©",
"pos+1": "åè©",
"pos+2": "å©è©",
"iob-2": "I-TIME",
"iob-1": "O"
}
ãã³ãã¬ã¼ãã¨ç´ æ§æ½åºãè¡ãé¢æ°ãå®ç¾©ãã¾ãã
# ç´ æ§æ½åºã®ããã®é¢æ°
word_feature = lambda x: x.surface
pos_feature = lambda x: x.pos
iob2_feature = lambda x: x.iob2
# ãã³ãã¬ã¼ã
templates = [
("word-2", word_feature, -2), ("word-1", word_feature, -1), ("word", word_feature, 0), ("word+1", word_feature, 1), ("word+2", word_feature, 2),
("pos-2", pos_feature, -2), ("pos-1", pos_feature, -1),("pos", pos_feature, 0), ("pos+1", pos_feature, 1), ("pos+2", pos_feature, 2),
("iob2-2", iob2_feature, -2), ("iob2-1", iob2_feature, -1),
]
å®ç¾©ãããã³ãã¬ã¼ãã¨ç³»åãå ¥åãããã³ãã¬ã¼ãã«å¾ã£ã¦ç´ æ§ãæ½åºããé¢æ°ãæ¸ãã¾ãã
def iter_feature(tokens, templates):
tokens_len = len(tokens)
for i in range(tokens_len):
# Biasé
ãå
¥ãã¦é »åºãããã®ãåªå
ãããªãããã«ãã
feature = {"bias": 1.0}
# ãã³ãã¬ã¼ããé©ç¨
for label, f, target in templates:
current = i + target
if current < 0 or current >= tokens_len:
continue
# ãã³ãã¬ã¼ãä½æ対象ã®å ´åã«ãç´ æ§æ½åºé¢æ°ãé©ç¨
feature[label] = f(tokens[current])
# BOSã¨EOSãç´ æ§ã«å ãã
if i == 0:
feature["BOS"] = True
elif i == tokens_len - 1:
feature["EOS"] = True
# ç´ æ§ãã¤ãã¬ã¼ã
yield feature
å®éã«ç´ æ§ãæ½åºããscikit-learnã®å¦ç¿ã§å©ç¨ã§ããããã«å¤æãã¾ãã
# ç´ æ§ã®æ½åº
features = []
for tokens in corpus:
features.extend(iter_feature(tokens, templates))
from sklearn.feature_extraction import DictVectorizer
# scikit-learnã§ç´ æ§ãå
¥åã§ããããã«è¾æ¸å½¢å¼ã®ç´ æ§ãæ°å¤åã«å¤æ
feature_vectorizer = DictVectorizer()
vector = feature_vectorizer.fit_transform(features)
ãã³ãã¬ã¼ãã使ã£ã¦æ¨å®
å¦ç¿ããã¢ãã«ã§æ¨å®ããéã«ããã³ãã¬ã¼ããé©ç¨ãã¦åãããã«ç´ æ§ãæ½åºããå¿ è¦ãããã¾ãã
æ¨å®ãããã©ãã«ãç¨ãã¦æ¬¡ã®ã¿ã°ãæ¨å®ãããããªã¢ãã«ã®å ´åãèãã¾ãã
for token, feature in zip(tokens, iter_feature(tokens, templates)):
# ç´ æ§æ½åº
vec = feature_vectorizer.transform(feature)
# tokenã®iob2å¤æ°ã«æ¨å®å¤ãã»ãããã
token.iob2 = label_encoder.inverse_transform(model.predict(vec))[0]
label_encoder
ã¯æ¨å®ããã©ãã«ãscikit-learnã§å©ç¨ã§ããããã«æ°å¤åãããã®ã§ãã
å¦ç¿æã«ä¿åãã¦ããå¿ è¦ãããã¾ããããã«ããå¦ç¿æ¸ã¿ã¢ãã«ã§æ¨å®ããæ°å¤åãããã©ãã«ããã¨ã®ã©ãã«ã«æ»ãã¾ãã
ä¸è¨ä¾ã§ã¯ãããããã®iob2
ã¤ã³ã¹ã¿ã³ã¹å¤æ°ã«æ¨å®ãããIOB2ã¿ã°ãæ ¼ç´ããã¾ãã
ãã使ããã¨ãæ´çãã¦ãã°ããå®è£ ãã
ä¼¼ããããªå¦çãæ´çãããã¨ã§ãä½æ¥ãéå§ãããããªãã¾ãã
ãããããã¨ã¢ã¸ã¥ã¼ã«åãã¦Githubã§ç®¡çãã¦ãããã¨ã§ãpip install git+https://github.com/hogehoge_user/mymodule
ã«ããçéã§åå¦çç°å¢ãæ´ãããã¨ãã§ããããããã¾ããã
次åããã¯ãè¨èªå¦çã§ã®ã¢ã«ã´ãªãºã ã¤ãã¦ç´¹ä»ãã¦ããããã¨æãã¾ãã
å·çè ãããã£ã¼ã«
â» æ¬é£è¼ã®ç¶ãã¯ããSansan Builders Boxãã§èªããã¨ãã§ãã¾ãã
éå»è¨äº
â¼ç¬¬10å è¨èªå¦çã§ã®ã¡ãã£ã¨ãããã¼ã¿ç¢ºèªãã¯ã¬ã³ã¸ã³ã°
â¼ç¬¬9å ãAPI Gatewayã¨AWS Lambda Pythonã§APIéçºã Vol. 4ï¼ãããã¤
â¼ç¬¬8å ãAPI Gatewayã¨AWS Lambda Pythonã§APIéçºãVol. 3ï¼ã¨ã©ã¼å¦ç
â¼ç¬¬7å ãAPI Gatewayã¨AWS Lambda Pythonã§APIéçºãVol. 2ï¼ãã¼ã«ã«ã§ã®éçºç°å¢æ§ç¯
â¼ç¬¬6å ãAPI Gatewayã¨AWS Lambda Pythonã§APIéçºãVol. 1ï¼API Gatewayã¨AWS Lambdaãç¥ã
â¼ç¬¬5å å¿«é©ãªã·ã§ã«ç°å¢ã®åæ§ç¯ãèªååãã
â¼ç¬¬4å 第16åæ å ±ç§å¦æè¡ãã©ã¼ã©ã ï¼FIT2017ï¼ã§ç»å£
â¼ç¬¬3å 第11åããã¹ãã¢ããªãã£ã¯ã¹ã»ã·ã³ãã¸ã¦ã
â¼ç¬¬2å R&Dè«æèªã¿ä¼åå¼·ä¼
â¼ç¬¬1å è¨èªå¦ç100æ¬ããã¯åå¼·ä¼
text: DSOC R&Dã°ã«ã¼ã é«æ©å¯æ²»