ç ç©¶éçºé¨ã®ãµã¦ã©ã(bira)ã§ãã
æ¬ç¨¿ã§ã¯ã¦ã¼ã¶ãã¬ã·ãã®ä½æã«ãããå´åãæ¸ããããã«åãå ¥ãããæ©æ¢°å¦ç¿ãå©ç¨ããæ©è½ã®ä¸ã¤ã«ã¤ã㦠解説ãã¾ãããã®æ©è½ãå©ç¨ããã¨ãã¦ã¼ã¶ãã¬ã·ãã®ã¿ã¤ãã«ãå ¥åãããã¨ã§ãå©ç¨ãããã§ãããææãäºæ¸¬ã§ãã¾ãã
è¦ç´
- ã¬ã·ãã®ã¿ã¤ãã«ããææãäºæ¸¬ã§ããã¢ãã«ãä½ãã¾ããã
- æç¨¿éçºé¨ã¨ååãã¦ã¬ã·ãã¨ãã£ã¿ã«ææææ¡æ©è½ã追å ãã¾ããã
App Storeã§å ¥æå¯è½ãªææ°ã®Cookpadã¢ããªï¼v19.6.0.0ï¼ã§ãã®æ©è½ã使ç¨ã§ãã¾ãã
å | ä» |
---|---|
![]() |
![]() |
ã¢ãã«ã¯ã©ããªã£ã¦ããã
1.ãEmbed
- å¦ç¿(Training): Word Embeddingã¨SentenceãEmbeddingãå¦ç¿ãã¦S3ã«ã¢ãããã¼ããã¾ããï¼æ¬¡ã®ã»ã¯ã·ã§ã³ã§èª¬æï¼
- åå¦ç(Preprocessing): ç¹æ®æåãåé¤ãã¾ãã
å¤ãã®Cookpadã¦ã¼ã¶ã¼ã¯ããã¹ãã«ç¹æ®æåã使ç¨ãã¦ãã¾ãã ä¾ï¼"â§ããããâ¡ã¿ã³ããªã¼ããã³â¡^-^â§"ã«ç¹æ®æåãå«ã¾ãã¦ãã¾ãï¼
â¡
,â§
,^-^
ãç¹æ®æåã«ã¯ææã«é¢ããæ å ±ãå«ã¾ãã¦ããªãã®ã§ãããããåé¤ãã¾ããç¹æ®æåãåé¤ããã«ã¯ã次ã®python Functionã使ãã¾ããï¼
ã³ã¼ãã表示ãã
import re def remove_special_characters(text): non_CJK_patterns = re.compile("[^" u"\U00003040-\U0000309F" # Hiragana u"\U000030A0-\U000030FF" # Katakana u"\U0000FF65-\U0000FF9F" # Half width Katakana u"\U0000FF10-\U0000FF19" # Full width digits u"\U0000FF21-\U0000FF3A" # Full width Upper case English Alphabets u"\U0000FF41-\U0000FF5A" # Full width Lower case English Alphabets u"\U00000030-\U00000039" # Half width digits u"\U00000041-\U0000005A" # Half width Upper case English Alphabets u"\U00000061-\U0000007A" # Half width Lower case English Alphabets u"\U00003190-\U0000319F" # Kanbun u"\U00004E00-\U00009FFF" # CJK unified ideographs. kanjis "]+", flags=re.UNICODE) return non_CJK_patterns.sub(r"", text)
- ãã¼ã¯ã³åãã(Tokenize): MeCabã使ã£ã¦ããã¹ãããã¼ã¯ã³åãã¾ãã
- Embedding: Word Embeddingã¨Sentence Embedding ã¢ãã«ã使ç¨ãã¦ãCookpadãã¼ã¿ãã¼ã¹å ã®åã¬ã·ãã®ã¿ã¤ãã«ããã¯ãã«ã«å¤æãã¾ãã
- ç´¢å¼ä»ã(Indexing): Faissã使ç¨ãã¦ãã¯ãã«ã«ã¤ã³ããã¯ã¹ãä»ãï¼method = IndexFlatIPï¼Exact Search for Inner Productï¼ãã¤ã³ããã¯ã¹ãS3ã«ã¢ãããã¼ããã¾ããFaiss(Facebook AI Similarity Searchï¼ã¯ããã¯ãã«ã®å¹ççãªé¡ä¼¼æ¤ç´¢ã®ããã«Facebook AIã«ãã£ã¦éçºãããã©ã¤ãã©ãªã§ãã Faissã¯10åã¹ã±ã¼ã«ã®ãã¯ãã«ã»ããã§æè¿åæ¤ç´¢ããµãã¼ããã¾ãã
2. Search&Suggest (API Server)
- S3ããWord Embeddingã¢ãã«ã¨Sentence Embeddingã¢ãã«ã¨Faiss Indexããã¦ã³ãã¼ããã¾ãã
- Word Embeddingã¢ãã«ã¨Sentence Embeddingã¢ãã«ã¨Faiss Indexãã¡ã¢ãªã«ãã¼ããã¾ãã
- Embeddingã¢ãã«ã使ç¨ãã¦ãå ¥åãããã¿ã¤ãã«ããã¯ãã«ã«å¤æãã¾ãã
- Faissã使ç¨ãã¦kåã®é¡ä¼¼ããã¬ã·ããæ¤ç´¢ãã¾ãã
- é¡ä¼¼ããã¬ã·ãã®ä¸ã§æãä¸è¬çãªææãææ¡ãã¾ãã
Embeddingsãå¦ç¿ãã:
ã¬ã·ãã®ã¿ã¤ãã«ãã¼ã¿ã§Word Embeddingã¢ãã«ï¼Fasttextï¼ãå¦ç¿ãã¾ãã
gensimã§Fasttextã使ã£ã¦ãã¾ãããgensimã¯ã¨ã¦ã使ããããã§ãã
ã³ã¼ãã表示ãã
from gensim.models import FastText
# recipe_titles : [.....,çä¹³ã§ç°¡åï¼æ¬æ ¼ã¾ãããå¦ã
麺,...]
# tokenize recipe titles using MeCab and then train fasttext model
# recipe_title_list(tokenized) : [...,['çä¹³','ã§','ç°¡å','ï¼','','æ¬æ ¼','ã¾ããã','å¦ã
','麺'],....]
ft_model = FastText(size=100,min_count=5,window=5,iter=100, sg=1)
ft_model.build_vocab(recipe_title_list)
ft_model.train(recipe_title_list, total_examples=ft_model.corpus_count, epochs=ft_model.iter)
ãªãFasttextãé¸ãã ã®ã§ããï¼
Fasttextï¼ããã¯æ¬è³ªçã«word2vecã¢ãã«ã®æ¡å¼µã§ãï¼ã¯ãååèªãæån-gramã§æ§æããã¦ãããã®ã¨ãã¦èãã¾ãã ãã®ãããåèªãã¯ãã«ã¯ããããã®æåæ°n-gramã®åè¨ã§æ§æããã¾ããä¾ï¼âä¸è¯ä¸¼âã®åèªãã¯ãã«ã¯n-gramâï¼ä¸âãâä¸âãâï¼ä¸è¯âãâè¯âãâä¸è¯âãâä¸è¯ä¸¼ï¼âãâè¯ä¸¼ï¼âã®ãã¯ãã«ã®åè¨ã§ããFasttextã¯ãµãã¯ã¼ãæ å ±ã§åèªãã¯ãã«ãå å®ããã¾ãããããã: - ç¨ãªåèªã«å¯¾ãã¦ãããè¯ãWord Embeddingsãçæãã¾ãããã¨ãè¨èãç¨ã§ãã£ã¦ãããããã®æån-gramã¯ã¾ã ä»ã®åèªä¸ã«åºç¾ãã¦ãã¾ãããã®ããããã® Embedding ã¯ä½¿ç¨å¯è½ã§ããä¾:âä¸è¯é¢¨âã¯âä¸è¯ä¸¼âãâä¸è¯ãµã©ãâã®ãããªä¸è¬çãªåèªã¨æån-gramãå ±æãããã¨ã¯ç¨ã§ãããããFasttextã使ç¨ãã¦é©åãªåèªã®Embeddingãå¦ç¿ã§ãã¾ãã - èªå½å¤ã®åèª - å¦ç¿ç¨ã³ã¼ãã¹ã«åèªãåºç¾ãã¦ããªãã¦ããæåã®n-gramæ°ããåèªãã¯ãã«ã使ã§ãã¾ãã
Sentence Embeddingã¢ãã«ãå¦ç¿ãã¾ãã
äºã¤ã® Sentence Embedding ã¢ãã«ã試ãã¦ã¿ã¾ãã:
Average of Word Embeddings
:æã¯æ¬è³ªçã«åèªã§æ§æããã¦ããã®ã§ãåã«åèªãã¯ãã«ã®åè¨ã¾ãã¯å¹³åãåãã°æã®ãã¯ãã«ã«ãªãã¨è¨ããããããã¾ããã ãã®ã¢ããã¼ãã¯ãBag-of-words表ç¾ã«ä¼¼ã¦ãã¾ããããã¯åèªã®é åºã¨æã®æå³ãå®å ¨ã«ç¡è¦ãã¾ãï¼ãã®åé¡ã§é åºã¯éè¦ã§ããããï¼ð¤ï¼ã
ã³ã¼ãã表示ãã
import MeCab VECTOR_DIMENSION=200 mecab_tokenizer_pos = MeCab.Tagger("-Ochasen") def sentence_embedding_avg(title, model=ft_model): relavant_words = [ws.split('\t') for ws in mecab_tokenizer_pos.parse(title).split('\n')[:-2]] relavant_words = [w[0] for w in relavant_words if w[3].split('-')[0] in ['åè©', 'åè©', '形容è©']] sentence_embedding = np.zeros(VECTOR_DIMENSION) cnt = 0 for word in relavant_words: if word in model.wv word_embedding = model.wv[word] sentence_embedding += word_embedding cnt += 1 if cnt > 0: sentence_embedding /= cnt return sentence_embedding
- ãã¼ã¯ã³åãã(Tokenize): MeCabã使ç¨ãã¦æãå½¢æ ç´ è§£æãã¾ãã
- ãã£ã«ã¿(filter) :åè©ã形容è©ãåè©ã ããæ®ãã¦ãä»ã®åèªãé¤å¤ãã¾ãã
å¹³å(Average): ãã£ã«ã¿å¦çããåèªã®Word Embeddingãåå¾ããããããå¹³åãã¦ã¿ã¤ãã«ãã¯ãã«ãåå¾ãã¾ãã
Bi-LSTM Sentence Embeddings
: Cookpadã®ã¬ã·ããã¼ã¿ã使ã£ã¦æå¸«ããå¦ç¿ã«ãã£ã¦Sentence Embeddingãå¦ç¿ãã¾ããã©ãã«ã¯2ã¤ã®ã¬ã·ãéã®Jaccard Similarityããå°ãåºãã¾ããã¬ã·ããææã®ã»ããã¨è¦ãªãã¨ã2ã¤ã®ã¬ã·ãéã®Jaccard Similarityã¯æ¬¡ã®ããã«è¨ç®ããã¾ããã¢ã¤ãã¢ã¯ããããã®éã®é«ãJaccard Similarityãæã¤ã¬ã·ãã®ã¬ã·ãã¿ã¤ãã«ãã¯ãã«ãSentence Embeddingã¹ãã¼ã¹å ã§äºãã«è¿ãã«é ç½®ãããã¨ã§ãã
- ãã¼ã¿ã»ããã使ãã¾ã: 2ã¤ã®ã¬ã·ãã®ã¿ã¤ãã«ã¨ãããã2ã¤ã®ã¬ã·ãã®é¡ä¼¼åº¦ã表ãJaccardã¤ã³ããã¯ã¹ãå«ãåãµã³ãã«è¡ãæã¤ãã¼ã¿ã»ããã使ãã¾ãã{title_1, title_2, Jaccard_index}
- ä¸ã®ãããã¯ã¼ã¯ãå¦ç¿ãã¾ã:
ä¸è¨ã®ãããã¯ã¼ã¯ã¯2ã¤ã®è¨å®ã§å¦ç¿ãããã¨ãã§ãã¾ã:
- Regression: g(-) : sigmoid 㨠y = Jaccard Index
- Classification: g(-): dense+dense(softmax) 㨠y = Jaccardã¤ã³ããã¯ã¹ããæ´¾çããã¯ã©ã¹ã©ãã« 5ã¯ã©ã¹ã®åé¡è¨å®ã§ä¸è¨ã®ãããã¯ã¼ã¯ãå¦ç¿ãããã¨ã«ãã£ã¦å¦ç¿ãããFï¼ - ï¼ã¯ãæãããæ©è½ããããã§ãããããã¯ã¼ã¯ã«ã¨ã£ã¦ãå帰åé¡ãããåé¡åé¡ã®æ¹ãè§£ããããå ´åãããã¾ãã
Kerasã§ãããã¯ã¼ã¯ãå®è£ ãã:
ã³ã¼ãã表示ãã
from keras import backend as K from keras import optimizers from keras.models import Model from keras.layers import Embedding, LSTM, Input, Reshape, Lambda, Dense from keras.layers import Bidirectional import numpy as np def cosine_distance(vects): x, y = vects x = K.l2_normalize(x, axis=-1) y = K.l2_normalize(y, axis=-1) return K.sum(x * y, axis=-1, keepdims=True) title_1 = Input(shape=(MAX_SEQUENCE_LENGTH,)) title_2 = Input(shape=(MAX_SEQUENCE_LENGTH,)) word_vec_sequence_1 = embedding_layer(title_1) # Word embedding layer(fasttext) word_vec_sequence_2 = embedding_layer(title_2) # Word embedding layer(fasttext) F = Bidirectional(LSTM(100)) sentence_embedding_1 = F(word_vec_sequence_1) sentence_embedding_2 = F(word_vec_sequence_2) similarity = Lambda(cosine_distance)([sentence_embedding_1, sentence_embedding_2]) similarity = Dense(5)(similarity) y_dash = Dense(5, activation='softmax')(similarity) model = Model(inputs=[title_1, title_2], output=y_dash) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit([train_title_1, train_title_2], y) # [train_title_1, train_title_2], y are respectively input titles and class label np.save('bilstm_weights.npy', F.get_weights())
- åã®ã¹ãããã§å¦ç¿ããF(-)ãæã®Embeddingã¨ãã¦ä½¿ç¨ãã¾ã:
ã³ã¼ãã表示ãã
from keras.models import Model from keras.layers import Embedding, LSTM, Input, Reshape, Lambda, Dense from keras.layers import Bidirectional import numpy as np title = Input(shape=(MAX_SEQUENCE_LENGTH,)) word_embedding = embedding_layer(title) F = Bidirectional(LSTM(100)) sentence_embeddding = F(word_embedding) sentence_embedding_model = Model(input=title, output=sentence_embedding) sentence_embedding_model.layers[2].trainable = False sentence_embedding_model.layers[2].set_weights(np.load('bilstm_weights.npy')) def sentence_embedding_bilstm_5c(text): txt_to_seq = keras_tokenizer.texts_to_sequences([mecab_tokenizer.parse(text)]) padded_sequence = sequence.pad_sequences(txt_to_seq,maxlen=MAX_SEQUENCE_LENGTH) return K.get_value(sentence_embedding_model(K.cast(padded_sequence,float32)))[0]
çµæ
以ä¸ã¯ãµã¼ãã¹ã«ãããå©ç¨çã§ããä¾ãã°ã3 out of 5 suggested ingredients matches actual 㯠5 å suggest ãããã¡ 3 åãå©ç¨ãããå²åã§ãã
3 out of 5 suggested ingredients matches actual(%) | 2 out of 5 suggested ingredients matches actual(%) | |
---|---|---|
Average of word embeddings | 53% | 80% |
Bi-LSTM Sentence Embeddings | 50% | 76% |
Average of word embeddingsï¼ããã¯Bag-of-Wordsã«ä¼¼ã¦ãã¾ãï¼ã¯Bi-LSTM Sentence Embeddingããããã®åé¡ã«é©ãã¦ãã¾ããããã¯ãã¬ã·ãã®ã¿ã¤ãã«ã¯çãããã¹ãã§ããããã«ãåèªé åºã®æ å ±ã¯ææãäºæ¸¬ããã®ã«ã¯ãã¾ãå½¹ã«ç«ããªãããã ã¨æããã¾ãã
ã¾ã¨ã
- ã¬ã·ãã®ã¿ã¤ãã«ããææãäºæ¸¬ã§ããã¢ãã«ãä½ãã¾ããã
- æç¨¿éçºé¨ã¨ååãã¦ã¬ã·ãã¨ãã£ã¿ã«ææææ¡æ©è½ã追å ãã¾ããã
ãããã§ããã§ããããã Cookpadã§ã¯ãæ©æ¢°å¦ç¿ãç¨ãã¦æ°ããªãµã¼ãã¹ãåµãåºãã¦ãããæ¹ãåéãã¦ãã¾ãã èå³ã®ããæ¹ã¯ãã²è©±ãèãã«éã³ã«æ¥ã¦ä¸ããã