BoW ç¹å¾´éã«å¯¾ãããã¸ã¹ãã£ãã¯å帰åæã®éå¦ç¿
ããæ°åã®è¨äºã§ã言語処理 100 本ノックãã®ç¬¬ 8 ç« ããã¸ã¹ãã£ãã¯å帰ã«ãã極æ§åæã®åé¡ã«åãçµã¿ã¾ããããæ£ååç¡ãã§ã交差æ¤å®ã§ã®æ£è§£çããã¾ãå¤ãããªãã¨ããçµæãå¾ããã¾ãã*1ãä»åã¯ããã®ãã¨ã«ã¤ãã¦è©³ç´°ã«èª¿ã¹ã¦ã¿ã¾ãã
ãã¼ã¿ã®æºå
ååã¾ã§ã®è¨äºã§ã¯ PHP ã§ç¬èªã«å®è£ ãããã¸ã¹ãã£ãã¯å帰ãç¨ãã¾ããããä»å㯠scikit-learn ãæä¾ãã¦ããå®è£ ãå©ç¨ãã¾ãããã©ã¡ã¼ã¿ãå¤ããªããå®é¨ãç¹°ãè¿ãã«ã¯ãPHP ã§ã®ç´ æ´ãªå®è£ ã§ã¯è¨ç®æéãããããããããã§ãã
ã¾ãå ¨ä½ã®æºåã¨ãã¦ãNumPy 㨠pyplot ã import ãã¾ããã¾ããã°ã©ãã®ã©ãã«ã«æ¥æ¬èªãå©ç¨ã§ããããã« FontProperties ãä½æãã¦ããã¾ãã
import numpy as np import matplotlib.pyplot as plt from matplotlib.font_manager import FontProperties fp = FontProperties(fname=r'C:\Windows\Fonts\YuGothic.ttf', size=11)
極æ§åæç¨ã®ããã¹ããã¼ã¿ãèªã¿è¾¼ã¿ã¾ã*2ããã¼ã¿ã®ä¸é¨ã«éã¢ã¹ãã¼æåãå«ã¾ãã¦ããã®ã§ãencoding ã« latin-1 ãæå®ãã¾ããã
posdata = [line.rstrip('\n') for line in open('rt-polarity.pos', 'r', encoding='latin-1')] negdata = [line.rstrip('\n') for line in open('rt-polarity.neg', 'r', encoding='latin-1')]
ãã¼ã¿ã®å 容ãç°¡åã«ç¢ºèªãã¾ããæ£ä¾ã®å é ã® 5 è¡ã表示ããã¦ã¿ã¾ãã
posdata[0:5]
çµæã¯ä»¥ä¸ã®ã¨ããã§ãã
['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . ', 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth . ', 'effective but too-tepid biopic', 'if you sometimes like to go to the movies to have fun , wasabi is a good place to start . ', "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one . "]
æ£ä¾ã¨è² ä¾ãåããã¦æ£è§£ã©ãã«ãç¨æããå¾ãscikit-learn ã® train_test_split é¢æ°ãç¨ãã¦å¦ç¿ç¨ãã¼ã¿ã»ããã¨ãã¹ãç¨ãã¼ã¿ã»ããã«åå²ãã¾ããä»åã¯å ¨ä½ã® 20% ããã¹ãç¨ã«ç¢ºä¿ãã¾ãããrandom_state ã¯åç¾æ§ã®ããã«ä¹±æ°ã® seed ãåºå®ãããã®ã§ãæ°å¤ã«ã¯ç¹ã«æå³ã¯ããã¾ããã
from sklearn.cross_validation import train_test_split X = posdata + negdata y = [1] * len(posdata) + [0] * len(negdata) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
CountVectorizer ã¯ã©ã¹ãå©ç¨ã㦠BoW å½¢å¼ã®ç¹å¾´æ½åºãè¡ãã¾ããããã¥ã¢ã«ã確èªããã¨ããã¹ãããã¯ã¼ããæå®ã§ããããã ã£ãã®ã§ english ãæå®ãã¦ã¿ã¾ããã
from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer(encoding='latin-1', stop_words='english') X_train_cv = cv.fit_transform(X_train) X_test_cv = cv.transform(X_test)
確èªã®ãããåèªã®åºç¾é »åº¦ã®åå¸ãæç»ãã¦ã¿ã¾ããå¦ç¿ãã¼ã¿ã»ããã対象ã¨ãã¦ãã¹ãã°ã©ã ãä½æãã¾ããã
plt.hist(np.squeeze(np.asarray(np.sum(X_train_cv, axis=0))), bins=50) plt.title(u'å¦ç¿ãã¼ã¿ã»ããã®åèªåºç¾é »åº¦ã®åå¸', fontproperties=fp) plt.xlabel(u'åºç¾é »åº¦', fontproperties=fp) plt.ylabel(u'åèªæ°', fontproperties=fp) plt.yscale('log') plt.ylim(8e-1, 1e5) plt.show()
ãã¸ã¹ãã£ãã¯å帰ã¢ãã«ã«ãã極æ§åæ
ä½æãããã¼ã¿ã«ãã¸ã¹ãã£ãã¯å帰ã¢ãã«ãé©ç¨ãã¦æ¥µæ§åæã®å®é¨ãè¡ãã¾ããLogisticRegression ã¯ã©ã¹ãå©ç¨ãã¾ããã¾ããæé©åã½ã«ãã¼ã¨ã³ã¹ããåãåãæ£è§£çãè¨ç®ãã calc_accuracies é¢æ°ãä½æãã¾ãããmax_iter ã®å¤ã¯ãå¾è¿°ããç¯å²ã®ã³ã¹ãã§è¨ç®ãåæããããã«æ£æçã«è¨å®ãããã®ã§ãã
from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score def calc_accuracies(solver, cost): lr = LogisticRegression(solver=solver, max_iter=10000, C=cost) lr.fit(X_train_cv, y_train) y_train_pred = lr.predict(X_train_cv) y_test_pred = lr.predict(X_test_cv) return [accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)]
ä½æããé¢æ°ãç¨ãã¦æ£è§£çãè¨ç®ãã¾ããæé©åã½ã«ãã¼ã« sag (Stochastic Average Gradient descent) ãæå®ãã¦*3ãã³ã¹ã㯠10^-6 ãã 10^6 ã¾ã§ã®ç¯å²ã§ææ°é¨ã 0.2 ãã¤å¤åããã¾ãããã³ã¹ãã¯æ£ååé ã®ä¿æ°ã®éæ°ã«ç¸å½ããã®ã§ãã³ã¹ãã大ããã»ã©éå¦ç¿ãããããªãã¾ãã
costs = [pow(10, i) for i in np.arange(-6, 6.1, 0.2)] accuracies = np.matrix([calc_accuracies('sag', c) for c in costs])
è¨ç®çµæãæç»ãã¾ããã³ã¹ããã©ã¡ã¼ã¿ã大ããè¨å®ããã¨ããå¦ç¿ãã¼ã¿ã®æ£è§£ç㯠100% ã«è¿ã¥ãã¦ããéå¦ç¿ãã¦ããæ§åã確èªã§ãã¾ããããã¹ããã¼ã¿ã®æ£è§£ç㯠72% ãããã§æ¨ªã°ãã«ãªã£ã¦ãããå¦ç¿ãã¼ã¿ã«å¯¾ãã¦éå¦ç¿ãã¤ã¤ãä¸å®ã®æ±åè½åãç¶æã§ãã¦ãããã¨ã示ãã¦ãã¾ãããã®çµæã¯ãPHP ã®å®è£ ã§åå確èªãããã®ã¨ã»ã¼ä¸è´ãã¦ãã¾ããéå¦ç¿ã«ãã£ã¦ãã¹ããã¼ã¿ã®æ£è§£çã¯å³ä¸ããã«ãªãã ããã¨æã£ã¦ããã®ã§ãç§ã«ã¨ã£ã¦ã¯æå¤ãªçµæã§ããã
plt.plot(costs, accuracies[:,0], label='Training') plt.plot(costs, accuracies[:,1], label='Test') plt.title(u'ã³ã¹ããã©ã¡ã¼ã¿ã«ããæ£è§£çã®å¤å (sag)', fontproperties=fp) plt.xlabel(u'ã³ã¹ã', fontproperties=fp) plt.ylabel(u'æ£è§£ç', fontproperties=fp) plt.legend() plt.xscale('log') plt.xlim(costs[0], costs[-1]) plt.ylim(0.5, 1) plt.show()
LIBLINEAR ã«ããå®é¨
次ã«ãLogisticRegression ã®æé©åã½ã«ãã¼ã liblinear ã«å¤æ´ãã¦åãå®é¨ãè¡ãã¾ããliblinear 㯠LogisticRegression ã¯ã©ã¹ã®æ¢å®ã®è¨å®ã§ãã
costs = [pow(10, i) for i in np.arange(-6, 6.1, 0.2)] accuracies = np.matrix([calc_accuracies('liblinear', c) for c in costs])
å
ã»ã©ã¨åæ§ã«è¨ç®çµæãæç»ãã¾ããããã°ã©ã ã¯çç¥ãã¾ãããçµæã¯æ¬¡ã®ã¨ããã§ããLIBLINEAR ã§ã¯ SAG ã®å ´åã¨ã¯ç°ãªããéå¦ç¿ã«ãã£ã¦ãã¹ããã¼ã¿ã®æ£è§£çãè½ã¡ã¦ããã¨ããçµæã«ãªãã¾ããããã®ä¸æ¹ã§ãLIBLINEAR ã§ã¯ã³ã¹ããå°ãããã¦ã (æ£ååé
ã®ä¿æ°ã大ãããã¦ã) 70% ç¨åº¦ã®æ£è§£çãç¶æã§ãã¦ããããã§ã*4ã
å¦ç¿çµæã®æ¯è¼
ãã©ã¡ã¼ã¿ã®éããå¦ç¿çµæã«ä¸ããå½±é¿ã確èªããããã代表çãªãã©ã¡ã¼ã¿ãããã¤ãé¸æãã¦ãæ¹ãã¦ã¢ãã«ãå¦ç¿ãã¾ããsag 㨠liblinear ã®ããããã«ã¤ãã¦ãã³ã¹ãã 10^-6 (é«ãã¤ã¢ã¹), 10^-0.6 (è¯ã), 10^6 (é«ããªã¢ã³ã¹) ã¨ãã¦å¦ç¿ãã¾ããã
def train(solver, cost): lr = LogisticRegression(solver=solver, max_iter=10000, C=cost) lr.fit(X_train_cv, y_train) return lr costs = [pow(10, i) for i in [-6, -0.6, 6]] # high-bias, good, high-variance lr_sag = [train('sag', c) for c in costs] lr_lin = [train('liblinear', c) for c in costs]
å¦ç¿çµæãããéã¿ã®å¤§ããªåèªã表示ããã¦ã¿ã¾ãã以ä¸ã¯ sag ã§ã³ã¹ãã 10^-0.6 ã¨ããå ´åã®å®è¡ä¾ã§ãã
features = cv.get_feature_names() wc = np.sum(X_train_cv, axis=0) def print_top_n_words(lr, n, negative=False): sort_order = 1 if negative else -1 sorted_idx = np.argsort(lr.coef_)[0,::sort_order] for i in sorted_idx[:n]: print('%16s\t%f\t%4d' % (features[i], lr.coef_[0,i], wc[0,i])) print_top_n_words(lr_sag[1], 10)
çµæã¯æ¬¡ã®ã¨ããã§ããå·¦ãããåèªãéã¿ãåºç¾é »åº¦ã§ãã
powerful 1.113147 37 enjoyable 1.057301 53 solid 1.046076 50 warm 1.043495 28 entertaining 1.038387 98 touching 0.993614 40 performances 0.993271 149 engrossing 0.981329 24 unexpected 0.964950 19 heart 0.963876 106
ã³ã¹ããå¤ããã¨ãå¦ç¿çµæã¯å¤§ããå¤ããã¾ãã以ä¸ã¯ãå·¦ã 10^-6 ã§ã®çµæãå³ã 10^6 ã§ã®çµæã§ã*5ãã³ã¹ãå° (æ£ååé ã大ãã) ã§ã¯é«é »åº¦èªã«é«ãéã¿ãä¸ããããã³ã¹ã大ã§ã¯ãã®éã®å¾åããããã¨ãèªã¿åãã¾ãã
film 0.000080 1280 taut 15.092569 7 best 0.000039 203 liberating 13.974338 3 performances 0.000038 149 remarkable 11.096390 27 heart 0.000031 106 tape 10.901090 1 love 0.000029 191 serviceable 10.873075 4 world 0.000028 123 warm 10.780220 28 life 0.000028 212 engrossing 10.483317 24 funny 0.000027 245 despicable 10.477185 2 entertaining 0.000025 98 unexpected 10.360805 19 fun 0.000023 131 heartwarming 10.275091 14
ã°ã©ããä½æãã¦å ¨ä½ã®æ§åãçºãã¦ã¿ã¾ããæç»çµæã¯ãå·¦ãã³ã¼ãä¾ã®ã¨ããã« 1e-0.6 ã§å®è¡ãããã®ãå³ã¯ 1e+6 ã§å®è¡ãããã®ã§ããå ¨ä½çãªå¾åã¨ãã¦ããã³ã¹ãã大ããªå ´åã«ä½é »åº¦èªã«éã¿ãåãæ§åã確èªã§ãã¾ãã*6ã
plt.scatter(wc, lr_sag[1].coef_, 1, marker='.') plt.title(u'åèªã®åºç¾é »åº¦ã¨éã¿ã®åå¸ (sag, C=1e-0.6)', fontproperties=fp) plt.xlabel(u'åºç¾é »åº¦', fontproperties=fp) plt.ylabel(u'éã¿', fontproperties=fp) plt.xlim(9e-1, 2e+3) plt.xscale('log') plt.show()
ã
ãã¦ãæé©åã½ã«ãã¼ã liblinear ã«ããå®é¨ã§ã¯ç°ãªãçµæãå¾ããã¦ãã¾ããã®ã§ãæ¯è¼ã®ããåæ§ã«ååèªã®éã¿ã確èªãã¾ããã³ã¹ãã 10^-0.6 ã¨ããã¨ãã®çµæã以ä¸ã§ãããã®ã³ã¹ãã§ã¯ sag ã®å ´åã¨ä¼¼ãéã¿ãå¾ããã¦ãã¾ãã
powerful 1.113019 37 enjoyable 1.057044 53 solid 1.045898 50 warm 1.043301 28 entertaining 1.038027 98 touching 0.993489 40 performances 0.992962 149 engrossing 0.981152 24 unexpected 0.964856 19 heart 0.963650 106
ã³ã¹ãã 10^-6, 10^6 ã¨ããã¨ãã®çµæã¯ä»¥ä¸ã®ã¨ããã§ããã³ã¹ãã 10^6 ã®å ´åã«ãå¦ç¿ãããéã¿ãã¯ãã«ã sag ã¨å¤§ããç°ãªã£ã¦ããæ§åããããã¾ãã
film 0.000080 1280 taut 28.132811 7 best 0.000039 203 demand 25.624158 2 performances 0.000038 149 crowdpleaser 25.316732 1 heart 0.000031 106 schaeffer 25.301165 4 love 0.000029 191 liberating 25.298221 3 world 0.000028 123 moist 24.340107 1 life 0.000028 212 skillful 24.266720 2 funny 0.000027 245 apocalypse 23.973220 1 entertaining 0.000025 98 town 23.276383 6 fun 0.000023 131 despicable 23.105512 2
ã³ã¹ãã 10^6 ã¨ããå ´åã«ã¤ãã¦ãã°ã©ããæç»ãã¦ã¿ã¾ããå·¦å´ã¯å
ã»ã©ã® sag ã§ã®ã°ã©ããåæ²ãããã®ã§ãå³å´ã liblinear ã§ã®å¦ç¿çµæã§ãã縦軸ã®ã¹ã±ã¼ã«ãéãããå°ãåããã«ããã®ã§ãããå½¢ç¶ã¨ã㦠liblinear ã®æ¹ãé«é »åº¦èªã«åãã¦çã¾ã£ã¦ããæ§åãè¦ããã§ããããã
ã
ååèªã®éã¿ã sag 㨠liblinear ã§ã©ã®ããã«ç°ãªã£ã¦ããã®ããæ£å¸å³ã§è¦ã¦ã¿ã¾ããã¾ãã次ã®ã³ã¼ãã§åºç¾é »åº¦ã 1 ã®åèªããããããã¾ãã
idx = np.where(wc == 1)[1] plt.scatter(lr_sag[2].coef_[0,idx], lr_lin[2].coef_[0,idx], 1, marker='.', edgecolors='none', c='k') plt.title(u'åºç¾é »åº¦ã 1 ã®åèªã«ä¸ããããéã¿ã®æ¯è¼ (C=1e+6)', fontproperties=fp) plt.xlabel('sag') plt.ylabel('liblinear') plt.xlim(-10, 10) plt.ylim(-20, 20) plt.show()
以ä¸ã®ãããªçµæãå¾ããã¾ãããæ£ã®ç¸é¢ãè¦ããã®ã¯èªç¶ãªçµæã§ãããä¸å¤®ä»è¿ã«å°ã£ã¦ããé¨åãããã¾ããããã¯ãsag ã§ã¯éã¿ã 0 ã«è¿ãåèªã§ã liblinear ã§ã¯å¤§ããªéã¿ãä¸ãããã¦ãããã¨ã示ãã¦ãã¾ãã
åºç¾é »åº¦ã 2 ã®åèªã3 以ä¸ã®åèªã«ã¤ãã¦åæ§ã«ãããããã¦ã¿ã¾ã*7ãå
ã»ã©ã®ã°ã©ãã§å°ã£ã¦ããé¨åãç®ç«ããªããªã£ã¦ãã¾ãããã®ãã¨ãããéå¦ç¿ãçºçãããããªãã©ã¡ã¼ã¿è¨å®ã§ã®æåã¨ãã¦ãliblinear ã®æ¹ãããä½é »åº¦èªã«åã£ãéã¿ä»ãããã¦ããã¨è¨ãããã§ãã
ã
sag 㨠liblinear ã§ãã¹ããã¼ã¿ã®æ£è§£çãç°ãªã£ã¦ããçç±ã¨ãã¦ãã³ã¹ãã大ããããã¨ãã« liblinear ã®æ¹ãä½é »åº¦èªãããç©æ¥µçã«å¦ç¿ããããã«ãªããçµæã¨ãã¦æ±åè½åãæã¤åèªãååã«å¦ç¿ã§ããªãã¾ã¾å復è¨ç®ãçµãã¦ãã¾ããã¨ãèãããã¾ãããããç´æ¥çã«ç¢ºèªããæ¹æ³ã¯æãã¤ããªãã£ãã®ã§ãããå¦ç¿ãã¼ã¿ããä½é »åº¦èªãé¤å¤ãããã®ã®æ£è§£çãè¨ç®ãããã¨ã§ãå証ã¯å¾ããããã§ãã
æåã®å®é¨çµæã§è¦ãããã«ãsag 㨠liblinear ã®ãããã®å ´åã§ããã³ã¹ãã大ãããã¦éå¦ç¿ãããã¨ãã«ã¯å¦ç¿ãã¼ã¿ã«å¯¾ãã¦ã¯ 100% è¿ãæ£è§£çãå¾ããã¦ãã¾ãããä½é »åº¦èªã«ä¾åããã¢ãã«ã§ããã»ã©ããããã®åèªãé¤å¤ãããã¨ã§æ£è§£çã大ããè½ã¡ãã¯ãã§ããã¢ãã«ãå¦ç¿ãã¼ã¿ã®ä½é »åº¦èªã«ä¾åãã¦ãããã¨ã¯ããã®ã¢ãã«ãæ±åè½åãæããªããã¨ãæå³ããã®ã§ãããã調ã¹ããã¨ã§ãã¹ããã¼ã¿ã®æ£è§£çã®éãã説æã§ãããã§ãã
以ä¸ã®ããã°ã©ã ã§ã°ã©ããæç»ãã¾ããããã®å®è£ ã§ã¯å¦ç¿ãã¼ã¿ãæä½ãã代ããã«ãå¦ç¿æ¸ã¿ã®ã¢ãã«ã® coef_ ãæ¸ãæãã¦åçã®å¦çãè¡ã£ã¦ãã¾ããå¦ç¿ãã¼ã¿ã¯å·¨å¤§ãªçè¡åã§ããã®å 容ãæ¸ãæããå¦çã¯è¨ç®è² è·ãé«ãããã§ãã
accs_sag = [] accs_lin = [] for i in range(0, 21): lr_sag[2].coef_[0, np.where(wc == i)[1]] = 0 lr_lin[2].coef_[0, np.where(wc == i)[1]] = 0 accs_sag.append(accuracy_score(y_train, lr_sag[2].predict(X_train_cv))) accs_lin.append(accuracy_score(y_train, lr_lin[2].predict(X_train_cv))) plt.plot(range(0, 21), accs_sag, marker='.', label='sag') plt.plot(range(0, 21), accs_lin, marker='.', label='liblinear') plt.title(u'ä½é »åº¦èªã®éã¿ã 0 ã¨ããã¨ãã®å¦ç¿ãã¼ã¿ã®æ£è§£ç (C=1e+6)', fontproperties=fp) plt.xlabel(u'é¾å¤ (åºç¾é »åº¦ã n 以ä¸ã®åèªã®éã¿ã 0 ã¨ãã)', fontproperties=fp) plt.ylabel(u'æ£è§£ç', fontproperties=fp) plt.legend() plt.ylim(0.5, 1) plt.show()
å®è¡çµæã¯ä»¥ä¸ã®ã¨ããã§ããäºæ³ããã¨ãããliblinear ã§å¦ç¿ãããã¢ãã«ã®æ¹ãä½é »åº¦èªã®é¤å»ã«å¯¾ããæ£è§£çã®æ¸å°å¹
ã大ãããããä½é »åº¦èªã«ä¾åããã¢ãã«ãå¾ããã¦ãããã¨ãåããã¾ãã*8ã
*1:『言語処理 100 本ノック』に PHP で挑む (問題 78 ~ 79) - y_uti のブログ
*2:ãã¼ã¿ã®å ¥ææ¹æ³ã¯ãè¨èªå¦ç 100 æ¬ããã¯ãã®ã¦ã§ããµã¤ãã«è¨è¼ããã¦ãã¾ãã
*3:æä¾ããã¦ããã½ã«ãã¼ã®ä¸ã§ãç´ æ´ãªææ¥éä¸æ³ã«è¿ãããªãã®ãé¸ã³ã¾ããã
*4:ãã¡ãã®åå ãæ°ã«ãªãã¾ãããä»åã¯ãã¾ãæ·±ã追ãã¾ããã§ããã
*5:ããã¯åã ã®å®è¡çµæã paste ã³ãã³ãã§ä¸¦ã¹ããã®ã§ãã
*6:ã³ã¹ãã 1e-6 ã¨ããå¦ç¿çµæã§ã¯ãé«é »åº¦èªã®æ¹ã«åãã¦åºãã£ãæ£å¸å³ãå¾ããã¾ãã
*7:å®è£ ã¯ãå é è¡ã® where ã«æå®ããæ¡ä»¶ãå¤ããã ãã§ãã
*8:ã°ã©ãã®å³ç«¯ã®æ¹ã§ã¯ãæåã®å®é¨ã§å¾ããããã¹ããã¼ã¿ã®æ£è§£çãããä½ãå¤ã«ãªã£ã¦ãã¾ããããã¯ãæ¬æ¥å¿ è¦ãªæ±åè½åãæã¤åèªã¾ã§è½ã¨ãã¦ãã¾ã£ã¦ãããã¨ã示ãã¦ãã¾ãã