æ©æ¢°å¦ç¿ã®ãã¼ã¿ã¨ãã¦ç¹å¾´éãä½ãã¨ãã®æ³¨æç¹ãæ©ããã¨ãªã©ãã¡ã¢ã£ã¦ããã¾ããã
ééããªã©ãå«ã¾ãã¦ããããããã¾ããã
åºæ¬çãªå
容ã§ãã®ã§èª¿ã¹ãã°ãã£ã¨é©åãªããæ¹ãããã¨æãã¾ãã
ã«ãã´ãªã«ã«ã»ãã¼ã¿
ã«ãã´ãªã«ã«ã»ãã¼ã¿ã¨ããã®ã¯ãããã¤ãã®éããã種é¡ã®å¤ãã¨ãããã®å¤§å°é¢ä¿ã«æå³ãç¡ããã®ã§ãã
質çãã¼ã¿ã¨ãå義尺度ã¨ãå¼ã°ãããã¨ãããã¾ãã
ä¾ãã°é½éåºçã®ãã¼ã¿ãèããæã«ãåæµ·éã¨æ²ç¸ã¯éãå¤ã§ããããã®å¤§å°é¢ä¿ã¯å®ç¾©ã§ãã¾ããã
(ãã¡ããåæµ·éã¨æ²ç¸ã«é¢ç©çãªå¤§å°é¢ä¿ãªã©ã¯ããã¾ããã欲ããæ
å ±ã§ã¯ãªãã¨ãã¾ã)
ã«ãã´ãªã«ã«ã»ãã¼ã¿ãç¹å¾´éã«ããã¨ãã«ã¯ã«ãã´ãªã¼ãã¨ã«ãã®ç¹å¾´ã§ãããã©ããã®äºå¤ã«ããã¨ããã¨è¨ããã¦ãã¾ã
以ä¸ã«ä¾ã示ãã¾ããããããã®åããã¼ã¿ãã¨ã®ç¹å¾´éã表ãã¦ããã¨èãã¦ãã ãã
åæµ·é:1 æ²ç¸:0 æ±äº¬:0 åæµ·é:0 æ²ç¸:1 æ±äº¬:0 åæµ·é:0 æ²ç¸:0 æ±äº¬:1
ã¡ãªã¿ã«æè¿Pythonã®æ©æ¢°å¦ç¿ã©ã¤ãã©ãªã§ããscikit-learnã使ã£ã¦ããã®ã§ããDictVectorizerã¨ããã®ã使ãã¨äºå¤åããã£ã¦ãããã¿ããã§ãã
from sklearn.feature_extraction import DictVectorizer vec = DictVectorizer() print vec.fit_transform([{'ä½æ': 'åæµ·é'}, {'ä½æ': 'æ²ç¸'}, {'ä½æ': 'æ±äº¬'}]).toarray()
ã¨ããã¨ä»¥ä¸ã®ããã«åºåããã¾ã
[[ 1. 0. 0.] [ 0. 0. 1.] [ 0. 1. 0.]]
ä»ã®ç¹å¾´éã®ä¸ãæ¹ã¨ãã¦ã¯ã«ãã´ãªã¼ãæ°å¤ã¨ãã¦æ±ãã¨ããã®ãèããããã®ã§ãããç¡æå³ãªæ¬ä¼¼çãªå¤§å°é¢ä¿ãçããããå¤ãã®ææ³ã§ã¯æªå½±é¿ãããã¾ãã
åæµ·éã1ãæ²ç¸ã2ãæ±äº¬ã3ã¨ããã¨ä»¥ä¸ã®ããã«ãªãã¾ãã
ä½æ:1 ä½æ:2 ä½æ:3
é »åº¦ã使ãããã¨ãï¼
ãã¼ã¿å
ã«è¤æ°ååºç¾ããè¦ç´ (åèªãªã©)ã«ã¤ãã¦ãæ§ã
ãªç¹å¾´éã®ä½ãæ¹ãããã¦ãã¾ãã
é »åº¦ãã®ãã®ã§ã¯ãªãå²åã«ããã¨ãããªã£ãããããã¨ãããã¾ãã
- å«ã¾ãããã©ããã®äºå¤
- é »åº¦ãã®ãã®
- é »åº¦ãå ¨ä½ã®é »åº¦ã§å²ã£ãããã¦éã¿ä»ããããã®(tf-idf - Wikipediaãªã©)
scikit-learnã«ãTf-idfの変換用のクラス(TfidfTransformer)ãããã¾ãã
>>> from sklearn.feature_extraction.text import TfidfTransformer >>> transformer = TfidfTransformer() >>> transformer TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True) >>> counts = [[3, 0, 1], ... [2, 0, 0], ... [3, 0, 0], ... [4, 0, 0], ... [3, 2, 0], ... [3, 0, 2]] ... >>> tfidf = transformer.fit_transform(counts) >>> tfidf <6x3 sparse matrix of type '<type 'numpy.float64'>' with 9 stored elements in Compressed Sparse Row format> >>> tfidf.toarray() array([[ 0.85..., 0. ..., 0.52...], [ 1. ..., 0. ..., 0. ...], [ 1. ..., 0. ..., 0. ...], [ 1. ..., 0. ..., 0. ...], [ 0.55..., 0.83..., 0. ...], [ 0.63..., 0. ..., 0.77...]])http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting
é£ç¶ãªç¹å¾´éã®é¢æ£å
ãæéããªã©ã®é£ç¶ãªå¤ãæã¤ç¹å¾´éãã©ã®ããã«å ¥ãããã¨ããã®ã¯å¤§ããªææ§æ§ãããã¾ãã
大å°é¢ä¿ãç¹å¾´éã§æãããå¾åã«åè´ãã¦ããã¨ã¯éããªããããã«ãã´ãªã«ã«ã»ãã¼ã¿ã§ãã£ãããã«ç¹å¾´éãããã¤ãã®åºåã«åãã¦ããã®åºåã«å«ã¾ãããã©ããã§èããã¨ããé¢æ£åãããè¡ããã¾ãã
ãã®åºåã®ããæ¹ã«ããããããã£ã¦ãåç´ãªæ¹æ³ã¨ãã¦ã¯ä»¥ä¸ã®ãããªãã®ãããã¾ãã
- çããå¹ ãä¾ãã°4æéãã¨ã«åºåã(1-4æ, 5-8æ, 9-12æ, 13-16æ, 17-20æ, 21-24æ)
- åºåãã¨ã®é »åº¦ãçãããªãããã«(ãã¼ã¿ãè¦ã¦é »åº¦ãçãããªãããã«åºåãã¾ã)
ãããã¯ããã®åºå以ä¸ãªããã®ç¹å¾´éã1ã«ãªãã¨ããå
¥ãæ¹ãèãããã¾ãã
5åæ¯ã«åãã¦ã5å以ä¸, 10å以ä¸, 15å以ä¸ã®åºåãããã¨ãã«ãã¼ã¿ã7åãªãã10å以ä¸ãã¤15å以ä¸ãªã®ã§ã2ã¤ã®ç¹å¾´éã1ã«ãªãã¾ãã
scikit-learnã«ã¯ãã®æ©è½ãããã¾ããã
使ã£ããã¨ã¯ãªãã§ãããåããPythonã®ã©ã¤ãã©ãªã®orangeã«ã¯離散化の機能ãããã¿ããã§ãã
ã¹ã±ã¼ãªã³ã°
åãããå¤ã®å¤§å°ãèããç°ãªãç¹å¾´éãå
¥ããã¨çµæãæªããªããã¨ãããã¾ãã
ããããã¨ãã¯å¹³åãå¼ãã¦ãæ¨æºåå·®ã§å²ãã¨ãããªããã¨ãããã¾ããå ´åã«ãã£ã¦ã¯æªããªããã¨ãããã¾ãã
scikit-learnã«ã¯preprocessing.scale関数ããã£ã¦ç°¡åã«ã¹ã±ã¼ãªã³ã°ã§ãã¾ãã
>>> from sklearn import preprocessing >>> import numpy as np >>> X = np.array([[ 1., -1., 2.], ... [ 2., 0., 0.], ... [ 0., 1., -1.]]) >>> X_scaled = preprocessing.scale(X) >>> X_scaled array([[ 0. ..., -1.22..., 1.33...], [ 1.22..., 0. ..., -0.26...], [-1.22..., 1.22..., -1.06...]])http://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling
å¹³æ»å
æç³»åãã¼ã¿ãªãåå¾ã®æéãç»åãã¼ã¿ãªãå¨ãã®ç»ç´ ãªã©ã§å¹³åãåãã¨çµæããããªããã¨ãããã¾ãã