ç¸äºæ å ±éãç¨ããç¹å¾´é¸æ
20 Newsgroupsã§åé¡ç²¾åº¦ãè©ä¾¡ï¼2010/6/18ï¼ã®ã¤ã¥ãã§ããä»åã¯ãç¹å¾´é¸æã«ææ¦ãã¦ã¿ããã¨æãã¾ããããã¹ãåé¡ã«ãããç¹å¾´ã¨ã¯åºæ¬çã«åèªã®ãã¨ã§ãã
ç¹å¾´é¸æ
ååããã¤ã¼ããã¤ãºã®åºåçµæã§
documents: 11269, vocabularies: 53852, categories: 20 accuracy: 0.802265156562
ã¨ãªã£ã¦ã¾ãããdocumentsã¯è¨ç·´ãã¼ã¿ã®ç·ææ¸æ°ãcategoriesã¯è¨ç·´ãã¼ã¿ã®ã«ãã´ãªæ°ãvocabulariesã¯è¨ç·´ãã¼ã¿ã®ç·åèªæ°ã表ãã¾ããããã¹ãåé¡ã«ããã¦53852åã®åèªãèæ ®ãã¦ãããã¨ãæå³ãã¾ãããããããã®åèªã®ä¸ã«ã¯åé¡ã«å¯ä¸ããªãã°ããããã¤ãºã«ãªã£ã¦éã«æ§è½ãæªåããããããªåèªãå«ã¾ãã¦ãããã¨ãããã¾ãããã¨ãã°ãthe, in, toãªã©ã®ã¹ãããã¯ã¼ãããã®ä¸ä¾ã§ãããã®ä»ã«ããã¹ã¦ã®ã«ãã´ãªã«åããããã®é »åº¦ã§åºç¾ããåèªãªãããããã§ãããã®ãããªåèªã¯åé¡ã®å½¹ã«ç«ã¡ã¾ãããä»åã¯ã53852åã®ããã£ãã©ãªãããã«çµãè¾¼ãï¼ç¹å¾´ãé¸æããï¼ã®ãç®æ¨ã§ãã
ç¸äºæ å ±é
ç¹å¾´é¸æã§ä»£è¡¨çãªã®ã¯ãç¸äºæ å ±éï¼Mutual Informationï¼ã¨ãã尺度ãç¨ããææ³ã§ããIIRã®13.5ãåèã«ãã¦å®è£ ãã¾ãã
ç¸äºæ å ±éã¯2ã¤ã®ç¢ºçå¤æ°ã®ç¸äºä¾åã®å°ºåº¦ã表ãéã¨ã®ãã¨ãããã¹ãåé¡ã®ç¹å¾´é¸æã§ç¨ããå ´åã¯ãããåèªtã®åºç¾ã表ã確çå¤æ°Uã¨ããã«ãã´ãªcã®åºç¾ã表ã確çå¤æ°Cãç¨ãã¦ç¸äºæ å ±éI(U; C)ãå®ç¾©ãã¾ããUã¯1ã¾ãã¯0ã®å¤ãã¨ããU=1ã®ã¨ãåèªtãåºç¾ããäºè±¡ãU=0ã®ã¨ãåèªtãåºç¾ããªãã¨ããäºè±¡ã表ãã¾ããCã1ã¾ãã¯0ã®å¤ãã¨ããC=1ã®ã¨ãã«ãã´ãªãcã§ãããC=0ã®ã¨ãã«ãã´ãªãcã§ãªãã¨ããäºè±¡ã表ãã¾ããç¸äºæ å ±éã®å®ç¾©ã¯ã
ã§ããtã¯termãcã¯categoryã®ç¥ã§å ·ä½çãªåèªãã«ãã´ãªãå ¥ãã¾ããåæåå¸ãå¨è¾ºåå¸ãåºã¦ãã¾ãããä¸ã®ãããªã¯ãã¹è¡¨ãç¨ããã¨ç°¡åã«è¨ç®ã§ãã¾ãããã¨ãã°ãåèªãiPhoneãã¨ã«ãã´ãªãITãã®ç¸äºæ å ±éãæ±ãããã¨ããã¯ãã¹è¡¨ã¯ä¸ã®ããã«ãªãã¾ãã
ã«ãã´ãªãITã§ãã | ã«ãã´ãªãITã§ãªã | |
---|---|---|
åèªiPhoneãå«ã | N11 | N10 |
åèªiPhoneãå«ã¾ãªã | N01 | N00 |
ããã§ã
- N11ã¯ãè¨ç·´ææ¸ä¸ã§ã«ãã´ãªãITï¼C=1ï¼ã§ãã¤åèªiPhoneãå«ãï¼U=1ï¼ææ¸æ°
- N10ã¯ãè¨ç·´ææ¸ä¸ã§ã«ãã´ãªãITã§ãªãï¼C=0ï¼ãã¤åèªiPhoneãå«ãï¼U=1ï¼ææ¸æ°
- N01ã¯ãè¨ç·´ææ¸ä¸ã§ã«ãã´ãªãITï¼C=1ï¼ã§ãã¤åèªiPhoneãå«ã¾ãªãï¼U=0ï¼ææ¸æ°
- N00ã¯ãè¨ç·´ææ¸ä¸ã§ã«ãã´ãªãITã§ãªãï¼C=0ï¼ãã¤åèªiPhoneãå«ã¾ãªãï¼U=0ï¼ææ¸æ°
ã¨ãªãã¾ãããã®ãããªã¯ãã¹è¡¨ãæ±ã¾ãã°ã
- P(U=1, C=1) = N11 / N
- P(U=1) = (N10+N11) / N
- P(C=1) = (N01+N11) / N
ã¨ãã£ãæãã«ãã¹ã¦ã®åæ確çã¨å¨è¾ºç¢ºçãç°¡åã«è¨ç®ã§ãã¾ããI(U; C)ãNã使ã£ã¦æ¸ãç´ãã¨
ã¨ãªãã¾ããããã§ãN1.ã¯N10+N11ã®ç¥ã§ããä¸ã®ãããªã¯ãã¹è¡¨ã¯ãåèªã¨ã«ãã´ãªã®ãã¹ã¦ã®çµã¿åããåã ãçæããã¾ãããã¨ãã°ãåèªã50000åãã£ã¦ã«ãã´ãªã20åããã°ã50000x20éãã®ã¯ãã¹è¡¨ãã§ãã¾ããç¸äºæ å ±éã¯0以ä¸ã®å¤ãã¨ããå¤ã大ããã»ã©ã«ãã´ãªã®ç¹å¾´ã表ããããªåèªã¨è¦ãªããã¨ãã§ããã¨ã®ãã¨ãããããªããã¯å°ãèãã¦ãã¾ã£ããã©ãããã¤ã極端ãªã¯ãã¹è¡¨ãä½ã£ã¦ç¸äºæ å ±éãè¨ç®ãã¦ã¿ãã¨ç´å¾ã§ãããããåèªã¨ã«ãã´ãªãå®å ¨ã«ç¬ç«ã ã¨ç¸äºæ å ±éã¯0ã«ãªã£ã¦ãã¾ãã£ã¦ãã¨ã¯ã»ã»ã»ãã®ãããªåèªã¯ã«ãã´ãªã®å 容ã表ãã¨ã¯è¨ããããã£ã¦ã®ãç´æçãªç解ã§ããããï¼
ç¸äºæ å ±éãé«ãåèªãæ½åº
ä¸ã®å®ç¾©ããã®ã¾ã¾ç´ ç´ã«Pythonã§å®è£ ãã¦ã¿ã¾ãã20 Newsgroupsã®åã«ãã´ãªããç¸äºæ å ±éãé«ãä¸ä½kåã®åèªãæ±ããã®ãç®æ¨ã§ãã
#coding:utf-8 import codecs import math import sys from collections import defaultdict # feature_selection.py def mutual_information(target, data, k=0): """ã«ãã´ãªtargetã«ãããç¸äºæ å ±éãé«ãä¸ä½k件ã®åèªãè¿ã""" # ä¸ä½k件ãæå®ããªãã¨ãã¯ãã¹ã¦è¿ã if k == 0: k = sys.maxint V = set() N11 = defaultdict(float) # N11[word] -> wordãå«ãtargetã®ææ¸æ° N10 = defaultdict(float) # N10[word] -> wordãå«ãtarget以å¤ã®ææ¸æ° N01 = defaultdict(float) # N01[word] -> wordãå«ã¾ãªãtargetã®ææ¸æ° N00 = defaultdict(float) # N00[word] -> wordãå«ã¾ãªãtarget以å¤ã®ææ¸æ° Np = 0.0 # targetã®ææ¸æ° Nn = 0.0 # target以å¤ã®ææ¸ã # N11ã¨N10ãã«ã¦ã³ã for d in data: cat, words = d[0], d[1:] if cat == target: Np += 1 for wc in words: word, count = wc.split(":") V.add(word) N11[word] += 1 # ææ¸æ°ãã«ã¦ã³ãããã®ã§+1ããã°OK elif cat != target: Nn += 1 for wc in words: word, count = wc.split(":") V.add(word) N10[word] += 1 # N01ã¨N00ã¯ç°¡åã«æ±ãããã for word in V: N01[word] = Np - N11[word] N00[word] = Nn - N10[word] # ç·ææ¸æ° N = Np + Nn # ååèªã®ç¸äºæ å ±éãè¨ç® MI = [] for word in V: n11, n10, n01, n00 = N11[word], N10[word], N01[word], N00[word] # ããããã®åºç¾é »åº¦ã0.0ã¨ãªãåèªã¯log2(0)ã¨ãªã£ã¦ãã¾ãã®ã§ã¹ã³ã¢0ã¨ãã if n11 == 0.0 or n10 == 0.0 or n01 == 0.0 or n00 == 0.0: MI.append( (0.0, word) ) continue # ç¸äºæ å ±éã®å®ç¾©ã®åé ãè¨ç® temp1 = n11/N * math.log((N*n11)/((n10+n11)*(n01+n11)), 2) temp2 = n01/N * math.log((N*n01)/((n00+n01)*(n01+n11)), 2) temp3 = n10/N * math.log((N*n10)/((n10+n11)*(n00+n10)), 2) temp4 = n00/N * math.log((N*n00)/((n00+n01)*(n00+n10)), 2) score = temp1 + temp2 + temp3 + temp4 MI.append( (score, word) ) # ç¸äºæ å ±éã®éé ã«ã½ã¼ããã¦ä¸ä½kåãè¿ã MI.sort(reverse=True) return MI[0:k] if __name__ == "__main__": # è¨ç·´ãã¼ã¿ããã¼ã trainData = [] fp = codecs.open("news20", "r", "utf-8") for line in fp: line = line.rstrip() temp = line.split() trainData.append(temp) fp.close() # ç¸äºæ å ±éãç¨ãã¦ç¹å¾´é¸æ target = "comp.graphics" features = mutual_information(target, trainData, k=10) print "[%s]" % target for score, word in features: print score, word
å®è¡ããã¨ãcomp.graphicsã«ãã´ãªã®ç¸äºæ å ±éãé«ãé ã«10件ã®åèªãåºåããã¾ãã
[comp.graphics] 0.0396591914417 graphics 0.018190625901 image 0.013256620866 animation 0.0122108176792 gif 0.0109647921474 polygon 0.0102937113306 images 0.00984347501837 files 0.00839170021167 format 0.00832240938732 tiff 0.00799473476391 people
ããã«ãcomp.graphicsã£ã½ãåèªã並ãã§ã¾ããä»ã«ãããã¤ãã®ã«ãã´ãªã§è©¦ãã¦ã¿ã¾ãã
[comp.os.ms-windows.misc] 0.094183661915 windows 0.0237696712358 dos 0.0184571833467 file 0.0176200422274 cica 0.0161838846441 win 0.014151739062 ms 0.0135110433857 files 0.0131853055706 drivers 0.0126673382015 driver 0.0126148229575 ini [rec.sport.baseball] 0.0522161980385 baseball 0.031323260157 pitching 0.0241678683103 season 0.0226248072068 games 0.021565415199 mets 0.0204313593006 team 0.0198532689063 braves 0.0195487548191 hitter 0.0192974411302 game 0.0184112914927 phillies [sci.space] 0.0648087600801 space 0.0414339611904 nasa 0.0410198092698 orbit 0.0307253416988 launch 0.0290759088133 moon 0.0266171725319 shuttle 0.0242961005478 lunar 0.0224255674403 earth 0.0208045078764 spacecraft 0.0197463695696 flight [talk.politics.guns] 0.064023268956 gun 0.047075696447 guns 0.0372513306097 firearms 0.0321421838979 weapons 0.0213623165189 batf 0.0201845527411 handgun 0.019986956496 fire 0.018426078685 weapon 0.018311826034 fbi 0.0182303153541 waco [talk.religion.misc] 0.0167788239223 jesus 0.0152625625906 god 0.014476187063 christian 0.013821963122 sandvik 0.0122269234191 kent 0.0121410088867 bible 0.0116866079726 christians 0.0115979054194 newton 0.010151710003 religion 0.00988892270487 christ
ããã£ã½ããããã£ã½ããç¸äºæ å ±éã§ãã®ã«ãã´ãªã®ç¹å¾´èªãæ½åºã§ããã®ã¯ãã£ããé¢ç½ãã
ç¹å¾´é¸æãèæ ®ãããã¤ã¼ããã¤ãº
ã§ã¯ãæ¬é¡ã®ãã¤ã¼ããã¤ãºã§ç¹å¾´é¸æãèæ ®ããåé¡å¨ãä½æã§ããããã«ãã¾ããnews20ãnews20.tã®ãã¼ã¿ã¯ãã®ã¾ã¾ã§ãã¤ã¼ããã¤ãºã®ããã°ã©ã å´ã§ç¸äºæ å ±éã大ããä¸ä½kåã®ããã£ãã©ãªã®ã¿ä½¿ãããã«ãã¾ããNaiveBayesã¯ã©ã¹ã®ã³ã³ã¹ãã©ã¯ã¿ã«kã®å¤ãæå®ãã¾ãã
#coding:utf-8 import math import sys from collections import defaultdict from feature_selection import mutual_information # ç¹å¾´é¸æãè¡ããã¤ã¼ããã¤ãºåé¡å¨ class NaiveBayes: """Multinomial Naive Bayes""" def __init__(self, k): # ç¸äºæ å ±éã大ããé ã«kåã®åèªãããã£ãã©ãªã¨ãã self.categories = set() # ã«ãã´ãªã®éå self.vocabularies = set() # ããã£ãã©ãªã®éå self.wordcount = {} # wordcount[cat][word] ã«ãã´ãªã§ã®åèªã®åºç¾åæ° self.catcount = {} # catcount[cat] ã«ãã´ãªã®åºç¾åæ° self.denominator = {} # denominator[cat] P(word|cat)ã®åæ¯ã®å¤ self.k = k # ããã£ãã©ãªæ° def train(self, data): """ãã¤ã¼ããã¤ãºåé¡å¨ã®è¨ç·´""" # ææ¸éåããã«ãã´ãªãæ½åºãã¦è¾æ¸ãåæå for d in data: cat = d[0] self.categories.add(cat) for cat in self.categories: self.wordcount[cat] = defaultdict(int) self.catcount[cat] = 0 # ç¹å¾´é¸æãã¦ããã£ãã©ãªãçµãè¾¼ã L = [] for cat in self.categories: features = mutual_information(cat, data) L.extend(features) L.sort(reverse=True) for i in range(len(L)): # L[i]=(score, word)ãªã®ã§åèªã¯L[i][1]ã§åãåºãã self.vocabularies.add(L[i][1]) # ããã£ãã©ãªã®æ°ãæå®ããæ°ã«éãããçµäº if len(self.vocabularies) == self.k: break # ææ¸éåããã«ãã´ãªã¨åèªãã«ã¦ã³ã for d in data: cat, doc = d[0], d[1:] self.catcount[cat] += 1 for wc in doc: word, count = wc.split(":") count = int(count) # åèªãããã£ãã©ãªã«å«ã¾ããªããã°ç¡è¦ if not word in self.vocabularies: continue self.wordcount[cat][word] += count # åèªã®æ¡ä»¶ä»ã確çã®åæ¯ã®å¤ããããããä¸æ¬è¨ç®ãã¦ããï¼é«éåã®ããï¼ for cat in self.categories: self.denominator[cat] = sum(self.wordcount[cat].values()) + len(self.vocabularies) def classify(self, doc): """äºå¾ç¢ºçã®å¯¾æ° log(P(cat|doc)) ããã£ã¨ã大ããªã«ãã´ãªãè¿ã""" best = None max = -sys.maxint for cat in self.catcount.keys(): p = self.score(doc, cat) if p > max: max = p best = cat return best def wordProb(self, word, cat): """åèªã®æ¡ä»¶ä»ã確ç P(word|cat) ãæ±ãã""" # ã©ãã©ã¹ã¹ã ã¼ã¸ã³ã°ãé©ç¨ # åæ¯ã¯train()ã®æå¾ã§ä¸æ¬è¨ç®æ¸ã¿ return float(self.wordcount[cat][word] + 1) / float(self.denominator[cat]) def score(self, doc, cat): """ææ¸ãä¸ããããã¨ãã®ã«ãã´ãªã®äºå¾ç¢ºçã®å¯¾æ° log(P(cat|doc)) ãæ±ãã""" total = sum(self.catcount.values()) # ç·ææ¸æ° score = math.log(float(self.catcount[cat]) / total) # log P(cat) for wc in doc: word, count = wc.split(":") count = int(count) # åèªãããã£ãã©ãªã«å«ã¾ããªããã°ç¡è¦ if not word in self.vocabularies: continue # logãã¨ãã¨ããç®ã¯è¶³ãç®ã«ãªã for i in range(count): score += math.log(self.wordProb(word, cat)) # log P(word|cat) return score def __str__(self): total = sum(self.catcount.values()) # ç·ææ¸æ° return "documents: %d, vocabularies: %d, categories: %d" % (total, len(self.vocabularies), len(self.categories))
横軸ã«ããã£ãã©ãªãµã¤ãºã®å¯¾æ°ã¹ã±ã¼ã«ã縦軸ã«åé¡ç²¾åº¦ã¨ãã¦ã°ã©ããæãã¦ã¿ã¾ããã
ããï¼ããã£ãã©ãªãµã¤ãºãæ¸ããã¨ç²¾åº¦ã¯ä¸ãã£ã¦ãã¾ãã¾ããã»ã»ã»å®ã¯ãããããã¨ãããã¿ããã§ããå ãã¿ã®è«æ
- McCallum et al. A Comparison of Event Models for Naive Bayes Text Classification (PDF), Figure.3
ã§ãåãçµæã«ãªã£ã¦ãã¾ãããã ããã¼ã¿ã»ããã«ãã£ã¦ã¯ããã£ãã©ãªãµã¤ãºãæ¸ããã¨ç²¾åº¦ãä¸ããå ´åãããããã§ãIIRã«ã¯Reuters-RCV1ã¨ãããã¼ã¿ã®å®è¡ä¾ãè¼ã£ã¦ãã¾ãããªãªã¸ãã«ã®ããã£ãã©ãªãµã¤ãºã132776ã§ãããç¸äºæ å ±éã大ããé ã«100åã«çµãã¨accuracyã20%è¿ãä¸ãã£ã¦ãã¾ãããããªãå¾ãªãã¨ãããã®ã§ã¨ãããããã£ã¦ã¿ãæ¹ãããã®ããã