ã¯ããã«
é å¼µãã°ãä½ããããã£ã¦ãä¿¡ãã¦ããnikkieã§ãã
2019å¹´12ææ«ããèªç¶è¨èªå¦çã®ãã¿ã§æ¯é±1æ¬ããã°ãæ¸ãã¦ãã¾ãã
2/3ã®é±ããèªç¶è¨èªå¦çã®åºç¤åºãã¨ãã¦ãå
¥é èªç¶è¨èªå¦çãã«åãçµãã§ãã¾ãã
- ä½è :Steven Bird,Ewan Klein,Edward Loper
- çºå£²æ¥: 2010/11/11
- ã¡ãã£ã¢: 大åæ¬
ä»é±ã¯ã6ç« ãããã¹ãåé¡ã®å¦ç¿ãã«åãçµã¿ã¾ããã
- èªç¶è¨èªå¦çã§æ©æ¢°å¦ç¿ãæã¡è¾¼ããåé¡è¨å®ãç¥ãã¾ãã
- NLTKã§æ©æ¢°å¦ç¿ã使ã£ã¦åé¡åé¡ã«ã¢ããã¼ãããæ¹æ³ãç¥ãã¾ãã
6ç« ã¯ä»¥ä¸ã§å
¬éããã¦ãã¾ãï¼
ç®æ¬¡
- ã¯ããã«
- ç®æ¬¡
- åä½ç°å¢
- èªç¶è¨èªå¦çã¨æ©æ¢°å¦ç¿
- ååãç·æ§ã女æ§ãã®åé¡ã«åãçµã
- å¥ã®åé¡å¨ã試ã
- 試ããããã¨
- ææ³
åä½ç°å¢
å é±ã¾ã§ã¨åãç°å¢ãå¼ãç¶ã使ã£ã¦ããã¾ãï¼macOSã®ã¢ãããã¼ãã«ããBuildVersionãå¤ããã¾ããï¼ã
$ sw_vers ProductName: Mac OS X ProductVersion: 10.14.6 BuildVersion: 18G3020 $ python -V # venvã«ããä»®æ³ç°å¢ãä½¿ç¨ Python 3.7.3 $ pip list # grepã使ã£ã¦æç²ãã¦è¡¨ç¤º beautifulsoup4 4.8.2 ipython 7.12.0 matplotlib 3.1.3 nltk 3.4.5 wordcloud 1.6.0
èªç¶è¨èªå¦çã¨æ©æ¢°å¦ç¿
èªç¶è¨èªå¦çã®åé¡è¨å®ã«ã¯ãæ©æ¢°å¦ç¿ã«ãããåé¡1ã¨ãã¦æ±ãããã®ãããã¾ãã
ä¾ãã°ã
- ååãä¸ããããæã«ãç·æ§ã女æ§ãåé¡ãã
- æ ç»ã®ã¬ãã¥ã¼ãè¯å®çãå¦å®çãåé¡ãã
- åè©ã¿ã°ä»ãï¼ãã¼ã¯ã³ãä¸ããããæã«ãåè©ï¼åè©ãåè©ãªã©ï¼ãåé¡ããï¼
- æã®åå²ï¼ã»ã°ã¡ã³ãã¼ã·ã§ã³ï¼ãå¥èªç¹ãæãçµäºãããã©ããã§åé¡ãã
ãªã©ã§ãã
NLTKã¨æ©æ¢°å¦ç¿
åé¡ã®åé¡è¨å®ã«ä½¿ããå®è£
ã¯ãNLTKã®nltk.classify
ããã±ã¼ã¸ã«ç¨æããã¦ãã¾ãã
åé¡å¨ï¼classifierï¼ã®ã¯ã©ã¹ã«ç¨æãããtrain
ã¡ã½ããã使ã£ã¦ããã¼ã¿ãå¦ç¿ãããåé¡å¨ãä½æãã¾ãã
åé¡å¨ã®ã¯ã©ã¹ã¯ã次ã®5ã¤ãå®ç¾©ããã¦ãã¾ã2ï¼å¤ªåã®ãã®ã6ç« ã§è¨åããã¦ãã¾ãï¼ã
- ConditionalExponentialClassifier
- DecisionTreeClassifier
- MaxentClassifier
- NaiveBayesClassifier
- WekaClassifier
NLTKã§åé¡ã«åãçµãéã®æµã
ãå
¥é èªç¶è¨èªå¦çãã§ã¯ feature ã¨ããèªãç´ æ§ã¨è¨³ãã¾ãã
訳注ã«ããã¾ããããfeatureã¯ç¹å¾´éã¨ã訳ãããèªã ããã§ãï¼p.241ï¼ã
ç§ã«ã¯ç¹å¾´éã¨ããå¼ã³æ¹ã®ã»ãã馴æã¿ãããã®ã§ãç´ æ§ã¯ç¹å¾´éã«èªã¿æ¿ãã¾ããã
åé¡ã«åãçµãæãã´ã¼ã«ã¯åé¡å¨ã®ä½æã§ãã
- å
¥åã«ç´ æ§(ç¹å¾´é)æ½åºå¨ï¼feature extractorï¼ãé©ç¨ãã¦ç¹å¾´éãåãåºã
- åé¡ã«é¢é£ããç¹å¾´éã決å®
- ç¹å¾´éã®ç¬¦å·åæ¹æ³ã決å®
- ç¹å¾´éã¨ã©ãã«ããåé¡å¨ãä½æ
1ã¨2ã1åè¡ã£ãã ãã§é«æ§è½ãªåé¡å¨ãä½ããããã¨ã¯ã¾ãã§ãã
åé¡å¨ã®èª¤ãããã¨ã«ãæé 1ããå度åãçµããã¨ï¼ã¨ã©ã¼åæï¼ãæçã ã¨ç¥ãã¾ããã
ã¨ã©ã¼åæãè¡ãããã«ã¯ãã¼ã¿ã®åå²ããã¤ã³ãã«ãªãã¾ã3ã
- éçºã»ãã
- è¨ç·´ã»ãããåé¡å¨ã®å¦ç¿ã«ä½¿ã
- æ¤è¨¼ã»ãããã¨ã©ã¼åæã«ä½¿ã
- ãã¹ãã»ãããåé¡å¨ãç¥ããªããã¼ã¿ã§æ§è½ãè©ä¾¡ãã
ãã¼ã¿ã®åå²ã«ã¤ãã¦
éçºã»ããã¨ãã¹ãã»ãããåãããã¨ã«ã¯ãã¬ã¼ããªããããã¾ãã
- éçºã»ããã®ãã¼ã¿ãå°ãªããã°ãååã«å¦ç¿ããåé¡å¨ã¯ã§ãã¾ãã
- ãã¹ãã»ããã®ãã¼ã¿ãå°ãªããã°ãåé¡å¨ã®æ±ç¨çãªæ§è½ã¯åããã¾ãã
6.3ã«ã¯ãã©ãã«ä»ãããã大éã®ãã¼ã¿ãå©ç¨å¯è½ãªå ´åã¯ãå ¨ãã¼ã¿ã®10%ããã¹ããã¼ã¿ã¨ãã¦ä½¿ãã®ãä¸è¬çã¨ããã¾ããï¼p.257ï¼ã
ã¾ããããææ¸ããéçºã»ããã«ããã¹ãã»ããã«ããã¼ã¿ãåããã¹ãã§ã¯ããã¾ãã4ã
éçºã»ããã«ä½¿ãææ¸ã¨ãã¹ãã»ããã«ä½¿ãææ¸ã¨ããããã«ææ¸åä½ã§åãã¾ãï¼åæ§ã«ãéçºã»ããã®ä¸ã§ãè¨ç·´ã»ããã«ä½¿ãææ¸ã¨æ¤è¨¼ã»ããã«ä½¿ãææ¸ãåãã¾ãï¼ã
ðOKãªä¾ï¼éçºã»ããã¨ãã¹ãã»ããã§ææ¸ãåãã5ï¼
In [3]: from nltk.corpus import brown In [4]: file_ids = brown.fileids(categories='news') In [5]: len(file_ids) Out[5]: 44 In [6]: type(file_ids) Out[6]: list In [7]: file_ids[:3] Out[7]: ['ca01', 'ca02', 'ca03'] In [8]: size = int(len(file_ids) * 0.1) In [9]: size Out[9]: 4 In [10]: train_set = brown.tagged_sents(file_ids[size:]) In [11]: test_set = brown.tagged_sents(file_ids[:size]) In [12]: len(train_set) Out[12]: 4227 In [13]: len(test_set) Out[13]: 396
file_ids
ãshuffleãã¦ãããããã§ããã
ð
NGãªä¾ï¼tagged_sents
ã使ã£ããã¨ã§ãªã¼ã±ã¼ã¸ã®æ¸å¿µããï¼
In [1]: import random In [2]: random.seed(42) In [14]: tagged_sents = list(brown.tagged_sents(categories='news')) # NGãªã®ã§çä¼¼ããªãã§ãã ãã In [15]: random.shuffle(tagged_sents) In [16]: size = int(len(tagged_sents) * 0.1) In [17]: size Out[17]: 462 In [18]: train_set, test_set = tagged_sents[size:], tagged_sents[:size] In [19]: len(train_set) Out[19]: 4161 In [20]: len(test_set) Out[20]: 462
ååãç·æ§ã女æ§ãã®åé¡ã«åãçµã
ããã§ã¯ãä¾é¡ã¨ãã¦ååãç·æ§ã女æ§ãã®åé¡ã«åãçµãã§ã¿ã¾ãã
ã¢ã«ãã¡ãããã®ååãä¸ããããæã«ãç·æ§ã®ååã女æ§ã®ååããåé¡ãã¾ãã
- namesã³ã¼ãã¹ãç¨ãã¾ã
- ååã®æå¾ã®1æåãç¹å¾´ã«ãã¾ã
- NaiveBayesClassifierãæ±ãã¾ãï¼å¾ã»ã©DecisionTreeClassifierãMaxentClassifierã試ãã¾ãï¼
- æ¸ç±ã«æ²¿ã£ã¦ã¨ã©ã¼åæãå®æ½ããç¹å¾´éæ½åºå¨ãæ´æ°ãã¾ã
namesã³ã¼ãã¹
In [21]: from nltk.corpus import names In [29]: print(names.readme()) Names Corpus, Version 1.3 (1994-03-29) Copyright (C) 1991 Mark Kantrowitz Additions by Bill Ross This corpus contains 5001 female names and 2943 male names, sorted alphabetically, one per line. # çç¥ Mark Kantrowitz <mkant+@cs.cmu.edu> http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/ In [30]: names = [(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')] In [31]: len(names) Out[31]: 7944
ç¹å¾´éæ½åº & ãã¼ã¿ã®åå²
In [32]: random.shuffle(names) In [33]: def gender_features(word): # æå¾ã®1æåãåãåºã ...: return {'last_letter': word[-1]} ...: In [35]: names[:3] Out[35]: [('Raye', 'female'), ('Marita', 'female'), ('Fey', 'female')] In [51]: devtest_names = names[500:1500] # 1000件 In [52]: test_names = names[:500] # 500件 In [54]: train_names = names[1500:] In [55]: len(train_names) Out[55]: 6444 In [57]: train_set = [(gender_features(n), g) for n, g in train_names] In [58]: devtest_set = [(gender_features(n), g) for n, g in devtest_names] In [59]: test_set = [(gender_features(n), g) for n, g in test_names]
åé¡å¨ä½æï¼NaiveBayesClassifierï¼
nltk.classify.naivebayes.NaiveBayesClassifier
ããåé¡å¨ãä½ãã¾ãï¼ä»çµã¿ã«ã¤ãã¦ã¯6.5ã§èª¬æããã¦ãã¾ãï¼ã
In [26]: import nltk In [60]: classifier = nltk.NaiveBayesClassifier.train(train_set) In [61]: nltk.classify.accuracy(classifier, devtest_set) Out[61]: 0.766
æ¤è¨¼ã»ããã«ãããæ£è§£çã¯76.6%ã§ããã
åé¡ã«æçãªç¹å¾´éã®ä¸ä½ã確èªã§ãã¾ãã
In [62]: classifier.show_most_informative_features(5) Most Informative Features last_letter = 'a' female : male = 31.4 : 1.0 last_letter = 'f' male : female = 26.9 : 1.0 last_letter = 'k' male : female = 26.8 : 1.0 last_letter = 'p' male : female = 11.3 : 1.0 last_letter = 'd' male : female = 9.8 : 1.0
ã¨ã©ã¼åæ
In [63]: errors = [] In [65]: for name, tag in devtest_names: ...: guess = classifier.classify(gender_features(name)) ...: if guess != tag: ...: errors.append((tag, guess, name)) ...: In [66]: len(errors) Out[66]: 234 In [67]: errors[:5] Out[67]: [('female', 'male', 'Marlo'), ('female', 'male', 'Hildagard'), ('female', 'male', 'Melicent'), ('female', 'male', 'Moll'), ('male', 'female', 'Georgie')]
ã¨ã©ã¼åæã§ã©ãééãããã確èªããããã«names
ï¼ååã¨æ師ã©ãã«ã®çµï¼ãè¨ç·´ãæ¤è¨¼ããã¹ãã«åãã¦ãã¾ãã
ãããã«ããã®åºåãè¦ãã°å·¥å¤«ã§ãããã§ããã
ã¨ã©ã¼ããåãããã¨ã®ä¸ä¾ï¼
åé¡å¨ã¯hã§çµããååãfemaleã¨åé¡ããããchã§çµããååã¯maleã«åé¡ãã¦ã»ãã
ããã§ãæå¾ã®2æåãç¹å¾´éã¨ãã¦æ½åºãã¾ãã
In [68]: def gender_features(word): ...: return {'suffix1': word[-1:], ...: 'suffix2': word[-2:]} ...: In [69]: train_set = [(gender_features(n), g) for n, g in train_names] In [70]: devtest_set = [(gender_features(n), g) for n, g in devtest_names] In [72]: classifier = nltk.NaiveBayesClassifier.train(train_set) In [73]: nltk.classify.accuracy(classifier, devtest_set) Out[73]: 0.788 In [75]: classifier.show_most_informative_features(5) Most Informative Features suffix2 = 'na' female : male = 87.6 : 1.0 suffix2 = 'la' female : male = 62.8 : 1.0 suffix2 = 'ta' female : male = 37.4 : 1.0 suffix2 = 'rd' male : female = 35.3 : 1.0 suffix2 = 'ia' female : male = 33.7 : 1.0
æ£è§£çã78.8%ã«å¢å ãã¾ããã
ã¨ã©ã¼åæã¯ããã«ç¹°ãè¿ãããã§ãã
ãªããã¨ã©ã¼åæãç¹°ãè¿ããã³ã«ãè¨ç·´ã»ããã¨æ¤è¨¼ã»ãããåãç´ããã»ããããããã§ãã
çç±ã¯æ¤è¨¼ã»ããã®ãã¼ã¿ã®åããåé¡å¨ã«åæ ãããªãããã§ãã
å¥ã®åé¡å¨ã試ã
1.決å®æ¨ï¼DecisionTreeClassifierï¼
nltk.classify.decisiontree.DecisionTreeClassifier
ã試ãã¾ãã
In [76]: classifier = nltk.DecisionTreeClassifier.train(train_set) In [77]: nltk.classify.accuracy(classifier, devtest_set) Out[77]: 0.782
決å®æ¨ã¯6.4ã§èª¬æããã¦ãã¾ãã
決å®æ¨ã¯ããã¼ãã£ã¼ãã§ããã解éããããå ´åãå¤ãã¨ããç¹å¾´ãããããã§ãã
pseudocode
ã¡ã½ããã§ããã¼ãã£ã¼ããæåã§ç¢ºèªã§ãã¾ãã
In [80]: print(classifier.pseudocode(depth=2)) if suffix2 == 'Ag': return 'female' if suffix2 == 'Al': return 'male' if suffix2 == 'Bo': return 'male' if suffix2 == 'Cy': return 'male' if suffix2 == 'Di': return 'female' # çç¥
決å®æ¨ã¯æ±ºå®æ ªï¼1ã¤ã®ç¹å¾´éã«åºã¥ããåå²ã1ã¤ã ãæã¤æ±ºå®æ¨ï¼ãé¸ã¶6ãã¨ã§ä½ããã¾ãã
æ ¹ã¨ãªã決å®æ ªãé¸ã³ãèã®æ£è§£çã調ã¹ãååãªæ£è§£çã§ãªãå ´åã¯èã決å®æ ªã§ç½®ãæãã¦ã決å®æ¨ãè²ã¦ã¦ããã¾ãã
解éããããã¨ããå©ç¹ããã決å®æ¨ã§ãããæ¬ ç¹ãããã¾ãã
- ç¹å¾´éãæ¯è¼çç¬ç«ãããã®ã§ããå ´åã§ãã£ã¦ããç¹å®ã®é çªã§èª¿ã¹ããã¨ãå¼·å¶ãã
- ã©ãã«ä»ãã«é¢ããé¢ä¸ã®å°ããªç¹å¾´éãæ±ãã®ãå¾æã§ãªã
決å®æ¨ã¯ããã¼ãã£ã¼ãã«ããããå¾ãªãã®ã§ãç¹å¾´éãç¹å®ã®é çªã§ãã§ãã¯ãããã¨ã«ãªãã®ã ã¨ç解ãã¾ããã
ãªããåç´ãã¤ãºåé¡å¨ï¼NaiveBayesClassifierï¼ã¯ãå ¨ã¦ã®ç¹å¾´éãã並åã«ãæ±ããã¨ã§ã決å®æ¨ã®åé¡ãå æãã¦ããããã§ãã
2. æ大ã¨ã³ãããã¼åé¡å¨ï¼MaxentClassifierï¼
nltk.classify.maxent.MaxentClassifier
ã試ãã¾ãã
In [81]: classifier = nltk.MaxentClassifier.train(train_set) ==> Training (100 iterations) Iteration Log Likelihood Accuracy --------------------------------------- 1 -0.69315 0.367 2 -0.34214 0.792 # çç¥ 99 -0.30114 0.805 Final -0.30113 0.805 In [82]: nltk.classify.accuracy(classifier, devtest_set) Out[82]: 0.788
æ大ã¨ã³ãããã¼åé¡å¨ã¯6.6ã§èª¬æããã¦ãã¾ãã
- åç´ãã¤ãºåé¡å¨ã¯ãã¢ãã«ã®ãã©ã¡ã¿ã¨ãã¦ç¢ºçã使ç¨
- æ大ã¨ã³ãããã¼åé¡å¨ã¯ãåé¡å¨ã®æ§è½ï¼âå ¨ä½å°¤åº¦ï¼ãæ大åãããã©ã¡ã¿ã®ã»ãããæ¢ã
â»çè«é¨åã¯ä»åã¯èªãã¦ãããã宿é¡äºé ã§ã
試ããããã¨
LazyMap
ãä½ãnltk.classify.util.apply_features
ï¼ã¡ã¢ãªæ¶è²»ãæããï¼- ååã®å¤å®ä»¥å¤ã®ä¾ï¼æ ç»ã¬ãã¥ã¼
- åè©ã¿ã°ä»ã
- æèãå©ç¨ãã
- åæåé¡å¨ã®å©ç¨ï¼ç³»ååé¡å¨ï¼
- ã¢ã«ã´ãªãºã ã®é¨åã¯ãè¦ã¦è©¦ãã¦ãããæ©æ¢°å¦ç¿ã¢ã«ã´ãªãºã ã®ä»çµã¿ æ©æ¢°å¦ç¿å³éãã§åããã¨ããããè£å¼·
- è¨ç·´ã»ããã¨æ¤è¨¼ã»ããã«äº¤å·®æ¤è¨¼ãé©ç¨ãããï¼
nltk
ã«ããï¼sklearn
ãã使ãï¼ï¼ - ãrecommended machine learning packages that are supported by NLTKããããã¨ã®ãã¨ãªã®ã§ãããã©ããªããã±ã¼ã¸ãããã®ã ããï¼NLTKã®ãµã¤ãã«è¦ã¤ãããã¨è½ããï¼
ææ³
ãNLTKã¯é«æ©è½ï¼ããã®ä¸è¨ã«ã¤ãã¾ãã
ããã¾ã§ã¯ã¹ããã³ã°ãã¿ã°ä»ããªã©ã®èªç¶è¨èªã®åãæ±ãæ¹ããã³ã¼ãã¹ã使ã£ã¦ã®éè¨æ¹æ³ãå¦ãã§ãã¾ããã
ããã ãã§ã¯ãªããscikit-learn
ã«ãããããªåé¡å¨ãå®è£
ããã¦ããã¨ã¯ï¼
Webéçºã«ãããDjangoã®ãããª"é»æ± å梱"ã£ã·ãã§ããã
æ¸ç±ã§ç´¹ä»ããã¦ãã交差æ¤è¨¼ã¾ã§NLTKã§ã§ããã®ããããã¨ãscikit-learn
ã«ä»»ããã»ããããã®ããNLTKã«ãããæ©æ¢°å¦ç¿ã®éçãæ°ã«ãªãã¾ãã
ããã¾ã§ãå
¥é èªç¶è¨èªå¦çãã§èªç¶è¨èªå¦çã®åºæ¬ãè¦ã¦ãã¾ããã
ãå®åã®ãã®é¨åã¯ãã£ã¨ãã¾ãã§ãããªãã¨ããçºè¦ãããã¤ãããã¾ããã
ãã¦ããã®æ¬ãæ¸ãããæç¹ï¼2009ï¼ã¨ç¾å¨ã¨ã§ã¯ãèªç¶è¨èªå¦çãåãå·»ãç¶æ³ã¯ç°ãªãã¾ãã
10å¹´åã¨æ¯ã¹ã¦æ©æ¢°å¦ç¿ã®çºå±ã¯ããã¾ããããã®æ¬ã«ã¯è¼ã£ã¦ããªã深層å¦ç¿ã使ã£ãææ¸åé¡ã¯ãåãã¬ã¼ã ã¯ã¼ã¯ã®ãã¥ã¼ããªã¢ã«ã«ãè¦ããã¾ãã
ããã§æ¬¡åã¯æ©æ¢°å¦ç¿ã®ãã¾ã¸ã®ãã£ããã¢ãããç®æããã¤ãã«BERTã触ãäºå®ã§ãã
æè¿åºã¦è©±é¡ã®"ãã®æ¬"ãåè£ã§ãã
-
æ師ããå¦ç¿ã®åé¡ã§ãï¼é¢æ£å¤ãäºæ¸¬ããåé¡è¨å®ã®ãã¨ã§ãï¼↩
-
NLTK HOWTOã® Classifiers ãã↩
-
æ©æ¢°å¦ç¿ã®ãã¼ã¿ã®åãæ¹ã®è©±ã¨åãã§ããæ©æ¢°å¦ç¿ã§ä½ãããã®ã¯æ±ç¨çãªã¢ãã«ã§ãããªã®ã§ãå¦ç¿ã«ä½¿ã£ã¦ããªããã¼ã¿ã«ã¤ãã¦ã©ãã ãæ£ããåé¡ã§ãããããã¢ãã«ãä½æããä¸ã§æçµçã«ç¥ããããã¨ã«ãªãã¾ã↩
-
èªç¶è¨èªå¦çã«ããããªã¼ã±ã¼ã¸ã¨ãã¦ç解ãã¾ãã↩
-
NLTKã®ã³ã¼ãã¹ã®
tagged_sents
ã¡ã½ããã¯ã第1å¼æ°ã«ãã¡ã¤ã«IDãæå®ã§ãã¾ãï¼ããã¾ã§ã¯ãã¡ã¤ã«IDãæå®ããã«ä½¿ã£ã¦ãã¾ããï¼↩ -
é¸ã¶æ¹æ³ã§ä¸è¬çãªæ¹æ³ãæ å ±å©å¾ã ããã§ãï¼ç©ãèªé¨åï¼↩