æ ç»ãThe Social Networkãã®èæ¬ãNLTKã§è§£æãã¦éãã§ã¿ã
â»ãã®è¨äºã«ã¯æ ç»ãThe Social Networkãã®ãã¿ãã¬ããããªãã«å«ã¾ãã¦ãã¾ãï¼ããããæ ç»ã観ãäºå®ã®æ¹ã¯éããæ¹ãè³¢æã§ãï¼
æè¿ããã°ã§å®£è¨ããéãï¼入門 自然言語処理ãèªã¿ã¤ã¤Pythonã®NLTK(Natural Language ToolKit)ã使ã£ãèªç¶è¨èªå¦çã«ã¤ãã¦åå¼·ä¸ï¼入門 自然言語処理ã¯Pythonããã¯ã«è§¦ã£ããã¨ããªãç§ã§ãã¡ããã¨çè§£ããªããèªã¿é²ããããããã«ãªã£ã¦ããã®ãå¬ããï¼
ã¨ããã§ï¼å°ãåã«æ ç»ãThe Social Network (ã½ã¼ã·ã£ã«ã»ãããã¯ã¼ã¯)ãã観ã¦ï¼ç»å ´äººç©ã®å°è©ãè¡åããªããªãé¢ç½ãã¦æ°ã«å ¥ã£ãã®ã ãã©ï¼ãã®èæ¬ãæ ç»ã®å ¬å¼ãµã¤ãã§å ¬éããã¦ãããã¨ãæè¿ç¥ã£ãï¼æ ç»ã®èæ¬ã¨ãªãã¨ï¼ç¹å¾´çãªè¡¨ç¾ãå¤ãæç« æ°ããããªãã«ããã®ã§ï¼è峿·±ãã³ã¼ãã¹ã«ãªãå¾ãã®ã§ã¯ãªããã¨æãï¼
ã¨ããããã§ï¼NLTKç¿ãç«ã¦ã®ä¸åº¦ããã¿ã¤ãã³ã°ãªã®ã§ï¼æ©éNLTKã使ã£ã¦ãã®èæ¬ãåæãã¦ã¿ãï¼
èæ¬ã®å ¥æ
The Social Networkの公式サイトã®å³ä¸ã®"DOWNLOAD THE SCREENPLAY"ã¨ãããªã³ã¯ããèæ¬ã®PDFããã¦ã³ãã¼ãã§ããï¼
PDFãããã¬ã¼ã³ããã¹ãã¸ã®å¤æ
ç§ã¯Adobe Readerã使ã£ã¦ãã¬ã¼ã³ããã¹ãã«å¤æããï¼ãã ï¼çæãããããã¹ãã¯ãã®ã¾ã¾ã ã¨è²ã åé¡ããã£ãã®ã§ï¼ä»¥ä¸ã®ããã«æä½æ¥ã§ããã¹ããä¿®æ£ããï¼
- ^Mãæ¹è¡æå(\n)ã«ç½®ãæãã (^Mã¯å¶å¾¡æåï¼ã¿ã¼ããã«ãVimã ã¨Ctrl-V Ctrl-Mã§å ¥åã§ãã)
- ^Lãåãé¤ã (^Lãå¶å¾¡æåï¼Ctrl-V Ctrl-Lã§å ¥åã§ãã)
- ã·ã³ã°ã«ã¯ã©ã¼ããªã©ã®è¨å·ãå¤ãªå¶å¾¡æåã«åãã¦ããã®ã§PDFã¨ç §ããåãããªããç½®æãã
- ãã¼ã¸çªå·ã ãã®è¡ãåãé¤ã
- ãªã©ãªã©
ãªããªãæéãããã£ãã®ã§ï¼The Social Networkã®èæ¬ã使ã£ã¦åããã¨ãããããã¨æã£ã¦ãã人åãã«テキスト版の脚本をアップロードしておいたï¼ãã¬ã¼ã³ããã¹ãã«ããããã§å°è©æã¨ç¶æ³èª¬ææã®åºå¥ãä»ããªããªã£ã¦ããã©æªããããï¼
NLTKã使ã£ã¦éã¶
ã§ã¯æ©éNLTKã§èæ¬ã®ããã¹ããè§£æãã¦éãã§ã¿ããï¼
$ python >>> import nltk
以éï¼èæ¬ã®ææ¸ã«ã¤ãã¦ä»¥ä¸ã®å¦çãè¡ã£ã¦ããï¼
- ææ¸ã®åèªæ°ã®ç¢ºèª
- ç¹å®ã®åèªã®æ¤ç´¢
- ç¹å®ã®åèªãåºç¾ããæã®æ¤ç´¢
- åèªã®é »åº¦åå¸ã®å¯è¦å
- å ´é¢ãã¨ã®ç»å ´äººç©ã®æ¨ç§»ã®å¯è¦å
ãã¼ã¿ãèªã¿è¾¼ã
ããã¹ããã¡ã¤ã«ãèªã¿è¾¼ãã§ãã¼ã¯ã³åå²ãï¼ãã¼ã¯ã³ãªã¹ãããnltk.Textãªãã¸ã§ã¯ããä½ãï¼
>>> raw = open('the_social_network.txt').read() >>> tokens = nltk.word_tokenize(raw) >>> text = nltk.Text(tokens)
å夿°ã«ã©ããªå¤ãå ¥ã£ã¦ãããè¦ã¦ã¿ãï¼rawã«ã¯èæ¬ã®æååãå ¥ã£ã¦ããï¼
>>> type(raw) <type 'str'> >>> raw 'FROM THE BLACK WE HEAR--\nMARK (V.O.)\nDid you know there are more pe ople with\ngenius IQ\'s living in China than there\nare people of any kind living in the\nUnited States?\nERICA (V.O.)\nThat can\'t possibly be true.\nMARK (V.O.)\nIt is.\nERICA (V.O.)\nWhat would account for th at?\nMARK (V.O.)\nWell, first, an awful lot of people live\nin China. (ç¥)
tokensã«ã¯ãã¼ã¯ã³åå²ãããçµæã®ãªã¹ããå ¥ã£ã¦ããï¼
>>> type(tokens) <type 'list'> >>> tokens ['FROM', 'THE', 'BLACK', 'WE', 'HEAR--', 'MARK', '(', 'V.O.', ')', 'Di d', 'you', 'know', 'there', 'are', 'more', 'people', 'with', 'genius', 'IQ', "'s", 'living', 'in', 'China', 'than', 'there', 'are', 'people', 'of', 'any', 'kind', 'living', 'in', 'the', 'United', 'States', '?', ' ERICA', '(', 'V.O.', ')', 'That', 'ca', "n't", 'possibly', 'be', 'true (ç¥)
textã«ã¯NLTKã®Textãªãã¸ã§ã¯ããå ¥ã£ã¦ããï¼
>>> type(text) <class 'nltk.text.Text'> >>> text <Text: FROM THE BLACK WE HEAR-- MARK ( V.O....>
以éã§ã¯ãããã®ãªãã¸ã§ã¯ãã使ã£ã¦è§£æãã¦ããï¼
ææ¸ã®åèªæ°ã調ã¹ã
ã¾ãï¼æå§ãã«èæ¬ä¸ã®ãã¼ã¯ã³æ°(åèªæ°)ã調ã¹ãï¼
>>> len(tokens) 34821
ãã®èæ¬ã«ã¯å ¨é¨ã§34,821åã®ãã¼ã¯ã³(åèª)ãåå¨ããï¼
次ã«ï¼èæ¬ä¸ã®ç°ãªãèªã®åæ°ã調ã¹ãï¼åèªã®å¤§æåã»å°æåã®è¡¨è¨æºãã®å½±é¿ããªããããã«ï¼åèªãå ¨ã¦å°æååãããã¼ã¯ã³ãªã¹ããç¨æãï¼ãã®setãä½ã£ã¦ç°ãªãèªæ°ã調ã¹ãï¼
>>> tokens_l = [w.lower() for w in tokens] >>> len(set(tokens_l)) 4275
ç°ãªãèªã®æ°ã¯4,275åã ã£ãï¼
è¤æ°å½¢ãéå»å½¢ãå¥ã®åèªã¨ãã¦ã«ã¦ã³ããã¦ãã¾ã£ã¦ããããï¼å®éã®ç°ãªãèªæ°ã¯ããã«å°ãªãï¼ã¾ãï¼nltk.word_tokenize(raw)ãè¨å·ã¨ã¢ã«ãã¡ãããããã¾ããã¼ã¯ã³åå²ã§ãã¦ããªãã±ã¼ã¹ãããï¼ä¾ãã°'fuck.'ã['fuck', '.']ã¨åå²ããã['fuck.']ã¨åå²ãããããå ´åãããããï¼æ£ããç°ãªãèªæ°ãç®åºãããã¨ã¯é£ããï¼ããã§ãèæ¬ã®å¤§éæãªèªå½æ°ã¯åããã ããï¼
Emacs vs Vim (or Vi)!!!
The Social Networkã®ä¸çªæ°ã«ãªãç¹ã¨ãã£ããï¼ä½ã¨è¨ã£ã¦ãMark Zuckerbergãåä¸ã§ä½¿ã£ã¦ããã¨ãã£ã¿ã¯ä½ã ã£ãã®ãã¨ãããã¨ï¼æ ç»ã®æ¥æ¬èªåå¹ã«ã¯EmacsãVimãåºã¦ããªãã£ããï¼ãããããã¨è±èªã®èæ¬ã«ã¯è¼ã£ã¦ãããããããªãï¼ã¨ããããã§æ©é調ã¹ã¦ã¿ããï¼
ãªã¹ãã®countã¡ã½ããã使ãã°ï¼æå®ããå¤ããªã¹ãä¸ã«ããã¤å«ã¾ãã¦ãããã調ã¹ããã¨ãã§ããï¼ä»åã¯ãã¼ã¯ã³ãªã¹ãã®countã¡ã½ããã®å¼æ°ã«'emacs'ï¼'vim'ï¼'vi'ãæå®ãã¦ããã°è¯ãããã ï¼ã©ãã©ãâ¦ï¼
>>> tokens_l.count('emacs') 3 >>> tokens_l.count('vim') 0 >>> tokens_l.count('vi') 0
SON OF A BITCH!!!
ç¹å®ã®åèªãåºç¾ããæãæ¤ç´¢ãã
æ°ãåãç´ãã¦ï¼æ¬¡ã¯ç¹å®ã®åèªãåºç¾ããæã«ã¤ãã¦è¦ã¦ãããï¼NLTKã®Textãªãã¸ã§ã¯ãã使ãã°ï¼åèªãåºç¾ããæãæ¤ç´¢ãããï¼ææ¸ä¸ã§ããåèªã¨å ±èµ·ããä»ã®åèªã調ã¹ããã¨ãç°¡åã«ã§ããï¼
ä¾ãã°'Facebook'ã¨ããåèªã¯ã©ã®ãããªæä¸ã§ç»å ´ãã¦ããã®ã調ã¹ããï¼ãã®å ´åã¯Textãªãã¸ã§ã¯ãã®concordanceã¡ã½ããã使ãã°ããï¼
>>> text.concordance('Facebook', lines=5) Displaying 5 of 47 matches: on his desktop labeled " Kirkland Facebook " . He clicks and opens it. A menu 's a Tuesday night ? The Kirkland Facebook is open on my desktop and some of hese people have pretty horrendous Facebook pics . ( MORE ) Billy Olson 's sit does n't keep a public centralized Facebook so I 'm going to have to get all t ry to download the entire Kirkland Facebook . Kids ' stuff . On the computer s
'Facebook'ãåºã¦ããç®æã¨åå¾ã®æã表示ãããï¼concordanceã§ã¯å¤§æåã»å°æåã®éãã¯åºå¥ãããªãï¼
次ã¯'bitch'ãåºã¦ããæã調ã¹ã¦ã¿ããï¼
>>> text.concordance('bitch') Displaying 2 of 2 matches: TheFacebook ? ERICA You called me a bitch on the internet , Mark . MARK That ' published that Erica Albright was a bitch right before you made some ignorant
2ã¤ããåºã¦ããªãã£ãï¼ãããï¼å®ã¯èæ¬ä¸ã«'bitch'ãåºã¦ããç®æã¯ã¾ã ã¾ã ããï¼ãã¡ã²ã¨ã¤ã¯'bitch.'ã§æ¤ç´¢ããã¨åºã¦ããï¼
>>> text.concordance('bitch.') Displaying 1 of 1 matches: . MARK ( V.O. ) Erica Albright 's a bitch. Do you think that 's because her fa
ããã¯ï¼æåã®nltk.word_tokenize(raw)ã®å¦çã§"Erica Albright's a bitch."ã¨ããæããã¼ã¯ã³åå²ããæã«ï¼ãã¼ã¯ãã¤ã¶ã誤ã£ã¦'bitch'ã¨'.'ãã²ã¨ã¤ã®ãã¼ã¯ã³ã¨è¦ãªãã¦ãã¾ã£ããã¨ãåå ã§ããï¼
ã¾ãï¼æ®ãã®bitchã¯'CEO...Bitch'ã§æ¤ç´¢ããã¨è¦ã¤ããããï¼
>>> text.concordance('CEO...Bitch') Displaying 2 of 2 matches: a business card that says " I 'm CEO...Bitch " , that 's what I want for you , cards out and looks at it . I 'm CEO...Bitch And over this we HEAR a woman 's
ãã®ããã«ï¼nltk.word_tokenize()ã«ãããã¼ã¯ã³åå²ã¯å¤±æããã±ã¼ã¹ãããã¨ãããã¨ãé ã«å ¥ãã¦ãããæ¹ãè¯ãããã ï¼
ã¡ãªã¿ã«'Vim'ã¨'Vi'ã«ã¤ãã¦ã¯ãã¼ã¯ã³åå²ã®ãã¹ã¯ãªãï¼ããããåèªãææ¸ä¸ã«åå¨ããªããã¨ãç¢ºèªæ¸ã¿ã§ããï¼Fuck.
åèªã®é »åº¦åå¸ãå¯è¦åãã
次ã¯èæ¬ã«ãããåèªã®é »åº¦åå¸ã調ã¹ã¦é »åºåèªã®ç¹å¾´ã調ã¹ããï¼
ææ¸ä¸ã®åèªã®é »åº¦åå¸ã調ã¹ãã«ã¯nltk.FreqDistã使ãï¼FreqDistã®keysã¡ã½ããã使ãã¨ï¼æãé »åºããåèªããé ã«ä¸¦ãã ç°ãªãèªã®ãªã¹ããå¾ããã¨ãã§ããï¼
>>> fdist = nltk.FreqDist(w.lower() for w in text) >>> fdist.keys()[:50] ['.', 'the', 'to', ',', 'a', 'and', 'mark', 'you', 'i', "'s", '?', 'ed uardo', 'of', ')', '(', 'it', 'that', 'in', 'is', 'we', "n't", 'he', ' on', 'sean', 'do', 'with', '-', ':', 'was', 'for', 'at', 'what', 'this ', '"', 'int.', 'cut', 'his', "'m", "'re", 'are', 'cameron', 'they', ' have', 'room', 'up', 'be', 'tyler', 'as', "'d", 'not']
ã¾ãï¼plotã¡ã½ããã使ãã¨é »åº¦åå¸ãå¯è¦åã§ããï¼
>>> fdist.plot(50, cumulative=True)
å¯è¦åçµæãè¦ã¦ã¿ãã¨ï¼'the'ï¼'to'ã¨ãã£ãèªå½å 容ã«ä¹ããåèªãè¨å·ã°ãããé »åº¦åå¸ã®ä¸ä½ã«åºç¾ãã¦ãã¾ã£ã¦ããï¼ããã§ã¯ãã¾ãæå³ã®ããé »åº¦åå¸ãå¾ããã¦ããã¨ã¯è¨ããªãã ããï¼ããããããããã®åèªãè¨å·ãé¤ããä¸ã§é »åº¦åå¸ãä½ãã¹ãã ï¼
NLTKã«ã¯ã¹ãããã¯ã¼ã('the'ã'to'ãªã©ã®ããã«ã©ã®ææ¸ã§ãé«é »åº¦ã§åºç¾ããåèª)ã®ã³ã¼ãã¹ãç¨æããã¦ããï¼ããããã¾ã使ãã°èæ¬ã®ç¹å¾´ããã表ãé »åº¦åå¸ãä½ãããã ï¼
>>> stopwords = nltk.corpus.stopwords.words('english') >>> stopwords ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', ' your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself ', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'th em', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', (ç¥)
ãã®ã¹ãããã¯ã¼ãã¨ä¸é¨ã®è¨å·ãåãé¤ããé »åº¦åå¸ãä½ã£ã¦ã¿ãï¼
>>> symbols = ["'", '"', '`', '.', ',', '-', '!', '?', ':', ';', '(', ')'] >>> fdist = nltk.FreqDist(w.lower() for w in text if w.lower() not in stopwords + symbols) >>> fdist.plot(50, cumulative=True)
大åãã·ã«ãªã£ãããã«è¦ããï¼
ç»å ´äººç©ã®ååã§ãã'mark'ï¼'eduardo'ï¼'sean'ãæãé »åºãã¦ããï¼ 'room'ã'night'ã¨ãã£ãåèªã®åºç¾é »åº¦ãé«ããã¨ãåããï¼ ä¸é¨"'s"ã"n't"ã¨ãã£ãã¢ãã¹ãããã£ã¼ä»ãã®çããã¼ã¯ã³ãæ··ãã£ã¦ãããï¼ãããã¯"Mark's"ã"didn't"ã®ãããªåèªãåå²ãããçµæåºã¦ãããã¼ã¯ã³ã§ããï¼æ¬æ¥ã¯ããããåãé¤ãã¦ããé »åº¦åå¸ã調ã¹ãæ¹ãããã ããï¼
å ´é¢ãã¨ã®ç»å ´äººç©ã®æ¨ç§»ãå¯è¦åãã
ä»åº¦ã¯æ ç»ã®åå ´é¢ã§ç¾ããç»å ´äººç©ã®æ¨ç§»ãå¯è¦åãã¦ã¿ããï¼
æ ç»ã®èæ¬ã«ã¯åç»å ´äººç©ã®å°è©ã®åã«ï¼çºè¨ãã人ç©ã®ååã大æåã§æ¸ããã¦ãã(ä¾ï¼MARKï¼EDUARDO)ï¼ãã£ã¦ï¼èæ¬ã§ãããã®æååãåºç¾ããç®æã調ã¹ãã°ããããã ï¼ããã§ã¯ä¸»äººå ¬ã®Mark Zukerbergï¼ã¬ã¼ã«ãã¬ã³ãã®Erica Albrightï¼Markã®è¦ªåã®Eduardo Saverinï¼ãã¼ãé¨ã®å å¼Cameron Winklevossã¨Tyler Winklevossï¼ããã¦æ ç»å¾åã®ãã¼ãã¼ã½ã³ã§ããSean Parkerã«ã¤ãã¦èª¿ã¹ã¦ã¿ãï¼
忣ããããã使ãã¨ï¼ååèªãææ¸ä¸ã«ç»å ´ããä½ç½®ãå¯è¦åãããã¨ãã§ããï¼åæ£ããããã¯Textãªãã¸ã§ã¯ãã®dispersion_plotã¡ã½ããã§è¡¨ç¤ºããããã¨ãã§ããï¼
>>> persons = ['MARK', 'ERICA', 'EDUARDO', 'CAMERON', 'TYLER', 'SEAN'] >>> text.dispersion_plot(persons)
å³ã®æ¨ªè»¸ãæ ç»ã®æé軸ã«ã»ã¼å¯¾å¿ããï¼
ãã®åæ£ãããããã以ä¸ã®ãã¨ãææ¡ã§ããï¼
- Mark Zuckerbergã¯æ ç»ã®ä¸»äººå ¬ã ããã£ã¦ç»å ´ã·ã¼ã³ãå¤ãï¼ãã¼ãé¨ã®å å¼ãã¡ã¤ã³ã§ç»å ´ããã·ã¼ã³ä»¥å¤ã§ã¯Markã¯ã»ã¼ç¢ºå®ã«ç»å ´ãã¦ããï¼
- Erica Albrightã¯æåã®æ¹ã¨æ ç»ã®åã°ãããã®ã¿ç»å ´ãã¦ããï¼åã°ã®ã·ã¼ã³ã«ã¤ãã¦ã¯Markãä¹ ãæ¯ãã«Ericaã«è©±ãããããã®ã·ã¼ã³ã®ãã¨ã ãããªã¼ã¨é¡æ¨ã§ããï¼
- Eduardo Saverinã¯Markã«æ¬¡ãã§ç»å ´ãã¦ããå ´é¢ãå¤ãï¼
- ãã¼ãé¨ã®å å¼ã®Cameron Winklevossã¨Tyler Winklevossã¯è¦æè¦æã«å°è©ããããï¼Mark以å¤ã®ä»ã®äººç©ã¨ã¯ãã¾ãä¸ç·ã«ç»å ´ãã¦ããªãï¼
- Sean Parkerã¯æ ç»ã®å¾åã§ç»å ´ããã®ã§ï¼å½ç¶ãªããèæ¬ã®å°è©ãå¾åããåºã¦ãã¦ããï¼
ããããæãåºãã¦ãããããä¸åº¦æ ç»ã観ãããªã£ã¦ããï¼
以ä¸
NLTKã使ãã¨ææ¸ã®ã¤ã³ã¹ãã¯ããå¯è¦åãç°¡åã«ã§ãããã¨ã確èªã§ããï¼ãã®è¨äºã§ã¯The Social Networkã®èæ¬ã«ã¤ãã¦ï¼ä»¥ä¸ã®å¦çãè¡ã£ãï¼
- ææ¸ã®åèªæ°ã®ç¢ºèª
- Emacs...
- ç¹å®ã®åèªãåºç¾ããæã®æ¤ç´¢
- åèªã®é »åº¦åå¸ã®å¯è¦å
- å ´é¢ãã¨ã®ç»å ´äººç©ã®æ¨ç§»ã®å¯è¦å
ããç²ããã®ã§ä»æ¥ã®ã¨ããã¯ãããããã«ãã¦ããã¦ãããï¼
ãã£ã¨é¢ç½ããã¿ãæãä»ãããã¾ãããã°è¨äºã«ã¾ã¨ãã¦ã¿ããï¼
åèè³æ

- ä½è : Steven Bird,Ewan Klein,Edward Loper,è©åæ£äºº,ä¸å±±æ¬åº,æ°´éè²´æ
- åºç社/ã¡ã¼ã«ã¼: ãªã©ã¤ãªã¼ã¸ã£ãã³
- çºå£²æ¥: 2010/11/11
- ã¡ãã£ã¢: 大忬
- è³¼å ¥: 20人 ã¯ãªãã¯: 639å
- ãã®ååãå«ãããã° (44ä»¶) ãè¦ã