word2vecã¨ããããã¥ã¼ã©ã«ãããã¯ã¼ã¯çãªãã¨ãç¨ãã¦ææ¸éåããã³ã¼ãã¹ãæ§ç¯ãã¦ãåèªã®ãã¯ãã«æ¼ç®ãã§ããããã«ãªãææ³ããããããã
艦これ加賀さんから乳を引いてみるã¨ãã話ãèãã¦ã¹ã²ã¼!!ã£ã¦ãªã£ãã®ã§ãTwitterでやってみたã¨ã英辞郎でやってみたã¨ãMagic: The Gatheringã¨ãwikipediaãããããããªããä½çªç
ãã ãソï¾ï½ªâ¦ã£ã¦æããããã ããã©ããã£ã¦ã¿ãã
こちらãåèã«word2vecãã¤ã³ã¹ãã¼ã«ãããä»åã¯Pythonã§ã¯ãªãã¿ã¼ããã«ã§ã«ãã£ã«ãã£ãããã¨ã«ããã
demo-word.sh ã®ä¸ã«text8ã¨ãããã¼ã¿ãããããããã¯100MBã»ã©ã®ã³ã¼ãã¹ã§ã
anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as
ã¨ããããã«ãåè§ã¹ãã¼ã¹åºåãã§ã²ãããåèªã並ãã§ãããã¼ã¿ã§ããããªã®ã§è±èªã§ãæ¥æ¬èªã§ããååèªãåè§ã¹ãã¼ã¹åºåãã§ã²ã¨ã¤ã®ãã¡ã¤ã«ã«ã¶ã¡ããã§ããã°ãããã³ã³ããããªãªãã大æåãæ°åãç¹æ®æåã¯ãã¹ã¦åã£ã¦ããã
è±èªãªãã°åè§ã¹ãã¼ã¹ã§åæã«åºåããã¦ããã®ã§åãã¡æ¸ãã®å¿
è¦ã¯ãªãããæ´»ç¨å½¢ãå
ã«æ»ãå¿
è¦ãããã®ã§こちらãåèã«WordNetã§æ´»ç¨ãä¸è¬å½¢ã«æ»ãã¨å¤å精度ããããã
æ¥æ¬èªãªãããªãã¿ã®MeCabã使ã£ã¦ãæ´»ç¨å½¢ãä¸è¬å½¢ã«æ»ããªãã·ã§ã³ãã¤ãã¦åå¦çãã¦ããã
# Pythonã§ã³ã³ããããªãªãã大æåãæ°åãç¹æ®æåãåãã¹ã¯ãªãã file = "text.txt" f = open(file, "rU") rep = ['[', ']', '#', '&', ',', '.', ';', ':', '(', ')', '%', '<', '>', '!', '?', '\x81f'] + map(str, range(0, 10)) res = [] for line in f: tmp = line.rstrip().lower() for i in rep: tmp = tmp.replace(i, "") res += [" " + tmp] res = "".join(res) w0 = open("text1.txt", "w") w0.write(res) w0.close()
ã
demo-word.sh ãå®è¡ããã°ãµã³ãã«ãã¼ã¿ã®ãã¦ã³ãã¼ãããword2vecã®å®è¡ã¾ã§åæã«ãã£ã¦ãããã®ã ããéè¦ãªã®ã¯å®è¡ã¹ã¯ãªããã¨ãåºã¦ããçµæ vectors.bin ã§ãããä¸è¨ã®é¨åã§å®è¡ã§ããã
time ./word2vec -train text8 -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1 ./distance vectors.bin # è·é¢ã®æ¼ç® ./word-analogy vectors.bin # é¡ä¼¼åº¦ã®æ¼ç®
è·é¢ã«ã¤ãã¦ã¯ãµã¤ãã«ããéãã ããé¡ä¼¼åº¦ã«ã¤ãã¦ã¯
Enter three words (EXIT to break): paris france berlin
Word: paris Position in vocabulary: 1055 Word: france Position in vocabulary: 303 Word: berlin Position in vocabulary: 1360 Word Distance ------------------------------------------------------------------------ germany 0.633804 hungary 0.486985 russia 0.472652 austria 0.469492 ussr 0.465501
å¤åãparis 㨠france ã足ã㦠berlin ãå¼ãã°ãé¦é½åå士ãæ¶ãã¦å½åãåºã¦ããããããªãããª?
ã
ãã¦ãããããæèé«ã解æãããã
å»çç³»ã®ããã¹ããã¯ãã¼ãªã³ã°ãã¦20MBãããã®ãã®ãå
¥æããã
ã¨ããããã¶ã¡è¾¼ãã§ãã£ã¦ã¿ãã
æèé«ãç³»ã®äººãåå¼·ããããæèè¬ã«ã¤ãã¦ã¯ããã¯ãã©ã¤ããβ-ã©ã¯ã¿ã ã¨ãã£ãæèè¬ã®åé¡ãåºã¦ããã
Enter word or sentence (EXIT to break): antibiotics
Word: antibiotics Position in vocabulary: 564 Word Cosine distance ------------------------------------------------------------------------ macrolide 0.686868 β-lactam 0.671718 rifampin 0.659790 regimens 0.659161 broad-spectrum 0.649693 antibiotic 0.646567 antifungal 0.638228 fluoroquinolone 0.632737 rifapentine 0.628934 regimen 0.628189 aminoglycosides 0.623643 fluoroquinolones 0.619080 prophylactic 0.618527 combination 0.618506 cephalosporins 0.613100 metronidazole 0.612240 quinolones 0.609158 antituberculous 0.608638 cephalosporin 0.608548 agents 0.607759 antimicrobial 0.601876 third-generation 0.598869 single-dose 0.598591 vancomycin 0.594600 aminoglycoside 0.591262 isoniazid 0.589512 low-dose 0.587311 corticosteroids 0.585872
cephalosporinã«ã¯ç¬¬nä¸ä»£ã¨ããéçºä¸ã®åé¡ãããã®ã§ããããã©ããªãããã£ã¦ã¿ããããããã¾ãä¸èº«è¦ãã®ããã©ãããã!!
Enter word or sentence (EXIT to break): cephalosporin
Word: cephalosporin Position in vocabulary: 5052 Word Cosine distance ------------------------------------------------------------------------ ampicillin/sulbactam 0.899973 cefotaxime 0.878304 quinolone 0.875494 tetracycline 0.871787 ampicillin 0.870064 imipenem 0.868366 aztreonam 0.866771 fluoroquinolone 0.864932 aminoglycoside 0.859710 clindamycin 0.858559 chloramphenicol 0.851192 tobramycin 0.851054 third-generation 0.848996 nitrofurantoin 0.848938 metronidazole 0.848060 macrolide 0.847879 erythromycin 0.847772 ofloxacin 0.843216 cephalexin 0.843158 ticarcillin/clavulanate 0.839059 cefazolin 0.835968 ciprofloxacin 0.835081 vancomycin 0.833714 sulbactam 0.833306 levofloxacin 0.831119 ceftazidime 0.830860 cefotetan 0.830272 gentamicin 0.828700 piperacillin 0.827680 tmp-smx 0.827597 penicillin-allergic 0.827292 amoxicillin 0.826194 minocycline 0.824555 antipneumococcal 0.824483 meropenem 0.819185 ceftriaxone 0.818070 bismuth 0.816361 quinolones 0.815513 oxacillin 0.815391 amoxicillin/clavulanate 0.814614
ã
çã§cancerããã£ã¦ã¿ã¦ãmalignantã¨ãbenignãé£ããããªã¨æå¾
ãã¦ãã£ã¦ã¿ããã©ãä½ã®é¨ä½ã°ããåºã¦ãã¦ãå
±èµ·è§£æã§ååãããªãã®?ã¨ããå°è±¡â¦ã§ããhormone-dependentãªãã¦æè¿ã®ååæ¨çè¬æ²»çãåæ ãããããªãã¨ãåºã¦ãã¦ã¦ã¡ãã£ã¨ããæããããã¨æã£ãã
Enter word or sentence (EXIT to break): cancer
Word: cancer Position in vocabulary: 92 Word Cosine distance ------------------------------------------------------------------------ cancers 0.772832 breast 0.741152 prostate 0.736099 colorectal 0.732450 endometrial 0.700628 melanoma 0.679323 osteogenic 0.645691 ovarian 0.639607 cervix 0.635429 carcinoma 0.631158 hormone-dependent 0.623069 non-small 0.604022 small-cell 0.597804 polyps 0.583271 hnpcc 0.581199 neuroblastoma 0.569610 carcinomas 0.563273 malignancies 0.554716
ã
Bioinformaticsã¯ãã¾ãã¡ä½ãæãã¦ããã®ãããããªãã®ã§ããã®ã³ã¼ãã¹ã§ã¯ä½ãã¨ããã¨ãçµå±microarrayããããNGSã¯ã¾ã ãããªã«è¨è¿°ããªãããã ã£ãã®ã§ããããããã¨ãªãã ãããHapMapãé£ãã¦ããã®ã§ãDBã使ããã¨ãå«ãã§ããããã ã
Enter word or sentence (EXIT to break): bioinformatics
Word: bioinformatics Position in vocabulary: 18193 Word Cosine distance ------------------------------------------------------------------------ plausible 0.799823 catalogue 0.774522 microarrays 0.774336 nonconventional 0.768620 double-antibody 0.766882 hobbies 0.766779 hgp 0.766022 outdoors 0.764794 stereotype 0.759451 hapmap 0.756572
ã
æå¾ã«ãæ¼ç®ããé¡ä¼¼åº¦æ¨å®çãªãã¨ããããã£ããããããã¿ãæãã¤ããªãã£ããã¨ãããããæè¿ã®æµè¡ãã¯å¤å¤ä½µç¨ã®æçå¤æ²»çãªã®ã§ãä¾ãã°ç½è¡ç
ã«åå¦çæ³ãããããã©ããååæ¨çè¬ã§ããリツキシマブãå¼ããããå¤å
¸çãªDNAé害系ã®æçå¤ãåºã¦ãããã¨æå¾
ãã¦ãã£ããçµæã¯adjuvantãneoadjuvantã¨ãã£ãè¡å¾ã»è¡ååå¦çæ³ãåºã¦ãããã©ãcyclophosphamideãgemcitabineã¨ãã£ãæçå¤ãåºã¦ããã®ã§ãã¾ãããã¨ãããããªããã·ããã¯å¤å¤ä½µç¨ãªã®ã§ãmonotherapyãåºã¦ããã®ããªããããæãã ã¨æã£ãã
Enter three words (EXIT to break): leukemia chemotherapy rituximab
Word: leukemia Position in vocabulary: 777 Word: chemotherapy Position in vocabulary: 422 Word: rituximab Position in vocabulary: 4732 Word Distance ------------------------------------------------------------------------ adjuvant 0.615775 cyclophosphamide 0.609339 platinum-based 0.607038 high-dose 0.604657 neoadjuvant 0.596819 gemcitabine 0.596638 regimens 0.593977 monotherapy 0.592749 infliximab 0.584584 efficacious 0.578444 lopinavir/ritonavir 0.574542 adalimumab 0.568594 thalidomide 0.567074 three-drug 0.565481 multidrug 0.562847 cytarabine 0.560205 single-agent 0.559553 methotrexate 0.559375 bevacizumab 0.554945 low-dose 0.550815 combination 0.548845 tamoxifen 0.547920 second-line 0.547072 doxorubicin 0.543647 lamivudine 0.542324 alone 0.534888 ifn-α 0.534870 antiviral 0.534470 mitoxantrone 0.534140 trastuzumab 0.527470 posaconazole 0.526527 taxane 0.526007 comparable 0.523686 voriconazole 0.523049 treatment-experienced 0.521246 leflunomide 0.521215 combinations 0.519301 m-vac 0.518551 anti-tnf 0.517589 cisplatin 0.514165
ä»ã«ããçéºä¼åãçæå¶éºä¼åãæ¾ãè¾¼ãã ããåå éºä¼åãããã£ã¦ããéºä¼ç
ãæ¾ãè¾¼ãã§éãã§ã¿ããã©ãããããªã®ãå¾ãããªãã£ãã®ã§ãããªæãã
ã
çç¶ããç¾æ£ãæ¨å®ãããã¨ããã®ãèªå診æã ã¨æãã®ã§ããè
¹ãçã女æ§ã¨ããã®ãããã£ã¦ã¿ãããããããã®ãããããããã
Enter three words (EXIT to break): pain abdomen woman
Word: pain Position in vocabulary: 86 Word: abdomen Position in vocabulary: 1939 Word: woman Position in vocabulary: 3529 Word Distance ------------------------------------------------------------------------ dre 0.516829 contrast-enhanced 0.482033 spn 0.477665 child-bearing 0.463668 childbearing 0.462467 demonstrating 0.457256 flexible 0.456756 -year-old 0.456158 pregnant 0.452602 session 0.441014 tof 0.440933 mmse 0.439102 nonpregnant 0.438862