Clojureã¨Kuromojiã使ã£ããé¢ç½ãããªã¨ã³ããªããã£ãã®ã§
Clojure/kuromojiã§ããã¹ããã¤ãã³ã°å
¥éããå½¢æ
ç´ è§£æããã¯ã¼ãã«ã¦ã³ãã¾ã§ã
http://antibayesian.hateblo.jp/entry/2013/09/10/231334
Luceneã«å ¥ã£ã¦ããKuromojiã使ã£ã¦æ¸ãç´ãã¦ã¿ã¾ããã
ããã¸ã§ã¯ãã®ä½æã
$ lein new app lucene-kuromoji
project.clj
(defproject lucene-kuromoji "0.1.0-SNAPSHOT" :description "FIXME: write description" :url "http://example.com/FIXME" :license {:name "Eclipse Public License" :url "http://www.eclipse.org/legal/epl-v10.html"} :dependencies [[org.clojure/clojure "1.5.1"] [incanter "1.5.1"] [org.apache.lucene/lucene-analyzers-kuromoji "4.4.0"]] :main lucene-kuromoji.core :profiles {:uberjar {:aot :all}})
å ã®ã¨ã³ããªã¨ã¯ç°ãªããLuceneã¯Maven Centralã«ããã®ã§repositoriesã®æå®ã¯ä¸è¦ã§ãã
ã½ã¼ã¹ã
src/lucene_kuromoji/core.clj
(ns lucene-kuromoji.core (:gen-class) (:import (org.apache.lucene.analysis Analyzer TokenStream) (org.apache.lucene.analysis.ja JapaneseAnalyzer) (org.apache.lucene.analysis.ja.tokenattributes BaseFormAttribute InflectionAttribute PartOfSpeechAttribute ReadingAttribute) (org.apache.lucene.analysis.tokenattributes CharTermAttribute) (org.apache.lucene.util Version)) (:use (incanter core stats charts io)) (:require [clojure.string :as str])) (def version (Version/LUCENE_44)) (defn morphological-analysis-sentence [^String sentence & predicates] (let [^Analyzer analyzer (JapaneseAnalyzer. version)] (with-open [^TokenStream token-stream (.tokenStream analyzer "" sentence)] (.reset token-stream) (let [^CharTermAttribute char-term (.addAttribute token-stream CharTermAttribute) ^BaseFormAttribute base-form (.addAttribute token-stream BaseFormAttribute) ^InflectionAttribute inflection (.addAttribute token-stream InflectionAttribute) ^PartOfSpeechAttribute part-of-speech (.addAttribute token-stream PartOfSpeechAttribute) ^ReadingAttribute reading (.addAttribute token-stream ReadingAttribute)] (loop [results []] (if (.incrementToken token-stream) (let [attrs [(.toString char-term) (.getReading reading) (.getPartOfSpeech part-of-speech) (.getBaseForm base-form) (.getInflectionType inflection) (.getInflectionForm inflection)]] (if (or (empty? predicates) (reduce (fn [b p] (and b (p attrs))) true predicates)) (recur (conj results attrs)) (recur results))) (do (.end token-stream) results))))))) (defn word-count [words] (reduce (fn [words word] (assoc words word (inc (get words word 0)))) {} words)) (defn wc-result [words] (reverse (sort-by second (word-count words)))) (defn top10 [words] (take 10 (wc-result words))) (defn top100 [words] (take 100 (wc-result words))) (defn -main [& args] (println "===== Simple Pattern =====") (doseq [t (morphological-analysis-sentence "é»ã大ããªç³ã®ç·ã®å¨")] (println t)) (println "===== Filter Pattern =====") (doseq [t (morphological-analysis-sentence "åã¯ã¦ãã®ã ã象ã¯é¼»ãé·ã" #(not (nil? (re-find #"åè©" (nth % 2)))))] (println t)) (println "===== åã¡ãã =====") (let [tokens (morphological-analysis-sentence (slurp "bocchan.txt") #(not (nil? (re-find #"åè©" (nth % 2))))) words (flatten (map #(first %) tokens))] (view (bar-chart (keys (top10 words)) (vals (top10 words)))) (save (bar-chart (keys (top10 words)) (vals (top10 words))) "natume.png" :width 600) (save (bar-chart (keys (top100 words)) (vals (top100 words))) "natume_zip.png" :width 600)))
Lucene Analyzers Kuromojiã®JapaneseAnalyzerã使ãã¨ããã£ã«ã¿ãããã£ã¦ããã®ã§å©è©ã¨ããã¹ãããã¯ã¼ãã¨ãã¦è½ã¨ããã¾ãã
ãã¨ãå ã®ã½ã¼ã¹ãããã¡ãã£ã¨ã³ã¬ã¯ã·ã§ã³ã¨é¢æ°å¼ã³åºããã¼ã¹ã«ãªãããã«å¤ãã¾ããâ¦ããClojureåä¸è¶³ãæãç¥ã£ãæãã§ããâ¦ã
æçµçã«å ¥åã¨ãªããåã¡ããã®ãã¼ã¿ã¯ããããå¼ã£å¼µã£ã¦ãã¾ããã
åã£ã¡ãã
http://www.aozora.gr.jp/cards/000148/files/752_14964.html
ä¸é¨ã¨è¨ãããå ¨é¨ï¼ç¬ï¼ã
å®è¡çµæã¯ããã¡ãã
ã¾ãã¯ã³ã³ã½ã¼ã«ã«åºåããæ¹ã
===== Simple Pattern ===== [é»ã ã¯ã㤠形容è©-èªç« nil 形容è©ã»ã¢ã¦ãªæ®µ åºæ¬å½¢] [大ã㪠ãªãªãã é£ä½è© nil nil nil] [ç³ ããã åè©-ä¸è¬ nil nil nil] [ç· ãªãã³ åè©-ä¸è¬ nil nil nil] [å¨ ã ã¹ã¡ åè©-ä¸è¬ nil nil nil] ===== Filter Pattern ===== [å ã㯠åè©-代åè©-ä¸è¬ nil nil nil] [ã¦ãã® ã¦ãã® åè©-ä¸è¬ nil nil nil] [象 ã¾ã¦ åè©-ä¸è¬ nil nil nil] [é¼» ãã åè©-ä¸è¬ nil nil nil]
å¾è ã¯ãåè©ã ãã²ã£ããã¦ããã¿ã¼ã³ã§ããã
å è¨äºã¨ãã³ã¿ãã¼ã«çµæãéãâ¦ä½¿ã£ãææ¸ã®éã¨ãç¯å²ãéãã®ããªâ¦ï¼