ã¨ã ã¹ãªã¼ã¨ã³ã¸ãã¢ãªã³ã°ã°ã«ã¼ã AIã»æ©æ¢°å¦ç¿ãã¼ã ã®ä¸æ(@po3rin) ã§ãã 好ããªè¨èªã¯Goãä»äºã§ã¯ä¸»ã«æ¤ç´¢å¨ããæ å½ãã¦ãã¾ãã
Overview
æè¿ã®ä»äºã§å»å¸«ã«è³ªåãã§ãããµã¼ãã¹ã§ãElasticsearchã使ã£ã¦ãªãã¹ãä½ã³ã¹ãã§é¢é£ãã¼ã¯ã¼ãæ©è½ãå®è£ ãããã¨ããæ¡ä»¶ã«æºãã£ã¦ãã¾ãããæ¬è¨äºã§ã¯é¢é£ãã¼ã¯ã¼ãæ©è½ãä½ã³ã¹ãã§å®è£ ããããã®æè¡èª¿æ»ã®çµæã¨ãå®éã«æ¡ç¨ããæ¹æ³ããç´¹ä»ãã¾ãã
ä»åç´¹ä»ããæ¹æ³ã¯æ©æ¢°å¦ç¿ãªã©ã¯ä½¿ããããªãã¹ãä½ã³ã¹ãã§ããç¨åº¦ã®å質ãç®æããã®ã§ãããã®è¨äºãèªããã¨ã§æ¤ç´¢ã¢ããªã±ã¼ã·ã§ã³ã«ãµã¯ãã¨é¢é£ãã¼ã¯ã¼ãæ©è½ãå®è£ ã§ããããã«ãªãã§ãããã
- Overview
- æ¤ç´¢ã«ãããé¢é£ãã¼ã¯ã¼ãæ©è½ã¨ã¯
- å®è£ ã®åææ¡ä»¶
- å®è£ ãã¿ã¼ã³ã®ç´¹ä»
- å®è£ ãã¿ã¼ã³ã®æ¯è¼
- é¡ä¼¼èªã®é¤å»ã«ã¤ãã¦ã®èå¯
- æ¤è¨çµæ
- ã¾ã¨ã
- Reference
æ¤ç´¢ã«ãããé¢é£ãã¼ã¯ã¼ãæ©è½ã¨ã¯
ã¦ã¼ã¶ã¼ã®æ¤ç´¢ã¯ã¨ãªã«å¯¾ãã¦åæ¤ç´¢ããé¢é£ãã¼ã¯ã¼ããæ¨è¦ããæ©è½ã§ããä¸å³ã®ããã«Googleã§ã¯æ¤ç´¢çµæãã¼ã¸ã®ä¸ã®æ¹ã«é¢é£ãã¼ã¯ã¼ãã表示ããã¦ãã¾ãã
ä¼¼ãæ©è½ã«ã¯ã¨ãªãµã¸ã§ã¹ããããã¾ããããã¯æ¤ç´¢çªã§å ¥åããããã¼ã¯ã¼ãã®è£å®ãä¸ç·ã«æ¤ç´¢ãããã¼ã¯ã¼ããæ¨è¦ãããã®ã§ããä¾ãã°ãã«ã»nãã¨æ¤ç´¢çªã«å ¥åããå ´åããæ¥æ¬ãã¨ããã¯ã¨ãªãæ¨è¦ãã¾ããä»åã¯ãã®ãããªæ¤ç´¢çªã§ã®ã¯ã¨ãªãµã¸ã§ã¹ãã¯å¯¾è±¡ã«å«ãã¾ãããElasticsearchã使ã£ãã¯ã¨ãªãµã¸ã§ã¹ãã®å®è£ ã¯Elasticsearchã®å ¬å¼ããã°[1]ãåèã«ãªãã§ãããã
å®è£ ã®åææ¡ä»¶
å©ç¨ã§ãããã¼ã¿ã¯ä¸è¨ã®ãã¼ã¿ã®ã¿ã§ããã¨ãã¾ãã
- ã¦ã¼ã¶ã¼ãæ¤ç´¢ã§å ¥åããã¯ã¨ãªæåå
- éå»ã®æ¤ç´¢ãã°ãã¼ã¿
- æ¤ç´¢å¯¾è±¡ã®ããã¥ã¡ã³ã
ãã¡ããæ¤ç´¢ã«ã²ãã¥ããããæ°ãã¯ãªãã¯çãªã©ãããã°ãããè¯ãé¢é£ãã¼ã¯ã¼ããç®åºã§ããã¯ãã§ãããä»åæã ãé¢é£ãã¼ã¯ã¼ãæ©è½ãå®è£ ããã¢ããªã±ã¼ã·ã§ã³ã¯æ¤ç´¢ãã°åéåºç¤ãçµ¶è³æ§ç¯ä¸ã®ãµã¼ãã¹ã§ãããå®éã«å ¥åãããã¯ã¨ãªæååããå©ç¨ã§ãã¾ããã§ããããã®ããä»åã¯ä¸è¨ã®ãã¼ã¿ã ãã§å®è£ ã§ããæ¹æ³ãæ¤è¨ãã¦ããã¾ãã
å®è£ ãã¿ã¼ã³ã®ç´¹ä»
ä»åæã ãæ¤è¨ããææ³4ãã¿ã¼ã³ãç´¹ä»ãã¾ãã
- (1) ãã°ã§åºç¾ããåèªãæ°ãããã
- (2) ãã°ã«å¯¾ããSignificant terms aggregation
- (3) ãã°ã«å¯¾ãã¦Significant terms aggregation (å½¢æ ç´ è§£æãã)
- (4) ããã¥ã¡ã³ãã«å¯¾ãã¦Significant terms aggregation (å½¢æ ç´ è§£æãã)
(1) ãã°ã§åºç¾ããåèªãæ°ãããã
ãã¡ãã¯æ¤ç´¢ãã°ãsplit tokenizerã§åå²ãããã®ã«å¯¾ãã¦aggregations
æ©è½ã§ã¯ã¨ãªã¨ä¸ç·ã«åºç¾ããåèªãæ°ãä¸ãã¦ãããåºç¾æ°ä¸ä½kåãé¢é£ãã¼ã¯ã¼ãã¨ããææ³ã§ãã
split tokenizerã¯å½¢æ ç´ è§£æããããåç´ã«ã¹ãã¼ã¹ãè¨å·ãªã©ã§åèªãåå²ããTokenizerã§ãã詳ããæ å ±ã¯ä¸è¨ã®ããã¥ã¡ã³ããã覧ä¸ããã
split tokenizerã使ãmappingã¯ä¸è¨ã®ããã«è¨å®ãã¾ãããã®ä»ã®filterã¯æ¤ç´¢ã¢ããªã±ã¼ã·ã§ã³ã«åããã¦è¨å®ãã¾ãããã
{ "settings": { "analysis": { "analyzer": { "split_analyzer": { "type": "custom", "tokenizer": "split_tokenizer", "char_filter": [ "html_strip", "standard_icu_filter" ], "filter": [ "trim", "ja_stop" ] } }, "tokenizer": { "split_tokenizer": { "type": "simple_pattern_split", "pattern": [ "_", " ", "ã", "ã", "ã", "?", ",", "." ] } }, "char_filter": { "standard_icu_filter": { "type": "icu_normalizer" } }, "filter": { "ja_stop": { "type": "ja_stop", "stopwords": [ "_japanese_" ] } } } }, "mappings": { "properties": { "query_log": { "type": "text", "fielddata": true, "analyzer": "split_analyzer" } } } }
"fielddata": true
ã¯aggregations
ãå©ç¨ããå ´åã¯å¿
é ã®è¨å®ãªã®ã§å
¥ãã¦ããã¾ãã
ä¸è¨ã®ãããªã¯ã¨ãªãæããã°ãè ¹çãã®é¢é£ãã¼ã¯ã¼ããåå¾ã§ãã¾ããqueryé¨åã®sizeã0ã«è¨å®ãã¦ããã®ã¯æ¤ç´¢ã«ããããããã°ãé¢é£ãã¼ã¯ã¼ãæ©è½ã«ä¸è¦ã ããã§ãã
{ "size": 0, "query": { "match": { "query_log": { "query": "è ¹ç", "operator": "and" } } }, "aggs": { "keywords": { "terms": { "field": "query_log", "order": { "_count": "desc" }, "size": 10 } } } }
ãã®ã¯ã¨ãªãæããã¨ã¯ã¨ãªã«åºã¥ããé¢é£ãã¼ã¯ã¼ããåå¾ã§ãã¾ããä¸è¨ã¯å¼ç¤¾ã®2å¹´åã®ãã°ãã¼ã¿ã«å¯¾ããå®è¡çµæã§ãã
{ // ... "aggregations": { "keywords": { // ... "buckets": [ { "key": "è ¹ç", // ... }, { "key": "ä¸ç¢", // ... }, { "key": "å¦å¨ åæ", // ... }, { "key": "éææ§è ¸çå群", // ... }, { "key": "åä¾", // ... }, { "key": "å¦å¨ ", // ... }, { "key": "åãæ°", // ... }, { "key": "è °ç", // ... }, { "key": "é£å¾", // ... }, { "key": "便ç§", // ... } ] } } }
ãã®ææ³ã®ã¡ãªããã¯ã¯ã¨ãªã«ä»å ããé¢é£ãã¼ã¯ã¼ãã¨ãã¦èªç¶ãªæ¥æ¬èªãåå¾ã§ããç¹ã§ã(å¾ã»ã©å½¢æ ç´ è§£æã使ã£ãææ³ã¨æ¯è¼ãã¦ããã¾ã)ã ãã¡ãªããã¨ãã¦ã¯äººæ°ã®ãã¼ã¯ã¼ãã§è¡¨è¨ãæºãããã®(ãã/ã¬ã³)ãªã©ãé¢é£ãã¼ã¯ã¼ãã¨ãã¦ä¸ç·ã«è¿ã£ã¦ãã¦ãã¾ãå¯è½æ§ãããã¾ããã¾ããsplit tokenizerã§ã¯åè©ã«ããfilterãã§ããªããããé¢é£ãã¼ã¯ã¼ãã¨ãã¦ãã¾ãæå³ããªããªãæ°åãè¨å·ãªã©ã®ãã¼ã¯ã¼ããè¿ã£ã¦ããæããããã¾ãã
(2) ãã°ã«å¯¾ããSignificant terms aggregation
ãã®ææ³ã¯åç´ã«åºç¾æ°ä¸ä½ãè¿ãã®ã§ã¯ãªããElasticsearchã®Significant Terms Aggregationæ©è½ã使ãã¾ãã
ç°¡åã«èª¬æããã¨åºç¾æ°ã ãã§ãªããå ¨ä½ã§ã®åºç¾é »åº¦ãèæ ®ããææ³ã§ãªãã¾ããå ã»ã©å°å ¥ããindexã«å¯¾ãã¦ã¯ã¨ãªãå¤ããã ãã§å®è¡ã§ãã¾ããsignificant terms aggregationã§ã¯ã©ã®ã¹ã³ã¢ãªã³ã°ææ³ã使ãããé¸ã¹ã¾ãããä»åã¯Google normalized distance [2]ãå©ç¨ãã¾ãã
{ "size": 0, "query": { "match": { "query_log": "è ¹ç"} }, "aggregations": { "significant_crime_types": { "significant_terms": { "field": "query_log", "min_doc_count": 10, "size": 10, "gnd": { } } } } }
min_doc_count
ã¯é¢é£ãã¼ã¯ã¼ãã¨ãã¦è¿ãåèªã®æä½åºç¾æ°ãè¨å®ã§ãã¾ããsize
ã¯ã¬ã¹ãã³ã¹ã§è¿ããã¼ã¯ã¼ãã®æå¤§æ°ã§ããä¸è¨ã¯å¼ç¤¾ã®2å¹´åã®ãã°ãã¼ã¿ã«å¯¾ããå®è¡çµæã§ãã
{ // ... "aggregations": { "significant_crime_types": { // ... "buckets": [ { "key": "è ¹ç", // ... }, { "key": "ä¸ç¢", // ... }, { "key": "éææ§è ¸çå群", // ... }, { "key": "èç§»æ¤å¾", // ... }, { "key": "æä¾¿å", // ... }, { "key": "æä¾¿å¾", // ... }, { "key": "ã«ãã©ã¼ã«", // ... }, { "key": "é£å¾", // ... }, { "key": "å¤§è ¸å è¦é¡å¾", // ... } ] } } }
æåã«ç´¹ä»ããææ³ã¨æ¯ã¹ã¦ãè ¹çã«ããé¢ä¿ã®ãããã¼ã¯ã¼ããåå¾ã§ãã¦ãã¾ãã䏿¹ã§ãå¦å¨ ããªã©ã®ãè ¹çã以å¤ã®ãã¼ã¯ã¼ãã¨ãå ±èµ·ãã確çãé«ããã¼ã¯ã¼ãã¯åºç¾ãã«ãããªãã¾ãããã®ããå°ãçãããã¼ã¯ã¼ããè¿ã£ã¦ãããã¨ãããã¾ããããã¯æ¤ç´¢ã¢ããªã±ã¼ã·ã§ã³æ¬¡ç¬¬ã§ã¯ãã¡ãªããã«ãªãã¾ããã¾ããã®æ¹æ³ã§ãå ã»ã©ã®ææ³ã¨åãããã«åè©ã§ãã£ã«ã¿ã¼ã§ããªããªã©ã®ãã¡ãªãããæ®ãã¾ãã
(3) ãã°ã«å¯¾ãã¦Significant terms aggregation (å½¢æ ç´ è§£æãã)
ä»ã¾ã§ç´¹ä»ãã2ã¤ã®ææ³ã¯ä¸»ã«split tokenizerãä½¿ãææ³ã§ãããããã§ç´¹ä»ããææ³ã¯kuromojiãªã©ã®æ¥æ¬èªå½¢æ ç´ è§£æã¨ä½µç¨ãããã®ã§ãã
ãããã¨ã¯split_analyzer
ãã好ã¿ã®analzyerã«å·®ãæ¿ããã ãã§ãããã®ææ³ã®ã¡ãªããã¯åè©ã«ãããã£ã«ã¿ãªã©ãå©ç¨ã§ããç¹ã§ãã
䏿¹ã§ãã®ææ³ã®ãã¡ãªããããããããããä¼ãããããã«ãä»åã¯2å¹´åã®ãã°ãã¼ã¿ã«å¯¾ãã¦ãçççãã§aggregationãè¡ã£ãããè¦ããã¾ãã
{ // ... "aggregations": { "significant_crime_types": { // ... "buckets": [ { "key": "ççç", // ... }, { "key": "ã²ã©ã", // ... }, { "key": "ä¸è ¹é¨", // ... }, { "key": "é ·ã", // ... }, { "key": "æ¹å", // ... }, { "key": "ãã°ã¬ãã¼ã«", // ... }, { "key": "ãªã¼", // ... }, { "key": "ãã¹", // ... }, { "key": "ç·©å", // ... }, { "key": "æ¸", // ... } ] } } }
æ¥æ¬èªã¨ãã¦ä¸èªç¶ãªçµæãè¿ã£ã¦ãã¦ãã¾ããç¹ã«ããªã¼ãã¨ããã¹ããä¸èªç¶ã§ããããã¯ããªã¼ãã¹ãã¨ããå»è¬åãæªç¥èªã§ããããã«å½¢æ ç´ è§£æããã¦ãã¾ã£ãçµæã§ãããã®ããã«å½¢æ ç´ è§£æã使ã£ãææ³ã§ã¯æªç¥èªãä¸èªç¶ãªæ¥æ¬èªã«åå²ãã¦ãã¾ãå¯è½æ§ãããã¾ãã
(4) ããã¥ã¡ã³ãã«å¯¾ãã¦Significant terms aggregation (å½¢æ ç´ è§£æãã)
ä»ã¾ã§ç´¹ä»ãã3ã¤ã®ææ³ã§ã¯é¢é£ãã¼ã¯ã¼ããã¯ã¨ãªã«ä»å ãã¦åæ¤ç´¢ããããã¨ãããæ°ã0ã«ãªã£ã¦ãã¾ãå¯è½æ§ãããã¾ã(ãã°ãã¨ã®ãããæ°ãå©ç¨ããã°åé¿ã§ãã¾ãããä»åã¯ã¯ã¨ãªæååã ããå©ç¨ããã¨ããåæãããã¾ã)ã絶対ã«ãããæ°ã0ã®é¢é£ãã¼ã¯ã¼ããçæããããªãå ´åã¯ããã¥ã¡ã³ãã«å¯¾ããSignificant terms aggregationãä¸èã«å¤ãã¾ãã
ããã¥ã¡ã³ãã«å¯¾ããSignificant terms aggregationã¯ãã°ãã¼ã¿ãå©ç¨ããã«æ¤ç´¢å¯¾è±¡ã®ããã¥ã¡ã³ããã¼ã¿ã ããå©ç¨ããææ³ã§ãæ¤ç´¢ã§ä¸ä½ã«ãããããããã¥ã¡ã³ãã«ã¯é¢é£ã®ãããã¼ã¯ã¼ããå«ã¾ãã¦ããã¨ããä»®å®ã§æç«ããææ³ã§ããããã¥ã¡ã³ãã«å«ã¾ãããã¼ã¯ã¼ãã¯åºæ¬çã«ã¹ãã¼ã¹åºåãã§ã¯ãªãããå¿ ç¶çã«å½¢æ ç´ è§£æã®å©ç¨ãå¿ é ã«ãªãã¾ãã
ãçççãã§æ¤ç´¢ããçµæã示ãã¾ããã¯ã¨ãªã¯å¯¾è±¡ã®ãã£ã¼ã«ãããã°ããããã¥ã¡ã³ãã«å¤ããã ãã§å®è¡ã§ãã¾ãã
{ // ... "aggregations": { "significant_crime_types": { // ... "buckets": [ { "key": "ççç", // ... }, { "key": "é¨", // ... }, { "key": "卿", // ... }, { "key": "è", // ... }, { "key": "婦人", // ... }, { "key": "è °ç", // ... }, { "key": "æåµ", // ... }, { "key": "䏿£", // ... }, { "key": "çè «", // ... }, { "key": "åå®®", // ... }, { "key": "åµå·£", // ... } ] } } }
ãããã®ãã¼ã¯ã¼ãã¯ããã¥ã¡ã³ãããå¼ã£å¼µã£ã¦ãã¦ããã®ã§æ¤ç´¢çµæã0ä»¶ã«ãªããã¨ã¯ããã¾ããã䏿¹ã§å½¢æ ç´ è§£æãè¡ãã®ã§ä¸èªç¶ãªæ¥æ¬èªãåºã¦ããå¯è½æ§ã¯æ®ãã¾ããã¾ããåæã¯ã¨ãªã®æ¤ç´¢çµætop-kããé¢é£ãã¼ã¯ã¼ããæã£ã¦ããã®ã§kã®å¤æ¬¡ç¬¬ã§ã¯é¢é£ãã¼ã¯ã¼ãã§æ¤ç´¢ãã¦ãåæã¯ã¨ãªã¨æ¯è¼ãã¦æ¤ç´¢çµæããã¾ãå¤ãããªãæããããã¾ãã
å®è£ ãã¿ã¼ã³ã®æ¯è¼
ä»ã¾ã§ç´¹ä»ããæ¹æ³ãæ¯è¼ãã¦ããã¾ããæ¯è¼ã«ã¯ä¸è¨ã®ç°å¢ãç¨æãã¾ããã
* Elasticsearch v7.10.1 * Analyzerã¯kuromojiãå©ç¨ * 1ã¯ã¨ãªã§è¿ãé¢é£ãã¼ã¯ã¼ãæ°ã¯æå¤§10ä»¶ã¨è¨å® * éå»ã®ã¯ã¨ãªããã©ã³ãã ã«100ä»¶ãè©ä¾¡ç¨ã«å©ç¨ * å©ç¨ãããã°ãã¼ã¿ã¯2å¹´åã®ãã¼ã¿
ãã°ãã¼ã¿ã«é¢ãã¦ã¯å¼ç¤¾ã§éçºãéå¶ãã¦ããAskDoctorsã®ãã°ãã¼ã¿2å¹´åã使ã£ã¦è©ä¾¡ãã¾ããAskDoctorsã¯å»å¸«ã«è³ªåã§ãããµã¼ãã¹ã§ãç¸è«äºä¾ãæ¤ç´¢ã§ããæ©è½ãããã®ã§ãã®ãã°ãå©ç¨ãã¾ãã
è©ä¾¡ææ¨è¨å®
ä¸è¨ã¯å®éã«æã ãé¢é£ãã¼ã¯ã¼ãæ©è½ãå®è£ ããã«å½ãã£ã¦è¨å®ããææ¨ã«ãªãã¾ãã
1: é¢é£ãã¼ã¯ã¼ãç·æ° 2: æ¥æ¬èªã¨ãã¦ä¸èªç¶ãªé¢é£ãã¼ã¯ã¼ãã®åºç¾ç¢ºç 3: ãããæ°ã0ä»¶ã«ãªãé¢é£ãã¼ã¯ã¼ãã®åºç¾ç¢ºç 4: åæã¯ã¨ãªã®çµætop10ä»¶ã®ãã¡8件以ä¸åãã³ã³ãã³ãã表示ãããé¢é£ãã¼ã¯ã¼ãã®åºç¾ç¢ºç 5: ããé¢é£ãã¼ã¯ã¼ãã«å¯¾ãã¦8件以ä¸çµæãéè¤ããé¢é£ãã¼ã¯ã¼ããåºç¾ãã確ç 6: å®è£ ãéç¨ã³ã¹ã(5段é)
ä¸3ã¤ã¯åãããããææ¨ãã¨æãã¾ãã
4ã¤ç®ã®ææ¨ã¯ãåæã¯ã¨ãªã®çµæã¨ã®éè¤çãé«ãã¨é¢é£ãã¼ã¯ã¼ãã¨ãã¦æ¨è¦ãã¦ããã¾ãæå³ããªãããã§ãã
5ã¤ç®ã®ææ¨ã¯ãä¼¼ããã¼ã¯ã¼ãã§æ¤ç´¢ããã¨ä¼¼ãçµæãè¿ãã¨ãã仮説ã«åºã¥ãã¦ãé¢é£ãã¼ã¯ã¼ããªã¹ãã®ä¸ã«é¡ä¼¼èªãªã©ãå ¥ã£ã¦ãã¾ã£ã¦ããã±ã¼ã¹çãæ¸¬ãããã®ææ¨ã§ããæ¨è¦ããé¢é£ãã¼ã¯ã¼ããªã¹ãã«ãçããã¨ãçã¿ããå ¥ã£ã¦ããã¨ã¦ã¼ã¶ã¼ã«ã¨ã£ã¦é¢é£ãã¼ã¯ã¼ããéè¤ãã¦ããããã«è¦ãã¦ãã¾ãã®ã§é¡ä¼¼èªãæ´»ç¨ãéãã ãã®ãã¼ã¯ã¼ããªã©ã¯ãªã¹ãããåãé¤ãå¿ è¦ãããã¾ãã
6ã¤ç®ã®ææ¨ã¯ã主観ã§ãã£ããã¨å®è£ éç¨ã³ã¹ãã5段éã§è¦ç©ãã£ããã®ã§ãã
ã¡ãªã¿ã«2ã¤ç®ã®ææ¨ã®ãæ¥æ¬èªã¨ãã¦ä¸èªç¶ãªé¢é£ãã¼ã¯ã¼ããã¯å ¨ä»¶ç®è¦ã§ç¢ºèªãã¾ããã
å®è£ 4ãã¿ã¼ã³ããã®ææ¨ã§æ¸¬ã£ãçµæã¯ä¸è¡¨ã«ãªãã¾ãã
ãã®çµæãããæ¥æ¬èªã¨ãã¦ä¸èªç¶ãªçµæãå«ãã§ãã¾ãâ¢ã¨â£ã¯é¸æããé¤å¤ãã¾ãããæ®ãã¯(1)ã¨(2)ã§ããã±ã£ã¨è¦ã¯(2)ã®æ¹ãè¯ãããã§ããã(2)ã§å¾ãããé¢é£ãã¼ã¯ã¼ãæ°ã(1)ããæããã«å°ãªããã¨ãã¾ã(2)ã¯ææ³ç´¹ä»ã®ç¯ã§ãã話ããã¾ããããé¢é£ãã¼ã¯ã¼ãã®ä¸ã§ãåæã¯ã¨ãª"ã ã"ã«é¢é£ãå¼·ããã¼ã¯ã¼ãããè¿ã£ã¦ããªãã¨ããç¹ãããä»åæã ã¯(1)ã®ææ³ã鏿ãã¾ããã
é¡ä¼¼èªã®é¤å»ã«ã¤ãã¦ã®èå¯
ããã¾ã§4ã¤ã®å®è£ ãã¿ã¼ã³ãæ¤è¨ãã¾ããããå ¨ã¦ã®ææ³ã§é¡ä¼¼èªãæ´»ç¨ã ãã®å·®ããããã¼ã¯ã¼ããªã©ãçºçããå¯è½æ§ãããã¾ããä¾ãã°ãçã¿/çãããå¦å¨ /å¦å¨ ä¸ããªã©ã§ãããããã®ãã¼ã¯ã¼ããããã¨ã¦ã¼ã¶ã¼ã«ã¨ã£ã¦åããã¼ã¯ã¼ãã並ãã§ããã ãã«è¦ãã¦ãã¾ãé常ã«è¦æ ããæªãã ãã§ãªããé¢é£ãã¼ã¯ã¼ãã¨ãã¦ææ¡ã§ããå¹ ãæ¸ã£ã¦ãã¾ãã¾ãã
ããããã·ããã è¾æ¸ã§å ¨ã¦å¸åããã®ã¯å°é£ãªã®ã§ãä½ãããå¥ã®æãæã¤å¿ è¦ãããã¾ãã
åèªåæ£è¡¨ç¾ã使ã£ãé¡ä¼¼èªå¤å®
ãã®åé¡ã®è§£æ±ºçã¨ãã¦åèªåæ£è¡¨ç¾ã使ãã¨ãã鏿è¢ãããã¾ããä¾ãã°æ¥æ¬èªåèªãã¯ãã«ã¨ãã¦å ¬éããã¦ããchiVe [3]ãå©ç¨ããã¨é¡ä¼¼èªãæ´»ç¨ã ãã®å·®ããããã¼ã¯ã¼ããããç¨åº¦è¦ã¤ãããã¨ãå¯è½ã§ãã
chiVeã¯Magnitudeã¨ããåæ£è¡¨ç¾ãæ±ãPythonã©ã¤ãã©ãªçµç±ã§å©ç¨ããã®ãç°¡åã§ãããã¡ãã®è¨äºãåèã«ãªãã§ãããã
ä¾ãã°ä¸è¨ã®ã³ã¼ãã§åèªãã¨ã®é¡ä¼¼åº¦ãåå¾ã§ãã¾ãã
from pymagnitude import Magnitude, MagnitudeUtils # ãªã¢ã¼ãã§ã®ãã¼ã vectors = Magnitude( "https://sudachi.s3-ap-northeast-1.amazonaws.com/chive/chive-1.2-mc15.magnitude") print(vectors.similarity("çã¿", "ç æ°")) print(vectors.similarity("çã¿", "çã")) print(vectors.similarity("å¦å¨ ä¸", "å¦å¨ ")) print(vectors.similarity("ã¶ã¤ãã", "ã¶ã¤ãã"))
ãããå®è¡ããã¨åèªã®é¡ä¼¼åº¦ã0~1ã®ç¯å²ã§åå¾ã§ãã¾ãã
python chive.py 0.4237097 0.66561085 0.8114295 0.7468979274167429
é¡ä¼¼åº¦ã®é¾å¤ãè¨å®ãããã¨ã§é¢é£ãã¼ã¯ã¼ãã®ãªã¹ãããä¼¼ãåèªãåé¤ãããã¨ãå¯è½ã§ãã
æ¡ç¨ããé¡ä¼¼èªã®ã«ã¼ã«ãã¼ã¹é¤å»
åèªã®åæ£è¡¨ç¾ãå©ç¨ããªãæ¹æ³ã¨ãã¦ãã¨ããããç®ã«è¦ãã¦ããçºçãã¿ã¼ã³ãã«ã¼ã«ãã¼ã¹ã§é¤å»ãããã¨ãèãããã¾ãããã®ã«ã¼ã«ãã¼ã¹ã®å®è£ ã§ã¯ãçãããã¯é¤å»ãã®ç²¾ç¥ã§ã¬ã³ã¬ã³é¤å»ãã¦ããã¾ãã
ããããã¯æã ãç¨æããé¤å»ã«ã¼ã«ãç°¡åã«ç´¹ä»ãã¾ãã
- 1æåç®ã®æ¼¢åãä¸è´ããå¾ç¶ãä»»æã®ã²ãããªã ã£ãå ´åã¯æ´»ç¨ãéãã·ããã ã¨å¤æãã¦é¤å»
- ã²ãããª/ã«ã¿ã«ãã®å·®ã ãã®ãã®ãé¤å»
- ãµãæååã®é¨åä¸è´ãé¤å»
1ã¤ç®ã®ã«ã¼ã«ã§ã¯ãçã/çã¿ããå¼µã/å¼µãããªã©ã®éè¤ããªã¹ãããåãé¤ãã¾ãããã®ã«ã¼ã«ã«ããæ´»ç¨ãéãã ãã®ãã®ããåè©ãéãã ãã®ãã®ãåãé¤ãã¾ãã
2ã¤ç®ã®ã«ã¼ã«ã§ã¯ãèãã/èã¬ã³ããªã©ã®éè¤ããªã¹ãããåãé¤ãã¾ããç¹ã«å»çç¨èªã§ã¯ã«ã¿ã«ã/ã²ãããªã®æºãã大ããããããããã®å¯¾å¿ãå¿ è¦ã§ãã
3ã¤ç®ã®ã«ã¼ã«ã§ã¯ãå¦å¨ /å¦å¨ ä¸ãã鬱/鬱ç ããªã©ã®éè¤ããªã¹ãããåãé¤ãã¾ããããã¯éãæå³ã®åèªããªã¹ãããé¤å»ãã¦ãã¾ãå¯è½æ§ãããã¾ããä¾ãã°ãã®ã«ã¼ã«ã ã¨ãç/æçå¤ãã®çµã¿åãããåããã¦ãã¾ãã¾ããããããçãããã¯é¤å»ãã®ç²¾ç¥ã§ããããã¬ã³ã¬ã³é¤å»ãã¦ããã¾ãã
é¡ä¼¼èªã®é¤å»ã®ã¾ã¨ã
ææ³(1)ã«å¯¾ãã¦(1-1)é¡ä¼¼èªã®é¤å»ãªãã(1-2)åèªåæ£è¡¨ç¾ã®å©ç¨ã(1-3)ã«ã¼ã«ãã¼ã¹ã®é¤å»ã®3ã¤ã®æ¹æ³ã§è©ä¾¡ãããã®ãä¸è¡¨ã«ãªãã¾ãã
åèªåæ£è¡¨ç¾ãå©ç¨ããéã«ã¯ãããé¢é£ãã¼ã¯ã¼ãã«å¯¾ãã¦8件以ä¸çµæãéè¤ããé¢é£ãã¼ã¯ã¼ããåºç¾ãã確çãã5.11%ãã3.48%ã«ã¾ã§è½ã¨ããã¨ãåºæ¥ã¾ããã䏿¹ã§ã«ã¼ã«ãã¼ã¹ã®é¡ä¼¼èªé¤å»ã§ã¯5.11%ã ã£ãã®ã2.54%ã«ã¾ã§è½ã¨ããã¨ãåºæ¥ã¾ãããåèªåæ£è¡¨ç¾ã®æ¹æ³ãããã«ã¼ã«ãã¼ã¹ã®æ¹ã1%ã»ã©æ¹åãã¦ãã¾ãã
ã¾ãå®è£ ã³ã¹ããèããã¨æã ã®ãã¼ã ã§åèªåæ£è¡¨ç¾ãå©ç¨ãããã¨ããã¨Pythonã®ãã¤ã¯ããµã¼ãã¹ãç«ã¦ãéçºã³ã¹ããçºçãããããä»åã®å®è£ ã§ã¯ã«ã¼ã«ãã¼ã¹ã§ã®é¤å»ãæ¡ç¨ãã¾ããã
䏿¦æ¡ç¨ãè¦éã£ãåèªåæ£è¡¨ç¾ã§ããã«ã¼ã«ãã¼ã¹ã§ã¯æããããªããã¿ã¼ã³ããããããä»å¾ãã©ã¡ã¼ã¿èª¿æ´ãªã©ã§ç²¾åº¦ãæ¹åãã¦ãã£ãä¸ã§å®è·µæå ¥ãããã¨ã¯ååã«ãããã¾ãã
æ¤è¨çµæ
çµæçã«ãå®è£ ã³ã¹ããä½ããªãããååãªãªã¼ã¹ã«åºããã ãã®å質ãããã¨å¤æãããã°ã§åºç¾ããåèªãæ°ãããããææ³ã«ãã«ã¼ã«ãã¼ã¹ã®é¡ä¼¼èªé¤å»ããçµã¿åãããææ³ãæ¡ç¨ãã¾ããã
ã¾ã¨ã
ä»åã¯Elasticsearchã使ã£ãé¢é£ãã¼ã¯ã¼ãæ©è½ãã©ãã ãä½ã³ã¹ãã§å®è£ ã§ãããã調æ»ããå®éã«æ¡ç¨ããææ³ããç´¹ä»ãã¾ãããæ¤ç´¢ã¢ããªã±ã¼ã·ã§ã³ã«ãã£ã¦ã¯æã ãæ¡ç¨ããææ³ãåããªãã±ã¼ã¹ãããã®ã§ã注æãã ããã
ä»å¾ã¯ã¯ã¨ãªæåå以å¤ã®ãã¼ã¿(ãããæ°ãCVæ°ããã¼ã½ãã«ãã¼ã¿)ãèæ ®ããé¢é£ãã¼ã¯ã¼ããæ¨è¦ã§ããããã«ã¢ã«ã´ãªãºã ãã¢ãããã¼ããã¦ããããã¨èãã¦ãã¾ãã
We're hiring !!!
ã¨ã ã¹ãªã¼ã§ã¯æ¤ç´¢åºç¤ã®éçº&æ¹åãéãã¦å»çãåé²ãããã¨ã³ã¸ãã¢ãåéãã¦ãã¾ãï¼ ç¤¾å ã§ã¯æè¿ãæ¤ç´¢ãã¼ã ãä¸å¿ã«ãElasticsearch & Lucene ã³ã¼ããªã¼ãã£ã³ã°ä¼ããçºè¶³ããæ¤ç´¢ã®ä»çµã¿ã«é¢ããè°è«ãæ´»çºã§ãã
ãã¡ãã£ã¨è©±ãèãã¦ã¿ãããããã¨ãã人ã¯ãã¡ãããï¼
Reference
[1] Elasticsearchã§æ¥æ¬èªã®ãµã¸ã§ã¹ãã®æ©è½ãå®è£ ãã