Gunosyãã¼ã¿åæé¨ã¢ã«ãã¤ãã®é´æ¨ã§ããä»åã¯å¯åº¦æ¯ãå©ç¨ãããã¼ã¸ã§ã³ãªãªã¼ã¹ã«ãããç°å¸¸æ¤ç¥
ã«ã¤ãã¦å¦ãã ãã¨ãã¾ã¨ãããã¨æãã¾ãã
- ãããããã¨
- è¶ é·æçã«ãããããã¨
- å¯åº¦æ¯ãç¨ããç°å¸¸æ¤ç¥ã®ã¤ã¡ã¼ã¸
- ããã¼ãã¼ã¿ã§ã®å®è£ ä¾1
- ããã¼ãã¼ã¿ã§ã®å®è£ ä¾2
- åèè³æ
ãããããã¨
ãã¥ã¼ã¹ãã¹ï¼Gunosyã®æä¾ãããããã¯ãã®ä¸ã¤ï¼ããã¼ã¸ã§ã³ã¢ããããæã«ãããç°å¸¸ãããã°ã¦ã¼ã¶ã¼ã¢ã¯ã·ã§ã³ãã°ãããã®å
åãè¦ã¤ãã¦slackãªã©ã«éç¥ã§ããããã«ãããã¨ãç®æ¨ã§ãã
ï¼QAé
ç®ä»¥å¤ã§ã®ãã°æ¬ æãã¢ãããã¼ãã«ããäºæãã¬ã¦ã¼ã¶è¡åã®æ¤ç¥ãããããã§ããï¼
ç¾å¨Gunosyã§ã¯ããã¼ã¸ã§ã³ã¢ããæã«ç°å¸¸ããªããã©ãã調æ»ããããã«äººæãå²ãã¦ãã¾ããããããããèªåã§ç°å¸¸ã確å®ã«è¦ã¤ããããããã«ãªãã°ç¤¾å¡ããã®è² æ
ã軽ãã§ãã¾ãããï¼ãããã£ããã¨ãç°å¸¸æ¤ç¥ã®ã¢ããã¼ã·ã§ã³ã§ãã
è¶ é·æçã«ãããããã¨
åçªã§ããæè¯ã®ã·ã¹ãã ã£ã¦ä½ã§ãããï¼ã¨ã©ã¼ã®ãªãã·ã¹ãã ããã¨ã©ã¼ãæ³å®ããããèªå·±è§£æ±ºã§ããã·ã¹ãã ããåã¯å¾è ã ã¨æãã¾ãããã社å¡ããæ°ããç°å¸¸ãèªåã§æ¤ç¥ãèªåã§ä¿®æ£ã§ãããããã¯ããä½ããã¨ãæçµç®æ¨ã ããã§ããã¾ãã«äººå·¥ç¥è½ã§ããç¬ããã¥ã¼ã¹ãã¹äººå·¥ç¥è½åè¨ç»ã®ä¸ç«¯ãæ ã£ã¦ããã¨èããã¨å°ãã¯ã¯ã¯ã¯ãã¾ãã
å¯åº¦æ¯ãç¨ããç°å¸¸æ¤ç¥ã®ã¤ã¡ã¼ã¸
å¯åº¦æ¯ ã¯æ¬¡ã®ããã«å®ç¾©ããã¾ã
ï¼ã¤ã®åå¸ãä¸è´ããå ´åãr = 1 ã¨ãªãã¾ãã
ã¦ã¼ã¶ãç¹å®ã®è¡å(è¨äºã¯ãªãã¯ãªã©)ããã確çã®åå¸ãå©ç¨ãã¦ç°å¸¸æ¤ç¥ãè¡ãã¾ãã åæã¨ãã¦ããã®åå¸ã«ã¯è¦åæ§ããããã¢ããªã®æ°ãã¼ã¸ã§ã³ãªãªã¼ã¹ããã°ãªãè¡ãããã°ãæ¦ãä¼¼ãåå¸ãå½¢æãããã¨èãã¾ãã
ããä½ãããã®ãã°ãããã°ãã®åå¸ã®å½¢ç¶ãæªã¿ã¾ãããããå©ç¨ããã¨ãéå»ã®ãªãªã¼ã¹æã®åå¸ã¨ãææ°ãªãªã¼ã¹æã®åå¸ãæ¯è¼ãããã¨ã§ããã®ãªãªã¼ã¹ã«ãã°ããããã©ãã測ããã¨ãã§ããã®ã§ã¯ãªãããã¨ããã®ãå¯åº¦æ¯ãç¨ããç°å¸¸æ¤ç¥ã®èãæ¹ã§ãã
ã¨ããã§ã
ãªãåå¸ã®å¯åº¦æ¯ãªã®ãï¼
ã¦ã¼ã¶ã¼ã®ã¢ã¯ã·ã§ã³æ°ã®å¹³åãä¸å¤®å¤ã§ã¯ãã¡ãªã®ï¼
ã¨çåã«æãããæ¹ãããã®ã§ã¯ãªãã§ãããããçµè«ããè¨ãã¨ããã¡ã§ãã
ï¼500, 9500ï¼ã¨ï¼ 5000,5000ï¼ã®å¹³åã¯å
±ã«5000ã¨ããä¾ããããããããã«ãããããå¹³åã¨ããææ¨ã¯å¤åãè¦ã¤ãããã¨ã«é·ãã¦ãã¾ãããå ãã¦ãã°ã®ç°å¸¸æ¤ç¥ã®å ´åã大éã®æ£å¸¸ãªãã°ã®ä¸ããå°éã®ç°å¸¸ãªãã°ãè¦ã¤ãåºãå¿
è¦ãããããã§ãããå°éã®ç°å¸¸ãªãã°ãæ··å
¥ããã ãã§ã¯å¹³åã¯ãã»ã©å¤åãã¾ãããä¸å¤®å¤ãåæ§ã§ããå¾®å°ãªå¤åã§ãç°å¸¸ãè¦ã¤ãåºãå ´åãå¯åº¦æ¯ãå©ç¨ããã®ããã¿ã¼ã ã¨å¤æãã¾ããã(å
·ä½ä¾ã¨ãã¦ããã¼ãã¼ã¿ã§ã®å®è£
ä¾ï¼ãåç
§ãã¦ãã ãã)
ããã¼ãã¼ã¿ã§ã®å®è£ ä¾1
ä»å試ããããæ¹
å¯åº¦æ¯æ¨å®ããããåå¥ã®åå¸ãæ±ãã¦ãããããããå²ãç®ããå¯åº¦æ¯ãæ±ãã¾ããã
å¯åº¦æ¯ãï¼ã ã¨åå¸ãä¸è´ãã¦ãããã¨ãããã¨ãªã®ã§ã
(1 - å¯åº¦æ¯)ã®äºä¹ã®å¹³å
ãç°å¸¸åº¦ã¨ãã¾ããã
import numpy as np import pandas as pd import matplotlib.pyplot as plt # ã¡ãã£ã¨åæ£ã®éãæ£è¦åå¸ normal_data = np.random.normal(100, 10, 10000) abnormal_data = np.random.normal(100, 9, 10000)
plt.hist(abnormal_data, bins=25, alpha=0.3, color='r', range=[80,130]) plt.hist(normal_data, bins=25, alpha=0.3, color='b', range=[80,130]) plt.show()
normal_hist = plt.hist(normal_data, bins=25, alpha=0.3, color='b', range=[80,130]) abnormal_hist = plt.hist(abnormal_data, bins=25, alpha=0.3, color='b', range=[80,130]) x_range_list = list(normal_hist[1] ) true_x_range_list = [] """ plt.hist()[1]ã¯æ¤æ¨ç®ã«ãããæ¨ã®æ£ãplt.hist()[0]ã¯æ¨ã®ééãªã®ã§é·ããï¼éãï¼ ã ãã len(x_range_list)-1 """ for i in range(0, len(x_range_list) - 1): true_x_range_list.append((x_range_list[i] + x_range_list[i+1])/2) normal_data_df = pd.DataFrame() abnormal_data_df = pd.DataFrame() normal_data_df['action_count'] = true_x_range_list normal_data_df['frequency'] = normal_hist[0]/len(normal_data) abnormal_data_df['action_count'] = true_x_range_list abnormal_data_df['frequency'] = abnormal_hist[0]/len(abnormal_data)
abnormal_data_df
action_count | frequency |
---|---|
81 | 0.0133 |
83 | 0.0170 |
85 | 0.0269 |
87 | 0.0341 |
89 | 0.0441 |
91 | 0.0553 |
93 | 0.0602 |
95 | 0.0693 |
97 | 0.0763 |
99 | 0.0795 |
101 | 0.0787 |
plt.plot(true_x_range_list, normal_data_df['frequency']/abnormal_data_df['frequency']) plt.xlabel('action_count') plt.ylabel('density ratio') plt.show()
ï¼ã«è¿ãã»ã©æ£å¸¸ã§ãé¢ãã¦ããã»ã©ç°å¸¸ã¨èãã¾ãã
abnormality = np.mean((1 - normal_data_df['frequency']/abnormal_data_df['frequency'])**2) abnormality >>> 1.576268672846534
ãã®å¤ã大ããã®ãå°ããã®ãã¨ããå¤æã¯èªåã§è¨å®ããªãã¨ããã¾ããããããé£ããã¨ããã§ãã
ä»å¾è©¦ãã¦ããããæ¹
densratio_pyããã±ã¼ã¸ãå©ç¨ãã¾ããåå¥ã®åå¸ãæ±ããã«ç´æ¥å¯åº¦æ¯ãæ¨å®ããããæ¹ã§ãã åè¿°ããåå¥ã®åå¸ãæ±ãã¦ããå¯åº¦æ¯ãè¨ç®ããããæ¹ã«ã¯ãã¡ãªãããããã¾ãã確çå¤æ°å士ã®å²ãç®ã¯èª¤å·®ãã¨ã¦ã大ãããªãå¯è½æ§ããããã¨ã§ããä¸æ¹ç´æ¥å¯åº¦æ¯ãæ¨å®ã§ããã°ãã®åé¡ã¯çºçãã¾ããã æç¶ç´æ¥æ±ãã¦ã¿ããã§ãããï¼ç¬
from numpy import random from scipy.stats import norm from densratio import densratio x = np.random.normal(100, 10, 10000) y = np.random.normal(100, 9, 10000) result = densratio(x, y) print(result)
result.compute_density_ratio(y) abnormality = - np.log(result.compute_density_ratio(y)) abnormality >>> array([ 0.00788939, -0.00016233, -0.00328684, ..., -0.00290122, -0.00187464, 0.01139688]) np.mean(abnormality) >>> 0.0038648076997555504
以ä¸ãæå°äºä¹å¯åº¦æ¯æ¨å®æ³ãç¨ããç°å¸¸åº¦ã®è¨ç®ã®æµãã§ãã 詳ããçè«ã¯ä»¥ä¸ã®åèè³æãåç §ãã¦ãã ããã
ããã¼ãã¼ã¿ã§ã®å®è£ ä¾2
ããå®éã®ç°å¸¸ã«è¿ãããã¼ãã¼ã¿ãç¨ãã¦ç°å¸¸æ¤ç¥ãã¦ã¿ã¾ããæ°ãããã¼ã¸ã§ã³ã®ã¢ããªããªãªã¼ã¹ããã°ããã®æãããã¡ãªãã°ã¨ãã¦ã¯ãããããªãã°ãå
¥ã (ç¹å®ã®ãã°ãè¤æ°åã¨ã°ããããªã©)ãããã¾ãããã®æãã¢ã¯ã·ã§ã³ã®åå¸ã¯å¤å³°æ§ã®ããåå¸ã«ãªãã¾ãã
ããã§ãå¤å³°æ§ã®ããåå¸ã«ã¤ãã¦ãå¯åº¦æ¯ãæ±ãã¦ã¿ããã¨æãã¾ãã
å¯åº¦æ¯ã®å¹³åäºä¹èª¤å·®ãç¨ããå ´å
import numpy as np import pandas as pd import matplotlib.pyplot as plt # æ£è¦åå¸ï¼å¹³åã®éãæ£è¦åå¸ã§å¤å³°ãªåå¸ãä½ã normal_data = np.random.normal(100, 10, 10000) abnormal_data = np.append(np.random.normal(100, 10, 10000), np.random.normal(130, 1, 10))
plt.hist(abnormal_data, bins=25, alpha=0.3, color='r', range=[80,150]) plt.hist(normal_data, bins=25, alpha=0.3, color='b', range=[80,150]) plt.show()
x=130è¿è¾ºã«ç°å¸¸ãªãã¼ã¿ãï¼ï¼å追å ãã¾ãããããã¹ãã°ã©ã ã§è¦ãéãã§ã¯ãåå¸ã®ç°å¸¸ã¯è¦å½ãããªãã§ãã
normal_hist = plt.hist(normal_data, bins=25, alpha=0.3, color='b', range=[80,normal_data.max()]) abnormal_hist = plt.hist(abnormal_data, bins=25, alpha=0.3, color='b', range=[80,normal_data.max()]) x_range_list = list(normal_hist[1] ) true_x_range_list = [] for i in range(0, len(x_range_list) - 1): true_x_range_list.append((x_range_list[i] + x_range_list[i+1])/2) normal_data_df = pd.DataFrame() abnormal_data_df = pd.DataFrame() normal_data_df['action_count'] = true_x_range_list normal_data_df['frequency'] = normal_hist[0]/len(normal_data) abnormal_data_df['action_count'] = true_x_range_list abnormal_data_df['frequency'] = abnormal_hist[0]/len(abnormal_data)
normal_data_df
action_count | frequency |
---|---|
81.21 | 0.0168 |
83.62 | 0.0232 |
86.03 | 0.0370 |
88.44 | 0.0495 |
90.85 | 0.0598 |
93.26 | 0.0798 |
95.67 | 0.0821 |
98.08 | 0.0960 |
100.49 | 0.0994 |
102.90 | 0.0924 |
105.31 | 0.0858 |
107.72 | 0.0689 |
110.13 | 0.0580 |
112.54 | 0.0407 |
114.96 | 0.0305 |
117.37 | 0.0250 |
119.78 | 0.0148 |
122.19 | 0.0083 |
124.60 | 0.0042 |
127.01 | 0.0022 |
129.42 | 0.0014 |
131.83 | 0.0007 |
134.24 | 0.0004 |
136.65 | 0.0003 |
139.06 | 0.0002 |
abnormal_data_df
action_count | frequency |
---|---|
81.21 | 0.015684 |
83.62 | 0.026773 |
86.03 | 0.032967 |
88.44 | 0.051548 |
90.85 | 0.063237 |
93.26 | 0.072428 |
95.67 | 0.089311 |
98.08 | 0.089910 |
100.49 | 0.098202 |
102.90 | 0.090509 |
105.31 | 0.084615 |
107.72 | 0.069530 |
110.13 | 0.060539 |
112.54 | 0.044356 |
114.96 | 0.034466 |
117.37 | 0.020080 |
119.78 | 0.014286 |
122.19 | 0.008991 |
124.60 | 0.003796 |
127.01 | 0.002897 |
129.42 | 0.001798 |
131.83 | 0.000799 |
134.24 | 0.000200 |
136.65 | 0.000100 |
139.06 | 0.000100 |
plt.plot(true_x_range_list, normal_data_df['frequency']/abnormal_data_df['frequency']) plt.xlabel('action_count') plt.ylabel('density ratio') plt.show()
ãã¹ãã°ã©ã ã§ã¯åãããªãã£ãç°å¸¸ããå¯åº¦æ¯ã®ã°ã©ãã«ããã¨ä¸ç®çç¶ã§ããï¼
abnormality = np.mean((1 - normal_data_df['frequency']/abnormal_data_df['frequency'])**2) abnormality >>> 0.25229624295868402
ç°å¸¸åº¦0.25ã¨ãªãã¾ããããã®å¤ãé©åãªã®ãå¦ããããé£ããå¤æã«ãªãããã§ããç¬
ç´æ¥å¯åº¦æ¯æ¨å®ããå ´å
from numpy import random from scipy.stats import norm from densratio import densratio x = np.random.normal(100, 10, 10010) y = np.append(np.random.normal(100, 10, 10000), np.random.normal(130, 1, 10)) result = densratio(x, y) print(result)
result.compute_density_ratio(y) abnormality = - np.log(result.compute_density_ratio(y)) abnormality >>> array([-0.02102919, -0.02234817, 0.05280343, ..., 0.1220773 , 0.12499095, 0.12314303]) np.mean(abnormality) >>> 0.0015096024596322134
ãã®ããæ¹ã§è¨ç®ããã¨ç°å¸¸åº¦ã®ãªã¼ãã¼ãå°ããã§ãããåæ£ãå¤ããæã®ç°å¸¸åº¦ã0.0038648076997555504
ã ã£ãã®ã§ã0.0015096024596322134
ã¯ãããããªãã¯ãªãå¤ã§ããããããããè¨ãã¾ãããç¬ãä¸æã«é¾å¤ãè¨å®ãã¦ãã ããã