quantumblackã£ã¦ååããã£ããã
以åquantumblack社ããªã¼ãã³ã½ã¼ã¹ã§éçºãã¦ããkedro
ãç´¹ä»ãã¾ããã
ãããkedro
éçºè
ã«ãè¦ã¤ãã£ã¦ããããã¨åå¿ãããã ãã¦å¤§å¤å
æ ã§ããã
ä»åã¯å社ãåæ§ã«ãªã¼ãã³ã½ã¼ã¹ã§éçºãã¦ããcausalnex
ã使ã£ã¦ã¿ã¾ãã
çç±ã¯åç´ã
ãã´ãããã£ãããã
ã¾ã ãã¥ã¼ããªã¢ã«æ®µéã§ãããããã ãã§ãååã«å®ç¨ã«è¶³ãããããªããããã
ã¨ãããããã«ã¯å¼·ããä»äºã®OSãUbuntuãªãå®å
¨ã«åªåãã¦ããã
githubã¯ãã¡ã
github.com
åºæ¬çã«ã¯å
¬å¼ãã¥ã¼ããªã¢ã«ã«æ²¿ã£ã¦ãã£ã¦ããã¾ãã
é²æçã«ã¯å æã°ã©ããã¤ããã¨ããã¾ã§ã
ãã¥ã¼ããªã¢ã«ã§ã¯ãã¤ã¸ã¢ã³ãããã¯ã¼ã¯ã®ä½æãã§ãã¦ãã¾ããã
ããã¾ã§è¿½ãã¤ãã¦ããªãã®ã§ä¸æ¦ä¿çã
ä¸æºå
causalnex
ã¯graphviz
ã使ã£ã¦å¯è¦åãé²ãã¦ããããã§ãã
graphviz
ã¯ãããã¯ã¼ã¯å³ãæç»ãããã¼ã«ããã±ã¼ã¸ã§ã
ã³ã¤ãã¯å¥éã¤ã³ã¹ãã¼ã«ãå¿
è¦ã§ãã
ãã¬ãã¨ã¯Ubuntu 20.04ãªã®ã§ãããªæãã
ãªããéçºçããªãã¨å¯è¦åã§ã³ã±ãã®ã§ãéçºçãå
¥ãã¦ããã¾ã*1ã
sudo apt install graphviz sudo apt install graphviz-dev
ãã¬ãã¨ã¯pipenv
ã§ä»®æ³ç°å¢ãä½ã£ã¦ãVSCodeã§ã³ã¼ããæ¸ãã¦ããã®ã§ã
ä½æ¥ãã£ã¬ã¯ããªä¸ã§ä»¥ä¸ãå®è¡ãã¦pipenv
ä¸ã«ã¤ã³ã¹ãã¼ã«ã
ãã®æpandas
ã¨ãmatplotlib
ã¨ãpygraphviz
ãªã©ãåããã¦å
¥ãã¦ããã¾ãã
ã¾ããå¯è¦åã«ã¯IPython.display
ããImage
ã¢ã¸ã¥ã¼ã«ã使ã£ã¦ããã®ã§ã
ãã®ã¢ã¿ãªãå
¥ãã¦ããã¾ãããã
pipenv install causalnex # ...ãã¨ã¯ã好ããªã©ã¤ãã©ãªãã¤ã³ã¹ãã¼ã«ã # Pipfileãlockããããshell/run pipenv shell
å®éããã§è©°ã¾ã£ãã®ã§ãããã ãæ¸ããããã¨ã¯githubããã½ã¼ã¹èªãã§æ¬²ãã(ãã¼)ã
causalnex
ã§å æãã¤ã¢ã°ã©ã ãæãã¦ããã
ä»åã¯ãã¥ã¼ããªã¢ã«ã«å³ãã¦UCIãæä¾ããå¦åãã¼ã¿ãæ¡ç¨ãã¾ãã
å
¥ã£ã¦ããå¤æ°ã¯ãªã³ã¯ããã©ãã°è¦ãããã®ã§ã詳細ã¯ãã¡ããã
ãã¼ã¿ã¨ãã¦ã¯æ°å¤ã¨ã«ãã´ãªã¼å¤æ°ãæ··ãã£ããã¼ã¿ã§ãSEMãã«ã¯è¯ããããªãã¼ã¿ããç¥ãã¾ããã
causalnex
ã®ãã¥ã¼ããªã¢ã«ã§ã¯ãæ§å¥ãå¹´é½¢ãå¦æ ¡ãªã©ã¯"sensitive"ã¨ãã¦ã
ã¢ãã«ããåé¤ãã¦ãã¾ã*2ã
ããã¾ã§ãã£ãã®ããã¡ãã
# %% load data student_por = pd.read_csv('../sample_data/student-por.csv', sep=';') student_por.shape >> (649, 33) # %% # to drop sensitive features(to avoid statistical discrimination) drop_col = ['school', 'sex', 'age', 'Mjob', 'Fjob', 'reason', 'guardian'] student_por.drop(drop_col, axis=1, inplace=True)
ä»åã¯ã«ãã´ãªã¼å¤æ°ãLabelEncodingãã¾ãã
ã«ãã´ãªã¼å¤æ°ã®æä½ã¯ãã¡ãããã©ã®ããã«ã°ã©ãåããããã«ãå¯ãã§ãããããã
å®åã§ã®åé¡è¨è¨ã«å¿ãã¦å¯¾å¿ããã°ããããã¨æãã¾ãã
ãã®è¾ºã¯kaggleã®ãã¼ã¹ã©ã¤ã³ã¢ãã«ã§ãããè¦ãã³ã¼ãã£ã³ã°ã§ããªãã
# %% preprocessing # label encoding le = LabelEncoder() categorical_features = list( student_por.select_dtypes(exclude=[np.number]).columns) categorical_features # %% for cat_col in categorical_features: student_por[cat_col] = le.fit_transform(student_por[cat_col])
ãã¦ãç¶ãã¦ã¯å æãã¤ã¢ã°ã©ã ã®ä½æã§ãã
causalnex
ã¯No-Tearsã¢ã«ã´ãªãºã ã使ã£ã¦ã°ã©ããä½æã§ãã¾ãã
ã¢ã«ã´ãªãºã ã«ã¤ãã¦ã¯è¿½ã£ã¦èªãã§ã¿ã¾ããæ¬å½ã«èªã¿ã¾ããã¯ãâ¦â¦
ãã ãæ°åæéããããã¾ããè¡ãªã®ãåãªã®ãã¯ãããã¾ãããã
ãã®è¦æ¨¡ã§æéããããã®ã§ãã¾ãããããããã¨ãªãã§ãããã
# %% NOTEARS algorithm structure # maybe some times to calculate. no_tears_sm = from_pandas(student_por) # %% visualize viz = plot_structure(no_tears_sm, graph_attributes={"scale": "0.5"}, all_node_attributes=NODE_STYLE.WEAK, all_edge_attributes=EDGE_STYLE.WEAK) Image(viz.draw(format='png'))
ãã¬ãã¨ã®ç°å¢ã§ã¯OSError
ãèµ·ãã¦çµãã£ã¦ãã¾ãã¾ããã*3ã
å
¬å¼ãã¥ã¼ããªã¢ã«ã§ã¯ãããªç»åãã
ã°ã©ããç»åããã£ãããã
ã§ã¯ãªããããã©ã«ãã§ã¯å®å
¨ã°ã©ãã«ãªã£ã¦ãã¾ãããã§ããããããã¼ã«ã«PCããè½ã¡ãã
ãã¥ã¼ããªã¢ã«éããé¾å¤ãè¨ãã¦è¶³åãããã¦ããã¾ãããã
足åãã¯remove_edges_below_threshold
ã§é¾å¤ãè¨å®ãããã¨ã§å¯è½ã§ãã
# %% no_tears_sm.remove_edges_below_threshold(0.8) viz = plot_structure(no_tears_sm, graph_attributes={"scale": "0.5"}, all_node_attributes=NODE_STYLE.WEAK, all_edge_attributes=EDGE_STYLE.WEAK) Image(viz.draw(format='png'))
ããã¨
ã»ãâ¦â¦ãã£ãããã
æ´ã«ãã³å
¥ããã¾ããã°ã©ããè¦ãã¨higher
(é«çæè²ãæãã§ããã)ãMedu
(æ¯è¦ªã®å¦æ´)ã«åãã£ã¦ç¢å°ã伸ã³ã¦ãã¾ãã
èªèº«ã®å¦æ´ãé«ãããããã©ããããæ¯è¦ªã®å¦æ´ãè¦å®ãããã¨ããæ§é ã¯ã¡ãã£ã¨èª¬æãé£ããã§ããã
ããããããããã®ã¯éæ¹åãããããã¯é¢ä¿ãªããã
ä»åã¯ãé¢ä¿ãªããã¨ããå¶ç´ãç½®ããã¨ã«ãã¾ã*4ã
ã¾ããããã¤ãã®å¤æ°ã«ç¢å°ãå¼ããã¨ãããã§ä¸æ°ã«ãã£ã¦ã¿ã¾ãã
# %% cut the relationship between higher and Medu no_tears_sm = from_pandas(student_por, tabu_edges=[('higher', 'Medu')], w_threshold=0.8) # %% add or remove edges no_tears_sm.add_edge('failure', 'G1') no_tears_sm.remove_edge('Pstatus', 'G1') no_tears_sm.remove_edge('address', 'G1') viz = plot_structure(no_tears_sm, graph_attributes={"scale": "0.5"}, all_node_attributes=NODE_STYLE.WEAK, all_edge_attributes=EDGE_STYLE.WEAK) Image(viz.draw(format='png'))
ãã®çµæããããã£ããhigher
ã¨Medu
ã®é¢ä¿æ§ãåãã
ããã¤ãã®é¢ä¿æ§ã追å ã»åé¤ã§ããããã§ãã
ããã¦ããè¦ãã¨ãã©ãã«ãç¢å°ã伸ã³ã¦ããªãç¹ãã
Dalc
ã¨Walc
ã«ã ãã°ã©ããã§ãã¦ãããªã©ãããã¾ãã
ç¶ã(ã§ããã)ãã¤ã¸ã¢ã³ãããã¯ã¼ã¯ã«ã¯ä¸æ¦ãããã®å¤æ°ã¯ä½¿ããªããã¨ã«ãã¦ã
ãã£ã¨ã大ããªãµãã°ã©ããæãåºãã¾ãã
# %% get the largest subgraph. no_tears_sm = no_tears_sm.get_largest_subgraph() viz = plot_structure(no_tears_sm, graph_attributes={"scale": "0.5"}, all_node_attributes=NODE_STYLE.WEAK, all_edge_attributes=EDGE_STYLE.WEAK) Image(viz.draw(format='png'))
çµæã¯ããã
è¦ããããªã£ãã
ãã®å æãã¤ã¢ã°ã©ã ã使ã£ã¦ããã®ãã¡ãã¤ã¸ã¢ã³ãããã¯ã¼ã¯ãæ§ç¯ãã¦ããã¾ãã
â¦â¦ã¨ãã£ãã¨ããã§ä»åã¯ããã¾ã§ã
ã¢ãã«ã¯èµ°ãããã¨çµæ§éãã®ã§ãpickle
ã§ä¿åãã¦ããã¾ãã
# %% save sm filename = '../output/no_tears_sm.pkl' pickle.dump(no_tears_sm, open(filename, 'wb'))
å®èµ°ããææ³
ããã¾ã å®èµ°ã¯ãã¦ã¾ãããâ¦â¦
åAPIãå
å®ãã¦ãã¦ãå æãã¤ã¢ã°ã©ã ã
- æ©æ¢°çã«åºã
- ãããè¦ã¦éåæãããã¨ããã«ã¤ãã¦ä¿®æ£ãã
ãç¹°ãè¿ãã¦ã°ã©ãæ§é ãå¾ãããã®ã¯ããªãã ãé¢ç½ãã¨æãã¾ãã
次åã¯ãã¤ã¸ã¢ã³ãããã¯ã¼ã¯ããã£ã¦ããäºå®ã§ããã
ããããDAGãã¼ã¹ã§ã®ç·å½¢å帰ã¢ãã«ãéç·å½¢å帰ã¢ãã«ãªã©ãå®è£
å¯è½ãªããã§ãã
ã¾ããä»å端æã£ããå ææ§ãã«ã¤ãã¦ãæã£ã以ä¸ã«è©³ããè¨è¿°ãããã¾ãã
ããã¾ã§ãå ææ¢ç´¢ãã®ã¢ã«ã´ãªãºã ã§ãããããã¾ã§APIãå
å®ãã¦ããã°ã
å®åã«ãå¿ç¨ã¯ã§ããããªäºæããã¾ãã
注æ
causalnex
ã¯å¼·åãªæ¦å¨ã§ãããå ææ¢ç´¢ã¯å®åä¸ãã£ããé²ãã¦ããå¿
è¦ãããã¾ãã
å®éãç¢å°ãå¼ãã«ã¯è¨ç®æéã¨è¨ç®ãªã½ã¼ã¹ã大ããå²ãã¾ããã
ã¾ããcausalnex
ã§ã¯LiNGAMã¯å®è£
ã§ããªãï¼æ¨¡æ§ã§ãã
introductionã«ãããã¨ããã
ããã¾ã§åå¿è
(ã¨ããããã¯å æã¢ãã«ãã°ã©ã表ç¾ã§ãã¦ãå®è£
ãã§ããªã)ã«
ç°¡åã«ã¢ããªã³ã°ã§ããããè¨è¨ããã¦ããã®ã§ãã·ã³ãã«ãªå®è£
ã«ãªã£ã¦ããããã§ãã
LiNGAMãå°ç¨ç¯å²ãªã®ãã©ããããå¾®å¦ã§ããã
ã¨ããããã¨ãããããã¤ã¸ã¢ã³ãããã¯ã¼ã¯ä½ã£ã¦ã¿ã£ãï½*5
*1:ãã®è¾ºã¡ããã¨æ´ã£ã¦ãããã°ãããã§ããã©ãããããããªãã
*2:ããããã¯çµ±è¨çå·®å¥ãåé¿ããããã¨æããã¾ã
*3:å¤åãã¬ãã¨ã®ã¹ããã¯ã®åé¡ã§ããgitã®issueã«ããã¦ã¿ã¦ã¾ãã
*4:ã¶ã£ã¡ããå人çã«ã¯éã®å æé¢ä¿ãä»®å®ãã¦ãè¯ãã¨ã¯æãã¾ãããã¨ã§ãã£ã¦ã¿ã¾ããã
*5:ãã¤å®è£ ããããã¯æªå®ã