ããã«ã¡ã¯ã
æ¬å½ã«ã¯ã½ãã¾ãããªãã§ããã©ãã¢ã³ãµã³ãã«ææ³ã¨ãåå¼·ãã¦ããã復ç¿ããããªã£ã¦ããã®ã§ãããã§å¾©ç¿ãããã¦ã¾ã¨ãã¦ããã¾ãã
決å®æ¨ã¨ã¯
決å®æ¨ã®æ¦è¦³
決å®æ¨ã¯ããããæ©æ¢°å¦ç¿ã¨ãããã£ããã¨ããã人ãªã確å®ã«ä¸åã¯è¦ãã使ã£ããèããããããã®ã§ã¯ãªããã¨æãã¾ãã決å®æ¨ã¯ãããªã¼æ§é ã®å½¢ã§åé¡ã¾ãã¯å帰ã®ã¢ãã«ãæ§ç¯ãããã®ã§ãã決å®æ¨ã§ã¯ããã®ã¢ã«ã´ãªãºã ã«ãã£ã¦ã¯ãã«ãã´ãªã¨æ°å¤ã®ä¸¡æ¹ã®ãã¼ã¿ãæ±ããã¨ãã§ãã¾ãã
åèï¼Decision tree learning - Wikipedia
決å®æ¨ã®ç®çã¯ãå¤æ°ã¨ãã®å¤ã®çµã«ãã£ã¦è¡¨ç¾ããããã¼ã¿ãããã¤ãã®ãµãã»ããã«åå²ãã¦ãããã¨ã«ããã¾ããããã¦åå²ãã¦ããéç¨ã§ãããæ¨ã®æ·±ããæ·±ããããä¸ã¤æ±åæ§è½ã®é«ãåºåãè¿ãããããªæ¨ãæ§ç¯ãã¦ãããã¨ãã´ã¼ã«ã«ãªãã¾ãã
決å®æ¨ã¯ãã®åã®éã"æ¨"ãªã®ã§ãæ¨ã«ãã£ã¦ä»¥ä¸ã®ããã«åå²ã表ç¾ãã¾ãã
- éçµç«¯ãã¼ãï¼å±æ§ã®ã©ãã«ãä»ãããã
- æï¼éçµç«¯ãã¼ãããåºã¦ããæã«ã¯ããã®å±æ§ã®åãããå¤ã示ããã¦ãã
- èï¼æçµçãªåé¡çµæã®å¤
決å®æ¨ã®ã¡ãªããã»ãã¡ãªãã
çµæã®å¯èªæ§ã¨ãã観ç¹ã§ãã®ææ³ã®å³ã«åºããã®ã¯ãªãããããªããã¨ããããããããããããã§ãã
ãããããæ¨ã§è¡¨ç¾ãã¦ããã¦ãåå²ã®åºæºãã©ããªã£ã¦ãããããåå²ã®åºæºã¨ãªããã®ãä¸ããé çªã«åªå 度ãä»ãã¦ããã®ã§ãããã¾ãããããããã§ãã
ã¾ãããããã®ã¹ã±ã¼ã«ã«ããã»ã©ä¾åãããªãã¨ããç¹ã決å®æ¨ãä½ãã¨ãã®ã¡ãªããã¨è¨ãã¾ãã
ãã ãå¦ç¿ãã¼ã¿ã¸ã®ä¾åãæ¿ããã並ååãªã©ãã§ããªãã¨ãããã¡ãªãããããã¾ãã
åè
åºæ¬ã¨ãªãã¢ã«ã´ãªãºã
ãã¨ã§ç´¹ä»ããããã«ã決å®æ¨ã«ã¯è²ã 種é¡ããããããªãã§ãããæãåºæ¬ã¨ãªãã¢ã«ã´ãªãºã ã§ããID3 (Iterative Dichotomiser 3)ã«ã¤ãã¦ããã§ã¯ç´¹ä»ãã¾ãã
ããã§ç´¹ä»ããã¢ã«ã´ãªãºã ã¯こちらでも簡潔にわかりやすく書かれていますã
ID3ã¯ã1986 å¹´ã«ãRoss Quinlan ã«ãã£ã¦éçºããã¾ãããéæ»ããããªãã¨ããå¶ç´ã®ä¸ã§å®ç¾å¯è½ãªãã©ã³ã空éãã ããã¹ãããã«ããã¦å±æçã«è©ä¾¡ã®é«ããã¿ã¼ã³ãé¸æãããããããã¦ã³ã®è²ªæ¬²ãªæ¢ç´¢ããã¾ãã
æ¢ç´¢ãè¡ã£ã¦ããéç¨ã§ãã©ã®å¤æ°ã®ã©ããªå¤ã使ã£ã¦åå²ãã¦ããããéè¦ã«ãªã£ã¦ããããã§ãããªãã¡èå¥è½åã大äºã«ãªã£ã¦ãã¾ãã
ããã§ã¯ã©ããã£ã¦èå¥è½åãå®ç¾©ãããã大äºãªã®ã ããID3ã¯Entropyã¨Information Gainã使ç¨ãã¦æ±ºå®æ¨ãæ§ç¯ãã¦ããã¾ãã
ã¨ã³ãããã¼
æ å ±ã®ä¹±éãã表ãææ¨ã¨ãã¦ã¨ã³ãããã¼ã¨ãããã®ãããã¾ããã¨ã³ãããã¼ã¯ä»¥ä¸ã®å¼ã§å®ç¾©ããã¾ãã
$$
I(p_1, ..., p_n) = \sum_{i=1}^{n} P_i \log_2 \frac{1}{P_i}
$$
ã¡ãªã¿ã«åºæ°ã¯2ã§ãªãã¦ãè¯ãã¦ãåé¡ããã¯ã©ã¹ã®æ°ã¨çããããã¨å¤ã®æ大å¤ã1ã«ãªã£ã¦ä¾¿å©ãã©ã®ãããªå¤ã®æã«ãã®ã¨ã³ãããã¼ãé«ããªãã®ããã¿ãããã«ããã®å¼ãã³ã¤ã³æãã®ä¾ã«å¾ã£ã¦å¯è¦åããã¾ãã
# coding: utf-8 import numpy as np import matplotlib.pyplot as plt # 表ã®åºã確ç P1 = np.linspace(0, 1, 1000) # è£ã®åºã確ç P2 = 1 - P1 # ã¨ã³ãããã¼ I = lambda p1, p2: -p1 * np.log2(p1) - p2 * np.log2(p2) # ã¨ã³ãããã¼ã®å¤ã®è¨ç® result = np.array([I(p1, p2) for p1, p2 in zip(list(P1), list(P2))]) result[np.isnan(result)] = 0 # å¯è¦å plt.scatter(P1, result) plt.show()
ä¸ãå®è¡ãããã®ã以ä¸ã®å³ã
ãã®å³ãããã¨ã³ãããã¼ã®å¤ã大ãããªãã®ã¯ã表ãè£ãã©ã¡ããã§ããããããªãæã¨ãããã¨ããããã¾ããããªãã¡ãæ å ±ãããã©ã³ãã ã«è¿ãã»ã©ãå¤ã大ãããªããéã«æ å ±ãããã©ã³ãã ã§ç¡ãåã£ã¦ããæãå¤ãå°ãããªãã¾ãã極端ãªè©±ããã¼ãã®ãµã³ãã«ãå ¨ã¦åãã¯ã©ã¹ã«å±ãã¦ããå ´åã¯ãã¨ã³ãããã¼ã¯0ã¨ãªãã¾ãã
ããã使ã£ã¦æ±ºå®æ¨ã®åãã¼ãã«ããã¦èå¥åãæ±ããã©ã®ãããªèå¥ãè¡ããã決ãããããªã®ã§ãããããã¯æ å ±éã²ã¤ã³ã§è©±ãã¾ãã
æ å ±éã²ã¤ã³
決å®æ¨ã®æ§é ã決å®ãã¦ããéç¨ã§ã¯ãèãã¼ãã«ã¦åé¡çµæãè¿ããã¨ãç®æ¨ãªã®ã§ãåºæ¥ãéãèãã¼ãã§ã¯çµæãåã£ã¦ãã¦ã»ããã¨ãªãã¾ãã
ã¤ã¾ããã¨ã³ãããã¼ãæ ¹ãã¼ãããèãã¼ãã«è¡ãã«é£ãã¦å¾ã ã«ä¸ãã£ã¦ãã£ã¦ã»ããããã§ãããã®æããããã¼ããããããã¼ãã«åå²ããæã«ãããã®ã¨ã³ãããã¼ã®å·®ãæ å ±éã²ã¤ã³ã¨ããããã®æ å ±éã²ã¤ã³ã®å¤ã大ãããã°å¤§ããã»ã©ããè¯ãåå²ãã§ãã¦ããã¨ãããã¨ã«ãªãã¾ãã
æ å ±éã²ã¤ã³ã®å¼ã¯ä»¥ä¸ã®ã¢ã«ã´ãªãºã ã§åç §ãã¾ããID3ã®ã¢ã«ã´ãªãºã ã§ã¯ããã®æ å ±éã²ã¤ã³ãæ大ã«ãªãåå²ã®ããã®å¤æ°ã¨ãã®å¤ã貪欲ã«æ¢ãã¦ããã¾ãã
ID3ã®ã¢ã«ã´ãªãºã
ã¢ã«ã´ãªãºã ã«ã¤ãã¦ã¯ãこちらの方の記事をベースにさせていただきました(ãããã¨ããããã¾ã)ã
1. ã«ã¼ããã¼ã $N$ ãä½æãã¦å ¨ã¦ã®è¨ç·´ãã¼ã¿éåã $N$ ã«æå±ãããã
2. ãã $N$ ã«æå±ããã¯ã©ã¹ãå ¨ã¦åãæ±ºå® $X$ ãä¸ãããªã $N$ ã« $X$ ã¨ã©ãã«ä»ããã¦å¦çãçµäºããã
3. è¨ç·´ãã¼ã¿éå $C$ ã«å¯¾ããã¨ã³ãããã¼ãæ±ãããå³ã¡ä»¥ä¸ã®å¼ãè¨ç®ããã
$$
M(C)=- \sum_{x \subset D} p_x(C) \log p_x(C)
$$
4. $C$ ãå¤æ° $a_i$ ã®å¤ã«å¿ãã¦åå²ããã$a_i$ ã $v_1 ⦠v_n$ ã® $n$ éãã®å¤ãæã¤å ´åã¯ä»¥ä¸ã®éãã«åå²ããã
$$
C_{ij} \subset C (a_i = v_j)
$$
5. åå²ãã $C_{ij}$ ã«å¿ãã¦å¹³åæ å ±éãæ±ããã
$$
M(C)=-\sum_{x \subset D} p_x(C_{ij}) log p_x(C_{ij})
$$
6. è¨ç®ããå¹³åæ å ±éããç¬ç«å¤æ° $a_i$ ã®å¹³åæ å ±éã®æå¾ å¤ $M_i$ ã以ä¸ã®å¼ã§æ±ããã
$$
M_i= M(C) - \sum_{j = 1}^n M(C_{ij}) \times \frac{\mid C_{ij} \mid }{ \mid C\mid}
$$
â»ä¸ã®å¼ãæ å ±éã²ã¤ã³ã®å¼ã«ãªãã¾ãããã®å¼ãè¦ãã¨ã親ãã¼ãããåãã¼ãã«åã£ãæã®ã¨ã³ãããã¼ã®å·®ã®æå¾ å¤ã¨ãããæ å ±éã²ã¤ã³ã®ãå¼ããèªã¿åããã¨æãã¾ãã
7. $M_i$ ãæ大ã¨ãªãç¬ç«å¤æ°ã $a_k$ ã¨ããã
8. $N$ ã®ã©ãã«ã $a_k$ ã¨ãã¦ã$N$ ã®åãã¼ã $N_j$ ãä½æããããããã«$C_{kj}$ ãæå±ãããã
9. ããããã®åãã¼ãã«å¯¾ã㦠$N = N_j$, $C = C_{kj}$ ã¨ãã¦ã2 以ä¸ã®æä½ãå帰çã«è¡ãã
ãããã決å®æ¨ã®åºæ¬ã¨ãªãã¢ã«ã´ãªãºã ã§ãããã®ããã«ä½ãããã®åºæºã«å¾ã£ã¦ãã¼ãã®å¤æ°ã¨ãã®åå²åºæºãé¸ãã§ããã決å®æ¨ãæ§ç¯ãã¦ããã¾ãã
決å®æ¨ã®ã¢ã«ã´ãªãºã ã®ç¨®é¡
決å®æ¨ã®ã¢ã«ã´ãªãºã ã®æ¦è¦³
決å®æ¨ã®ã¢ã«ã´ãªãºã ã«ã¯ããã¤ã種é¡ãããã®ã§ããããã®ã¢ã«ã´ãªãºã ã®ç¨®é¡ã¨éããç°¡åã«è¡¨ã§ã¾ã¨ãã¦ããããã¨æãã¾ãã
ID3 | C4.5(C5.0) | CART | CHAID | |
ç®çå¤æ°ã®å | é¢æ£å¤ | é¢æ£å¤ | é¢æ£å¤,é£ç¶å¤ | é¢æ£å¤,é£ç¶å¤ |
説æå¤æ°ã®å | é¢æ£å¤ | é¢æ£å¤,é£ç¶å¤ | é¢æ£å¤,é£ç¶å¤ | é¢æ£å¤,é£ç¶å¤ |
åå²ã®æ° | å¤åå²å¯è½ | å¤åå²å¯è½ | ï¼åå²ã®ã¿ | å¤åå²å¯è½ |
è©ä¾¡åºæº | æ å ±éã²ã¤ã³ | æ å ±éã²ã¤ã³ | Giniä¿æ° | ã«ã¤ï¼ä¹å¤ |
æ¬ æå¤ã¸ã®å¯¾å¿ | ä¸å¯ | å¯ | å¯ | å¯ |
æåãããã | ããªã | ãã | ãã | ãã |
大ããç°ãªãã®ã¯ãç®çå¤æ°ã®åã¨ãã¦ä½¿ç¨å¯è½ãªãã®ã¨ããã¼ãã®åå²æ¡ä»¶ã決å®ããéã®è©ä¾¡åºæºã¨ããã¨ãã§ãããããããããã®ææ³ã®ã¡ãªããããã¡ãªããã«ã¤ãã¦ã¯ä»¥ä¸ã®åã¢ã«ã´ãªãºã ã§è§¦ãããã¨æãã¾ãã
決å®æ¨ã¢ã«ã´ãªãºã ã®é¸ææ¹æ³
ã©ã®æ±ºå®æ¨ã®ã¢ã«ã´ãªãºã ãé¸ã¶ãã¯ã以ä¸ã®è¦³ç¹ã§é¸ã¶ã¨è¯ãããã§ãã
- Will tree be binary or not?
- Which attribute will be node?
- How missing data will be handled?
- If tree becomes too large,how to prune it?
åèï¼What are the differences between ID3, C4.5 and CART? - Quora
ä¸è¨ã®åãã«ãä¸ãããããã¼ã¿ã»ããã課é¡ãåºã«ã©ã®ããã«çãããã§ãã¢ã«ã´ãªãºã ãé¸ã¹ã°ããã¨æãã¾ããããããã®ã¢ã«ã´ãªãºã ãã©ã®ãããªç¹å¾´ãããªãã¦ãããã©ããã¯ä»¥ä¸ã§è©³ç´°ã«è§¦ãã¾ãã
ID3
ã¢ã«ã´ãªãºã
ä¸ã§èª¬æãããããå²æã
ã¡ãªãã
ä¸ä½äºæãããããã¦ããã§ã¯ãã¾ãã¨ãããããã¨ããªãããâ¦æ±
ãã¡ãªãã
- ãã¼ã¿æ°ãå°ãªãã¨ãã¼ã¿ãéå¦ç¿ãããã
- åå²ã決å®ããæã«ä¸ã¤ã®å¤æ°ããèæ ®ãããã¨ãã§ããªã
- é£ç¶å¤ãåãæ±ããã¨ãã§ãããã¾ãæ¬ æå¤ã«å¼±ã
åèæç®
C4.5(C5.0)
C4.5 ã¯ãID3 ãåºã«æ¹è¯ããããã®ã§ãæ°å¤ãã¼ã¿ã®ç¹å¾´éãåçã«é¢æ£åãããã¸ãã¯ãå°å ¥ããå ¨ã¦ã®ç¹å¾´éãã«ãã´ãªã«ã«å¤æ°ã§ãªãã°ãªããªãã¨ããå¶ç´ãåãé¤ãã¾ãããC4.5 ã¯å¦ç¿æ¸ã¿ã®æ¨ã if-then ã§è¡¨ãããã»ããã«å¤æããè©ä¾¡ãæåã (決å®æ¨ã«ãããä¸è¦ãªé¨å(æ)ãåãé¤ããã¨) ã«ç¨ãããã¾ãã
ã¢ã«ã´ãªãºã ï¼C4.5ã¨C5.0ã®éã
ã¾ãããã®ã¢ã«ã´ãªãºã ãããã¨ç¥ã£ãæã«ãæåã«æ°ã«ãªã£ãã®ã¯ãã®ï¼ã¤ããã並åã§ã¨ãããããã¦ãããã¨ã§ãããªã®ã§ããã®ï¼ã¤ã«æããã¦éãã¯ããã®ã ãããã¨ããã®ããµã¤ãã«çåã¨ãã¦ä¸ããã¾ãã
ãã¨ãã¨ãC4.5ã¯ID3ã®æ¹è¯çã§ãC4.5ã¯1997å¹´ã«åç¨ã®C5.0ã«æ¹åããã以ä¸ã®ãããªå¤æ´ç¹ãããã¾ããã
- ãã¼ã¹ãã£ã³ã°ãçµã¿è¾¼ããã¨ã§ç²¾åº¦ãæ ¼æ®µã«ä¸ãã£ã
- ã¹ã±ã¼ã©ããªãã£ãåä¸ãã¦ããã®ã§ãã«ãã³ã¢CPUã§æç¨
詳細はこちらの記事でみることができますããå®è¡æéãæ ¼æ®µã«æ©ããªãããããã¡ã¢ãªæ¶è²»éãæãããã精度ãåä¸ãã¦ãããã¨ããããã¾ãã
ã¡ãªãã
- ID3ã¨éããé£ç¶å¤ã»é¢æ£å¤ã®ä¸¡æ¹ãæ±ããã¨ãã§ããã¾ãæ¬ æå¤ã«ã対å¿ãããã¨ãã§ãã
- éå¦ç¿ã®åé¡ãæåãã§é²ããã¨ãã§ãã
- CARTã¯æ¨ã®åªå®ãã¯ãã¹ããªãã¼ã·ã§ã³ã«ãã£ã¦è¡ãããæéãããããC4.5ã¯äºé ä¿¡é ¼éçã使ãããä¸æ¹è¡ã§ã§ãã
ãã¡ãªãã
- ã¼ãã®å¤ãæã¤ç©ºã®ãã©ã³ããæ§ç¯ãã¦ãã¾ã
- ç°å¸¸å¤ãæã¤ãã¼ã¿ã§ã¢ãã«ãæ§ç¯ããã¨ãéå¦ç¿ãçºçãã¦ãã¾ããããã¯ç¹ã«ãã¼ã¿ã«ãã¤ãºãå«ã¾ãã¦ããéã«ãèµ·ããã
åèæç®
CART
CART (Classification and Regression Trees) ã¯ãC4.5 ã«ããé¡ä¼¼ããææ³ã§ãããæ°å¤ãã¼ã¿ã§è¡¨ç¾ãããç®çå¤æ° (=å帰) ã«ã対å¿ãã¦ããç¹å¾´ãããã¾ããCART ã¯ãã«ã¼ã«ã»ãããè¨ç®ããã®ã§ã¯ãªãããã¤ããªã»ããªã¼ãæ§ç¯ãã¾ãããªããscikit-learn ã¯æé©åãããã¼ã¸ã§ã³ã® CART ãå®è£ ãã¦ãã¾ãã
ã¢ã«ã´ãªãºã
CART 㯠ID3 ã¨ã»ã¼åãã¢ã«ã´ãªãºã ã§ãããææ¨ã¨ãã¦ãGiniä¿æ°ã¨ããä¸ç´åº¦ã表ãææ¨ãç¨ãã¦åå²ãè¡ãã¾ãã
$$
Gini Index(t) = 1 - \sum_{j=1}^{k} p^{2}(j | t)
$$
$p(j | t)$ ã¯ããã¼ã $t$ å ã«ãããã¯ã©ã¹ $j$ ã®å²åã§ãããã®å¤ã大ããã»ã©ãã¯ã©ã¹å ã®ä¸ç´åº¦ãå°ãããªãã¾ãã
ã¤ã¾ããä¸ç´åº¦ãæãä½ããã°ã¸ãä¿æ°ã®å¤ã¯0ï¼ä¸ç´åº¦ãé«ããªãã°ãªãã»ã©ã¸ãä¿æ°ã®å¤ã1ã«æ¼¸è¿ãããã¨ããããã¾ãï¼
ãã®Giniä¿æ°ãè¨ç®ãã¦ã親ãã¼ãã¨åãã¼ãã®å·®ã大ãããªããããªãã®ãåå²ã®æ¡ä»¶ã¨ãã¦é¸ã¶ãã¨ã§ã決å®æ¨ãæ§ç¯ãã¾ãã
ã¡ãªãã
[å·¥äºä¸]
ãã¡ãªãã
- ä¸ã¤ã®å¤æ°ã§ããåå²æ¡ä»¶ãä½æã§ããªã
- æ§ç¯ãããæ¨ãä¸å®å®
åèæç®
CHAID
[å·¥äºä¸]
決å®æ¨ã®å¯è¦å
å æ¥ããããªè¨äºãè¦ã¤ãã¦ãåã£ã¦ãã³ã£ãããã¾ããã
決å®æ¨ã®å¯è¦åãããªã®ããã¾ãèãã¦ãªãã£ãç¬ãããããªããhttps://t.co/RquIrvllJZ
— Hakky@Kaggle(´・âï½¥) (@St_Hakky) 2018å¹´9æ27æ¥
sckit-learnã¨ãã使ã£ã¦ã®å¯è¦åã£ã¦çµæ§å¾®å¦ã ãªãã£ã¦æã£ã¦ãããã§ãããããã¾ã§ããããããããæãã§è¡¨ç¾ãã¦ãããã®ã¯ãã¨ã£ã¦ãé©ãã§ããã以ä¸ã®è¨äºã§ã¾ã¨ãã¦ã¿ã¾ããã
ãã®ä»ã®åèæç®
æ¸ãããæããç¬ãã¾ãæ´æ°ãã¾ãã