kaggleã«pythonã使ã£ã¦ã¿ã(2) ãsklearn
æ©æ¢°å¦ç¿ã©ã¤ãã©ãªã®sklearn(scikit-learn)
ååã®ç¶ã
äºæ¸¬ã¢ãã«ãã¤ããã¨ãã«ãæ©æ¢°å¦ç¿ã®ã©ã¤ãã©ãªã使ã£ã¦ã¿ããã¨ãããã¨ã§ãsklearnã試ãã¦ã¿ãã
ã¨ãããpandasã«æ©æ¢°å¦ç¿å
¥ã£ã¦ãã¨æã£ã¦ããã
ã¾ãã¯ã¤ã³ã¹ãã¼ã«ã
$ pip install scikit-learn
æ©æ¢°å¦ç¿ã¯ããããªã¢ã¸ã¥ã¼ã«ã«åå²ããã¦ããã®ã§ãå¿ è¦ãªãã®ã以ä¸ã®ããã«ãimportããã
# 決å®æ¨ from sklearn import tree # ç·å½¢ã¢ãã« from sklearn import linear_model # ãã¥ã¼ã©ã«ãããã¯ã¼ã¯ from sklearn import neural_network # ãµãã¼ããã¯ã¿ã¼ãã·ã³ from sklearn import svm
ä»åã¯ããã¸ã¹ãã£ãã¯å帰ã使ãã
import pandas as pd from sklearn import linear_model
ååã¨åãããã«ãtrain.csvããã¼ã¿ãã¬ã¼ã ã¨ãã¦èªã¿è¾¼ãã§ãtrain.Ageã®æ¬ æé¨åã¯ãå¹³åå¤ã§åããã
説æå¤æ°ã«ã¯ãAgeã¨Sexã使ããRãªãã説æå¤æ°ã¨ç®çå¤æ°ã«åããã³ã¹ã«å
¥ããã°ãããã ãã©ãsklearnã®å ´åã¯ããªãã¸ã§ã¯ãã注æãã¦é¸ã¶ã
説æå¤æ°ã¯DataFrameãªãã¸ã§ã¯ããç®çå¤æ°ã¯Seriesãªãã¸ã§ã¯ãã
ãã¨ãã«ãã´ãªã«ã«å¤æ°ãæåã®ã¨ãã¯ããã®ã¾ã¾ä½¿ããªãã¿ãããªã®ã§ãç´ãã¨ãã
[male, female]->[1, 0]ã£ã¦æãã§ã
train = pd.read_csv('train.csv') # ãã¼ã¿å å·¥ train.Age = train.Age.fillna(train.Age.mean()) for i, sex in enumerate(train.Sex): if sex=='male': train.Sex[i]=1 else: train.Sex[i]=0 # ãã¸ã¹ãã£ãã¯å帰 logiReg = linear_model.LogisticRegression()
ç®çå¤æ°ã¯ãSeriesãªãã¸ã§ã¯ãã
y = train['Survived'] print type(y) >> <class 'pandas.core.series.Series'>
説æå¤æ°ã¯ãDataFrameãªãã¸ã§ã¯ãã
X = train[['Age', 'Sex']] print type(X) >> <class 'pandas.core.frame.DataFrame'>
å¤æ°ã®æºåãã§ããããfit()ã¨ããscore()ã¨ãã使ã£ã¦ããã
logiReg.fit(X, y) print logiReg.coef_ # å帰ä¿æ° print logiReg.intercept_ # åç print logiReg.score(X, y) # 決å®ä¿æ° >> [[-0.0042936 -2.41865573]] >> [ 1.11913633] >> 0.786756453423
ãã®ã¢ãã«ã«ãç·´ç¿ç¨ãã¼ã¿ããã®ã¾ã¾ä½¿ã£ã¦ã¿ãã
py = logiReg.predict(X) # ç·´ç¿ç¨ãã¼ã¿å½ã¦ã¯ã table = pd.crosstab(y, py) # æ¯è¼ãã table
表示ãããçµæã¯ã以ä¸ã®éãã
col_0 | 0 | 1 |
---|---|---|
Survived | ||
0 | 468 | 81 |
1 | 109 | 233 |
æ£è§£çãè¨ç®ããã¨ã
(468+233)/(468+233+81+109.0) >> 0.7867564534231201
ä»åæ¸ããã³ã¼ãã¯ããã
-> http://nbviewer.ipython.org/6204124
githubã¯ãã¡ã
-> https://github.com/akiniwa/kaggle_titanic