ããã¤ãã©ã¤ã³ã£ã¦ä½ï¼ã
ä»äºã§ãæ©æ¢°å¦ç¿ã®æ¡ä»¶ãã¡ãã£ã¨å¢ãã¦ããã¨ããã®ã¨ã
kaggleããã¼ã¹ã©ã¤ã³ãããã¯èªåçã«submitã§ããã¨ããã¾ã§æã£ã¦ããããã£ã¦æã£ãã®ã§ã
pipelineãä½ããã¨è¨ããã¨ã«ãªãã¾ããã
ãã ãç§ã¯ã¨ã³ã¸ãã¢ãªã³ã°çã§ã¯ãªãã®ã§ãã¼ãããä½ããèªä¿¡ãããã¾ããã
å°ã£ããªã¼å°ã£ããªã¼ã¨æã£ã¦ããã¨ããã«ããããªQiitaãè¦ããã¾ããã
ãªãã»ã©ããããããããã ãªãã¨ãªãã¾ããã
ãã®ä¸ããä»åã¯Kedroãå°å
¥ãã¨ãããã触ã£ã¦ã¿ãã®ã§ã¬ããã¾ãã
ã2020å¹´2æ9æ¥è¿½è¨ã
ãã®è¨äºãæ¸ãããä¸ã®Qiitaãæ¸ããMinamiããããããªãã¨ããã£ãããã¾ãã¦
@yetudada
— Yusuke Minami ð¸ð¬ (@Minyus86) 2020å¹´2æ8æ¥
It would be nice to add this as a Japanese resource. https://t.co/edAoijWfZl
ããããªè¤ãéãã§ãããã¨ãè¿ããã¨ãæã£ããã
Kedroã®Product Managerã®Yetundeããããããªãã¨ããã£ãããã¾ããã
I really, really like @0_u0's blog post! I'm going to add it to the resource guide for Kedro in our FAQs.
— Yetunde Dada (@yetudada) 2020å¹´2æ8æ¥
ããªã½ã¼ã¹ã¬ã¤ãã¨ãã¦Kedroã®FAQã«è¼ããããã£ã¦æ¸ãã¦ãâ¦â¦ï¼ï¼ï¼ï¼ï¼ï¼ï¼ï¼ï¼ï¼ï¼ï¼ï¼
ç§ã¯æè¡åããªãããã®è¨äºãå人çãªã¡ã¢æ±ãã ã£ãã®ã§ããã
éçºã«æºãã£ã¦ãã人ãã¡ã«ãããã¦åãå
¥ãã¦ãããã¦ããã®ã¯ç´ ç´ã«å
æ ã«æãã¾ãã
ç§ãç§ã§ãkedroã¯ãããããã©ã¤ã¢ã³ãã¨ã©ã¼ãããã¨æã£ã¦ãã¾ãã
ã¨ãããã¨ã§ä»¥ä¸ã«ãREADMEã¨ãã¦è¨äºã転è¼ãã¦ããã¾ãã
å¤åãã©ã«ããå¢ãããREADMEãå çä¿®æ£ããããããã¨æãã¾ãã
pipelineã¨ã¯
æ©æ¢°å¦ç¿ã®ã¿ã¹ã¯ãã¾ã¨ãä¸ãã¦ãèªååãããã¤ã§ã(ãã£ãã)
ãã¼ã¿ã®åå¦çããã©ã¡ã¼ã¿ã®ãã¥ã¼ãã³ã°ãªã©ã
æ©æ¢°å¦ç¿ã®éç¨ã¯ããããªç
©éåè¦ç´ ãç¾ãã¦æ²ãããªãã¾ãã
ãã®ããããå«ãã¦æ´çããçµæã®åç¾ãã¡ã³ããã³ã¹æ§ã
確ä¿ããããã®è©¦ã¿ãã¨ããæãã§ãã
Kedroã¨ã¯
ããgithubã«é£ãã ã»ããæ©ã github.com
Quantumblack社ãæããããªã¼ãã³ã½ã¼ã¹ã®pipelineã©ã¤ãã©ãªã§ãã
ãªãã§Kedroãªã®ï¼
ä»ã®ã©ã¤ãã©ãªãæªããã¨ããããã§ã¯ãªããçç±ã¯ä¸»ã«3ç¹ã§ãã
ããã¥ã¡ã³ããä¸å¯§
éçºã«æºãã£ã¦ã人ãåªãã
ãã´ã»ããã¼ããã£ããã(éè¦)
ããã¥ã¡ã³ããä¸å¯§
get startedã¯ã©ã®ã©ã¤ãã©ãªã«ãããã¾ãããã¹ããããã¤ã¹ãããã§è§£èª¬ãã¦ããã¦ããã
ãã¡ã¤ã«ã®æ§é ããã¤ã触ãæ©ä¼ã®ãªãyamlãã¡ã¤ã«ãªã©ã®å¯¾å¿é¢ä¿ãã¨ã¦ãããããããã£ãã§ãã
pipelineã¯ããããPythonãåºæ¬å¯¾è©±åã§ä½¿ã£ã¡ãããããªç§ã«ã¨ã£ã¦ã¯ããã®ç¹ã¯ãããéè¦ã§ããã
éçºè ãã³ãã¥ããã£ããã¥ã¼ãã¼ã«åªãã
Twitterã§ããã®ããæ¹ãã¨ã§èª¿ã¹ãããã¤ã£ãã
ã¡ã¤ã³ãã¤ã³ãã¦ã人ããªãããããã
éçºãã¦ã人ããåå¿ãããããã
æ©ãã§ããã³ã¼ãã£ã³ã°ã®å®è£
ä¾ã示ãã¦ãããããã¦ã
ã³ã£ããããåé¢ãã¡ããã¡ãå¬ããã£ãã¨ããã®ãããã¾ãã
å¤åä»ã®ã©ã¤ãã©ãªã使ã£ã¦ã¦ãã誰ãããä½ãããæãã¦ãããã¨ã¯æããã§ããã
ãããã¦ããããæãã¦ããããã ã¨æãã ãã§çµæ§å¿å¼·ãã£ãããã¾ãã
ãã´ã»ããã¼ããã£ããã
ãã®ãã¤ã¼ããå ¨ã¦ã§ã*1ã
twitter.comQ. ãªãã§Kedroãªãã§ããï¼
— Kien Y. Knot (@0_u0) 2020å¹´2æ5æ¥
A. githubè¡ã£ãããã£ã¡ããã£ããããã´ããã£ããã(Staræ¼ãã) pic.twitter.com/JHj2WBNVfe
ããã¼ã¿ãã¨ãããããã£ãããã使ãããã£ã¦ãªã£ã¡ããã¾ãããä¸äºç
ãªã®ã§ãâ ã§HNå²ãã¿ã¤ãã®ã
â ãã¬ãã¨â ã
ã¤ãããªã£ã¦ããã
使ã£ã¦ã¿ã
ããã¥ã¡ã³ããããã®ã§ã¶ã£ã¡ããããããå
ã¯è足ã§ãããªããã§ããã使ã£ã¦ã¿ãã¬ãã¼ãã§ãã
å®è£
githubã¯ä»¥ä¸ã
github.com
Kedroã¯pip
ã§ã¤ã³ã¹ãã¼ã«ã§ãã¾ãã
pip install kedro
使ã£ããã¼ã¿
åããã°ã§ä½¿ã£ããã¤ãæ¹é ãããã¼ã¿ã使ã£ã¦ã¾ãã
ä¹±æ°ã§ä½ã£ã¦ãã®ã§ãå¦ç¿æ¸ã¿ã¢ãã«ãå
¬éãã¦åé¡ãªãã£ã¡ãåé¡ãªããä½ã«ã使ããªãã®ã§ã
ããã¸ã§ã¯ãç«ã¡ä¸ãã¯kedro new
Rã使ã£ã¦ããã°å²ã¨èªç¶ãªããã¸ã§ã¯ãã¨ããæ¦å¿µãKedroã«ãããã¾ã*2ã
ã³ãã³ãã©ã¤ã³ã«kedro new
ã¨å
¥åããã¨å¯¾è©±çã«ãããã¸ã§ã¯ãåããªã©ãå
¥åã§ãã¾ãã
ãããæ¸ãã¨ãä½æ¥ãã£ã¬ã¯ããªä¸ã«kedroã®ããã¸ã§ã¯ããã©ã«ããã§ãã¾ãããã
ããã
å®è¡æºåã¯kedro install
ããã¸ã§ã¯ããç«ã¡ä¸ãããkedro install
ã¨å
¥åããã¨ãªãããã«ããã«ãåãã¾ãã
åãçµããã¨ããã¼ã¿ãç½®ãå ´æããåå¦çãããå ´æããã¢ãã«ãæ§ç¯ããå ´æããªã©ã
大ä½ä½ããã¾ãããããã
ãã©ã«ãæ§æ
ãã¬ãã¨ã®gitã§ã¯ãããªæãã«ãªã£ã¦ãã¾ãã
__init__
ã¨ã__pycache__
ã¨ããããã¾ããçç¥ãã¦ãã¾ãã
ãã©ã«ãæ§æã®æç¹ã§ããããªã®ã好ãã§ãã
ãã ããã©ã«ãéã¯è¨å¤§ãªã®ã§ãåã
ã®èª¬æã¯å
¬å¼ããã¥ã¡ã³ãçãåèã«ãã¦ãã ããã
kedro_classification âââ README.md âââ conf â  âââ README.md â  âââ base â  â  âââ catalog.yml â  â  âââ credentials.yml â  â  âââ logging.yml â  â  âââ parameters.yml â  âââ local âââ data â  âââ 01_raw â  âââ 02_intermediate â  âââ 03_primary â  âââ 04_features â  âââ 05_model_input â  âââ 06_models â  âââ 07_model_output â  âââ 08_reporting âââ docs â  âââ source â  âââ conf.py â  âââ index.rst âââ kedro_cli.py âââ logs â  âââ errors.log â  âââ info.log â  âââ journals âââ notebooks âââ references âââ results âââ setup.cfg âââ src âââ kedro_classification â  âââ nodes â  âââ pipeline.py â  âââ pipelines â  â  âââ data_engineering â  â  â  âââ nodes.py â  â  â  âââ pipeline.py â  â  âââ data_science â  â  âââ nodes.py â  â  âââ pipeline.py â  âââ run.py âââ requirements.txt âââ setup.py âââ tests âââ test_run.py
pipelineæ§ç¯
ä¸è¨ã®ãã©ã«ãæ§æã§ããã第ä¸ã«data/01_raw
ã«ä½¿ããã¼ã¿ãçªã£è¾¼ãã¨ããã¨ãããããã¾ãã
ä»åã¯ããã¼ãã¼ã¿ãã¶ã¡è¾¼ã¾ãã¦ãã¾ãã
ãã以å¤ã§ã¯ãconf
ã«ããyamlãã¡ã¤ã«ã¨ãsrc/[ããã¸ã§ã¯ãå]/pipelines/data_engineering
ã
åããdata_science
ã®pythonã¹ã¯ãªãã(nodes.py
ãpipeline.py
)ãç·¨éãã¦ããã¾ãã
å
¨é¨ç·¨éããã1é層ä¸ã«ãpipeline.py
ãããã®ã§ããããç·¨éãã¾ãã
ãã®è¾ºã®ååã¯ããããåç
§ãããã£ãããã¦ããã°ããããããååãä»ãã¦ãè¯ãã¨æãã¾ãã
conf
å
ã®yamlãã¡ã¤ã«ã§ã¯ãä¸éãã¼ãã«ã®åºåãã¢ãã«ã®åºåãcatalog.yml
ã
ã¢ãã«ã®ãã©ã¡ã¼ã¿ãªã©ãparameters.yml
ã«æ ¼ç´ãã¾ãã
ããã§ãnodes.py
ã§å¼æ°ã¨ãã¦æå®ããã°ãKedroå´ã§ãããªã«å¼ã£å¼µã£ã¦ãã¦ããã¾ãã
nodes.py
ã«ã¯ãå
·ä½çãªãã¼ã¿ã®åå¦çãã¢ãã«é¢æ°ã®å®ç¾©ãæ¸ãã
pipeline.py
ã«ã¯ããããã®å
¥åºåãå®ç¾©ãã¾ãã
data_engineering
ãã®ãã©ã«ãã§ã¯ä¸»ã«åå¦çãè¡ãã¾ãã
ä»åã¯ã©ã¡ããã¨ããã¨ã¢ããªã³ã°ã®é¨åã®ãã©ã¤ã¢ã«ãã¡ã¤ã³ã ã£ãã®ã§ã
é©å½ãªåå¦çã«ãªã£ã¡ãã£ã¦ã¾ãã
import pandas as pd import numpy as np def preprocessing(usedata: pd.DataFrame) -> pd.DataFrame: for i in range(1,40): var_name = 'Var.' + str(i) usedata[var_name] = 1 - usedata[var_name] return usedata
ãã¡ããè¤æ°é¢æ°ãå®ç¾©ãã¦ãé©å®é©ç¨ãããã¨ãã§ãã¾ãã
ããã¦ããããæ ¼ç´ãã¦ããå
ãã¼ã¿ã«é©ç¨ãã
é©ç¨çµæãä¸éãã¼ãã«ã¨ãã¦åãåºãããã®ã¹ã¯ãªãããpipeline.py
from kedro.pipeline import node, Pipeline from kedro_classification.pipelines.data_engineering.nodes import preprocessing def create_pipeline(**kwargs): print('loading create_pipeline in pipeline.py....') return Pipeline( [ node( func=preprocessing, inputs='usedata', outputs='preprocessed_Data', name='preprocessed_Data', ), ] )
å®è£
ãnodes.py
ã®é¢æ°ãæã£ã¦ãã¦ã
åºåã¯preprocessed_Data
ã¨ãã¦åºããã¨ããã·ã³ãã«ãªãã®ã§ã*3ã
pipeline.py
ã¯data_science
ã§ãåæ§ã®è¨æ³ã§æ¸ãã¾ãããªã®ã§çç¥ã§ãã
data_science
å
¬å¼ããã¥ã¡ã³ãã§ã¯ç·åå帰ã®å®è£
ããã£ãã®ã§ãè伸ã³ãã¦LightGBMã®å®è£
ããã£ã¦ã¿ã¾ããã
ã³ã¼ã(ä¸é¨)ã¯ä»¥ä¸ã
def LightGBM_model( data: pd.DataFrame, parameters: Dict ) -> lgb.LGBMRegressor: ### define classes regressor = lgb.LGBMRegressor() y = data['y'] X = data.drop(['y', 'ID'], axis=1) ### hyperparameters from parameters.yml lgb_params = { 'num_iterations' : parameters['n_estimators'], 'boosting_type' : parameters['boosting_type'], 'objective' : parameters['objective'], 'metric' : parameters['metric'], 'num_leaves' : parameters['num_leaves'], 'learning_rate' : parameters['learning_rate'], 'max_depth' : parameters['max_depth'], 'verbosity' : parameters['verbose'], 'early_stopping_round' : parameters['early_stopping_rounds'], 'seed' : parameters['seed'] } fold = KFold(n_splits=parameters['folds'], random_state=parameters['random_state']) oof_pred = np.zeros(len(X)) ### run model with kfold for k, (train_index, valid_index) in enumerate(fold.split(X, y)): #print(train_index) X_train, X_valid = X.iloc[train_index], X.iloc[valid_index] y_train, y_valid = y.iloc[train_index], y.iloc[valid_index] lgb_train = lgb.Dataset(X_train, y_train) lgb_valid = lgb.Dataset(X_valid, y_valid) regressor = lgb.train(lgb_params, lgb_train, valid_sets=lgb_valid, verbose_eval=False) y_train_pred = regressor.predict(X_train, num_iteration=regressor.best_iteration) y_valid_pred = regressor.predict(X_valid, num_iteration=regressor.best_iteration) auc_train = roc_auc_score(y_train, y_train_pred) auc_valid = roc_auc_score(y_valid, y_valid_pred) print('Early stopping round is: {iter}'.format(iter=regressor.current_iteration())) print('Fold {n_folds}: train AUC is {train: .3f} valid AUC is {valid: .3f}'.format(n_folds=k+1, train=auc_train, valid=auc_valid)) return regressor def evaluate_LightGBM_model(regressor: lgb.basic.Booster, X_test: np.ndarray, y_test: np.ndarray): y_pred = regressor.predict(X_test, num_iteration=regressor.best_iteration) print('y predicted!') print(type(y_pred)) #y_pred = np.argmax(y_pred, axis=1) #roc_curve = r score = roc_auc_score(y_test, y_pred) logger = logging.getLogger(__name__) logger.info('AUC is %.3f.', score)
æ±ãå®è£
ã§æ¥ããããã§ããããããªæãã§æ¸ããã°å¤§ä¸å¤«ã§ãã
parameters
ãå
¥ãã¦ããã°ã対å¿ãããã¤ãã¼ãã©ã¡ã¼ã¿ãKedroããããªã«æã£ã¦ãã¦ããã¾ãã
pipeline.py
ã§ãããã®é¢æ°ã¨å
¥åºåãæå®ããã°OKã
# coding: utf-8 from kedro.pipeline import node, Pipeline from typing import Dict, Any, List from kedro_classification.pipelines.data_science.nodes import ( split_data, Linear_Regression_model, LightGBM_model, evaluate_LightGBM_model ) def create_pipeline(**kwargs): return Pipeline( [ node( func=split_data, inputs=['preprocessed_Data', 'parameters'], outputs=['X_train', 'X_test', 'y_train', 'y_test'], ), node( func=LightGBM_model, inputs=['preprocessed_Data', 'parameters'], outputs='regressor', name='regressor', ), node( func=evaluate_LightGBM_model, inputs=['regressor', 'X_test', 'y_test'], outputs=None, ), ] )
æå¾ã«ãsrc
ç´ä¸ã®pipeline.py
ãç·¨éãã¾ãã
ããã§åå¦çããã¢ãã«å®è¡ã»åºåã«è³ãã¾ã§ã®æµããæå®ãã¾ãã
from typing import Dict from kedro.pipeline import Pipeline from kedro_classification.pipelines.data_engineering import pipeline as de from kedro_classification.pipelines.data_science import pipeline as ds def create_pipelines(**kwargs) -> Dict[str, Pipeline]: de_pipeline = de.create_pipeline() ds_pipeline = ds.create_pipeline() return { 'de' : de_pipeline, '__default__': de_pipeline + ds_pipeline }
kedro test
ããã¾ã§ãã£ãã(ãããã¯ããã¾ã§ã«è³ãã¾ã§ã®ã©ããã§)ãã¹ããå®è¡ã§ãã¾ãã
ã³ãã³ãã©ã¤ã³ã§kedro test
ãå®è¡ããã ãã§ã©ãã§è©°ããè©°ã¾ãªããããããã¾ãã
ãããã
kedro run
ãã¹ããä¸æãè¡ã£ããèµ°ããã¾ããkedro run
ã§èµ°ãã¾ãã
ãããã
ãããã«
é·ããªãã¾ããã
ããã¾ã§ã®å®è£
ã¯ãããã¥ã¡ã³ããè¦ãªãã試è¡é¯èª¤ãã3æ¥ä½ã§ç¡çãªãé²ãããã¾ããã
ä»åã¯ãã¼ã«ã«ã§ã®å®è£
ã§ããããã¡ããKedroã¯AWSãGCPã§ãåãã¾ããã
æ§ã
ãªDBãããã¼ã¿ãç²å¾ã§ãã¾ãã
ãã¼ã¿ã¯åæã主ã§ããã¤ãã©ã¤ã³æ§ç¯ä»¥åã«ããããã¨ã³ã¸ãã¢ãªã³ã°ã®çµé¨ãæµ
ãç§ã«ã
ã©ãã«ãå½¢ã«ã§ããã¨ããã¾ã§éæ¡å
ãã¦ãããã®ã§ãã¨ã¦ãè¯ãã©ã¤ãã©ãªã§ããã
ä¸åãã¼ã¹ã©ã¤ã³ãã§ããã¨ã¯ãããã¾ã ã¾ã 足ããªããã¨ãå¤ãã§ãã
ç¾ç¶ã¯LighGBMã®ã¢ãã«é¢æ°ã®ä¸ã§CVã®ææ³ãããè¾¼ãã§ããé¨åã®æ¹åãã
è¤æ°ã®åå¦çé¢æ°ãçµåããpipelineæ§ç¯ãªã©ã
è±Tutorialãªèª²é¡ã¯ãã£ã±ãããã¾ãã
ãã ãããã¯Kedroå´ã®åé¡ã¨ããããã¯nodes.py
å
ã§ã©ãå®è£
ãããã®åé¡ãªã®ã§ã
Kedroã¯å²ã¨æè»ã«ãããåãå
¥ãã¦ãããã¨ä¿¡ãã¦ãã¾ã(ï¼)