ãã¤ãã©ç®¡çã®ããã -ãã¤ãã¼ãã©ã¡ã¼ã¿ãHydra+MLflowã§ç®¡çããã-
æ©æ¢°å¦ç¿ããã£ã¦ãã人ãªã誰ããééããã§ããããã®å æ¯
(â»åçã¯PyTorchã®Language Modelã®Exampleãã)
Pythonã®argparseã§ã·ã§ã«ããå¼æ°ãåãåãPythonã¹ã¯ãªããå ã§ãã©ã¡ã¼ã¿ã«è¨å®ãããã¿ã¼ã³ã¯ãè¨è¿°ãé·ããªããã¡ãªä¸ãã©ã®ãã©ã¡ã¼ã¿ãmodel/preprocess/optimizerã®ãã®ãªã®ãåºå¥ãã¤ãã«ããè¦éããæªãã¨ãã£ã課é¡ãããã¾ãã
ç§ã¯å®é¨ç¨ã®ãã©ã¡ã¼ã¿é¡ã¯å
¨ã¦YAMLã«è¨è¿°ãã¦ç®¡çãã¦ãã¾ãã
YAMLã§è¨è¿°ãããã¨ã§ãã©ã¡ã¼ã¿ãé層ç«ã¦ã¦æ§é çã«è¨è¿°ãããã¨ãã§ãããã©ã¡ã¼ã¿ã®è¦éãããã£ã¨ãããªãã¾ãã
preprocess: min_df: 3 max_df: 1 replace_pattern: \d+ model: hidden_size: 256 dropout: 0.1 optimizer: algorithm: Adam learning_rate: 0.01 norm: 0.001
ãã©ã¡ã¼ã¿ãã¥ã¼ãã³ã°ã®éã«ã¯ãã·ã§ã«ã¹ã¯ãªããããyqã³ãã³ãã§æ¸ãæããªããPythonã¹ã¯ãªããã«æµãã¨ããéç¨ããã¦ããã®ã§ãããyqã³ãã³ãã§ãã¡ããã¡ãæ¸ãç´ãã¦ãããã¡ã«ããã©ã«ãå¤ãåãããªããªãã¨ããæ©ã¿ãããã¾ããã
YAMLã«ãããã©ã¡ã¼ã¿ç®¡çã®ãã¹ããã©ã¯ãã£ã¹ã模索ãã¦ããæã«ãHydraã¨ãããã¼ã«ãç»å ´ããã®ã§ã家ã®å®é¨ç®¡çå¨ããHydraã使ã£ã¦æ´çãã¦ã¿ã¾ããã
Hydraã¨ã¯
Hydraã¯Facebook Researchãæä¾ãã¦ããè¨å®ãã¡ã¤ã«ã管çããããããããã®ãã¼ã«ã§ãã
æ§ã
ãªè¨å®ãYAMLå½¢å¼ã§è¨è¿°ãããã®YAMLã®è¨å®ç¾¤ãç°¡åã«Pythonã¹ã¯ãªããå
ã«æµãè¾¼ããã¨ã«ä¸»ç¼ãç½®ãã¦ãããã¼ã«ã§ãããExampleã«ã¯Databaseã®è¨å®ããããªã©æ©æ¢°å¦ç¿ä»¥å¤ã®ç¨éã§ã®ä½¿ç¨ãæ³å®ãã¦ãããã¼ã«ã§ãã
Hydraã«ããYAMLã®èªã¿è¾¼ã¿
以ä¸ã®ããã«YAMLãã¡ã¤ã«ã«ãã©ã¡ã¼ã¿ãè¨å®ããPythonã¹ã¯ãªããã§Hydraã®ãã³ã¬ã¼ã¿ãä»ä¸ããé¢æ°ãç¨æãããã¨ã§Dictã®å½¢å¼ã§ãã©ã¡ã¼ã¿ãèªã¿è¾¼ããã¨ãã§ããããã«ãªãã¾ãã
config.yaml
db: driver: postgresql pass: drowssap timeout: 20 user: postgre_user
my_app.py
@hydra.main(config_path='config.yaml') def my_app(cfg): print(cfg.pretty())
$ python my_app.py
db:
driver: postgresql
pass: drowssap
timeout: 20
user: postgre_user
ã³ãã³ãã©ã¤ã³ã§YAMLã®ãã©ã¡ã¼ã¿ãkey=valueã®å½¢ã§æ¸¡ãã¨ã対象ã®å¤ãæ¸ãæãã¦Pythonã¹ã¯ãªããã«æã¡è¾¼ããã¨ãã§ãã¾ãããã¡ããå ã®YAMLãã¡ã¤ã«ã«ã¯å½±é¿ã¯ããã¾ããã
$ python my_app.py db.user=ymym db.pass=3412 db: driver: postgresql pass: 3412 timeout: 20 user: ymym
è¤æ°ã®YAMLãã¡ã¤ã«ã®ç®¡ç
Hydraã§ã¯è¨å®ãã¡ã¤ã«ãè¤æ°ã®YAMLãã¡ã¤ã«ã«åå²ãã¦éç¨ãããã¨ãæ³å®ãã¦ãã¾ãã
ä¾ãã°ãNNã¨LightGBMã®ãã©ã¡ã¼ã¿ãå¥ã
ã®YAMLãã¡ã¤ã«ã«è¨è¿°ãã¦ä½¿ç¨ããã¢ãã«ããã¤ãã©ã«è¨å®ãã¦ããã«å¿ãã¦å¯¾å¿ããã¢ãã«ã®YAMLãèªã¿è¾¼ã¨ãã£ãæãã§ãã
nn.yaml
model: layers: 3 dropout: 0.5
lightgbm.yaml
model: max_depth: 10 learning_rate: 0.01
以ä¸ã®ããã«ãã£ã¬ã¯ããªãåã£ã¦YAMLãé
ç½®ãã¦ãã©ã®è¨å®ãã¡ã¤ã«ãèªã¿è¾¼ããã config.yaml
ã§å¶å¾¡ãã¾ãã
âââ conf â âââ config.yaml â âââ model â âââ lightgbm.yaml â âââ nn.yaml âââ my_app.py
config.yaml
defaults: - model: nn
$ python my_app.py model: layers: 3 dropout: 0.5
Hydraã®åºåãã£ã¬ã¯ããª
Hydraã¯Pythonã¹ã¯ãªãããæçµçã«ã©ããªYAMLãã¡ã¤ã«ã®å
容ã§å®è¡ãããããåºåãã£ã¬ã¯ããª(ããã©ã«ãã§ã¯outputs/
)ãçæãã¦ä¿ç®¡ãã¦ããã¾ãã
âââ .hydra â âââ config.yaml â âââ hydra.yaml â âââ overrides.yaml âââ my_app.log
ãã®åºåãã£ã¬ã¯ããªã«ã¯å°ã注æãå¿
è¦ã§ãPythonã¹ã¯ãªããã§hydraã®ãã³ã¬ã¼ã¿ãã¤ããé¢æ°ã®ä¸ã§ã¯cwdããã®åºåãã£ã¬ã¯ããªã«ãªã£ã¦ãã¾ãã¾ãã
Pythonã³ã¼ãã®ä¸ã§ pd.read_csv('data/train.csv')
ã¨ãã£ããã¡ã¤ã«èªã¿è¾¼ã¿ã使ç¨ã¨ããã¨cwdã®éãããäºæ
ããã¨ãå¤ãã®ã§ãhydraãç¨æãã¦ããã¦ããé¢æ°ã使ã£ã¦ãªãªã¸ãã«ã®ããã¸ã§ã¯ãã«ã¼ãã®ãã¹ãåå¾ããã¨ããã§ãããã
import os from omegaconf import DictConfig import hydra @hydra.main() def my_app(cfg: DictConfig) -> None: print(f'Current working directory: {os.getcwd()}') print(f'Orig working directory : {hydra.utils.get_original_cwd()}') print(f'to_absolute_path("foo") : {hydra.utils.to_absolute_path("foo")}') print(f'to_absolute_path("/foo") : {hydra.utils.to_absolute_path("/foo")}') >>>Current working directory: /home/user/workspace/hydra-exp/outputs/2020-02-09/02-29-26 >>>Orig working directory : /home/user/workspace/hydra-exp >>>to_absolute_path("foo") : /home/user/workspace/hydra-exp/foo >>>to_absolute_path("/foo") : /foo
Hydra + MLflowã§ãã©ã¡ã¼ã¿/å®é¨ã管çãã
ã§ã¯ãæ©æ¢°å¦ç¿ã®å®é¨ã«å¯¾ãã¦ãYAMLã§è¨è¿°ãããã¤ãã¼ãã©ã¡ã¼ã¿ã®èªã¿è¾¼ã¿ã¨ã°ãªãããµã¼ãã«Hydraãããã©ã®ãã©ã¡ã¼ã¿ã§å®é¨ãã©ããªçµæã«ãªã£ããã®è¨é²ãMLflowãã§è¡ãã¾ãã
ä»åãé¡æã¯ä¾ã«ãã£ã¦Livedoorã®ãã¥ã¼ã¹ã³ã¼ãã¹ã®ããã¹ãåé¡ã§ãã
ã¾ãã¯ãã¼ã¿ã®èªã¿è¾¼ã¿ãå å·¥çã®è«¸ã ã®é¢æ°ãå®ç¾©ãã¾ãã
# AllenNLPç¨ã«æç« ããInstanceãçæãã def text_to_instance(word_list, label): tokens = [Token(word) for word in word_list] word_sentence_field = TextField(tokens, {"tokens": SingleIdTokenIndexer()}) fields = {"tokens": word_sentence_field} if label is not None: label_field = LabelField(label, skip_indexing=True) fields["label"] = label_field return Instance(fields) def load_dataset(path, dataset): if dataset not in ['train', 'val', 'test']: raise ValueError('"dataset" parametes must be train/val/test') data, labels = pd.read_csv(f'{path}/{dataset}.csv'), pd.read_csv(f'{path}/{dataset}_label.csv', header=None, squeeze=True) return data, labels def preprocess(X, y, preprocessor=None): if preprocessor is None: preprocessor = Preprocessor() preprocessor\ .stack(ct.text.UnicodeNormalizer())\ .stack(ct.Tokenizer("ja"))\ .fit(X['article']) processed = preprocessor.transform(X['article']) dataset = [text_to_instance([token.surface for token in document], int(label)) for document, label in zip(processed, y)] return dataset, preprocessor
次ã«ãã¤ãã¼ãã©ã¡ã¼ã¿ãè¨è¿°ããYAMLãã¡ã¤ã«ã§ãã
config.yaml
# word embeddingã«é¢ãããã¤ãã¼ãã©ã¡ã¼ã¿ w2v: model_name: all vocab_size: 32000 norm: 2 # ã¢ãã«ã«é¢ãããã©ã¡ã¼ã¿ model: hidden_size: 256 dropout: 0.5 # å®é¨æã«ä½¿ç¨ãããã©ã¡ã¼ã¿ training: batch_size: 32 learning_rate: 0.01 epoch: 30 patience: 3
ä»åã¯YAMLã¯åå²ããã²ã¨ã¤ã®ãã¡ã¤ã«ã«ãã¹ã¦è¨è¿°ãã¦ãã¾ãã
å人çã«ã¯YAMLãç´°ãåå²ããããã¨å¤æ´å¿ããä¿®æ£ããã£ããã«ãªãã®ã§ãããã»ã©è¤éã§ãªããã°åä¸ã®YAMLã«ã¾ã¨ãã¦è¨è¿°ãã¦ãã¾ã£ãæ¹ãè¯ãã¨æãã¾ãã
ç¶ãã¦Train&Testã®é¢æ°ã§ãã
# å¦ç¿ def train(train_dataset, val_dataset, cfg): # Vocabularyãçæ VOCAB_SIZE = cfg.w2v.vocab_size vocab = Vocabulary.from_instances(train_dataset + val_dataset, max_vocab_size=VOCAB_SIZE) BATCH_SIZE = cfg.training.batch_size # ããã£ã³ã°æ¸ã¿ããããããçæãã¦ãããIterator iterator = BucketIterator(batch_size=BATCH_SIZE, sorting_keys=[("tokens", "num_tokens")]) iterator.index_with(vocab) # æ±å大ãæä¾ãã¦ããå¦ç¿æ¸ã¿æ¥æ¬èª Wikipedia ã¨ã³ãã£ãã£ãã¯ãã«ã使ç¨ãã # http://www.cl.ecei.tohoku.ac.jp/~m-suzuki/jawiki_vector/ model_name = cfg.w2v.model_name norm = cfg.w2v.norm cwd = hydra.utils.get_original_cwd() params = Params({ 'embedding_dim': 200, 'padding_index': 0, 'pretrained_file': os.path.join(cwd, f'embs/jawiki.{model_name}_vectors.200d.txt'), 'norm_type': norm}) token_embedding = Embedding.from_params(vocab=vocab, params=params) HIDDEN_SIZE = cfg.model.hidden_size dropout = cfg.model.dropout word_embeddings: TextFieldEmbedder = BasicTextFieldEmbedder({"tokens": token_embedding}) encoder: Seq2SeqEncoder = PytorchSeq2SeqWrapper(nn.LSTM(word_embeddings.get_output_dim(), HIDDEN_SIZE, bidirectional=True, batch_first=True)) model = ClassifierWithAttn(word_embeddings, encoder, vocab, dropout) model.train() USE_GPU = True if USE_GPU and torch.cuda.is_available(): model = model.cuda(0) LR = cfg.training.learning_rate EPOCHS = cfg.training.epoch patience = cfg.training.patience if cfg.training.patience > 0 else None optimizer = optim.Adam(model.parameters(), lr=LR) trainer = Trainer( model=model, optimizer=optimizer, iterator=iterator, train_dataset=train_dataset, validation_dataset=val_dataset, patience=patience, cuda_device=0 if USE_GPU else -1, num_epochs=EPOCHS ) metrics = trainer.train() logger.info(metrics) return model, metrics def test(test_dataset, model, writer): # æ¨è« model.eval() with torch.no_grad(): predicted = [model.forward_on_instance(d)['logits'].argmax() for d in tqdm(test_dataset)] # Accuracyã®è¨ç® target = np.array([ins.fields['label'].label for ins in test_dataset]) predict = np.array(predicted) accuracy = accuracy_score(target, predict) # Precision/Recallã®è¨ç® macro_precision = precision_score(target, predict, average='macro') micro_precision = precision_score(target, predict, average='micro') macro_recall = recall_score(target, predict, average='macro') micro_recall = recall_score(target, predict, average='micro') # MLflowã«è¨é² writer.log_metric('accuracy', accuracy) writer.log_metric('macro-precision', macro_precision) writer.log_metric('micro-precision', micro_precision) writer.log_metric('macro-recall', macro_recall) writer.log_metric('micro-recall', micro_recall) model.cpu() writer.log_torch_model(model)
ããã§åºã¦ããwriter
ã¨ããã¤ã³ã¹ã¿ã³ã¹ã¯MLflowã®Clientãã©ãããã¦ãã°ã®è¨é²ãArtifactã®ä¿åãè¡ãã¯ã©ã¹ã®ã¤ã³ã¹ã¿ã³ã¹ã§ãã
with mlflow.start_run():
ã®ãããã¯å¤ã§ãMLflowã使ãå ´é¢ããããRun IDãå¼ãåããªãã¨ãããªãããã©ããã¼ã¯ã©ã¹ãä½ã£ã¦ãã¾ãã
class MlflowWriter(): def __init__(self, experiment_name, **kwargs): self.client = MlflowClient(**kwargs) try: self.experiment_id = self.client.create_experiment(experiment_name) except: self.experiment_id = self.client.get_experiment_by_name(experiment_name).experiment_id self.run_id = self.client.create_run(self.experiment_id).info.run_id def log_params_from_omegaconf_dict(self, params): for param_name, element in params.items(): self._explore_recursive(param_name, element) def _explore_recursive(self, parent_name, element): if isinstance(element, DictConfig): for k, v in element.items(): if isinstance(v, DictConfig) or isinstance(v, ListConfig): self._explore_recursive(f'{parent_name}.{k}', v) else: self.client.log_param(self.run_id, f'{parent_name}.{k}', v) elif isinstance(element, ListConfig): for i, v in enumerate(element): self.client.log_param(self.run_id, f'{parent_name}.{i}', v) def log_torch_model(self, model): with mlflow.start_run(self.run_id): pytorch.log_model(model, 'models') def log_param(self, key, value): self.client.log_param(self.run_id, key, value) def log_metric(self, key, value): self.client.log_metric(self.run_id, key, value) def log_artifact(self, local_path): self.client.log_artifact(self.run_id, local_path) def set_terminated(self): self.client.set_terminated(self.run_id)
æå¾ã«Hydraã®ãã³ã¬ã¼ã¿ãä»ä¸ããmainé¢æ°ã§ãã
ãã¼ã¿ããã¼ã«ã«ã®csvããèªã¿è¾¼ããããHydraã®utilã使ã£ã¦ããã¸ã§ã¯ãã«ã¼ãã®ãã¹ãåå¾ãã¦ãã¾ãã
@hydra.main(config_path='config.yaml') def main(cfg: DictConfig): # https://medium.com/pytorch/hydra-a-fresh-look-at-configuration-for-machine-learning-projects-50583186b710 cwd = hydra.utils.get_original_cwd() train_X, train_y = load_dataset(os.path.join(cwd, 'data'), 'train') val_X, val_y = load_dataset(os.path.join(cwd, 'data'), 'val') test_X, test_y = load_dataset(os.path.join(cwd, 'data'), 'test') train_dataset, preprocessor = preprocess(train_X, train_y) val_dataset, preprocessor = preprocess(val_X, val_y, preprocessor) test_dataset, preprocessor = preprocess(test_X, test_y, preprocessor) EXPERIMENT_NAME = 'livedoor-news-hydra-exp' writer = MlflowWriter(EXPERIMENT_NAME) writer.log_params_from_omegaconf_dict(cfg) model, metrics = train(train_dataset, val_dataset, cfg) test(test_dataset, model, writer) # Hydraã®ææç©ãArtifactã«ä¿å writer.log_artifact(os.path.join(os.getcwd(), '.hydra/config.yaml')) writer.log_artifact(os.path.join(os.getcwd(), '.hydra/hydra.yaml')) writer.log_artifact(os.path.join(os.getcwd(), '.hydra/overrides.yaml')) writer.log_artifact(os.path.join(os.getcwd(), 'main.log')) writer.set_terminated() if __name__ == '__main__': main()
Hydraã«ã¯Multi-runã¨ããæ©è½ããããããã¯ã³ãã³ãã©ã¤ã³ããå¼ã¶éã«ãã©ã¡ã¼ã¿ã®key=valueã§valueå¤ãã«ã³ãåºåãã§è¨è¿°ã-m
ãªãã·ã§ã³ãã¤ããã¨ãå
¨ãã©ã¡ã¼ã¿ã®çµã¿åãããå®è¡ãã¦ãããã¨ãããã®ã§ãã
ã¾ãåºåã¯åãã©ã¡ã¼ã¿ã®çµã¿åããã®ãã³ã«ä¿åãããã®ã§ããã®æ©è½ã使ã£ã¦ãã©ã¡ã¼ã¿ã®ã°ãªãããµã¼ããè¡ããã¨ãã§ãã¾ãã
$ python main.py w2v.model_name=all,entity,word model.hidden_size=32,64,128,256 training.learning_rate=0.01,0.005 -m
ä¸è¨ãå®è¡ããã°åå®é¨ã®å 容ãMLflowä¸ã«è¨é²ããã¾ãã
Hydraã§ã°ãªãããµã¼ãããçµæãMLflowã«è¨é²ãã¦ããã°ãå®é¨çµæã®æ¯è¼ã容æã§ãã
以ä¸ã¯åå®é¨ã®ãã©ã¡ã¼ã¿ã表示ããªãããAccuracyããããããã¦ããã¨ããã§ãã
ã¾ã¨ã
ä»åã®è¨äºã§ã¯Facebook Researchãéçºãã¦ããè¨å®ç®¡çãã¼ã«ã®Hydraã®ä½¿ãæ¹ã¨ãHydra+MLflowã§ãã¤ãã¼ãã©ã¡ã¼ã¿ã®å ¥åºåã管çããããæ¹ãç´¹ä»ãã¾ããã
argparseã使ã£ã¦ãã©ã¡ã¼ã¿å ¥åãè¡ãã®ã¨æ¯ã¹ã¦ãYAMLã§ã®ãã©ã¡ã¼ã¿ç®¡çã¯è¦éããããHydraã¨çµã¿åããããã¨ã§è¨å®ãããããªããPythonã¨çµã¿åããããã¨ãç°¡åã«ãªãã¾ãã
ãããæ©ã«ãã¤ãã¼ãã©ã¡ã¼ã¿ã®ç®¡çãYAML+Hydraã«ç§»è¡ãã¦ã¿ã¦ã¯ãããã§ããããã