- ã¯ããã«
- ææ
- ç°å¢
- ç°å¢è¨å®
- ãã¼ã¿ã»ããã®æ§ç¯
- ãã¼ã¯ãã¤ã¶ã¼æºå
- äºåå¦ç¿
- æ¨è«
- ã¢ãã«ãhuggingfaceã«ã¢ããã°ã¬ã¼ããã
ã¯ããã«
LLMã®ç²¾åº¦æ¤è¨¼ãåã¢ãã«ããã£ã¦ããä¸ã§èªåã§ãäºåå¦ç¿ã¢ãã«ãä½ã£ã¦ã¿ããã¨æã£ãã®ã§ãããããã¨èª¿ã¹ã¦ä½ã£ã¦ã¿ã¾ãã
ä»åã¯ä»¥ä¸ã®è¨äºã§äºåå¦ç¿ã¢ãã«ã®ä½æç¨ã®Colobãå ¬éããã¦ããã®ã§ããã¡ãã使ç¨ãã¾ãã
ã¾ãä¸è¨ã®å 容ãç´°ããZennã«ã¾ã¨ãããã¦ããã®ã§ã詳細ãç¥ãããæ¹ã¯ä»¥ä¸ã®è¨äºãã覧ãã ããã
ææ
ä½æããã¢ãã«ã§æ¨è«ããçµæã§ã....æ°æéã®æ¨è«ã ãã ã¨ããªãå³ããã§ãã
ä»åä½æããã¢ãã«ã¯ä»¥ä¸ã§å ¬éãã¦ãã¾ã
åèè¨äºããã®å¤æ´ç¹
ä¸è¨ã®è¨äºããå¤æ´ããç¹ã¯ä»¥ä¸ã«ãªãã¾ã
- è¨ç®ãªã½ã¼ã¹ãA100ã使ç¨
- ãã¼ã¿ã»ããã graelo/wikipediaã®20230901ã®jpã使ã£ã¦ããæ°ãããã¼ã¿ã»ããã§å¦ç¿
- A100ã®ããflash attention2ã使ç¨ãã¦å¦ç¿
ç°å¢
- Google Colob A100
ç°å¢è¨å®
# cudaãã¼ã¸ã§ã³ã確èªãã !nvcc --version # refer:https://github.com/pytorch/pytorch/issues/107960 # torch.compileæã®libcuda.so not foundã¨ã©ã¼ã®åé¿ç !ldconfig /usr/lib64-nvidia !git clone https://github.com/ce-lery/japanese-mistral-300m-recipe.git %cd japanese-mistral-300m-recipe !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 !pip install -r requirements.txt !pip install flash-attn==2.3.4 --no-build-isolation # veréãã§ãã¼ã¿ã»ããã®å¦çããã¾ããããªãã®ã§ãã¢ããã°ã¬ã¼ããã !pip install --upgrade pyarrow
ãã¼ã¿ã»ããã®æ§ç¯
å¦ç¿ã«å¿ è¦ãªãã¼ã¿ã»ããããã¦ã³ãã¼ããã¦ãå¦ç¿ã§ããããã«ãã©ã¼ããããªã©ãæ´ãã¾ã
%%time !GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/graelo/wikipedia.git %cd wikipedia/data/20230901/ja/ !git lfs pull --include "train-*-of-0016.parquet" %cd ../../../../ import pandas as pd from datasets import load_dataset import csv import pandas as pd # create dataset for training tokenizer dataset = load_dataset("wikipedia/data/20230901/ja/", data_files= "train-*-of-0016.parquet",split="train") # datasetã®textåãwiki.txtã«æ¸ãåºã dataset.to_csv("wiki.txt", columns=["text"], sep="\t", index=False, header=False, quoting=csv.QUOTE_NONE, escapechar="\\")
å¿ è¦ã§ããã°ä»¥ä¸ã§ä¸è¦ãªãã¼ã¿ãåé¤ãã¾ã
%%bash rm -r wikipedia/ rm -r spm_tokenizer_neologdn_bytefallback_nofast/model.safetensors LINES=`wc -l wiki.txt | awk '{print $1}'` TRAIN_DATA_LINES=$(($LINES*10/100)) head -n $TRAIN_DATA_LINES wiki.txt > wiki2.txt rm -r wiki.txt mv wiki2.txt wiki.txt
ãã¼ã¯ãã¤ã¶ã¼æºå
ãã§ã«å¦ç¿æ¸ã¿ã® ce-lery/japanese-mistral-300m-base ã使ãã¾ã
# todo å¦ç¿æ¸ãã¼ã¯ãã¤ã¶ã¼ããã¦ã³ãã¼ã !git clone https://huggingface.co/ce-lery/japanese-mistral-300m-base.git # spm-wiki-cc100-for-spm-bytefallbackã¨ããå称ã§ä¿å !mv japanese-mistral-300m-base spm_tokenizer_neologdn_bytefallback_nofast
äºåå¦ç¿
å¦ç¿ç¨ã®ãã©ã¡ã¼ã¿ã®è¨å®
ã¢ãã«ã®ãµã¤ãºãã¢ãã«èªä½ã®ãã©ã¡ã¼ã¿ã /content/japanese-mistral-300m-recipe/pretrain/train/mistral-300m/config.json
ãå¤æ´ãã¾ãã(å¤æ´å¾ã¯ä»¥ä¸ã®ããã«ãªãã¾ã)
ãã©ã¡ã¼ã¿ã¯ rinna/japanese-gpt2-smallãåèã«ãã¾ãã
{ "architectures": [ "MistralForCausalLM" ], "bos_token_id": 0, "eos_token_id": 0, "hidden_act": "silu", "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 2400, "max_position_embeddings": 2048, "model_type": "mistral", "num_attention_heads": 12, "num_hidden_layers": 12, "num_key_value_heads": 6, "rms_norm_eps": 1e-05, "rope_theta": 10000.0, "sliding_window": 1024, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.35.2", "use_cache": true, "vocab_size": 50257 }
å¦ç¿æã®ãã©ã¡ã¼ã¿ã以ä¸ã®ããã«è¨å®ãã¾ã
%%bash # model_typeã¨config_nameãgpt2-mediumã«ãªã£ã¦ããããå é¨ã§"mistral"ã«ä¸æ¸ããããããç¡è¦ã§ãã # 1epoch=5657stepsãªã®ã§ã100stepã§0.02epochããã. T4ã®å ´åã1epochã§100hourããããããã®ã§ã100stepã« cat << EOS > hf_config_quick_start.json { "model_type": "gpt2", "config_name":"gpt2-medium" , "tokenizer_name":"../spm_tokenizer_neologdn_bytefallback_nofast" , "train_file":"../wiki.txt", "validation_split_percentage":5, "output_dir":"checkpoints-mistral-300M-FA2", "do_train":true, "do_eval":true, "prediction_loss_only":true, "remove_unused_columns":false , "learning_rate":6.0e-4 , "weight_decay":0.1 , "adam_beta2":0.95 , "max_steps":100, "logging_dir":"checkpoints-mistral-300M-FA2/logs", "logging_strategy": "steps" , "logging_steps":10 , "evaluation_strategy":"steps" , "save_strategy": "steps" , "eval_steps":100 , "save_steps":100 , "load_best_model_at_end":true , "save_total_limit":2 , "warmup_steps":1000 , "lr_scheduler_type":"cosine" , "per_device_train_batch_size":16 , "per_device_eval_batch_size":16, "block_size":1024 , "adam_epsilon":1.0e-4 , "fp16":true , "gradient_accumulation_steps":256, "push_to_hub":false, "dataloader_num_workers": 8, "optim":"adamw_bnb_8bit" , "torch_compile":true } EOS
äºåå¦ç¿ã®å®è¡
%%time
%cd pretrain/
# åã®å¦ç¿ã§ã§ãã空ã®å¦ç¿æ¸ã¢ãã«ãã©ã«ããåé¤
!rm -r checkpoints-mistral-300M-FA2
!deepspeed --no_local_rank train/run_clm.py ../hf_config_quick_start.json --deepspeed --deepspeed_config train/ds_config_zero3.json
%cd ../
å¦ç¿çµäºæã®ãã°ã¯ä»¥ä¸ã®ç¨ã«ãªãã¾ãã
[INFO|trainer.py:2139] 2024-01-23 13:15:06,193 >> Loading best model from checkpoints-mistral-300M-FA2/checkpoint-100 (score: 8.360694885253906). {'train_runtime': 4915.0289, 'train_samples_per_second': 83.336, 'train_steps_per_second': 0.02, 'train_loss': 9.417760696411133, 'epoch': 2.87} 100% 100/100 [1:21:55<00:00, 49.15s/it] [INFO|trainer.py:2881] 2024-01-23 13:15:06,396 >> Saving model checkpoint to checkpoints-mistral-300M-FA2 [INFO|configuration_utils.py:461] 2024-01-23 13:15:06,397 >> Configuration saved in checkpoints-mistral-300M-FA2/config.json [INFO|configuration_utils.py:564] 2024-01-23 13:15:06,398 >> Configuration saved in checkpoints-mistral-300M-FA2/generation_config.json [INFO|modeling_utils.py:2193] 2024-01-23 13:15:07,386 >> Model weights saved in checkpoints-mistral-300M-FA2/pytorch_model.bin [INFO|tokenization_utils_base.py:2428] 2024-01-23 13:15:07,387 >> tokenizer config file saved in checkpoints-mistral-300M-FA2/tokenizer_config.json [INFO|tokenization_utils_base.py:2437] 2024-01-23 13:15:07,387 >> Special tokens file saved in checkpoints-mistral-300M-FA2/special_tokens_map.json [INFO|tokenization_t5_fast.py:191] 2024-01-23 13:15:07,388 >> Copy vocab file to checkpoints-mistral-300M-FA2/spiece.model ***** train metrics ***** epoch = 2.87 train_loss = 9.4178 train_runtime = 1:21:55.02 train_samples = 142865 train_samples_per_second = 83.336 train_steps_per_second = 0.02 01/23/2024 13:15:07 - INFO - __main__ - *** Evaluate *** [INFO|trainer.py:3158] 2024-01-23 13:15:07,401 >> ***** Running Evaluation ***** [INFO|trainer.py:3160] 2024-01-23 13:15:07,401 >> Num examples = 8752 [INFO|trainer.py:3163] 2024-01-23 13:15:07,401 >> Batch size = 16 100% 547/547 [00:46<00:00, 11.68it/s] ***** eval metrics ***** epoch = 2.87 eval_loss = 8.3607 eval_runtime = 0:00:47.18 eval_samples = 8752 eval_samples_per_second = 185.478 eval_steps_per_second = 11.592 perplexity = 4275.6648 [INFO|modelcard.py:452] 2024-01-23 13:15:54,937 >> Dropping the following result as it does not have all the necessary fields: {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}} [2024-01-23 13:15:58,618] [INFO] [launch.py:347:main] Process 32974 exits successfully. /content/japanese-mistral-300m-recipe CPU times: user 25.4 s, sys: 4.65 s, total: 30.1 s Wall time: 1h 23min 18s
æ¨è«
å®éã«ä½æããã¢ãã«ã®æ¨è«ãè¡ã£ã¦ã¿ã¾ã
æ¨è«
%%time from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer import torch MODEL_NAME = "./pretrain/checkpoints-mistral-300M-FA2" torch.set_float32_matmul_precision('high') DEVICE = "cuda" if torch.cuda.is_available(): print("cuda") DEVICE = "cuda" else: print("cpu") DEVICE = "cpu" # DEVICE = "cpu" tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME,use_fast=False) model = AutoModelForCausalLM.from_pretrained( MODEL_NAME, trust_remote_code=True, ).to(DEVICE) # streamer = TextStreamer(tokenizer) prompt = "大è¦æ¨¡è¨èªã¢ãã«ã¨ã¯ã" inputs = tokenizer(prompt, add_special_tokens=False,return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate( inputs["input_ids"], max_new_tokens=100, do_sample=True, early_stopping=False, top_p=0.95, top_k=50, temperature=0.9, # streamer=streamer, no_repeat_ngram_size=2, num_beams=3 ) print(outputs.tolist()[0]) outputs_txt = tokenizer.decode(outputs[0]) print(outputs_txt)
çµæ
cuda The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:0 for open-end generation. [9114, 2342, 1073, 396, 260, 3528, 316, 42974, 316, 316, 10951, 316, 260, 260, 262, 260, 261, 261, 260, 10951, 42974, 906, 260, 316, 275, 260, 272, 260, 273, 261, 272, 261, 262, 261, 263, 260, 264, 260, 263, 261, 275, 261, 273, 260, 267, 260, 265, 261, 10951, 718, 260, 275, 262, 316, 261, 279, 260, 282, 261, 316, 268, 260, 279, 261, 265, 260, 1904, 260, 268, 261, 401, 260, 284, 260, 266, 260, 278, 260, 283, 261, 264, 262, 262, 42974, 42974, 262, 272, 262, 264, 261, 267, 261, 266, 261, 644, 260, 318, 260, 42974, 265, 263, 262, 1904, 261, 42974] 大è¦æ¨¡è¨èªã¢ãã«ã¨ã¯ãåã©ãã® \ \ ããã®ãããã\ \ç«ã ã»ã)ã(ã)ãã®ãã¯ãããã¯ãã»ã(ãã§ããã\å·ãã»ã® ãå¹´ãæã ã¨ãå¹´ããã Sãã¨ãã¹ãæ¥ãã«ã1ã2ããã®ã® \ \ã®)ã®ããã§ãã«ããã:ã \ãã¯ã® Sã \ CPU times: user 4.3 s, sys: 324 ms, total: 4.62 s Wall time: 4.01 s
ã¢ãã«ãhuggingfaceã«ã¢ããã°ã¬ã¼ããã
ä½ã£ãã¢ãã«ãhuggingfaceã«ã¢ãããã¼ããã¦ããã¾ã
ã©ã¤ãã©ãªã®ã¤ã³ã¹ãã¼ã«ã¨ãã°ã¤ã³
ã¾ãã¯ã©ãã©ã¤ããå ¥ãã¾ã
!pip install -U "huggingface_hub[cli]"
ãã®å¾ã«ãã°ã¤ã³ãè¡ãã¾ã
!huggingface-cli login
ããã§ãtokenã®å ¥åãæ±ããããã®ã§ huggingfaceã®settingããwrite権éã®ããtokenãå ¥åãã¾ã
To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens . Token:
ã¢ãã«ã®ã¢ãããã¼ã
ã¢ãã«ãã¢ãããã¼ããã¾ã
!huggingface-cli upload {your repository id} pretrain/checkpoints-mistral-300M-FA2 .
以ä¸ã®ããã«ã¢ãããã¼ãããã¾ã