æ±å大NLPã°ã«ã¼ãããããæ¥æ¬èªè¨èªã¢ãã«ãæ°ãã«å
¬éããã¦ãã¾ããã
æ¢åã®ã¢ãã«ã®ã¢ãããã¼ãã§ãã
æ±å大 NLP ã°ã«ã¼ã (@NlpTohoku) ã§å ¬éãã¦ããæ¥æ¬èª BERT ãã¢ãããã¼ãããæ°ãã« CC-100 㨠Wikipedia ã§è¨ç·´ãã4ã¤ã®ã¢ãã«ã追å ãã¾ãããå種ã³ã¼ãã TensorFlow v2.11 対å¿ã®ãã®ã«æ´æ°ãã¾ãããç ç©¶ã»æè²ãéçºã«ãå½¹ç«ã¦ããã ããã°ã¨æãã¾ããhttps://t.co/O4H2llCLyn
— Masatoshi Suzuki (@fivehints) 2023å¹´5æ19æ¥
ä»åæ°ãã«å ¬éãããã®ã¯æ¬¡ã®ï¼ã¤ã®ã¢ãã«ã§ãã
- cl-tohoku/bert-base-japanese-v3
- cl-tohoku/bert-base-japanese-char-v3
- cl-tohoku/bert-large-japanese-v2
- cl-tohoku/bert-large-japanese-char-v2
charãã¤ãã¦ããã®ã¯æåãã¨ã®ãã¼ã¯ãã¤ãºã§ãã¤ãã¦ããªããã®ã¯Unidic 2.1.2ãã¼ã¹ã®ãã¼ã¯ãã¤ãºãããªã
ããããCyberAgentã®ã¢ãã«ã¨åãã³ã¼ãã 㨠ãããªæãã«ãªã£ã¦ãã¾ãã¾ããã
åããããã«ã¯fugashiã¨unidic_liteãå¿ è¦ãªã®ã§pip installãã¦ããå¿ è¦ãããã¾ããunidic_liteã¯charãã¤ããã¢ãã«ã§ã¯ä¸è¦ãªæ°ããããã©è©¦ãã¦ã¾ããã
GPUã¡ã¢ãªã¯1.4GBç¨åº¦ã®æ¶è²»ãªã®ã§ãCUDAãåãã°ã ãããã®ç°å¢ã§åãããã
import torch from transformers import AutoModelForCausalLM, AutoTokenizer from colorama import Fore, Back, Style, init # need fugashi, unidic_lite init(autoreset=True) # model_name = "cl-tohoku/bert-base-japanese-v3" # model_name = "cl-tohoku/bert-base-japanese-char-v3" model_name = "cl-tohoku/bert-large-japanese-char-v2" # model_name = "cl-tohoku/bert-large-japanese-v2" print ("model:" + model_name) model = AutoModelForCausalLM.from_pretrained(model_name, is_decoder=True).to("cuda") tokenizer = AutoTokenizer.from_pretrained(model_name) # prompt = "AIã«ãã£ã¦ç§éã®æ®ããã¯ã" prompt = "ã¢ã¡ãªã«ã®é¦é½ã¯ã¯ã·ã³ãã³ãæ¥æ¬ã®é¦é½ã¯" # prompt = "å¾è¼©ã¯ç«ã§" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): tokens = model.generate( **inputs, max_new_tokens=64, do_sample=True, temperature=0.7, pad_token_id=tokenizer.pad_token_id, ) output = tokenizer.decode(tokens[0], skip_special_tokens=True).replace(" ", "") print(f"{Fore.YELLOW}{prompt}{Fore.WHITE}{output[len(prompt):]}")