æè¿ãããªè¨äºãè¦ããã¾ããã
èªåãRAGã¨ãã¡ããã£ã¨åå¼·ãã¦ãããã¦LLMã¢ããªã±ã¼ã·ã§ã³ã®è©ä¾¡å¨ãã¯ãã£ã¨æ°ã«ãªãã¨ããã§ã¯ãã£ãã®ã§ãä¸è¨ã®è¨äºãè¦ã¦ã¡ãã£ã¨åå¼·ãã¦ã¿ãæ°ã«ãªãã¾ããã
ãã£ãããªã®ã§ãè²ã ä½ããªããè©ä¾¡ã«ã¤ãã¦èªåã§èãã¦ã¿ãããã¨æãã¾ãã
- LLMã¢ããªã±ã¼ã·ã§ã³ã®è©ä¾¡
- 試ãã«ä½ã£ã¦ã¿ã
- ä½ã£ã¦ã¿ã
- ä½ã£ãã³ã¼ã
- åèæç®
- ææ³
LLMã¢ããªã±ã¼ã·ã§ã³ã®è©ä¾¡
è©ä¾¡ã®3ã¬ã¤ã¤ã¼
LLMã¢ããªã±ã¼ã·ã§ã³ã®è©ä¾¡ã¨ä¸å£ã«è¨ã£ã¦ããè¤æ°ã®ã¬ã¤ã¤ã¼ã«åãã¦èãããã¨ãã§ããããã§ãã
è©ä¾¡ææ¨ã«ã¯ã¬ã¤ã¤ã¼ã®æ¦å¿µããããã¨ã念é ã«ããã¦ããå¿ è¦ãããã§ãããã
- ã¬ãã«1: LLMæ©è½ã»ã¢ããªã±ã¼ã·ã§ã³ãã®ãã®ã«å¯¾ããè©ä¾¡
- åºåã«å¯¾ããã®è©ä¾¡
- æå¾ ããã¢ã¦ããããï¼Grand Truthã¨å®éã®ã¢ã¦ããããã®æ¯è¼
- åºåã®å¦¥å½æ§ã®è©ä¾¡ï¼LLM as a Judgeã§æ±ãï¼
- ã¬ã¤ãã³ã·ã¼ãªã©ã®éæ©è½è¦ä»¶ã®è©ä¾¡
- ã¬ãã«2: LLMæ©è½ã»ã¢ããªã±ã¼ã·ã§ã³ã«å¯¾ããã¦ã¼ã¶ã¼ã®åå¿ãæåã«å¯¾ããè©ä¾¡
- ã¦ã¼ã¶ã¼ããã®ç´æ¥çãªãã£ã¼ãããã¯ï¼Good/Badãã¿ã³ã§ã®è©ä¾¡ãªã©ï¼
- ã¦ã¼ã¶ã¼ã®å©ç¨ç¶æ³ï¼ã¯ãªãã¯çãåå ¥ãçãªã©ï¼
- ã¬ãã«3: KPIãåä¸ãããã©ããã®è©ä¾¡ LLMアプリケーションの評価入門〜基礎から運用まで徹底解説〜
LLMæ©è½ã»ã¢ããªã±ã¼ã·ã§ã³ãã®ãã®ã«å¯¾ããè©ä¾¡ãè¡ã£ãããã§ãã¬ãã«2ãã¬ãã«3ã®ã¬ã¤ã¤ã¼ã«ã¤ãã¦ãè©ä¾¡ããå¿ è¦ãããããã§ãã
ã¬ãã«2ãã¬ãã«3ã¯ã©ããã¦ãã¦ã¼ã¶ã¼ã«å¯¾ãã¦åºãã¦ã¿ãªãã¨ããããªãé¨åãå¤ãã試è¡åæ°ã¨ãã¦ã¯å°ãªããªããã¡ã§ãããã®ããããªãã¹ãã¬ãã«1ã®æ®µéã§ãã¬ãã«2, ã¬ãã«3ã®ãã¹ãã¾ã§ãããªãã¦ããããåé¡ãã¯è¦ã¤ãããããã§ããã
LLMæ©è½ã»ã¢ããªã±ã¼ã·ã§ã³ãã®ãã®ã«å¯¾ããè©ä¾¡
ã¬ãã«2ãã¬ãã«3ã¾ã§è©ä¾¡ã大äºã ãããããã¬ãã«1ã®ãLLMæ©è½ã»ã¢ããªã±ã¼ã·ã§ã³ãã®ãã®ã«å¯¾ããè©ä¾¡ããå質ã»å¹çè¯ãå®è¡ãã¦ããããããã§ãã
ãLLMæ©è½ã»ã¢ããªã±ã¼ã·ã§ã³ãã®ãã®ã«å¯¾ããè©ä¾¡ããã©ãããã®ãèãã¾ãã
LLMã¢ããªã±ã¼ã·ã§ã³ã®åºåã«å¯¾ããã®è©ä¾¡ãã©ã®ããã«è¡ãã¹ããªã®ããèãã¦ããã¾ãããã
ä¸è¨ã§ãè¿°ã¹ãããã«ãåºåã«å¯¾ããã®è©ä¾¡ã«ã¯å¤§ããåãã¦2ã¤ã®è©ä¾¡æ¹æ³ããã£ã¦ã
- æå¾ ããã¢ã¦ããããï¼Grand Truth)ã¨å®éã®ã¢ã¦ãããããæ¯è¼ãã¦ã¹ã³ã¢ãªã³ã°ãã
- å®ç¾©ããè©ä¾¡åºæºã«åºã¥ãã¦ãã·ã¹ãã ã®åºåã®å¦¥å½æ§ãã¹ã³ã¢ãªã³ã°ï¼åæ ¼/ä¸åæ ¼ãå¤å®ï¼ããã¨ãããã®ãããã¾ãã LLMアプリケーションの評価入門〜基礎から運用まで徹底解説〜
ããè¨ã"精度"ã£ã¦ãã¤ã測å®ããããã§ããããç®åºããã«ã¯ãæå¾ ããã¢ã¦ããããããç¨æããå¿ è¦ãããã¾ãã 決ã¾ã£ãå½¢å¼ã®åçã§ããã°æååä¸è´ãããã³ã°è·é¢ãªã©ã§ä¸æååä½ã§ä¸è´ãå¤å®ããã®ãããã§ãããã
ããã§ã¯ãªãããããç¨åº¦ã®èªç±åçãã許ãã¤ã¤ããæå¾ ããã¢ã¦ãããããã¨æå³ãä¸è´ãã¦ãããã©ãããå¤æããã«ã¯ãã©ããã¦ãæååã¬ãã«ã®å¤å®ã§ã¯é£ããå¥ã®æ¹æ³ãç¨ããå¿ è¦ãããã¾ãã
LLM-as-a-Judge
ãã¡ãã人éã®ç®è¦ã§å¤å®ããã§ãè¯ãã®ã§ããã人éãæ¯åç®è¦ç¢ºèªãã¦ãã¦ã¯éçºå¹çã¯å½ç¶ä¸ããã¾ããã ã¨ãããã¨ã§ãLLMã®åºåããåçãæå¾ ããã¢ã¦ããããã¨æå³çã«åè´ãã¦ãããã©ãããå¤å®ããããã«"LLM-as-a-Judge"ã¨ããããæ¹ãå¤ãã®å ´ååããã¦ãã¾ãã
LLM-as-a-Judgeãã©ãå®ç¾ãããã®ç´°ããããæ¹ã¯æ§ã èããããã§ãããããå人çã«ã¯ãã®ããæ¹ãè¯ããªã¨æãã¾ããã
ãããªæãã®ããã³ããã使ã£ã¦æ¡ç¹ãè¡ã£ã¦ããã¾ãã
system-message
ããªã(evaluation-assistant)ã«ã¯å¥ã®ã¢ã·ã¹ã¿ã³ã(suggestion-assistant)ã®ã¡ãã»ã¼ã¸ãè©ä¾¡ãã¦ããã ãã¾ãã ## suggestion-assistantã®åæ suggestion-assistantã¯~~~~
user-message
suggestion-assistantã®æå¾ã®è¿çãã©ã®ç¨åº¦ä¸è¨ã®ææ¸ä½æããã¥ã¢ã«ã«å¾ã£ã¦ãããã§0ã100ç¹ã§scoreãã¤ãã¦ãã ãã ### ææ¸ã©ã¤ãã£ã³ã°ã®æ¹é - ä¸å¯§ã«å¯¾å¿ãã - ~~~
ãããã
{ "evaluated_text": {è©ä¾¡å¯¾è±¡ã®æç« }, "reason": {å¤æçç±}, "score": {score} }
ã®ãããªå½¢å¼ã§å¿çããããã¨ã§ãã¢ããªã±ã¼ã·ã§ã³ã®è¯ãæªãã確èªã§ããããã«ãã¦ããããã§ãã ãããããµãã«æ¸ãã°ç¢ºãã«åºæºã«æ²¿ã£ãå¤å®ãã§ãããã§ããã
試ãã«ä½ã£ã¦ã¿ã
ããããæ¦è¦ãããã£ãã¨ããã§ãå®éã«è©ä¾¡ãããæãã«ãããã¨ãèãã¦ããã¾ãã
è©ä¾¡è¦³ç¹
ã¡ããã¨ä½ããªãè¨ãåãã¨ãè¨èé£ãã¨ãæ°ã«ããã»ããè¯ããã§ãããä»åã¯å 容ãä¸è´ãã¦ãããã ãã§è©ä¾¡ãããã¨æãã¾ãã
è©ä¾¡ã«ããã£ã¦ä¸è¨ã®æ å ±ãè©ä¾¡ç¨ã®LLMã«å®æ½ããããã¨æãã¾ãã
- score: LLMã®åºåã¨æå¾ ããåçã¨ã®ä¹é¢ã®åº¦åã
- reason: scoreã®å¤æçç±
scoreã«ã¤ãã¦ãåºæºã¯ä¸è¨ã®ããã«ãããã¨æãã¾ãã
- 1.0: LLMã®åºåã¨æå¾ ããåçãåè´ãã¦ãã
- 0.5: æå¾ ããåçã«é¨åçã«ä¸è´ãã¦ãã
- 0.0: LLMã®åºåã¨æå¾ ããåçãå®å ¨ã«ç°ãªã£ã¦ãã
å¾ã¯å¤æçç±ãä¸ç·ã«åºåããã¦chain of thoughtãç¨ãã¦ãããããã¨æãã¾ãã
ããã³ãã
ãã®è©ä¾¡ãè¡ãããã³ããã¯ä¸æ¦ãããªæãã«ãã¦ã¿ã¾ããã ï¼ãã£ã¨è¯ãã®ãããã°èª°ããã£ããæãã¦ãã ããï¼
system message
ããªã(evaluation-assistant)ã«ã¯å¥ã®ã¢ã·ã¹ã¿ã³ã(suggestion-assistant)ã®ã¡ãã»ã¼ã¸ãè©ä¾¡ãã¦ããã ãã¾ãã 以ä¸ã«ç¤ºã質åã«å¯¾ãã¦ã³ã³ããã¹ãã®æ å ±ããã¨ã«ãæå¾ ããåçãsuggestion-assistantãåºåãããã¨ãæå¾ ãã¦ãã¾ãã ### 質å {question} ### ã³ã³ããã¹ã {context} ### æå¾ ããåç {answer}
user message
suggestion-assistantã®åºåãã©ã®ç¨åº¦æå¾ ããåçã«åè´ãã¦ãããã§0.0ã1.0ã®ç¯å²ã§scoreãã¤ãã¦ãã ããã ## ç¹æ°ã®åºæº - 1.0: æå¾ ããåçã®å 容ã¨åè´ãã¦ãã - 0.8: æå¾ ããåçã®å 容ã¨ããããåè´ãã¦ãã¦ããããã«ç°ãªã£ã¦ãã - 0.5: æå¾ ããåçã®å 容ã¨é¨åçã«ä¸è´ãã¦ãã - 0.2: æå¾ ããåçã®å 容ã¨ããããç°ãªã£ã¦ãã¦ãããä¸é¨ã ãåè´ãã¦ãã - 0.0: æå¾ ããåçã®å 容ã¨å®å ¨ã«ç°ãªã£ã¦ãã ## suggestion-assistantã®åºå {output} ## Output score: reason:
ä½ã£ã¦ã¿ã
ã¡ãã£ã¨åã«ãã¹ããã¼ã¿ã»ããã試ãã«ä½æããã¨ããã®ããã£ã¦ããã®ã§ãããã®å»¶é·ã§ä»åã®è©ä¾¡ããã£ã¦ã¿ããã¨æãã¾ãã
æ¸ããã³ã¼ãã¯ãã®å¾ã«ç½®ãã¨ãã¨ãã¦ãæçµçãªã¢ãã¿ãªã³ã°ã®ç»é¢ã¯ãããªæãã«ãªã£ã¦ã¾ããã
ä»åã¯è©ä¾¡ã®è©³ç´°ã«ã¤ãã¦è¦ã¦ã¿ãã¨ãåç精度ã¯QA Correctness
: 0.33ã£ã¦äºã«ãªã£ã¦ã¾ããã
ããã¯äºåã«æ±ºããåºæºãè¸ã¾ããã¨ããæå¾
ããåçã®å
容ã¨ããããç°ãªã£ã¦ãã¦ãããä¸é¨ã ãåè´ãã¦ãããããã¡ãã£ã¨ãã·ã£ã¦ã¬ãã«ã®RAGã«ãªã£ã¦ããã¨ãããã¨ããããã¾ããã
ååçã®æ¡ç¹çç±ã«ã¤ãã¦ãfeedbackãè¦ãã¨ç¢ºèªã§ããããã«ãªã£ã¦ããããããªæãã§ãã
retrievalã®ç²¾åº¦ã«ãããã¦ã¯ãããªæãã§ãã
æ¤ç´¢çµæã«å¯¾ããã³ã¡ã³ãã¨ãããã®ã§ã欲ããæ å ±ãå ¥ã£ã¦ãããã©ããã確èªã§ãã¾ããã
ä»åã®å®è£ ã§ã¯ãä¸èº«ãè¦ã¦ã¿ãæãã¨ãã¦ã¯
- åçã®ç²¾åº¦ããã¾ãè¯ããªã
- é¨åçã«åçã§ãã¦ããç¨åº¦ã§ãååãªå 容ã®åçã¯ã§ãã¦ããªã模æ§
- æ¤ç´¢ã®æ¹ã§æ£ããç®æãåç
§ã§ãã¦ããªã
- ãã®å½±é¿ãªã®ããæçµçãªåçãããã¾ãæ£ãããªã
ã¨ããç¶æ³ãªãã¨ããããã¾ããã
精度ã¯ãã¾ãè¯ãã¯ããã¾ããããä»åè©ä¾¡ã®ã³ã¼ããå ãããã¨ã§ä»ä½ã£ã¦ãRAGã
- ã©ããããã®ç²¾åº¦ãªã®ã
- åå¥è¦ç´ ãã©ããããã®ç²¾åº¦ã§çµã¿åããã£ã¦ããã®ã
ã説æã§ããããã«ãªãã¾ããã
ä½ã£ãã³ã¼ã
ä»åä½ã£ãã³ã¼ãã¯ãã¡ãã«ããã¾ãã
åèæç®
ä¸è¨ã®æç®ãåèã«ããã¦ããã ãã¾ããã
- LLMアプリケーションの評価入門〜基礎から運用まで徹底解説〜
- LLMによるLLMの評価「LLM-as-a-Judge」入門〜基礎から運用まで徹底解説
- LLMアプリケーションの評価の運用についてまとめてみた
ææ³
以ä¸ãRAGã®è©ä¾¡ã©ããã£ã¦ãã£ããè¯ããã ï¼ã¨æã£ã¦ããã¨ããã«ããæãã®ããã°ãè¦ãããã®ã§ãã£ã¦ã¿ãè¨äºã§ããã LLM-as-a-judgeã£ã¦ã©ããã£ã¦ãã£ããè¯ããã ï¼ã£ã¦é·ãéãã£ã¨èãã¦ã¾ãããããããã¡ããã¨ã§ããããã«ãªã£ãæ°ããã¾ãã