ãã®åã¯Phoenixã使ã£ã¦RAGã®å®é¨ç®¡çããã¦ã¿ã¾ããã ã¨ã¯ãããã®ã®ãPhoenixã«äºåå®ç¾©ãããæ©è½ã§è©ä¾¡ãããã ããªã®ã§ãä»åæ¹ãã¦RAGã¢ããªã±ã¼ã·ã§ã³ã®ç²¾åº¦è©ä¾¡ã«ã¤ãã¦èãã¦ã¿ããã¨æãã¾ãã
RAGã®è©ä¾¡å¨ãã§ããç¥ããããã¼ã«ã¨ãã¦Ragasãããã¾ãããä»åã¯ãã¡ãã使ããªããè©ä¾¡ã«ã¤ãã¦åå¼·ãã¦ã¿ããã¨æãã¾ãã
Ragas
ãã®è¨äºã®æ¬é¡ã§ããRAGã®è©ä¾¡ã«ã¤ãã¦å ¥ã£ã¦ããããã¨æãã¾ãã
Ragasã§ç¨ããè©ä¾¡ææ¨
Ragasã§ã¯Retrievalã¨Generationããããã§è©ä¾¡ææ¨ãå®ãã¦ãã¾ãã
- Generation
- faithfulness: ä¸ããããã³ã³ããã¹ãã«å¯¾ããåçã¨äºå®ã®ä¸è²«æ§
- answer relevancy: çæãããåçãä¸ããããããã³ããã«å¯¾ãã¦ã©ãã ãé©åã§ããã
- Retrieval
- context precision
- context recall
ãã¡ããRagasã§èãããã¦ããè©ä¾¡ææ¨ä»¥å¤ã«ãèãããã¨ã¯ãããã¨æãã¾ãããRAGã¢ããªã±ã¼ã·ã§ã³ãç¶ç¶çã«æ¹åãã¦ããã«ã¯Retrievalã¨Generationãåãåãã¦è©ä¾¡ã確èªãããã¨ãè¯ãããã«æãã¾ãã
åºæ¬çãªè©ä¾¡ææ¨
Ragasã§æä¾ããã¦ããææ¨ãä»ã®ã¨ããä¸è¨ã®ãããªæãã§ããã®ä¸ããå¿ è¦ãªè©ä¾¡ææ¨ãå¼ã³åºãã¦ä½¿ç¨ãã¾ãã
- Generation
- Faithfulness
- Answer relevancy
- Retrieval
- Context precision
- Context recall
- Context relevancy
- Context entity recall
Faithfulness
This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.
ä¸ããããã³ã³ããã¹ãã«å¯¾ããçæãããçãã®äºå®ã®ä¸è²«æ§ã表ãã¾ããæ£è§£ãæ£ãããã§ã¯ãªããã³ã³ããã¹ãã¨çæãããåçã¨ã®éã§ã®ä¸è²«æ§ãªã®ã§æ³¨æãå¿ è¦ã§ãã
Answer relevancy
The evaluation metric, Answer Relevancy, focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the question, the context and the answer.
çæãããåçãä¸ããããããã³ããã«ã©ãã ãé©åã§ãããã表ãã¦ãããä¸å®å ¨ã ã£ããåé·ãªåçã«ã¯ã¹ã³ã¢ãä¸ããããã«ãªã£ã¦ããããã§ãã
Context recall, Context precision
Context Precision is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This metric is computed using the question, ground_truth and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.
åçã®æ ¹æ ã¨ãªãã³ã³ããã¹ãã®é¢é£é ç®ããã¹ã¦ä¸ä½ã«ã©ã³ã¯ããã¦ãããã©ãããè©ä¾¡ããææ¨ã§ããã質åãground truthãã³ã³ããã¹ããæ¯è¼ãã¦æ±ãã¦ããããã§ãã
Context Relevancy
This metric gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of (0, 1), with higher values indicating better relevancy.
æ¤ç´¢ãããã³ã³ããã¹ãã¨è³ªåã¨ã®é¢é£æ§ã測å®ãã¦ããããã§ãã
Context entities recall
This metric gives the measure of recall of the retrieved context, based on the number of entities present in both ground_truths and contexts relative to the number of entities present in the ground_truths alone. Simply put, it is a measure of what fraction of entities are recalled from ground_truths. This metric is useful in fact-based use cases like tourism help desk, historical QA, etc. This metric can help evaluate the retrieval mechanism for entities, based on comparison with entities present in ground_truths, because in cases where entities matter, we need the contexts which cover them.
ground_truthsã¨ã³ã³ããã¹ãã«åå¨ããã¨ã³ãã£ãã£ã®æ°ã«åºã¥ãã¦ãground_truthsã ãã«åå¨ããã¨ã³ãã£ãã£ã®æ°ã«å¯¾ãããæ¤ç´¢ãããã³ã³ããã¹ãã«å«ã¾ããã¨ã³ãã£ãã£ã®æ°ã表ãã¦ããããã§ãã
ãã£ã¦ã¿ã
大ä½è©ä¾¡ææ¨ã«ã¤ãã¦ããã£ãã®ã§å®éã«ãã£ã¦ã¿ã¾ãã
è©ä¾¡
ä»åã¯wikipedia ãããã¼ã¿ãåå¾ãã¦ãããããLLMã ãã§ã¯çããããªããããªè³ªåããã¦RAGã®è©ä¾¡ãæå®æ ¡ã¨æãã¾ãã
ç¨æããè©ä¾¡ç¨ãã¼ã¿ã»ããã¯ãããªæãã«ãã¾ããã
questions = [ "ãããã£ã³ã®é«æ ¡ã¯ã©ã?", "ãããã£ã³ã®3æã¯ãã¤ããæ¾é?", "ãã³ã¸ã§ã³é£¯ã®ä¸»äººå ¬ã¯èª°ã§ãã?", "ãã³ã¸ã§ã³é£¯ã®ã¢ãã¡ã¯ãã¤ããæ¾éãã¦ãã¾ãã?", ] ground_truths = [ "æ¬æ é«æ ¡", "2024å¹´4æ", "ã©ã¤ãªã¹ã»ãã¼ãã³", "2024å¹´1æ", ] # 質åããã®æèã¨å¿çã®çæ contexts = [] answers = [] for question in questions: response = query_engine.query(question) contexts.append([x.node.get_content() for x in response.source_nodes]) answers.append(str(response))
ããã«å¯¾ãã¦è©ä¾¡ãè¡ãã¨ãã®ãããªå½¢ã«ãªãã¾ãã
from datasets import Dataset from ragas import evaluate from ragas.metrics import ( faithfulness, answer_correctness, answer_relevancy, context_recall, context_precision, ) # ãã¼ã¿ã»ããã®æºå ds = Dataset.from_dict( { "question": questions, "answer": answers, "contexts": contexts, "ground_truth": ground_truths, } ) # è©ä¾¡ result = evaluate( ds, [faithfulness, answer_correctness, answer_relevancy, context_recall, context_precision] ) print(result)
åºåã¨ãã¦ã¯ãããªæãã§ãã
{'faithfulness': 0.7500, 'answer_correctness': 0.5322, 'answer_relevancy': 0.8860, 'context_recall': 0.7500, 'context_precision': 0.8750}
è©ä¾¡ç¨ã®è³ªåãground truthã®ãã¼ã¿ãé©åã«ç¨æããå¿ è¦ã¯ããã¾ãããããããã§ããã°çµ±ä¸çãªæ¹æ³ã§RAGã®è©ä¾¡ãã§ããã¨ãããã¨ããããã¾ããã
ä»å使ç¨ããnotebook
åèæç®
- GitHub - explodinggradients/ragas: Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
- Introduction | Ragas
- RAG評価ツール ragas を試す|npaka
- RAGASを試す
ææ³
ããã¥ã¡ã³ãããã®ã¾ã¾åããã¨åããªãç®æããããããã£ãã®ã§ãå®éã«ä½¿ãã¨ãã¯ã½ã¼ã¹ã³ã¼ãã¨è¦æ¯ã¹ãå¿ è¦ãããããããã¾ãããã ã¨ãããããRAGã®è©ä¾¡ã®æ¦å¿µã¨ä½¿ãæ¹ã«ã¤ãã¦ã¯å¤§ä½ããã£ãã®ã§ããã®ç¹ã§ã¯è¯ãã£ãã§ãã