15
6

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

Amazon BedrockAdvent Calendar 2024

Day 25

RAGASを超える!?Amazon研究チームが提案するRAG評価フレームワーク「RAGChecker」を試してみた

Last updated at Posted at 2024-12-25

はじめに

Amazonの研究チームが、新しいRAG(Retrieval-Augmented Generation)の評価フレームワークに関する論文を発表していました。

ざっくり言うと、従来のRAG評価フレームワークにある弱点を洗い出し、それを克服する新手法を提案したとのことです。

さらに、この研究、NeurIPS 2024 Track Datasets and Benchmarksに採択もされているそうです。

arxivは査読を待たずに投稿できてしまうため、内容の信頼性に注意が必要ですが、著名な学会のポスター採択されている研究となれば、一定の信頼性が担保されているといえます。

また、この研究で使われているフレームワーク「RAGChecker」はオープンソースで公開されています。

そこで実際にRAGCheckerを試してみたので、その内容を紹介します。

論文のポイント

RAGCheckerの主なポイントは、RAGの検索(Retriever)部分と生成(Generator)部分を分けて評価できる仕組みを導入していることです。

さらに、回答をクレーム(主張)単位に分割して細かい評価を行うため、人間の感覚に近い精度が出せるといいます。

従来のRAG評価フレームワーク

従来のRAG評価手法(たとえばRAGAS)では、多くの場合、回答全体を一つのまとまりとして扱い、回答全体に対して評価を行っていました。
しかし、この方法だと、回答が長文になるとノイズが発生しやすいなどの課題がありました。

RAGCheckerの特徴

そこで、RAGCheckerは回答をクレーム単位に分割します。
たとえば、以下のような文章があるとします。

猫は哺乳類の動物で、平均寿命は13-17年です。通常は夜行性で、優れた夜間視力を持っています。また、体が柔軟で高い場所から落ちても着地することができます。

これを、

1. 猫は哺乳類の動物である  
2. 猫の平均寿命は13-17年である  
3. 猫は夜行性である  
4. 猫は優れた夜間視力を持つ  
5. 猫は体が柔軟である  
6. 猫は高い場所から落ちても着地できる  

のように一文ごとに分割し、クレームごとに評価します。

その際の主な評価項目は、以下のように整理されています。

評価モジュール 評価項目 意味
検索モジュール (Retriever) クレーム再現率
(claim recall)
検索結果に、正しいクレームがどの程度含まれているか
コンテキスト精度
(context precision)
検索結果(チャンク)のうち、関連クレームを含んでいるものの割合
生成モジュール (Generator) 忠実度
(faithfulness)
生成された回答が、検索結果に含まれる情報に忠実であるかどうか
ノイズ感度
(noise sensitivity)
誤った情報をどの程度拾わずに済んでいるか
幻覚
(hallucination)
検索結果に含まれない情報を勝手に生成してしまう度合い
全体評価 正確度
(precision)
生成回答内のクレームのうち、正しいクレームの割合
再現率
(recall)
グランドトゥルース(正解)と照らして、どの程度のクレームをカバーできているか

論文での結果によると、6項目中5項目でRAGCheckerがRAGASなどの従来手法を上回り、最も人間の評価に近いという報告があります。

image.png

RAGCheckerを使ってみる

ここからは、RAGCheckerを実際に使ってみた手順を紹介します。

AWSのSageMaker Studio Code Editorの環境で実行しました。
ちなみに、SageMaker Studio Code Editorのセットアップについては、こちらのQiita記事がとても参考になります。

1. インストール

まず、RAGChecker自体はPyPIからインストール可能です。

pip install ragchecker
python -m spacy download en_core_web_sm

続いて、テストコードが入っているため、公開されている公式リポジトリもcloneします。

git clone https://github.com/amazon-science/RAGChecker.git

2. 評価データセット用意

RAGCheckerでは、以下のようなJSON形式のデータを用意して評価します。

{
   "results": [
      {
         "query_id": "000",
         "query": "This is the question for the first example",
         "gt_answer": "This is the ground truth answer for the first example",
         "response": "This is the RAG response for the first example",
         "retrieved_context": [
            {
               "doc_id": "xxx",
               "text": "Content of the first chunk"
            },
            {
               "doc_id": "xxx",
               "text": "Content of the second chunk"
            }
         ]
      },
      ...
   ]
}
  • query: ユーザーが発する質問
  • gt_answer: 理想的な回答(人間が用意した正解)
  • response: 実際にRAGシステムが生成した回答
  • retrieved_context: 検索モジュールが返したテキストチャンクのリスト

まずは、GitHubのリポジトリにある下記サンプルデータセットを用いて評価してみます。

checking_inputs.json
{
  "results": [
    {
      "query_id": "0",
      "query": "What's the longest river in the world?",
      "gt_answer": "The Nile is a major north-flowing river in northeastern Africa. It flows into the Mediterranean Sea. The Nile is the longest river in Africa and has historically been considered the longest river in the world, though this has been contested by research suggesting that the Amazon River is slightly longer. Of the world's major rivers, the Nile is one of the smallest, as measured by annual flow in cubic metres of water. About 6,650 km (4,130 mi) long, its drainage basin covers eleven countries: the Democratic Republic of the Congo, Tanzania, Burundi, Rwanda, Uganda, Kenya, Ethiopia, Eritrea, South Sudan, Sudan, and Egypt.",
      "response": "The longest river in the world is the Nile, stretching approximately 6,650 kilometers (4,130 miles) through northeastern Africa, flowing through countries such as Uganda, Sudan, and Egypt before emptying into the Mediterranean Sea. There is some debate about this title, as recent studies suggest the Amazon River could be longer if its longest tributaries are included, potentially extending its length to about 7,000 kilometers (4,350 miles).",
      "retrieved_context": [
        {
          "doc_id": "000",
          "text": "Scientists debate whether the Amazon or the Nile is the longest river in the world. Traditionally, the Nile is considered longer, but recent information suggests that the Amazon may be longer."
        },
        {
          "doc_id": "001",
          "text": "The Nile River was central to the Ancient Egyptians' rise to wealth and power. Since rainfall is almost non-existent in Egypt, the Nile River and its yearly floodwaters offered the people a fertile oasis for rich agriculture."
        },
        {
          "doc_id": "002",
          "text": "The world's longest rivers are defined as the longest natural streams whose water flows within a channel, or streambed, with defined banks."
        },
        {
          "doc_id": "003",
          "text": "The Amazon River could be considered longer if its longest tributaries are included, potentially extending its length to about 7,000 kilometers"
        }
      ]
    },
    {
      "query_id": "1",
      "query": "What does the democratic republic of Congo flag represent?",
      "gt_answer": "The national flag of the Democratic Republic of the Congo represents blue for peace, red for \"the blood of the country's martyrs\", yellow for the country's wealth ; and a star for a radiant future for the country.",
      "response": "The flag of the Democratic Republic of the Congo (DRC) features a sky blue field with a red diagonal stripe bordered by narrow yellow edges, and a yellow five-pointed star in the upper left corner. Each element on the flag carries specific symbolism: the blue represents peace, the red symbolizes the blood of the country's martyrs, the yellow denotes the nation's wealth, and the star stands for hope for a better future.",
      "retrieved_context": [
        {
          "doc_id": "000",
          "text": "The flag of the second Republic of Mobutu Sese Seko became the official banner after Mobutu established his dictatorship. This flag was used from 1966 to 1971 and consisted of the same yellow star, now made smaller, situated in the top corner of the hoist side, with a red, yellow-lined band running diagonally across the center. The red symbolized the people's blood; the yellow symbolized prosperity; the blue symbolized hope; and the star represented unity."
        },
        {
          "doc_id": "001",
          "text": "The current flag of the Democratic Republic of Congo, which has been adopted after the approval of a new constitution in 2006, is composed of a blue sheet, red diagonal stripe and a yellow five-pointed star at the top of the left part of the flag. Blue symbolizes peace, red stands for blood of martyrs, yellow color that frames the red stripe denotes prosperity and the star represents hope for a brighter future of the country."
        },
        {
          "doc_id": "002",
          "text": "The blue color in the flag symbolizes peace, the red should remind of the country’s martyrs, the yellow is for the country’s riches and the star represents the future."
        }
      ]
    }
  ]
}

3. テストコードを実行する

RAGChecker/examples/ ディレクトリに移動して、サンプルコードを試してみます。

GitHubのチュートリアルにあるコードを test.py として貼り付けます。

from ragchecker import RAGResults, RAGChecker
from ragchecker.metrics import all_metrics

# JSONファイル読み込み
with open("examples/checking_inputs.json") as fp:
    rag_results = RAGResults.from_json(fp.read())

# Evaluatorをセットアップ
evaluator = RAGChecker(
    extractor_name="bedrock/meta.llama3-1-70b-instruct-v1:0",
    checker_name="bedrock/meta.llama3-1-70b-instruct-v1:0",
    batch_size_extractor=32,
    batch_size_checker=32
)

# 全メトリクスを使って評価
evaluator.evaluate(rag_results, all_metrics)
print(rag_results)

SageMaker Studio環境で

cd RAGChecker
python examples/test.py

と叩くと、RAGCheckerが実際にBedrockのLlama 3.1 70Bモデルを使って回答評価をしてくれます。

Amazon Bedrockのモデルにアクセスするには、下記の手順を踏んで有効化が必要です。
https://docs.aws.amazon.com/ja_jp/bedrock/latest/userguide/getting-started.html
また、SageMaker Studioの場合、IAMロールにモデル呼び出し権限を付与しておく必要があります。
https://docs.aws.amazon.com/ja_jp/bedrock/latest/userguide/security_iam_id-based-policy-examples.html

同じくexamplesディレクトリの中にあるrun.shというスクリプトでも同様の評価ができるようなのですが、私の環境ではうまく動きませんでした、、

run.sh
python -m ragchecker.cli \
    --input_path=examples/checking_inputs.json \
    --output_path=examples/checking_outputs.json \
    --extractor_name=bedrock/meta.llama3-70b-instruct-v1:0 \
    --checker_name=bedrock/meta.llama3-70b-instruct-v1:0 \
    --batch_size_extractor=64 \
    --batch_size_checker=64 \
    --metrics all_metrics

4. 結果の表示

実行すると、以下のようなメトリクスが得られます。

{
"overall_metrics": {
  "precision": 58.6,
  "recall": 67.5,
  "f1": 60.8
},
"retriever_metrics": {
  "claim_recall": 55.0,
  "context_precision": 100.0
},
"generator_metrics": {
  "context_utilization": 100.0,
  "noise_sensitivity_in_relevant": 31.3,
  "noise_sensitivity_in_irrelevant": 0.0,
  "hallucination": 10.1,
  "self_knowledge": 18.2,
  "faithfulness": 71.7
}
}

今回は全メトリクスを出力するようにしたので、overall_metricsretriever_metricsgenerator_metrics など、全11メトリクスが並びます。

さらに出力ファイル例(examples/checking_output.jsonなど)を見てみると、RAGCheckerがどのように回答をクレーム単位に分割し、それぞれEntailment(正しい), Neutral(わからない), Contradiction(誤り) のようなラベルをつけて評価していることも確認できました。

checking_output.json(長すぎるので省略)
checking_output.json
{
  "results": [
    {
      "query_id": "0",
      "query": "What's the longest river in the world?",
      "gt_answer": "The Nile is a major north-flowing river in northeastern Africa. It flows into the Mediterranean Sea. The Nile is the longest river in Africa and has historically been considered the longest river in the world, though this has been contested by research suggesting that the Amazon River is slightly longer. Of the world's major rivers, the Nile is one of the smallest, as measured by annual flow in cubic metres of water. About 6,650 km (4,130 mi) long, its drainage basin covers eleven countries: the Democratic Republic of the Congo, Tanzania, Burundi, Rwanda, Uganda, Kenya, Ethiopia, Eritrea, South Sudan, Sudan, and Egypt.",
      "response": "The longest river in the world is the Nile, stretching approximately 6,650 kilometers (4,130 miles) through northeastern Africa, flowing through countries such as Uganda, Sudan, and Egypt before emptying into the Mediterranean Sea. There is some debate about this title, as recent studies suggest the Amazon River could be longer if its longest tributaries are included, potentially extending its length to about 7,000 kilometers (4,350 miles).",
      "retrieved_context": [
        {
          "doc_id": "000",
          "text": "Scientists debate whether the Amazon or the Nile is the longest river in the world. Traditionally, the Nile is considered longer, but recent information suggests that the Amazon may be longer."
        },
        {
          "doc_id": "001",
          "text": "The Nile River was central to the Ancient Egyptians' rise to wealth and power. Since rainfall is almost non-existent in Egypt, the Nile River and its yearly floodwaters offered the people a fertile oasis for rich agriculture."
        },
        {
          "doc_id": "002",
          "text": "The world's longest rivers are defined as the longest natural streams whose water flows within a channel, or streambed, with defined banks."
        },
        {
          "doc_id": "003",
          "text": "The Amazon River could be considered longer if its longest tributaries are included, potentially extending its length to about 7,000 kilometers"
        }
      ],
      "response_claims": [
        [
          "Nile",
          "is",
          "longest river in the world"
        ],
        [
          "Nile",
          "stretches",
          "approximately 6,650 kilometers"
        ],
        [
          "Nile",
          "stretches",
          "4,130 miles"
        ],
        [
          "Nile",
          "flows through",
          "northeastern Africa"
        ],
        [
          "Nile",
          "flows through",
          "Uganda"
        ],
        [
          "Nile",
          "flows through",
          "Sudan"
        ],
        [
          "Nile",
          "flows through",
          "Egypt"
        ],
        [
          "Nile",
          "empties into",
          "Mediterranean Sea"
        ],
        [
          "Amazon River",
          "could be",
          "longer"
        ],
        [
          "Amazon River",
          "length",
          "about 7,000 kilometers"
        ],
        [
          "Amazon River",
          "length",
          "4,350 miles"
        ]
      ],
      "gt_answer_claims": [
        [
          "The Nile",
          "is",
          "a major north-flowing river"
        ],
        [
          "The Nile",
          "location",
          "northeastern Africa"
        ],
        [
          "The Nile",
          "flows into",
          "the Mediterranean Sea"
        ],
        [
          "The Nile",
          "is",
          "the longest river in Africa"
        ],
        [
          "The Nile",
          "has historically been considered",
          "the longest river in the world"
        ],
        [
          "The Nile",
          "is contested by",
          "research suggesting that the Amazon River is slightly longer"
        ],
        [
          "The Nile",
          "is one of",
          "the smallest of the world's major rivers"
        ],
        [
          "The Nile",
          "measured by",
          "annual flow in cubic metres of water"
        ],
        [
          "The Nile",
          "length",
          "6,650 km"
        ],
        [
          "The Nile",
          "length",
          "4,130 mi"
        ],
        [
          "The Nile's drainage basin",
          "covers",
          "eleven countries"
        ],
        [
          "The Nile's drainage basin",
          "includes",
          "the Democratic Republic of the Congo"
        ],
        [
          "The Nile's drainage basin",
          "includes",
          "Tanzania"
        ],
        [
          "The Nile's drainage basin",
          "includes",
          "Burundi"
        ],
        [
          "The Nile's drainage basin",
          "includes",
          "Rwanda"
        ],
        [
          "The Nile's drainage basin",
          "includes",
          "Uganda"
        ],
        [
          "The Nile's drainage basin",
          "includes",
          "Kenya"
        ],
        [
          "The Nile's drainage basin",
          "includes",
          "Ethiopia"
        ],
        [
          "The Nile's drainage basin",
          "includes",
          "Eritrea"
        ],
        [
          "The Nile's drainage basin",
          "includes",
          "South Sudan"
        ],
        [
          "The Nile's drainage basin",
          "includes",
          "Sudan"
        ],
        [
          "The Nile's drainage basin",
          "includes",
          "Egypt"
        ]
      ],
      "answer2response": [
        "Neutral",
        "Entailment",
        "Entailment",
        "Entailment",
        "Entailment",
        "Entailment",
        "Entailment",
        "Entailment",
        "Entailment",
        "Neutral",
        "Contradiction"
      ],
      "response2answer": [
        "Entailment",
        "Entailment",
        "Entailment",
        "Entailment",
        "Entailment",
        "Entailment",
        "Contradiction",
        "Neutral",
        "Entailment",
        "Entailment",
        "Neutral",
        "Neutral",
        "Neutral",
        "Neutral",
        "Neutral",
        "Entailment",
        "Neutral",
        "Neutral",
        "Neutral",
        "Neutral",
        "Entailment",
        "Entailment"
      ],
      "retrieved2response": [
        [
          "Neutral",
          "Entailment",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Entailment",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Entailment",
          "Neutral",
          "Neutral",
          "Entailment"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Entailment"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ]
      ],
      "retrieved2answer": [
        [
          "Neutral",
          "Entailment",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Entailment",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Entailment",
          "Entailment",
          "Neutral",
          "Neutral"
        ],
        [
          "Entailment",
          "Neutral",
          "Neutral",
          "Entailment"
        ],
        [
          "Contradiction",
          "Contradiction",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Neutral",
          "Neutral",
          "Neutral"
        ],
        [
          "Neutral",
          "Entailment",
          "Neutral",
          "Neutral"
        ]
      ],
      "metrics": {
        "precision": 0.7272727272727273,
        "recall": 0.5,
        "claim_recall": 0.22727272727272727,
        "context_precision": 0.75,
        "faithfulness": 0.36363636363636365,
        "noise_sensitivity_in_relevant": 0.18181818181818182,
        "noise_sensitivity_in_irrelevant": 0.0,
        "f1": 0.5925925925925926,
        "hallucination": 0.09090909090909091,
        "self_knowledge": 0.5454545454545454,
        "context_utilization": 1.0
      }
    },
    {
      "query_id": "1",
      "query": "What does the democratic republic of Congo flag represent?",
      "gt_answer": "The national flag of the Democratic Republic of the Congo represents blue for peace, red for \"the blood of the country's martyrs\", yellow for the country's wealth ; and a star for a radiant future for the country.",
      "response": "The flag of the Democratic Republic of the Congo (DRC) features a sky blue field with a red diagonal stripe bordered by narrow yellow edges, and a yellow five-pointed star in the upper left corner. Each element on the flag carries specific symbolism: the blue represents peace, the red symbolizes the blood of the country's martyrs, the yellow denotes the nation's wealth, and the star stands for hope for a better future.",
      "retrieved_context": [
        {
          "doc_id": "000",
          "text": "The flag of the second Republic of Mobutu Sese Seko became the official banner after Mobutu established his dictatorship. This flag was used from 1966 to 1971 and consisted of the same yellow star, now made smaller, situated in the top corner of the hoist side, with a red, yellow-lined band running diagonally across the center. The red symbolized the people's blood; the yellow symbolized prosperity; the blue symbolized hope; and the star represented unity."
        },
        {
          "doc_id": "001",
          "text": "The current flag of the Democratic Republic of Congo, which has been adopted after the approval of a new constitution in 2006, is composed of a blue sheet, red diagonal stripe and a yellow five-pointed star at the top of the left part of the flag. Blue symbolizes peace, red stands for blood of martyrs, yellow color that frames the red stripe denotes prosperity and the star represents hope for a brighter future of the country."
        },
        {
          "doc_id": "002",
          "text": "The blue color in the flag symbolizes peace, the red should remind of the country\u2019s martyrs, the yellow is for the country\u2019s riches and the star represents the future."
        }
      ],
      "response_claims": [
        [
          "The flag of the Democratic Republic of the Congo",
          "features",
          "a sky blue field with a red diagonal stripe bordered by narrow yellow edges, and a yellow five-pointed star in the upper left corner"
        ],
        [
          "The blue",
          "represents",
          "peace"
        ],
        [
          "The red",
          "symbolizes",
          "the blood of the country's martyrs"
        ],
        [
          "The yellow",
          "denotes",
          "the nation's wealth"
        ],
        [
          "The star",
          "stands for",
          "hope for a better future"
        ]
      ],
      "gt_answer_claims": [
        [
          "The national flag of the Democratic Republic of the Congo",
          "represents",
          "blue"
        ],
        [
          "Blue",
          "represents",
          "peace"
        ],
        [
          "The national flag of the Democratic Republic of the Congo",
          "represents",
          "red"
        ],
        [
          "Red",
          "represents",
          "the blood of the country's martyrs"
        ],
        [
          "The national flag of the Democratic Republic of the Congo",
          "represents",
          "yellow"
        ],
        [
          "Yellow",
          "represents",
          "the country's wealth"
        ],
        [
          "The national flag of the Democratic Republic of the Congo",
          "represents",
          "a star"
        ],
        [
          "A star",
          "represents",
          "a radiant future for the country"
        ]
      ],
      "answer2response": [
        "Neutral",
        "Entailment",
        "Entailment",
        "Entailment",
        "Entailment"
      ],
      "response2answer": [
        "Entailment",
        "Entailment",
        "Contradiction",
        "Entailment",
        "Neutral",
        "Entailment",
        "Entailment",
        "Entailment"
      ],
      "retrieved2response": [
        [
          "Neutral",
          "Entailment",
          "Neutral"
        ],
        [
          "Neutral",
          "Entailment",
          "Entailment"
        ],
        [
          "Contradiction",
          "Entailment",
          "Entailment"
        ],
        [
          "Entailment",
          "Neutral",
          "Entailment"
        ],
        [
          "Contradiction",
          "Entailment",
          "Entailment"
        ]
      ],
      "retrieved2answer": [
        [
          "Neutral",
          "Entailment",
          "Neutral"
        ],
        [
          "Neutral",
          "Entailment",
          "Entailment"
        ],
        [
          "Neutral",
          "Entailment",
          "Neutral"
        ],
        [
          "Contradiction",
          "Entailment",
          "Entailment"
        ],
        [
          "Neutral",
          "Neutral",
          "Entailment"
        ],
        [
          "Entailment",
          "Neutral",
          "Entailment"
        ],
        [
          "Entailment",
          "Entailment",
          "Entailment"
        ],
        [
          "Neutral",
          "Entailment",
          "Entailment"
        ]
      ],
      "metrics": {
        "precision": 0.8,
        "recall": 0.75,
        "claim_recall": 1.0,
        "context_precision": 1.0,
        "faithfulness": 1.0,
        "noise_sensitivity_in_relevant": 0.2,
        "noise_sensitivity_in_irrelevant": 0.0,
        "f1": 0.7741935483870969,
        "hallucination": 0.0,
        "self_knowledge": 0.0,
        "context_utilization": 0.75
      }
    }
  ],
  "metrics": {
    "overall": {
      "precision": 76.4,
      "recall": 62.5,
      "f1": 68.3
    },
    "retriever": {
      "claim_recall": 61.4,
      "context_precision": 87.5
    },
    "generator": {
      "context_utilization": 87.5,
      "noise_sensitivity_in_relevant": 19.1,
      "noise_sensitivity_in_irrelevant": 0.0,
      "hallucination": 4.5,
      "self_knowledge": 27.3,
      "faithfulness": 68.2
    }
  }
}

日本語データセットの検証

日本語のデータも同じ要領で評価できるのか気になったので、以下のようなJSONを用意して試しました。

{
  "results": [
    {
      "query_id": "001",
      "query": "新入社員向けの福利厚生制度について教えてください",
      "gt_answer": "当社の新入社員向け福利厚生制度には以下が含まれます:1)健康保険・厚生年金保険 2)住宅手当(月額20,000円) 3)通勤手当(全額支給) 4)社員食堂の利用補助 5)資格取得支援制度 6)年次有給休暇(初年度10日) 7)育児・介護休業制度。これらは入社時のオリエンテーションで詳しく説明されます。",
      "response": "当社では新入社員に対して、充実した福利厚生制度を用意しています。主な制度として、法定の健康保険と厚生年金保険に加入できます。また、住宅手当として月額20,000円、通勤手当は実費全額支給となっています。社員食堂では、一食あたり300円の補助があります。資格取得支援制度もあり、業務に関連する資格の取得費用を会社が負担します。休暇制度については、年次有給休暇が初年度10日付与され、育児・介護休業制度も整備されています。",
      "retrieved_context": [
        {
          "doc_id": "HR_HANDBOOK_2024",
          "text": "第3章 福利厚生\n3.1 社会保険\n当社では法令に基づき、健康保険および厚生年金保険に加入していただきます。保険料は労使折半となります。\n\n3.2 手当\n住宅手当:月額20,000円を支給します。\n通勤手当:公共交通機関の実費を全額支給します。"
        },
        {
          "doc_id": "HR_HANDBOOK_2024",
          "text": "3.3 その他の福利厚生\n・社員食堂:一食あたり300円の補助があります。\n・資格取得支援:業務に関連する資格取得費用を会社が負担します。\n\n3.4 休暇制度\n・年次有給休暇:初年度10日\n・育児休業:最長2年間\n・介護休業:対象家族1人につき通算93日まで"
        }
      ]
    },
    {
      "query_id": "002",
      "query": "来週の火曜日に予定されている新製品発表会の場所と時間を教えてください",
      "gt_answer": "新製品発表会は12月31日(火)14:00から、本社ビル3階大会議室で開催されます。参加者は13:45までに会場に集合してください。",
      "response": "新製品発表会は本社ビル3階大会議室で14:00から開催されます。13:45までに会場にお集まりください。プレゼンテーション資料は当日配布いたします。",
      "retrieved_context": [
        {
          "doc_id": "COMPANY_CALENDAR_2024",
          "text": "12月31日(火)\n14:00-16:00 新製品発表会\n場所:本社ビル3階大会議室\n集合時間:13:45\n※プレゼンテーション資料は当日配布"
        },
        {
          "doc_id": "OFFICE_MAP_2024",
          "text": "本社ビル3階見取り図\n・大会議室(収容人数:100名)\n・小会議室A(収容人数:20名)\n・小会議室B(収容人数:20名)\n・休憩スペース"
        }
      ]
    }
  ]
}

すると、下記のような結果が得られました。

{
"overall_metrics": {
  "precision": 80.8,
  "recall": 95.5,
  "f1": 87.2
},
"retriever_metrics": {
  "claim_recall": 85.5,
  "context_precision": 100.0
},
"generator_metrics": {
  "context_utilization": 100.0,
  "noise_sensitivity_in_relevant": 19.2,
  "noise_sensitivity_in_irrelevant": 0.0,
  "hallucination": 0.0,
  "self_knowledge": 0.0,
  "faithfulness": 100.0
}
}

日本語でも評価できているのことが分かります。

モデルを切り替えて実験してみる

RAGCheckerは内部で Lite LLM というオープンソースライブラリを用いてAmazon Bedrockのモデルを呼び出しています。

そのため、使うLLMの種類を変えるのは簡単です。

たとえば、Claude 3.5 Sonnet v2を呼び出したいときは、以下のようにモデルIDを変更します。

from ragchecker import RAGResults, RAGChecker
from ragchecker.metrics import all_metrics

with open("examples/checking_inputs_japanese.json") as fp:
    rag_results = RAGResults.from_json(fp.read())

evaluator = RAGChecker(
    extractor_name="bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0", #変更箇所
    checker_name="bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0", #変更箇所
    batch_size_extractor=32,
    batch_size_checker=32
)

evaluator.evaluate(rag_results, all_metrics)
print(rag_results)

extractor_nameはクレーム分割を行うLLM、
checker_nameは評価を行うLLMを設定します。

このようにモデルを切り替えたところ、次のような結果が出ました。

{
"overall_metrics": {
  "precision": 77.5,
  "recall": 83.0,
  "f1": 80.1
},
"retriever_metrics": {
  "claim_recall": 95.5,
  "context_precision": 75.0
},
"generator_metrics": {
  "context_utilization": 87.5,
  "noise_sensitivity_in_relevant": 22.5,
  "noise_sensitivity_in_irrelevant": 0.0,
  "hallucination": 0.0,
  "self_knowledge": 0.0,
  "faithfulness": 100.0
}
}

先ほどのLlama 3.1 70Bモデルと比較してみると、context_precision(コンテキスト精度) の値がLlamaは100なのに対し、Claudeは75.0を示します。

実際、日本語データセットの二番目の質問では「新製品発表会の場所と時間」にフォーカスしているのに対して、OFFICE_MAP_2024のような見取り図系のコンテキストとの関連度は低いと言えます。

このことからClaudeの方が評価精度が高いことが分かります。

まとめ

RAGCheckerとは?

  • 従来のRAG評価フレームワークをさらに細分化し、回答をクレームごとに評価する新手法
  • RAGの検索モジュールと生成モジュールを個別に評価し、さらにクレーム単位で評価するため、人間の評価感覚に近いスコアが出せる

試してみた結果

  • 単一のLLMモデルだけで評価完結できるのは便利(RAGASのように埋め込みモデルを追加で用意する必要がない)
  • 日本語でも動作可能(モデルによるが多言語に対応可能)

さらに、Llama Indexとも統合しています。

現行のAG評価フレームワークに不満がある方や、RAGの精度をより正確にモニタリングしたい方は、ぜひRAGCheckerを試してみてください。

15
6
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
15
6

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?