åˆã‚ã«

ä»¥ä¸‹ã§BM25ã‚ˆã‚Šã‚‚ç²¾åº¦ãŒã„ã„BM42ãŒç™ºè¡¨ã•ã‚ŒãŸã¨ã‚ã‚‹ã®ã§ã€å®Ÿéš›ã«è§¦ã£ã¦ã¿ã¾ã™

www.atpartners.co.jp

ä»¥ä¸‹ã®è¨˜äº‹ã§ã€éŽåŽ»ã«BM25ã‚’å‹•ã‹ã—ã¦ã„ã¾ã™ã€‚

ayousanz.hatenadiary.jp

ä»¥ä¸‹ã§ä»Šå›žã®è¨˜äº‹ã®ãƒªãƒã‚¸ãƒˆãƒªã‚’å…¬é–‹ã—ã¦ã„ã¾ã™

github.com

é–‹ç™ºç’°å¢ƒ

Windows 11
Python 3.11

ãƒ©ã‚¤ãƒ–ãƒ©ãƒªã®ã‚¤ãƒ³ã‚¹ãƒˆãƒ¼ãƒ«

pip install numpy
pip install transformers
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

BM42ã®indexãŠã‚ˆã³æ¤œç´¢

ä»¥ä¸‹ã§ãƒ‰ã‚ãƒ¥ãƒ¡ãƒ³ãƒˆã®indexã¨æ¤œç´¢ã‚’ã—ã¦ã¿ã¾ã™

from typing import List, Dict
import math
from transformers import AutoTokenizer, AutoModel
import torch

# ã‚µãƒ³ãƒ—ãƒ«ãƒ‡ãƒ¼ã‚¿ã‚»ãƒƒãƒˆ
documents = [
    "Hello world is a common phrase in programming",
    "Python is a popular programming language",
    "Vector databases are useful for similarity search",
    "Machine learning models can be complex",
]

def compute_idf(term: str, documents: List[str]) -> float:
    doc_freq = sum(1 for doc in documents if term in doc.lower())
    return math.log((len(documents) - doc_freq + 0.5) / (doc_freq + 0.5) + 1)

def get_bm42_weights(text: str, model, tokenizer):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs, output_attentions=True)

    attentions = outputs.attentions[-1][0, :, 0].mean(dim=0)
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

    word_weights = {}
    current_word = ""
    current_weight = 0

    for token, weight in zip(tokens[1:-1], attentions[1:-1]):  # Exclude [CLS] and [SEP]
        if token.startswith("##"):
            current_word += token[2:]
            current_weight += weight
        else:
            if current_word:
                word_weights[current_word] = current_weight
            current_word = token
            current_weight = weight

    if current_word:
        word_weights[current_word] = current_weight

    return word_weights

# ãƒ¢ãƒ‡ãƒ«ã¨ãƒˆãƒ¼ã‚¯ãƒŠã‚¤ã‚¶ãƒ¼ã®åˆæœŸåŒ–
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

def compute_bm42_score(query: str, document: str, documents: List[str]) -> float:
    query_weights = get_bm42_weights(query, model, tokenizer)
    doc_weights = get_bm42_weights(document, model, tokenizer)

    score = 0
    for term, query_weight in query_weights.items():
        if term in doc_weights:
            idf = compute_idf(term, documents)
            score += query_weight * doc_weights[term] * idf

    return score

def search_bm42(query: str, documents: List[str]) -> List[Dict[str, float]]:
    scores = []
    for doc in documents:
        score = compute_bm42_score(query, doc, documents)
        scores.append({"document": doc, "score": score})

    return sorted(scores, key=lambda x: x["score"], reverse=True)

# ä½¿ç”¨ä¾‹
query = "programming language"

print("BM42 Results:")
for result in search_bm42(query, documents):
    print(f"Score: {result['score']:.4f} - {result['document']}")

çµæžœã¯ä»¥ä¸‹ã®ã‚ˆã†ã«ãªã‚Šã¾ã™

BM25 Results:
Score: 1.9970 - Python is a popular programming language
Score: 0.6398 - Hello world is a common phrase in programming
Score: 0.0000 - Vector databases are useful for similarity search
Score: 0.0000 - Machine learning models can be complex

BM42 Results:
BertSdpaSelfAttention is used but `torch.nn.functional.scaled_dot_product_attention` does not support non-absolute `position_embedding_type` or `output_attentions=True` or `head_mask`. Falling back to the manual attention implementation, but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.
Score: 0.0117 - Python is a popular programming language
Score: 0.0069 - Hello world is a common phrase in programming
Score: 0.0000 - Vector databases are useful for similarity search
Score: 0.0000 - Machine learning models can be complex

yousanã®ãƒ¡ãƒ¢

ãƒã‚¤ãƒ–ãƒªãƒƒãƒ‰æ¤œç´¢ã‚¢ãƒ—ãƒãƒ¼ãƒã€ŒBM42ã€ã‚’å‹•ã‹ã—ã¦ã¿ã‚‹

åˆã‚ã«

é–‹ç™ºç’°å¢ƒ

ãƒ©ã‚¤ãƒ–ãƒ©ãƒªã®ã‚¤ãƒ³ã‚¹ãƒˆãƒ¼ãƒ«

BM42ã®indexãŠã‚ˆã³æ¤œç´¢

åˆã‚ã«

é–‹ç™ºç’°å¢ƒ

ãƒ©ã‚¤ãƒ–ãƒ©ãƒªã®ã‚¤ãƒ³ã‚¹ãƒˆãƒ¼ãƒ«

BM42ã®indexãŠã‚ˆã³æ¤œç´¢

åˆã‚ã«

ãƒ©ã‚¤ãƒ–ãƒ©ãƒªã®ã‚¤ãƒ³ã‚¹ãƒˆãƒ¼ãƒ«

BM42ã®indexãŠã‚ˆã³æ¤œç´¢