Overview
ã¨ã ã¹ãªã¼ã¨ã³ã¸ãã¢ãªã³ã°ã°ã«ã¼ã AIã»æ©æ¢°å¦ç¿ãã¼ã ã§ã½ããã¦ã§ã¢ã¨ã³ã¸ãã¢ããã¦ããä¸æ(po3rin) ã§ããæ¤ç´¢ã¨Goã好ãã§ãã
ã¨ã ã¹ãªã¼ã§ã¯ChatGPTã®å¯è½æ§ã«ãã¡æ©ã注ç®ãã¦æ´»ç¨ãæ¤è¨ãã¦ãã段éã§ãããæ¬æ ¼çãªãã¼ã¿æå ¥ã«ã¯ã¾ã æ¸å¿µããããã»ãã¥ãªãã£ãã¼ã ã¨æ¤è¨ãé²ãã¦ãã段éã§ãã
ãããªä¸ã§å人ã¾ãã¯çµç¹ã®ããã¥ã¡ã³ãã®ã»ãã³ãã£ãã¯æ¤ç´¢ã¨åå¾ãå¯è½ã«ããChatGPTãã©ã°ã¤ã³ãChatGPT Retrieval Pluginããç»å ´ãã¾ããã
æ å ±æ¤ç´¢å¥½ãã¨ãã¦ã¯é»ã£ã¦ãããããå¤é¨å ¬éç¨ã®ã¨ã ã¹ãªã¼AIã»æ©æ¢°å¦ç¿ãã¼ã ã®ã¡ã³ãã¼ç´¹ä»ããã¥ã¡ã³ãã使ã£ã¦ãã¼ã«ã«ã§è©¦ãã¦ã¿ã¾ããã
# ç¨æããããã¥ã¡ã³ã ä¸æå¼æ¦ã¯æ±äº¬é½å¨ä½ã§ãã¨ã ã¹ãªã¼ã¨ããä¼æ¥ã§åãã§ãã¾ãã ã¨ã ã¹ãªã¼ã®æ¤ç´¢åºç¤ã主ã«æ å½ãã¦ãã¾ããã¾ããæ¸ç±æ¨è¦ã·ã¹ãã éçºãªã©ãè¡ã£ã¦ãã¾ãã è¾²è¦ä¿æã¯åè·ã§ã¯èªç¶è¨èªå¦çãã¿ã³ãã¯è³ªè§£æã«å¿ç¨ããã¢ãã«ã®éçºã«å¾äº ã¨ã ã¹ãªã¼ã§ã¯æ¨è¦ã¢ã«ã´ãªãºã ã»ã·ã¹ãã ã®æ§ç¯ã¨REST APIéçºãæ å½ æµ®ç°ç´å¹³ã¯å»å¦é¨åºèº« (MD)ã»å»å¦å士 (PhD)ãå¦é¨æããç´8å¹´éãæ©æ¢°å¦ç¿ (ç¹ã«æ·±å±¤å¦ç¿) ãç¨ããçä½ãã¼ã¿ã®è§£æã深層å¦ç¿ã®ç 究ãè¡ã£ãã 大å¦é¢ã§ã¯èªèº«ã®ç 究ã®ã»ããè¨ç®æ©ãµã¼ãã¼ã®ç«ã¡ä¸ããå¦é¨çã®ç 究æå°ã«ãå¾äºããã
質å
# (çç¥: ã³ã¼ãã¯å¾ã»ã©ç´¹ä»ãã¾ã).. response = index.query("ã¨ã ã¹ãªã¼AIãã¼ã ã«æå±ããå»å¦é¨åºèº«ã®ã¨ã³ã¸ãã¢ãâåå:ç¹å¾´âã®ãã©ã¼ãããã§ä¸äººæãã¦ãã ãã") print(response)
çµæ
åå: æµ®ç°ç´å¹³ ç¹å¾´: å»å¦é¨åºèº« (MD)ã»å»å¦å士 (PhD)ã§ãããæ©æ¢°å¦ç¿ (ç¹ã«æ·±å±¤å¦ç¿) ãç¨ããçä½ãã¼ã¿ã®è§£æã深層å¦ç¿ã®ç 究ãç´8å¹´éè¡ã£ã¦ãããã¾ãã大å¦é¢ã§ã¯èªèº«ã®ç 究ã®ã»ããè¨ç®æ©ãµã¼ãã¼ã®ç«ã¡ä¸ããå¦é¨çã®ç 究æå°ã«ãå¾äºãã¦ãããããã«ãã¨ã ã¹ãªã¼ã§ã¯æ¤ç´¢åºç¤ã主ã«æ å½ããæ¸ç±æ¨è¦ã·ã¹ãã ...
ãªããåã®ãããã£ã¼ã«ãããã£ã¨æå¾ã«æ··ãã¦åãè¨ã£ã¦ãã¾ãã(ç¬)ããããããåçã¯å¾ããã¾ãããããã§ChatGPTã®æ å ±æºãæ¡å¼µã§ãããã¨ã確èªã§ãã¾ããã
åããã ãã ã¨ã¤ã¾ããªãã®ã§ãä»åã¯ChatGPT Retrieval Pluginããµãã¼ããã¦ãããã¯ãã«æ¤ç´¢ã¨ã³ã¸ã³ã§ã¯ãªããAWSã§å©ç¨ã§ããOpenSearch ã®Providerãå®è£ ãã¦ãChatGPTã«OpenSearchã®ãã¯ãã«æ¤ç´¢ãæä¾ããæ¹æ³ã試ãã¦ã¿ã¾ããã
ãã®ããã°ãèªããã¨ã§ãChatGPT Retrieval Pluginã®åä½ã®ç解ã¨ãçãããæ¥é 使ã£ã¦ããæ £ã親ããã ãã¯ãã«æ¤ç´¢ã¨ã³ã¸ã³ãChatGPT Retrieval Pluginã«å¯¾å¿ãããç¥è¦ãç²å¾ã§ãã¾ãã
- Overview
- ChatGPT Retrieval Plugin ã¨ã¯
- ChatGPTãPluginã¨ããåãããä»çµã¿
- æ¢ã«ãµãã¼ãããã¦ãããã¯ãã«æ¤ç´¢ã¨ã³ã¸ã³ã使ã£ã¦ChatGPT Retrieval Pluginã試ã
- ä»»æã®ãã¯ãã«æ¤ç´¢ã¨ã³ã¸ã³ã使ããããã«Providerãå®è£ ãã
- DataStoreã®ç¨æ
- deleteå®è£
- Elasticsearch Providerã¯ä½ããã
- ã¾ã¨ã
- We're hiring!
ChatGPT Retrieval Plugin ã¨ã¯
å人ã¾ãã¯çµç¹ã®ããã¥ã¡ã³ãã®ã»ãã³ãã£ãã¯æ¤ç´¢ã¨åå¾ãå¯è½ã«ããChatGPTãã©ã°ã¤ã³ã§ãããã½ã¼ã¹ããæãé¢é£æ§ã®é«ãããã¥ã¡ã³ãã¹ãããããåå¾ãã¦ChatGPTã§å©ç¨ãã¾ãããããå©ç¨ãããã¨ã§ç¤¾å ããã¥ã¡ã³ããå人ã®TODOãªã¹ããªã©ã®ãã©ã°ã¤ã³ãªã©ãä½ãã¾ãããã©ã°ã¤ã³ã«ã¯èªè¨¼æ©è½ãã¤ããããã®ã§ããµã¼ãã¹ã®ææã¦ã¼ã¶ã¼éå®ã®å»çç¸è«ãã©ã°ã¤ã³ãªã©ãä½ãã¾ãã
å é¨ã§ã¯ãã¯ãã«æ¤ç´¢ã¨ã³ã¸ã³ã使ããã¦ãããå³ã«ããã¨ä¸è¨ã®ãããªæ§é ã«ãªã£ã¦ãã¾ãã
ChatGPTãChatGPT Retrieval Pluginã¨ãã¦ç«ã¦ãAPIã«ã¯ã¨ãªãæãã¦ãè¿ã£ã¦ããã¹ãããããã¦ã¼ã¶ã¼ã¸ã®åçã«å©ç¨ãã¾ããå
é¨ã§ã¯text-embedding-ada-002
embeddings modelã使ã£ã¦ã¯ã¨ãªãããã¥ã¡ã³ãã®ãã¯ãã«ãåå¾ãã¦ããã¯ãã«æ¤ç´¢ã¨ã³ã¸ã³ã«ã¤ã³ããã¯ã¹ãããæ¤ç´¢ããããã¾ãããã¯ãã«æ¤ç´¢ã¨ã³ã¸ã³ã¯ChatGPT Retrieval Pluginããµãã¼ããã¦ãããã®ã§ããã°ãããã«ä½¿ãå§ãããã¨ãã§ãã¾ãã
ChatGPTãPluginã¨ããåãããä»çµã¿
ChatGPT Retrieval Pluginã触ã£ã¦ã¿ãåã«ChatGPTãå種ãã©ã°ã¤ã³ãå©ç¨ããä»çµã¿ã確èªãã¦ããã¾ãããããããç解ãããã¨ã§ãå¾ã«èª¬æããChatGPT Retrieval Pluginã®ç解ã容æã«ãªãã¨ã¨ãã«ãChatGPT Retrieval Pluginã«åãããªãç¬èªã®Pluginãä½ããã¨ãå¯è½ã§ãã
Pluginã¯APIã¨ãã¦ç«ã¦ãå¿
è¦ãããã¾ããAPIã¯/.well-known/ai-plugin.json
ã§Pluginã®ãããã§ã¹ãããã¹ãããå¿
è¦ãããã¾ããä¸è¨ã¯ããã¥ã¡ã³ãããæã£ã¦ãããããã§ã¹ãã®ä¾ã§ãã
{ "schema_version": "v1", "name_for_human": "TODO Plugin", "name_for_model": "todo", "description_for_human": "Plugin for managing a TODO list. You can add, remove and view your TODOs.", "description_for_model": "Plugin for managing a TODO list. You can add, remove and view your TODOs.", "auth": { "type": "none" }, "api": { "type": "openapi", "url": "http://localhost:3333/openapi.yaml", "is_user_authenticated": false }, "logo_url": "https://vsq7s0-5001.preview.csb.app/logo.png", "contact_email": "[email protected]", "legal_info_url": "http://www.example.com/legal" }
modelã¸ã®APIã®èª¬æãAPIã®ããã¥ã¡ã³ãã¸ã®ãªã³ã¯(http://localhost:3333/openapi.yaml
)ãªã©ãããã¾ããAPIã®ããã¥ã¡ã³ãã¯OpenAPIã§è¨è¿°ããã¦ããå¿
è¦ãããã¾ããChatGPTããããã®è¨å®ãèªã¿è¾¼ã¿ãAPIã®ä½¿ãæ¹ãç解ãã¾ãã
ã¤ã¾ãããããã§ã¹ããè¿ãAPIããä½ããã°ä»»æã®Pluginãããã«å®è£ å¯è½ã§ãã
ä»åã®ChatGPT Retrieval Pluginãå®è¡ããã¨ãPythonã®Webãã¬ã¼ã ã¯ã¼ã¯ã§ããFastAPIã§APIãç«ã¡ä¸ããããã«ãªã£ã¦ããããããã§ã¹ããã¡ã¤ã«ã ãä¿®æ£ããã°ãOpenAPIããã¥ã¡ã³ãå«ãããã«å©ç¨ã§ããç¶æ ã«ãªã£ã¦ãã¾ãã
Pluginã®ä½ãæ¹ã«ã¤ãã¦ã¯ä¸è¨ã®ããã¥ã¡ã³ããã覧ãã ããã
æ¢ã«ãµãã¼ãããã¦ãããã¯ãã«æ¤ç´¢ã¨ã³ã¸ã³ã使ã£ã¦ChatGPT Retrieval Pluginã試ã
ä»åã®ããã°ã®æ¬é¡ã§ã¯ãªãã®ã§ãµã¯ãã¨ç´¹ä»ãã¾ãã
åãã¦ChatGPT Retrieval Pluginã触ãéã«ã¯LlamaIndexã使ã£ã¦ãã¼ã«ã«ã§è©¦ãã¦ã¿ãã®ãç°¡åã§ãã LlamaIndexã¯OpenAIã®LLMã«ç¬èªã®ãã¼ã¿ãèªã¿è¾¼ã¾ããä»çµã¿ã§ãã¼ã«ã«ã§ãåããã¾ãã
å®è¡æ¹æ³ã®ç´¹ä»ã¯npakaããã®ããã°ã®æ¹ã詳ããã®ã§ãã¡ããã覧ãã ããã
ä»åã¯Rust製ãã¯ãã«æ¤ç´¢ã¨ã³ã¸ã³ã§ããQdrantãå©ç¨ãã¾ãã
Qdrantã¯ã³ã³ããã¤ã¡ã¼ã¸ãæä¾ããã¦ããã®ã§ãã¡ãã使ãã¾ãã
$ docker pull qdrant/qdrant $ docker run -p 6333:6333 qdrant/qdrant
ããã¦ãPluginç°å¢å¤æ°ãã»ãããã¾ããQdrantã®æ¥ç¶å ã¯ãã¼ã«ã«ãã¹ãã«åããããã©ã«ãã®å¤ãç¨æããã¦ããã®ã§ãä»åã¯è¨å®ããªãã¦OKã§ãã
DATASTORE=qdrant BEARER_TOKEN=XXXXXXXXX OPENAI_API_KEY=XXXXXXXXX
BEARER_TOKEN
ã¯https://jwt.io/ã§æå¾ã§ããOPENAI_API_KEY
ä¸è¨ããä½æã§ãã¾ãã
ä»å使ãPluginããã¼ã«ã«ã«æã£ã¦ãã¦ç°å¢å¤æ°ãèªã¿è¾¼ãã§APIãµã¼ãã¼ãèµ·åãã¾ãã
$ git clone https://github.com/openai/chatgpt-retrieval-plugin.git $ cd chatgpt-retrieval-plugin $ poetry install $ poetry run start
ããã§Pluginã¨ãã¦å¼ã³åºãAPIãç«ã¡ä¸ããã¾ãããOpenAPIã®ä»æ§ãhttp://0.0.0.0:8000/docsã§ç¢ºèªã§ãã¾ãã
ãã®APIãå¼ã³åºããã¨ã§ãChatGPTãã¯ã¨ãªã«é¢ããã¹ãããããåå¾ãã¦å©ç¨ãã¾ãã
ãã®APIçµç±ã§ãã¯ãã«æ¤ç´¢ã¨ã³ã¸ã³ã«ã¤ã³ããã¯ã¹ããããã¥ã¡ã³ããç¨æãã¾ããdata/sample.txt
ã«ããã¥ã¡ã³ãã®å
容ãè¨è¼ãã¾ããç§ã®åä½ç¢ºèªã§ã¯æè¿ä½ã£ãAIã»æ©æ¢°å¦ç¿ãã¼ã ã®ã¡ã³ãã¼ç´¹ä»ã¹ã©ã¤ãã®æç« ãæåãã¾ããçããã¯åã
試ãã¦ã¿ããããã¥ã¡ã³ããç¨æãã¦ãã ããã
ä»åã®ãã¼ã«ã«ã«ãããåä½ç¢ºèªã¯LlamaIndex
ã使ãã¾ãã
import os import openai import numpy as np from dotenv import load_dotenv from llama_index import SimpleDirectoryReader from llama_index.indices.vector_store import ChatGPTRetrievalPluginIndex documents = SimpleDirectoryReader("data").load_data() openai.api_key = os.getenv('OPENAI_API_KEY') index = ChatGPTRetrievalPluginIndex( documents, endpoint_url="http://localhost:8000", bearer_token=os.getenv("BEARER_TOKEN"), ) response = index.query("ã¨ã ã¹ãªã¼ã§åãä¸æå¼æ¦ã¨ã¯ã©ããã人ç©ã§ããï¼") print(response)
çµæ
ä¸æå¼æ¦ã¯32æ³ã®æ±äº¬é½å¨ä½ã®ã¨ã ã¹ãªã¼ã®AIã»æ©æ¢°å¦ç¿ãã¼ã ã§åãã¡ã³ãã¼ã§ããã¨ã ã¹ãªã¼ã®æ¤ç´¢åºç¤ã主ã«æ å½ãã¦ãã¾ããã¾ããæ¸ç±æ¨è¦ã·ã¹ãã éçºãªã©ãè¡ã£ã¦ãã¾ãã趣å³ã¯éº»éããµã¦ããçãã¬ã§ãã
ããã¥ã¡ã³ãããåå¾ããã¹ããããã使ã£ã¦çµæãè¿ã£ã¦ãã¾ããã
ä»»æã®ãã¯ãã«æ¤ç´¢ã¨ã³ã¸ã³ã使ããããã«Providerãå®è£ ãã
æ¬é¡ã§ããä»åã¯ChatGPT Retrieval Pluginã®å é¨ãè¦ãã¦ããµãã¼ãããã¦ããªããã¯ãã«æ¤ç´¢ã¨ã³ã¸ã³ã§ããOpenSearchã®Providerå®è£ ã試ã¿ã¾ãã
OpenSearchã¯ãã¯ãã«æ¤ç´¢ããµãã¼ããã¦ãããNMSLIBãFaissãLuceneå種ãã¯ãã«æ¤ç´¢ã©ã¤ãã©ãªãé¸ã¶ãã¨ãã§ãã¾ãã
ä¸è¨ã®ã¹ãããã§å®è£ ãã¦ããã¾ãã
- å®è£ ããã¹ããã®ãã³ã¼ãããæ¢ã
- DataStoreã®ç¨æ
- _upsertå®è£
- _queryå®è£
- deleteå®è£
- Provideråä½ç¢ºèª
å®è£ ããã¹ããã®ãã³ã¼ãããæ¢ã
Providerå®è£
ã®æ¹æ³ã確èªããããã«ãç¾å¨ãµãã¼ãããã¦ãããã¯ãã«æ¤ç´¢ã¨ã³ã¸ã³ã®å®è£
ãè¦ã¦ããã¾ããProviderã®å®è£
ã¯datastore/providers
ã§è¦ããã¨ãã§ãã¾ããä¾ãã°Rust製ã®ãã¯ãã«æ¤ç´¢ã¨ã³ã¸ã³ã®Qdrantã®Providerã¯datastore/providers/qdrant_datastore.py
ã«ããã¾ãã
class QdrantDataStore(DataStore): # ...
ããã§æ½è±¡ã¯ã©ã¹ã®DataStore
ãç¶æ¿ãã¦ããã®ã確èªã§ãã¾ããDataStore
ã®å®è£
ã¯datastore/datastore.py
ã«ããã¾ãã
class DataStore(ABC): # ...
ãã®ã¯ã©ã¹ãè¦ãã¨_upsert
ã_query
ãdelete
ã¡ã½ãããå®è£
ããã°ãããã¨ããããã¾ããã©ã®DataStoreã使ããã¯datastore/factory.py
ã§æ±ºå®ãã¦ãã¾ãã
async def get_datastore() -> DataStore: datastore = os.environ.get("DATASTORE") assert datastore is not None match datastore: case "pinecone": from datastore.providers.pinecone_datastore import PineconeDataStore return PineconeDataStore() # (çç¥)... case "qdrant": from datastore.providers.qdrant_datastore import QdrantDataStore return QdrantDataStore() case _: raise ValueError(f"Unsupported vector database: {datastore}")
ã¤ã¾ãç§ãã¡ãä»»æã®ãã¯ãã«æ¤ç´¢ã¨ã³ã¸ã³ã使ãããæã¯DataStore
ã®å
·è±¡ãå®è£
ããget_datastore
é¢æ°ã®æ¡ä»¶ã«å ãã¦ãããã ãã§å¯¾å¿ãå®äºãã¾ãããããªãããã«ã§ãããã§ãã
DataStoreã®ç¨æ
æ©éãdatastore/providers/opensearch.py
ã追å ãã¾ããå®è£
ã¯ã±ã£ã¨è¦ãéãOpenSearchã«å¿
è¦ãªå¦çã«ä¸çªè¿ããã ã£ãdatastore/providersã»pinecone_datastore.py
ãåèã«å®è£
ãã¦ããã¾ããæåã«å¿
è¦ãªmoduleãimportãã¦ããã¾ãã
import os import json from typing import Dict, List, Optional from opensearchpy import OpenSearch from opensearchpy.helpers import bulk from tenacity import retry, wait_random_exponential, stop_after_attempt import asyncio from datastore.datastore import DataStore from models.models import ( DocumentChunk, DocumentChunkMetadata, DocumentChunkWithScore, DocumentMetadataFilter, QueryResult, QueryWithEmbedding, Source, )
ç¶ãã¦OpenSearchã®ã¯ã©ã¤ã¢ã³ãã®åæåã次å æ°ãªã©ãè¨å®ãã¦ããã¾ãã
OPENSEARCH_INDEX = os.environ.get("OPENSEARCH_INDEX") OPENSEARCH_URL = os.environ.get("OPENSEARCH_URL") or "http://localhost:9200" OPENSEARCH_USER = os.environ.get("OPENSEARCH_USER") OPENSEARCH_PASSWORD = os.environ.get("OPENSEARCH_PASSWORD") assert OPENSEARCH_INDEX is not None assert OPENSEARCH_URL is not None es = OpenSearch(hosts=[OPENSEARCH_URL], basic_auth=f"{OPENSEARCH_USER}:{OPENSEARCH_PASSWORD}") UPSERT_BATCH_SIZE = 100 OUTPUT_DIM = 1536
ç¶ãã¦Providerã«æ¸¡ããããã¼ã¿ãæ ¼ç´ããããã®mappingãç¨æãã¾ããOpenSearchã§ãã¯ãã«æ¤ç´¢ããæã«ã¯index
ã®è¨å®ã¨ãã£ã¼ã«ãã®typeãknn_vector
ã«è¨å®ããå¿
è¦ãããã¾ãã
mapping = { "settings": { "index.knn": True }, "mappings": { "properties": { "chunk_id": {"type": "keyword"}, "document_id": {"type": "keyword"}, "text": {"type": "text"}, "text_vector": { "type": "knn_vector", "dimension": OUTPUT_DIM, }, "source": {"type": "keyword"}, "source_id": {"type": "keyword"}, "url": {"type": "text"}, "created_at": {"type": "date"}, "author": {"type": "text"}, } } }
ãã詳細ãªè¨å®ã¯ä¸è¨ã®ããã¥ã¡ã³ããã覧ãã ãããä»åã¯ããã©ã«ãã®NMSLIBã使ã£ã¦ãã¾ãã
ããã¦ãDataStore
ã¯ã©ã¹ãç¶æ¿ããOpenSearchDataStore
ãå®è£
ãã¾ããChatGPT Retrieval Pluginã®ä»ã®Providerã§ã¯åæåæã«ã¤ã³ããã¯ã¹ã®ä½æããã¦ããã®ã§ããã®æ¹æ³ã«åããã¦OpenSearchDataStore
ã®åæåæã«ãã¤ã³ããã¯ã¹ãä½æããããã¯åå¨ãã§ãã¯ãè¡ãã¾ãã
class OpenSearchDataStore(DataStore): def __init__(self): # Check if the index name is specified and exists in Pinecone index_exists = es.indices.exists(index=OPENSEARCH_INDEX) if OPENSEARCH_INDEX and not index_exists: # Get all fields in the metadata object in a list fields_to_index = list(DocumentChunkMetadata.__fields__.keys()) # Create a new index with the specified name, dimension, and metadata configuration try: print( f"Creating index {OPENSEARCH_INDEX} with metadata config {fields_to_index}" ) es.indices.create(index=OPENSEARCH_INDEX, body=mapping) print(f"Index {OPENSEARCH_INDEX} created successfully") except Exception as e: print(f"Error creating index {OPENSEARCH_INDEX}: {e}") raise e elif OPENSEARCH_INDEX and index_exists: # Connect to an existing index with the specified name try: print(f"Connected to index {OPENSEARCH_INDEX} successfully") except Exception as e: print(f"Error connecting to index {OPENSEARCH_INDEX}: {e}") raise e
_upsertå®è£
ç¶ãã¦ã_upsert
ã§ãã_upsert
ã¡ã½ãããå¼ã¶åã«ãããã¥ã¡ã³ããchunkã«åãã¦embeddingãçæãã¦ããã®ã§ããããOpenSearchã«bulk insertãã¾ããbulk insertã®ç´°ããè¨å®ã¯ããã¥ã¡ã³ããã覧ãã ããã
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(3)) async def _upsert(self, chunks: Dict[str, List[DocumentChunk]]) -> List[str]: """ Takes in a dict from document id to list of document chunks and inserts them into the index. Return a list of document ids. """ # Initialize a list of ids to return doc_ids: List[str] = [] # Initialize a list of vectors to upsert index_actions = [] # Loop through the dict items for doc_id, chunk_list in chunks.items(): # Append the id to the ids list doc_ids.append(doc_id) print(f"Upserting document_id: {doc_id}") for chunk in chunk_list: print(f"chunk: {chunk.id}") print(chunk.text) index_action = { "_id": f"{doc_id}-{chunk.id}", "_op_type": "update", "doc_as_upsert": True, "doc": { "chunk_id": chunk.id, "text": chunk.text, "text_vector": chunk.embedding, "source": chunk.metadata.source, "source_id": chunk.metadata.source_id, "url": chunk.metadata.url, "created_at": chunk.metadata.created_at, "author": chunk.metadata.author }, } index_actions.append(index_action) try: bulk(es, index_actions, index=OPENSEARCH_INDEX, raise_on_error=False) except Exception as e: print(f"Error upserting batch: {e}") raise e return doc_ids
_queryå®è£
ããã¦_query
ã¡ã½ãããå®è£
ãã¾ãããã¯ãã«æ¤ç´¢ã®çµæãQueryResultã«ã¤ãã¦è¿å´ãã¾ãã
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(3)) async def _query( self, queries: List[QueryWithEmbedding], ) -> List[QueryResult]: """ Takes in a list of queries with embeddings and filters and returns a list of query results with matching document chunks and scores. """ # Define a helper coroutine that performs a single query and returns a QueryResult async def _single_query(query: QueryWithEmbedding) -> QueryResult: print(f"Query: {query.query}") q = { "query": { "knn": { "text_vector": { "vector": query.embedding, "k": query.top_k } } } } try: # Query the index with the query embedding, filter, and top_k query_response = es.search(index=OPENSEARCH_INDEX, body=json.dumps(q)) except Exception as e: print(f"Error querying index: {e}") raise e query_results: List[DocumentChunkWithScore] = [] for result in query_response["hits"]["hits"]: score = result["_score"] metadata = result["_source"] # Remove document id and text from metadata and store it in a new variable metadata_without_text = ( {key: value for key, value in metadata.items() if key != "text"} if metadata else None ) # If the source is not a valid Source in the Source enum, set it to None if ( metadata_without_text and "source" in metadata_without_text and metadata_without_text["source"] not in Source.__members__ ): metadata_without_text["source"] = None # Create a document chunk with score object with the result data result = DocumentChunkWithScore( id=result["_id"], score=score, text=metadata["text"] if metadata and "text" in metadata else None, metadata=metadata_without_text, ) query_results.append(result) return QueryResult(query=query.query, results=query_results) # Use asyncio.gather to run multiple _single_query coroutines concurrently and collect their results results: List[QueryResult] = await asyncio.gather( *[_single_query(query) for query in queries] ) return results
deleteå®è£
ããã¦æå¾ã«delete
ãå®è£
ãã¾ãã
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(3)) async def delete( self, ids: Optional[List[str]] = None, filter: Optional[DocumentMetadataFilter] = None, delete_all: Optional[bool] = None, ) -> bool: """ Removes vectors by ids, filter, or everything from the index. """ # Delete all vectors from the index if delete_all is True if delete_all == True: try: print(f"Deleting all vectors from index") es.delete_by_query(index=OPENSEARCH_INDEX, body={"query": {"match_all": {}}}) print(f"Deleted all vectors successfully") return True except Exception as e: print(f"Error deleting all vectors: {e}") raise e # Delete vectors that match the document ids from the index if the ids list is not empty if ids != None and len(ids) > 0: try: print(f"Deleting vectors with ids {ids}") for document_id in ids: es.delete(index=OPENSEARCH_INDEX, id=document_id) # type: ignore print(f"Deleted vectors with ids successfully") except Exception as e: print(f"Error deleting vectors with ids: {e}") raise e return True
_delete
ã¯ä¸è¨ã®ã³ã¼ãã®ããã«DataStore
ã¯ã©ã¹ã®upsert
ã§å¼ã°ããã®ã§upsertæã«ãå¿
è¦ãªå¦çã§ããããã¯upsertãç´æ¥çã«ãµãã¼ããã¦ãããã¯ãã«æ¤ç´¢ã¨ã³ã¸ã³ã«ã¨ã£ã¦ã¯ä¸è¦ãªå¦çãªã®ã§ãä»å¾æ¹åä½å°ãããç®æã ã¨æãã¾ã(PRãã£ã³ã¹ã?)ã
class DataStore(ABC): async def upsert( self, documents: List[Document], chunk_token_size: Optional[int] = None ) -> List[str]: """ Takes in a list of documents and inserts them into the database. First deletes all the existing vectors with the document id (if necessary, depends on the vector db), then inserts the new ones. Return a list of document ids. """ # Delete any existing vectors for documents with the input document ids await asyncio.gather( *[ self.delete( filter=DocumentMetadataFilter( document_id=document.id, ), delete_all=False, ) for document in documents if document.id ] ) chunks = get_document_chunks(documents, chunk_token_size) return await self._upsert(chunks)
Provideråä½ç¢ºèª
ããã§OpenSearch Providerã®å®è£
ãå®äºãã¾ãããæå¾ã«ç°å¢å¤æ°DATASTORE
ã«opensearch
ãæå®ãããã¨ãã«ãã®Providerã使ãããã«datastore/factory.py
ã«æ¡ä»¶ã追å ãã¾ãã
def get_datastore() -> DataStore: datastore = os.environ.get("DATASTORE") assert datastore is not None match datastore: # ... case "opensearch": from datastore.providers.opensearch_datastore import OpenSearchDataStore return OpenSearchDataStore() case _: raise ValueError(f"Unsupported vector database: {datastore}")
å ¨ã¦ã®æºåãæ´ãã¾ãããæåã«åä½ç¢ºèªç¨ã«ä½ã£ãã¹ã¯ãªãããå®è¡ããã°åããããªçµæãåå¾ã§ããã¯ãã§ããããã§OpenSearch Providerã®å®è£ ãã§ãã¾ããã
Elasticsearch Providerã¯ä½ããã
å®ã¯æåã¯èªåã1çªæ
£ã親ããã Elasticsearchã§è©¦ãã¦ã¿ããã¨æã£ãã®ã§ãããç¾å¨Elasticsearchã¯ChatGPT Retrieval Pluginã§ã¯å©ç¨ã§ããªããã¨ãåããã¾ãããtext-embedding-ada-002ãåºåãã次å
æ°ã1536
ã§ãElasticsearchãå
é¨ã§å©ç¨ãã¦ããLuceneããµãã¼ããã次å
æ°ã®æ大å¤ã1024
ãªã®ã§ãElasticsearchãProviderã¨ãã¦å©ç¨ã§ãã¾ããã
次å æ°ã®å¶éãå¢ããè°è«ãããã¦ããã®ã§ãå°æ¥çã«ã¯ãã®å¶éããªããªãããããã¾ããã
æ¥æ¬èªã§ãã®è¾ºã®ååãã¾ã¨ãã¦ããã¦ããè¨äºãããã¾ãã
ã¾ã¨ã
ä»åã¯ChatGPT Retrieval Pluginããµãã¼ããã¦ãããã¯ãã«æ¤ç´¢ã¨ã³ã¸ã³ã§ã¯ãªããAWSã®OpenSearch Providerãå®è£ ãã¦ChatGPTã«ãã¯ãã«æ¤ç´¢ãæä¾ããæ¹æ³ã試ãã¦ã¿ã¾ããã
ä»åã®å®è£ ãéãã¦ãChatGPT Retrieval Pluginã®Providerã¨ãã¦å©ç¨ã§ããæ¡ä»¶ã¨ãã¦ã¯ä¸è¨ããã¾ãã
- 1536次å ã®ãã¯ãã«ãæ±ãã
- upsert/search/deleteãè¡ãã
ä¸è¨ãæºãããã¯ãã«æ¤ç´¢ã¨ã³ã¸ã³ãªãDataStoreæ½è±¡ã¯ã©ã¹ãå®è£ ããã ãã§å¯¾å¿ã§ãã¾ããValdãFaissãªããã®Providerãå®è£ ã§ãã¾ããããã¯ãã«æ¤ç´¢ã¨ã³ã¸ã³å®è£ è ããèªåã®ãæ°ã«å ¥ãã®ãã¯ãã«æ¤ç´¢ã¨ã³ã¸ã³ãããæ¹ã¯æ¯é試ãã¦ã¿ã¦ãã ããã
ä»åã®å®è£ ã¯ããå°ã綺éºã«ãã¦éãããã°PRãæããäºå®ã§ã(OpenSearchã使ã£ããã¨ããªãã®ã§ãããã§è¯ãã確èªããå¿ è¦ãã)ã
å¼ç¤¾ã§ãOpenAIãæ´»ç¨ãã¦ããæµããæ¥ã¦ããã®ã§æ¥½ãã¿ã§ãã
We're hiring!
å¼ç¤¾ã§ã¯æ å ±æ¤ç´¢ãæ©æ¢°å¦ç¿ã®åã§å»çãåé²ãããã¡ã³ãã¼ãåéä¸ã§ããå°ãã§ãèå³ãããã°1on1ãã¾ãããï¼