ã¯ããã«
CSVãPDF, Excel ãªã©ã®æ§ã
ãªãã¼ã¿ã½ã¼ã¹ã ElasticSearch ã«æºãã¦ãæå®ã®æè¨ãå«ã¾ãã¦ãããã¡ã¤ã«ãæ¤ç´¢ã§ããç°¡æçãªããã¥ã¡ã³ãæ¤ç´¢Webã¢ããªã±ã¼ã·ã§ã³ãä½æãã¾ããã
ä»åã®ã½ã¼ã¹ã³ã¼ãã¯ä¸è¨ã«ããã·ã¥ãã¦ãã¾ãã
github.com
Â
æ©è½ãåä½ã¤ã¡ã¼ã¸
ããã¥ã¡ã³ãæ¤ç´¢æ©è½
- ä»»æã®æååãå
¥åãã¦æ¤ç´¢ãæ¼ä¸
- æ¤ç´¢æååãå«ã¾ããããã¥ã¡ã³ããããå ´åããã®ãã¡ã¤ã«åã¨ãã¡ã¤ã«å
ã®ããã¹ãï¼200æåã¾ã§ï¼ãæ¤ç´¢çµæã¨ãã¦è¡¨ç¤ºãã
Â
Â
ããã¥ã¡ã³ãã¢ãããã¼ãæ©è½
- é¸æãã¿ã³ãã¯ãªãã¯ããã¨ãã¡ã¤ã«é¸æãã¤ã¢ãã°ãç«ã¡ä¸ãããä»»æã®ããã¥ã¡ã³ããé¸æ
- éä¿¡ãã¿ã³ãã¯ãªãã¯ããã¨é¸æããããã¥ã¡ã³ãã ElasticSearch ã«ç»é²ããã
â»ä»åã¯ãCSV, PDF, Excel ãã¡ã¤ã«ããããã®ã¿åãä»ããå½¢ã«ãã¦ãã¾ã
Â
Â
使ç¨æè¡
- ããã³ãã¨ã³ã
- React v18.2.0
- TypeScript v4.9.5
- ããã¯ã¨ã³ã
- Python v3.8.10
- FastAPI v0.110.0
- ElasticSearch v7.5.1
- Docker v20.10.17
Â
å®è£
詳細ã¼ããã¯ã¨ã³ã
以ä¸ã§ ElasticSearch ã¨ã®ã¤ãªããã¿ç®æãä¸å¿ã«ã½ã¼ã¹ã³ã¼ããæ·»ä»ãã¦ãã¾ãã
ElasticSearch
ElasticSearch ã®Dockerã¤ã¡ã¼ã¸ããç°ä½æãã¦ãã¾ãã
docker-compose.yml
version: "3"
services:
 sysctl:
  image: alpine
  container_name: sysctl
  command: ["sysctl", "-w", "vm.max_map_count=262144"]
  privileged: true
  networks:
   - esnet
 es01:
  build:
   context: .
   dockerfile: Dockerfile
  container_name: es01
  environment:
   - node.name=es01
   - cluster.initial_master_nodes=es01
   - cluster.name=docker-cluster
   - bootstrap.memory_lock=true
   - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
  ulimits:
   memlock:
    soft: -1
    hard: -1
  volumes:
   - esdata01:/usr/share/elasticsearch/data
   - 9200:9200
  networks:
   - esnet
  depends_on:
   - sysctl
volumes:
 esdata01:
  driver: local
networks:
 esnet:
Â
- esdata01 ã¨ããããªã¥ã¼ã ãä½æãã¦ãã¦ã³ããã¦ãã¾ã
- ElasticSearch ãèµ·åå¾ãã¡ã¢ãªå²ãå½ã¦ã足ããªãæ¨ã®ã¨ã©ã¼ãåºã¦ãã®ã¾ã¾åæ¢ãã¦ãã¾ãäºè±¡ãçºçããããã vm.max_map_count=262144 ã§ã¡ã¢ãªå²ãå½ã¦ãå¢ããã¦åæ¢ããªãããã«ãã¦ãã¾ã
Dockerfile
FROM docker.elastic.co/elasticsearch/elasticsearch:7.5.1
RUN elasticsearch-plugin install analysis-kuromoji
Â
- ElasticSearch ã³ã³ããã¯ãã¡ãã§æå®ãã¦ãã¾ã
- ä»åæ¥æ¬èªãæ±ããããkuromojiãã©ã°ã¤ã³ ã追å ãã¦ãã¾ã
Â
ããã¯ã¨ã³ã
ãªã¯ã¨ã¹ããã³ããªã³ã°
@app.get("/search", response_model=SearchDocumentResponse)
def search(text: Optional[str] = Query(None, description="æ¤ç´¢ã¯ã¨ãªã¼")):
  return search_document(text)
@app.post("/upload")
async def upload(file: UploadFile = File(...)):
  result = await upload_document(file)
  return JSONResponse(content={"result": result})
Â
- FastAPI ã使ãããããã¥ã¡ã³ãæ¤ç´¢ãã¨ãããã¥ã¡ã³ãã¢ãããã¼ããæ©è½ç¨ã®ã¨ã³ããã¤ã³ããå®ç¾©ãã¦ãã¾ã
- æ¤ç´¢ã¨ã³ããã¤ã³ãã¯æ¤ç´¢æååãåãåãå¿
è¦ããããããtext ã¨ããã¯ã¨ãªã¼ãã©ã¡ã¼ã¿ã¼ã§åãåãããã«ãã¦ãã¾ã
- ããã¥ã¡ã³ãã¢ãããã¼ãã§éä¿¡ããããã¡ã¤ã«ãã¼ã¿ã¯ãUploadFile ã¨ããã¯ã©ã¹ã§åãåããã¨ãã§ãã¾ã
Â
ElasticSearch ã®ããã¥ã¡ã³ãæ¤ç´¢
import re
import requests
from models.search_document import SearchDocumentResponse, SearchDocumentResult
def search_document(text: str):
  try:
    # Elasticsearchã¸ã®ã¯ã¨ãªå®è¡
    if is_alphanumeric(text):
      # åè§è±æ°åã®å ´å
      query = {"query": {"regexp": {"doc.text": f".*{text}.*"}}}
    else:
      # æ¥æ¬èªã®å ´å
      query = {"query": {"match_phrase": {"doc.text": text}}}
    if response.status_code != 200:
      print(
        f"Error: Failed to retrieve data from Elasticsearch. Status code: {response.status_code}"
      )
      return SearchDocumentResponse(count=0, result=)
    # ã¬ã¹ãã³ã¹ããhitsãåå¾
    data = response.json()["hits"]["hits"]
    # æ¤ç´¢çµæãæ ¼ç´ãããªã¹ã
    search_results =
    # ã¬ã¹ãã³ã¹ã®hitsã«å¯¾ãã¦ã«ã¼ãå¦ç
    for hit in data:
      file_name = hit["_source"]["doc"]["name"]
      # textããªã¹ããæååãã§å¦çãåå²
      text_content = hit["_source"]["doc"]["text"]
      if isinstance(text_content, list):
        # textããªã¹ãã®å ´åãåè¦ç´ ãé£çµãã¦1ã¤ã®æååã«ãã
        text_content = "\n".join(text_content)
      search_results.append(
        SearchDocumentResult(file_name=file_name, text=text_content)
      )
    # æ¤ç´¢çµæãSearchDocumentResponseã«æ ¼ç´ãã¦è¿ã
    return SearchDocumentResponse(count=len(search_results), result=search_results)
  except Exception as e:
    print(
      f"Error: An error occurred while processing the response from Elasticsearch: {str(e)}"
    )
    return SearchDocumentResponse(count=0, result=)
def is_alphanumeric(text):
  pattern = re.compile(r"^[a-zA-Z0-9]*$")
  return bool(pattern.match(text))
Â
- æ¤ç´¢æååã text å¤æ°ã¨ãã¦åãåããElasticSearch ã¸ã®æ¤ç´¢ã¯ã¨ãªã¼ãæ§ç¯ãã¾ã
- æ¥æ¬èªã¨è±æ°å両æ¹ã«å¯¾å¿ãã¦ããæ¤ç´¢ã¯ã¨ãªã¼ãè¦å½ããããªã£ããããæ¥æ¬èªã¨è±æ°åã§æããã¯ã¨ãªã¼ãåçã«å¤ãã¦ãã¾ãï¼ããããé¨åä¸è´æ¤ç´¢ã§æå®ã®æååãå«ã¾ãããæ¤ç´¢ããï¼
- è¤æ°ä»¶ãããããå¯è½æ§ããããããé
åã«æ¤ç´¢çµæãè©°ãã¦è¿å´ãã¦ãã¾ã
Â
ElasticSearch ã«ããã¥ã¡ã³ãç»é²
async def upload_document(file: UploadFile = File(...)):
  content_type = file.content_type
  if content_type == FILE_TYPE_PDF:
    await upload_pdf(file)
  elif content_type == FILE_TYPE_CSV:
    await upload_csv(file)
  elif content_type == FILE_TYPE_EXCEL:
    await upload_excel(file)
  else:
    print("not found")
    pass
- ä»åã¯CSV, Excel, PDF ããã¥ã¡ã³ãã対象ã«ãElasticSearch ã«ããã¥ã¡ã³ããç»é²ãã¾ã
ElasticSearch ã«ããã¥ã¡ã³ãç»é²ï¼CSVï¼
from csv import DictReader
import os
from fastapi import File, UploadFile
import requests
from constant.constant import (
  ELASTIC_SEARCH_REQUEST_HEADERS,
  ELASTIC_SEARCH_URL,
)
async def upload_csv(file: UploadFile = File(...)):
  try:
    file_name = file.filename
    file_path = os.path.join(os.getcwd(), file_name)
    # ã¢ãããã¼ãããããã¡ã¤ã«ãä¿å
    with open(file_path, "wb") as f:
      f.write(await file.read())
    file_content =
    with open(file_path, "rt", encoding="utf-8") as file:
      reader = DictReader(file)
      for row in reader:
        row_json = json.dumps(row, ensure_ascii=False)
        file_content.append(row_json)
    # Elasticsearchã«éä¿¡ãããã¼ã¿ãæ§ç¯
    data = {"doc": {"name": file_name, "text": file_content}}
    print("Uploaded file data:", data)
    response = requests.post(
      ELASTIC_SEARCH_URL,
      headers=ELASTIC_SEARCH_REQUEST_HEADERS,
    )
    if response.status_code == 201:
      print("Document indexed successfully.")
    else:
      print(f"Failed to index document. Status code: {response.text}")
      return False
    # ä¸æãã¡ã¤ã«ãåé¤
    os.unlink(file_path)
    return True
  except Exception as e:
    print(
      f"Error: An error occurred while processing uploading csv to Elasticsearch: {str(e)}"
    )
    return False
Â
- å¼æ°ã® UploadFile ãããã¡ã¤ã«åã¨ãã¡ã¤ã«å
容ãèªã¿è¾¼ã¿ã¾ã
- ãã¡ã¤ã«å
容ã®èªã¿è¾¼ã¿é¨åã¯ããã£ãã Temp ãã¡ã¤ã«ã«éé¿ãã¦ããã¦ãTemp ãã¡ã¤ã«ãããã¡ã¤ã«å
容ãæååã«èªã¿è¾¼ãã§ãã¾ãï¼ç´æ¥ UploadFile ããèªã¿è¾¼ã㧠ElasticSearch ã«éä¿¡ããã¨ããã¨ã©ã¼ã«ãªã£ãããããï¼
- name 㨠text ã¨ãããã£ã¼ã«ããæã£ãããã¥ã¡ã³ããç»é²ãã¦ãã¾ããï¼â»äºåã« book ã¨ããã¤ã³ããã¯ã¹ãä½æãããã使ç¨ãã¦ãã¾ãï¼
Â
Â
ElasticSearch ã«ããã¥ã¡ã³ãç»é²ï¼Excelï¼
import os
from fastapi import File, UploadFile
import pandas as pd
import requests
from constant.constant import ELASTIC_SEARCH_REQUEST_HEADERS, ELASTIC_SEARCH_URL
async def upload_excel(file: UploadFile = File(...)):
  try:
    file_name = file.filename
    file_path = os.path.join(os.getcwd(), file_name)
    # ã¢ãããã¼ãããããã¡ã¤ã«ãä¿å
    with open(file_path, "wb") as f:
      f.write(await file.read())
    # ããã¯å
¨ä½ã®å
容ãå
¥ããããã®ç©ºã®ãªã¹ããä½æ
    book_content = []
    # Excelãã¡ã¤ã«ããåã·ã¼ãã®ãã¼ã¿ãèªã¿è¾¼ãã§ãªã¹ãã«è¿½å ãã
    with pd.ExcelFile(file_path) as xls:
      for sheet_name in xls.sheet_names:
        df = pd.read_excel(xls, sheet_name)
        records = df.to_dict(orient="records")
        book_content.extend(records)
    # Elasticsearchã«éä¿¡ãããã¼ã¿ãæ§ç¯
    data = {
      "doc": {
        "name": file_name,
        "text": json.dumps(book_content, ensure_ascii=False),
      }
    }
    print("Uploaded file data:", data)
    response = requests.post(
      ELASTIC_SEARCH_URL,
      headers=ELASTIC_SEARCH_REQUEST_HEADERS,
    )
    if response.status_code == 201:
      print("Document indexed successfully.")
    else:
      print(f"Failed to index document. Status code: {response.text}")
      return False
    # ä¸æãã¡ã¤ã«ãåé¤
    os.unlink(file_path)
    return True
  except Exception as e:
    print(
      f"Error: An error occurred while processing uploading excel to Elasticsearch: {str(e)}"
    )
    return False
Â
- CSVã®å¦çåæ§ã«ããã¡ã¤ã«åã¨ãã¡ã¤ã«å
容ãèªã¿è¾¼ãã§ãã¾ã
- pandas ã使ãã·ã¼ããã¨ã«èªã¿è¾¼ãã å
容ãé
åã«èªã¿è¾¼ã¿ãæçµçã«ï¼ã¤ã®æååã«å¤æã㦠ElasticSearch ã«ç»é²ãã¾ã
Â
ElasticSearch ã«ããã¥ã¡ã³ãç»é²ï¼PDFï¼
import os
from tempfile import NamedTemporaryFile
from fastapi import File, UploadFile
from pdfminer.high_level import extract_text
import requests
from constant.constant import ELASTIC_SEARCH_REQUEST_HEADERS, ELASTIC_SEARCH_URL
async def upload_pdf(file: UploadFile = File(...)):
  try:
    file_name = file.filename
    file_path = os.path.join(os.getcwd(), file_name)
    # ã¢ãããã¼ãããããã¡ã¤ã«ãä¿å
    with open(file_path, "wb") as f:
      f.write(await file.read())
    # PDFããããã¹ããæ½åº
    file_text = extract_text(file_path)
    # Elasticsearchã«éä¿¡ãããã¼ã¿ãæ§ç¯
    data = {"doc": {"name": file_name, "text": file_text}}
    print("Uploaded file data:", data)
    response = requests.post(
      ELASTIC_SEARCH_URL,
      headers=ELASTIC_SEARCH_REQUEST_HEADERS,
    )
    if response.status_code == 201:
      print("Document indexed successfully.")
    else:
      print(f"Failed to index document. Status code: {response.text}")
      return False
    # ä¸æãã¡ã¤ã«ãåé¤
    os.unlink(file_path)
    return True
  except Exception as e:
    print(
      f"Error: An error occurred while processing uploading pdf to Elasticsearch: {str(e)}"
    )
    return False
Â
- PDF ã®èªã¿è¾¼ã¿ã¯ pdfminer ãã·ã³ãã«ã«æ¸ãã¦ä½¿ããããã£ãã®ã§æ¡ç¨ãã¾ãã
- ãã¡ããCSV, Excel ã®å¦çã¨åãå½¢å¼ã§ããã¡ã¤ã«åã¨ãã¡ã¤ã«å
容ãèªã¿è¾¼ã㧠ElasticSearch ã«ç»é²ãã¾ã
Â
以ä¸ã®ãããªå®è£
㧠ElasticSearch ã¸ããã¥ã¡ã³ãã®ç»é²ã¨æ¤ç´¢ãå®ç¾ãããã¨ãã§ãã¾ããã
â» elasticsearch · PyPI ãããããããæå試ãã¦ãããã®ã®Dockerèµ·åãã¦ãã ElasticSearch ã«ãã¾ãæ¥ç¶ãã§ãããä»å㯠HTTP ãªã¯ã¨ã¹ãã§ç»é²ããæ¹å¼ãåãã¾ãã
Â
ãããã«
ä»åã¯CSV, Excel, PDF ã®ã¿ã§ããããä»ã«ãããããå½¢å¼ã®ããã¥ã¡ã³ãã ElascticSearch ã«æºããããã¨ãã§ããã¨æãã¾ããã
ãããå¿ç¨ãã¦ä¾ãã°ãèªçµç¹ã®ãã¡ã¤ã«ã ElascticSearch ã«ããã¦ãã£ã¦ãä»»æã®æç« ã§æ¤ç´¢ãã¦ç®çã®ãã¡ã¤ã«ãæ¢ããã¿ãããªä½¿ãæ¹ãã§ãããã§ããã
Â
Â