Skip to content

LLM returns non-Western date formats like '16世紀' instead of YYYY #118

@monneyboi

Description

@monneyboi

Problem

When extracting dates from Japanese Wikipedia articles, the LLM returns dates in Japanese format like 16世紀 (16th century) instead of the expected YYYY format, causing validation errors:

Error extracting [<PropertyType.BIRTH_DATE: 'P569'>, <PropertyType.DEATH_DATE: 'P570'>] with LLM: 2 validation errors for PropertyExtractionResult
properties.0.value
  Value error, Invalid date format: '16世紀'. Must be YYYY, YYYY-MM, or YYYY-MM-DD

Current Behavior

The prompts in poliloom/prompts.py do specify the expected format:

- birth_date: Use format YYYY-MM-DD, or YYYY-MM, YYYY for incomplete dates
- death_date: Use format YYYY-MM-DD, or YYYY-MM, YYYY for incomplete dates

However, the LLM sometimes:

  1. Returns dates in non-Western formats (Japanese characters instead of Arabic numerals)
  2. Returns century-level precision which has no valid format

Proposed Solution

Strengthen the date extraction prompts to:

  1. Explicitly require Arabic numerals (0-9) for all date values
  2. Specify behavior when only century-level precision is available (e.g., skip extraction or use approximate year like "1550" for 16th century)

Affected Files

  • poliloom/poliloom/prompts.py - DATES_EXTRACTION_SYSTEM_PROMPT
  • Potentially also POSITIONS_EXTRACTION_SYSTEM_PROMPT for position start/end dates

Metadata

Metadata

Assignees

No one assigned

    Labels

    loomPoliloom core project issues

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions