-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
loomPoliloom core project issuesPoliloom core project issues
Description
Problem
When extracting dates from Japanese Wikipedia articles, the LLM returns dates in Japanese format like 16世紀 (16th century) instead of the expected YYYY format, causing validation errors:
Error extracting [<PropertyType.BIRTH_DATE: 'P569'>, <PropertyType.DEATH_DATE: 'P570'>] with LLM: 2 validation errors for PropertyExtractionResult
properties.0.value
Value error, Invalid date format: '16世紀'. Must be YYYY, YYYY-MM, or YYYY-MM-DD
Current Behavior
The prompts in poliloom/prompts.py do specify the expected format:
- birth_date: Use format YYYY-MM-DD, or YYYY-MM, YYYY for incomplete dates
- death_date: Use format YYYY-MM-DD, or YYYY-MM, YYYY for incomplete dates
However, the LLM sometimes:
- Returns dates in non-Western formats (Japanese characters instead of Arabic numerals)
- Returns century-level precision which has no valid format
Proposed Solution
Strengthen the date extraction prompts to:
- Explicitly require Arabic numerals (0-9) for all date values
- Specify behavior when only century-level precision is available (e.g., skip extraction or use approximate year like "1550" for 16th century)
Affected Files
poliloom/poliloom/prompts.py-DATES_EXTRACTION_SYSTEM_PROMPT- Potentially also
POSITIONS_EXTRACTION_SYSTEM_PROMPTfor position start/end dates
Metadata
Metadata
Assignees
Labels
loomPoliloom core project issuesPoliloom core project issues