Skip to content

Add detection of PDF/image sub-links and extract text via Gemini#1173

Merged
ericciarla merged 1 commit intofirecrawl:mainfrom
mayooear:feat/gemini-sub-links
Feb 12, 2025
Merged

Add detection of PDF/image sub-links and extract text via Gemini#1173
ericciarla merged 1 commit intofirecrawl:mainfrom
mayooear:feat/gemini-sub-links

Conversation

@mayooear
Copy link
Contributor

Enhanced Content Extraction with Gemini PDF/Image Analysis

Changes

  • Added PDF and image detection for sub-links within scraped pages
  • Integrated Gemini Vision API for image analysis and bounding box detection
  • Added PDF text extraction using Gemini's document understanding capabilities
  • Appends extracted content to main markdown for comprehensive analysis

Technical Details

  • New helper functions:
    • is_pdf_url() & is_image_url() for file type detection
    • gemini_extract_pdf_content() for PDF text extraction
    • gemini_extract_image_data() for image analysis and bounding boxes
  • Modified scrape_url to include 'links' format for sub-link discovery
  • Extracted content is clearly labeled with source URLs in the markdown output

Impact

This enhancement allows the crawler to discover and analyze content within PDFs and images that would otherwise be missed, providing more comprehensive results for user `queries.```

@ericciarla
Copy link
Contributor

Lgtm!

@ericciarla ericciarla merged commit a1b7d6e into firecrawl:main Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants