This document provides a high-level introduction to Crawl4AI, explaining its purpose, core architecture, and main components. It covers the fundamental concepts needed to understand how the system works and how to interact with it through its various interfaces.
For detailed configuration options, see Configuration System (#2.3). For deployment instructions, see Deployment Options (#1.3). For working code examples, see Usage Examples (#12).
Crawl4AI is an open-source web crawling and extraction system optimized for Large Language Model (LLM) applications. It transforms web pages into clean, structured markdown and JSON, making web content immediately usable for RAG systems, AI agents, and data pipelines.
Key Characteristics:
Current Version: 0.8.0 (crawl4ai/__version__.py4)
Sources: README.md1-68 crawl4ai/__version__.py1-9
Crawl4AI is built around a multi-layered architecture with clear separation between browser management, content processing, and data extraction:
Architecture Diagram: Crawl4AI Component Hierarchy
The system follows a clear data flow:
CrawlResult object returnedSources: README.md1-230 crawl4ai/async_crawler_strategy.py1-100 crawl4ai/async_configs.py1-50
Crawl4AI provides three primary interfaces for interacting with the system:
The AsyncWebCrawler class is the core Python API. It provides full control over crawling and extraction with an async context manager interface.
Python SDK Entry Points Diagram
Basic Usage:
Sources: README.md92-106 crawl4ai/async_crawler_strategy.py
The crwl command provides quick access to crawling functionality without writing code:
Entry Point: crawl4ai.cli:main defined in pyproject.toml83
Sources: README.md108-118 pyproject.toml78-84
The Docker deployment exposes a FastAPI server with comprehensive REST endpoints:
Core Endpoints:
POST /crawl - Synchronous crawling (returns results immediately)POST /crawl/job - Asynchronous job submission (returns task_id)GET /crawl/job/{task_id} - Job status and result retrievalPOST /screenshot - Page screenshot capturePOST /pdf - PDF generationPOST /html - HTML content extractionContainer Launch:
Sources: README.md319-354 deploy/docker/server.py Dockerfile1-205
Understanding how data flows through the system is essential for effective usage:
Data Flow Through Processing Pipeline
Key Data Transformations:
Result Object Structure:
The CrawlResult object (models.py) contains all outputs:
| Field | Type | Description |
|---|---|---|
url | str | Original URL crawled |
html | str | Raw HTML from browser |
cleaned_html | str | Cleaned HTML (scripts/ads removed) |
markdown | MarkdownGenerationResult | Contains raw_markdown and fit_markdown |
extracted_content | str | JSON string from extraction strategy |
links | dict | Internal/external links |
media | dict | Images, videos, audio metadata |
screenshot | str | Base64-encoded screenshot (if requested) |
success | bool | Whether crawl succeeded |
status_code | int | HTTP status code |
Sources: README.md360-473 crawl4ai/models.py crawl4ai/content_scraping_strategy.py
Crawl4AI uses a three-level configuration hierarchy that controls different aspects of operation:
Controls browser behavior across all crawls for a given AsyncWebCrawler instance:
Key Parameters:
browser_type: "chromium" (default), "firefox", "webkit", or "undetected"headless: Boolean for visible/headless modeviewport_width, viewport_height: Browser window sizeuser_agent_mode: "random", "real", or customuse_persistent_context: Enable session persistenceproxy_config: Proxy settings with rotation supportExample:
Defined in: crawl4ai/async_configs.py
Controls behavior for individual crawl requests:
Key Parameters:
cache_mode: CacheMode.ENABLED, DISABLED, BYPASS, READ_ONLY, WRITE_ONLYextraction_strategy: CSS, LLM, or Cosine extractionmarkdown_generator: Markdown generation strategycontent_filter: Content filtering strategyjs_code: JavaScript to execute on pagewait_for: Wait conditions before scrapingscreenshot, pdf: Media capture optionssession_id: For multi-step crawling workflowsExample:
Defined in: crawl4ai/async_configs.py
Controls LLM provider settings for AI-powered extraction and filtering:
Key Parameters:
provider: Format "provider/model" (e.g., "openai/gpt-4o", "anthropic/claude-3-opus")api_token: Authentication tokenbase_url: Custom endpoint URL (for self-hosted models)backoff_base_delay, backoff_max_attempts: Retry configurationextra_headers: Custom HTTP headersSupported Providers:
openai/gpt-4o, openai/gpt-4o-mini)anthropic/claude-3-opus, anthropic/claude-3-sonnet)google/gemini-2.0-flash)ollama/llama3.2 - local models)Example:
Defined in: crawl4ai/async_configs.py
Sources: crawl4ai/async_configs.py README.md367-556 tests/async/test_0.4.2_config_params.py1-212
Crawl4AI produces multiple output formats optimized for different use cases:
Raw Markdown (result.markdown.raw_markdown):
Fit Markdown (result.markdown.fit_markdown):
Citation Links (result.markdown.references_markdown):
When using an extraction strategy, result.extracted_content contains a JSON string with structured data:
CSS Extraction:
LLM Extraction:
Links Dictionary:
Media Dictionary:
Sources: README.md148-224 docs/examples/llm_extraction_openai_pricing.py1-56 docs/examples/docker_example.py159-221
Crawl4AI is built on Python's asyncio for concurrent operations:
async with AsyncWebCrawler()arun_many() for concurrent URL crawlingThe caching system uses content hashes to avoid redundant fetches:
~/.crawl4ai/crawl4ai.db~/.crawl4ai/html_content/, ~/.crawl4ai/markdown_content/Database structure: crawl4ai/database.py
Core functionality is pluggable through strategy interfaces:
JsonCssExtractionStrategy, LLMExtractionStrategy, CosineStrategyPruningContentFilter, BM25ContentFilter, LLMContentFilterDefaultMarkdownGeneratorRegexChunking, NlpSentenceChunking, TopicSegmentationChunkingExtend by subclassing base strategy classes.
Use session_id for multi-step workflows:
Cookies and local storage persist across requests with the same session_id.
Sources: README.md520-558 CHANGELOG.md1-262
For version-specific changes and migration guides, see the Release History.
Refresh this wiki