Introduction to Crawl4AI

Relevant source files

Purpose and Scope

This document provides a high-level introduction to Crawl4AI, explaining its purpose, core architecture, and main components. It covers the fundamental concepts needed to understand how the system works and how to interact with it through its various interfaces.

For detailed configuration options, see Configuration System (#2.3). For deployment instructions, see Deployment Options (#1.3). For working code examples, see Usage Examples (#12).

What is Crawl4AI?

Crawl4AI is an open-source web crawling and extraction system optimized for Large Language Model (LLM) applications. It transforms web pages into clean, structured markdown and JSON, making web content immediately usable for RAG systems, AI agents, and data pipelines.

Key Characteristics:

LLM-Optimized Output: Generates clean markdown with proper heading hierarchy, tables, code blocks, and citation references
Flexible Deployment: Works as a Python library, command-line tool, or Docker REST API
AI-Native Design: Deep integration with LLM providers for intelligent extraction and content filtering
Production-Grade: Includes browser pooling, smart caching, async job queues, and real-time monitoring

Current Version: 0.8.0 (crawl4ai/__version__.py4)

Sources: README.md1-68 crawl4ai/__version__.py1-9

System Architecture Overview

Crawl4AI is built around a multi-layered architecture with clear separation between browser management, content processing, and data extraction:

Architecture Diagram: Crawl4AI Component Hierarchy

The system follows a clear data flow:

User Request → Entry point (CLI/Python/API)
Configuration → Browser and crawler settings applied
Browser Execution → Page loaded and rendered
Content Scraping → HTML cleaned and structured
Processing → Markdown generated, content filtered
Extraction → Structured data extracted (optional)
Result → CrawlResult object returned

Sources: README.md1-230 crawl4ai/async_crawler_strategy.py1-100 crawl4ai/async_configs.py1-50

Main Entry Points

Crawl4AI provides three primary interfaces for interacting with the system:

Python SDK: AsyncWebCrawler

The AsyncWebCrawler class is the core Python API. It provides full control over crawling and extraction with an async context manager interface.

Python SDK Entry Points Diagram

Basic Usage:

Sources: README.md92-106 crawl4ai/async_crawler_strategy.py

Command Line: crwl

The crwl command provides quick access to crawling functionality without writing code:

Entry Point: crawl4ai.cli:main defined in pyproject.toml83

Sources: README.md108-118 pyproject.toml78-84

Docker REST API

The Docker deployment exposes a FastAPI server with comprehensive REST endpoints:

Core Endpoints:

POST /crawl - Synchronous crawling (returns results immediately)
POST /crawl/job - Asynchronous job submission (returns task_id)
GET /crawl/job/{task_id} - Job status and result retrieval
POST /screenshot - Page screenshot capture
POST /pdf - PDF generation
POST /html - HTML content extraction

Container Launch:

Sources: README.md319-354 deploy/docker/server.py Dockerfile1-205

Core Data Flow

Understanding how data flows through the system is essential for effective usage:

Data Flow Through Processing Pipeline

Key Data Transformations:

URL → HTML: Browser fetches and renders page (JavaScript execution, lazy loading)
HTML → Cleaned HTML: Remove scripts, ads, overlays using LXML (content_scraping_strategy.py)
Cleaned HTML → Markdown: Convert to readable markdown with proper structure (markdown_generation_strategy.py)
Markdown → Filtered Markdown: Optional content filtering for LLM consumption (content_filter_strategy.py)
Cleaned HTML → Structured Data: Optional extraction using CSS/LLM/Cosine strategies (extraction_strategy.py)

Result Object Structure:

The CrawlResult object (models.py) contains all outputs:

Field	Type	Description
`url`	`str`	Original URL crawled
`html`	`str`	Raw HTML from browser
`cleaned_html`	`str`	Cleaned HTML (scripts/ads removed)
`markdown`	`MarkdownGenerationResult`	Contains `raw_markdown` and `fit_markdown`
`extracted_content`	`str`	JSON string from extraction strategy
`links`	`dict`	Internal/external links
`media`	`dict`	Images, videos, audio metadata
`screenshot`	`str`	Base64-encoded screenshot (if requested)
`success`	`bool`	Whether crawl succeeded
`status_code`	`int`	HTTP status code

Sources: README.md360-473 crawl4ai/models.py crawl4ai/content_scraping_strategy.py

Configuration System

Crawl4AI uses a three-level configuration hierarchy that controls different aspects of operation:

BrowserConfig (Instance Level)

Controls browser behavior across all crawls for a given AsyncWebCrawler instance:

Key Parameters:

browser_type: "chromium" (default), "firefox", "webkit", or "undetected"
headless: Boolean for visible/headless mode
viewport_width, viewport_height: Browser window size
user_agent_mode: "random", "real", or custom
use_persistent_context: Enable session persistence
proxy_config: Proxy settings with rotation support

Example:

Defined in: crawl4ai/async_configs.py

CrawlerRunConfig (Request Level)

Controls behavior for individual crawl requests:

Key Parameters:

cache_mode: CacheMode.ENABLED, DISABLED, BYPASS, READ_ONLY, WRITE_ONLY
extraction_strategy: CSS, LLM, or Cosine extraction
markdown_generator: Markdown generation strategy
content_filter: Content filtering strategy
js_code: JavaScript to execute on page
wait_for: Wait conditions before scraping
screenshot, pdf: Media capture options
session_id: For multi-step crawling workflows

Example:

Defined in: crawl4ai/async_configs.py

LLMConfig (AI Feature Level)

Controls LLM provider settings for AI-powered extraction and filtering:

Key Parameters:

provider: Format "provider/model" (e.g., "openai/gpt-4o", "anthropic/claude-3-opus")
api_token: Authentication token
base_url: Custom endpoint URL (for self-hosted models)
backoff_base_delay, backoff_max_attempts: Retry configuration
extra_headers: Custom HTTP headers

Supported Providers:

OpenAI (openai/gpt-4o, openai/gpt-4o-mini)
Anthropic (anthropic/claude-3-opus, anthropic/claude-3-sonnet)
Google (google/gemini-2.0-flash)
Ollama (ollama/llama3.2 - local models)
Groq, Together AI, Deepseek, and more

Example:

Defined in: crawl4ai/async_configs.py

Sources: crawl4ai/async_configs.py README.md367-556 tests/async/test_0.4.2_config_params.py1-212

Output Formats

Crawl4AI produces multiple output formats optimized for different use cases:

Markdown Variants

Raw Markdown (result.markdown.raw_markdown):

Complete conversion of page content
No filtering or optimization
Includes all headings, paragraphs, lists, tables, code blocks
Preserves page structure exactly

Fit Markdown (result.markdown.fit_markdown):

Filtered for LLM consumption
Removes navigation, footers, ads, irrelevant content
Optimized for context window limits
Uses heuristic, BM25, or LLM-based filtering

Citation Links (result.markdown.references_markdown):

Converts inline links to numbered references
Creates bibliography-style citation list
Clean for LLM processing

Extracted Content

When using an extraction strategy, result.extracted_content contains a JSON string with structured data:

CSS Extraction:

LLM Extraction:

Media and Links

Links Dictionary:

Media Dictionary:

Sources: README.md148-224 docs/examples/llm_extraction_openai_pricing.py1-56 docs/examples/docker_example.py159-221

Key Concepts

Async-First Design

Crawl4AI is built on Python's asyncio for concurrent operations:

Context Manager Pattern: async with AsyncWebCrawler()
Batch Processing: arun_many() for concurrent URL crawling
Resource Management: Automatic browser cleanup on context exit

Smart Caching

The caching system uses content hashes to avoid redundant fetches:

Cache Validation: ETag, Last-Modified, HEAD fingerprints
SQLite Metadata: Cache metadata stored in ~/.crawl4ai/crawl4ai.db
File Storage: Content files in ~/.crawl4ai/html_content/, ~/.crawl4ai/markdown_content/
Cache Modes: Control read/write behavior per request

Database structure: crawl4ai/database.py

Strategy Pattern

Core functionality is pluggable through strategy interfaces:

ExtractionStrategy: JsonCssExtractionStrategy, LLMExtractionStrategy, CosineStrategy
ContentFilter: PruningContentFilter, BM25ContentFilter, LLMContentFilter
MarkdownGenerator: DefaultMarkdownGenerator
ChunkingStrategy: RegexChunking, NlpSentenceChunking, TopicSegmentationChunking

Extend by subclassing base strategy classes.

Session Persistence

Use session_id for multi-step workflows:

Cookies and local storage persist across requests with the same session_id.

Sources: README.md520-558 CHANGELOG.md1-262

Next Steps

Installation: See Installation and Setup
Quick Start: See Quick Start Guide
Core Architecture Deep Dive: See Core Architecture
Configuration Reference: See Configuration System
Working Examples: See Usage Examples
Docker Deployment: See Docker Deployment

For version-specific changes and migration guides, see the Release History.

Sources: README.md225-934 docs/md_v2/blog/index.md1-94

Introduction to Crawl4AI

Relevant source files

Purpose and Scope

For detailed configuration options, see Configuration System (#2.3). For deployment instructions, see Deployment Options (#1.3). For working code examples, see Usage Examples (#12).

What is Crawl4AI?

Key Characteristics:

LLM-Optimized Output: Generates clean markdown with proper heading hierarchy, tables, code blocks, and citation references
Flexible Deployment: Works as a Python library, command-line tool, or Docker REST API
AI-Native Design: Deep integration with LLM providers for intelligent extraction and content filtering
Production-Grade: Includes browser pooling, smart caching, async job queues, and real-time monitoring

Current Version: 0.8.0 (crawl4ai/__version__.py4)

Sources: README.md1-68 crawl4ai/__version__.py1-9

System Architecture Overview

Crawl4AI is built around a multi-layered architecture with clear separation between browser management, content processing, and data extraction:

Architecture Diagram: Crawl4AI Component Hierarchy

The system follows a clear data flow:

User Request → Entry point (CLI/Python/API)
Configuration → Browser and crawler settings applied
Browser Execution → Page loaded and rendered
Content Scraping → HTML cleaned and structured
Processing → Markdown generated, content filtered
Extraction → Structured data extracted (optional)
Result → CrawlResult object returned

Sources: README.md1-230 crawl4ai/async_crawler_strategy.py1-100 crawl4ai/async_configs.py1-50

Main Entry Points

Crawl4AI provides three primary interfaces for interacting with the system:

Python SDK: AsyncWebCrawler

The AsyncWebCrawler class is the core Python API. It provides full control over crawling and extraction with an async context manager interface.

Python SDK Entry Points Diagram

Basic Usage:

Sources: README.md92-106 crawl4ai/async_crawler_strategy.py

Command Line: crwl

The crwl command provides quick access to crawling functionality without writing code:

Entry Point: crawl4ai.cli:main defined in pyproject.toml83

Sources: README.md108-118 pyproject.toml78-84

Docker REST API

The Docker deployment exposes a FastAPI server with comprehensive REST endpoints:

Core Endpoints:

POST /crawl - Synchronous crawling (returns results immediately)
POST /crawl/job - Asynchronous job submission (returns task_id)
GET /crawl/job/{task_id} - Job status and result retrieval
POST /screenshot - Page screenshot capture
POST /pdf - PDF generation
POST /html - HTML content extraction

Container Launch:

Sources: README.md319-354 deploy/docker/server.py Dockerfile1-205

Core Data Flow

Understanding how data flows through the system is essential for effective usage:

Data Flow Through Processing Pipeline

Key Data Transformations:

URL → HTML: Browser fetches and renders page (JavaScript execution, lazy loading)
HTML → Cleaned HTML: Remove scripts, ads, overlays using LXML (content_scraping_strategy.py)
Cleaned HTML → Markdown: Convert to readable markdown with proper structure (markdown_generation_strategy.py)
Markdown → Filtered Markdown: Optional content filtering for LLM consumption (content_filter_strategy.py)
Cleaned HTML → Structured Data: Optional extraction using CSS/LLM/Cosine strategies (extraction_strategy.py)

Result Object Structure:

The CrawlResult object (models.py) contains all outputs:

Field	Type	Description
`url`	`str`	Original URL crawled
`html`	`str`	Raw HTML from browser
`cleaned_html`	`str`	Cleaned HTML (scripts/ads removed)
`markdown`	`MarkdownGenerationResult`	Contains `raw_markdown` and `fit_markdown`
`extracted_content`	`str`	JSON string from extraction strategy
`links`	`dict`	Internal/external links
`media`	`dict`	Images, videos, audio metadata
`screenshot`	`str`	Base64-encoded screenshot (if requested)
`success`	`bool`	Whether crawl succeeded
`status_code`	`int`	HTTP status code

Sources: README.md360-473 crawl4ai/models.py crawl4ai/content_scraping_strategy.py

Configuration System

Crawl4AI uses a three-level configuration hierarchy that controls different aspects of operation:

BrowserConfig (Instance Level)

Controls browser behavior across all crawls for a given AsyncWebCrawler instance:

Key Parameters:

browser_type: "chromium" (default), "firefox", "webkit", or "undetected"
headless: Boolean for visible/headless mode
viewport_width, viewport_height: Browser window size
user_agent_mode: "random", "real", or custom
use_persistent_context: Enable session persistence
proxy_config: Proxy settings with rotation support

Example:

Defined in: crawl4ai/async_configs.py

CrawlerRunConfig (Request Level)

Controls behavior for individual crawl requests:

Key Parameters:

cache_mode: CacheMode.ENABLED, DISABLED, BYPASS, READ_ONLY, WRITE_ONLY
extraction_strategy: CSS, LLM, or Cosine extraction
markdown_generator: Markdown generation strategy
content_filter: Content filtering strategy
js_code: JavaScript to execute on page
wait_for: Wait conditions before scraping
screenshot, pdf: Media capture options
session_id: For multi-step crawling workflows

Example:

Defined in: crawl4ai/async_configs.py

LLMConfig (AI Feature Level)

Controls LLM provider settings for AI-powered extraction and filtering:

Key Parameters:

provider: Format "provider/model" (e.g., "openai/gpt-4o", "anthropic/claude-3-opus")
api_token: Authentication token
base_url: Custom endpoint URL (for self-hosted models)
backoff_base_delay, backoff_max_attempts: Retry configuration
extra_headers: Custom HTTP headers

Supported Providers:

OpenAI (openai/gpt-4o, openai/gpt-4o-mini)
Anthropic (anthropic/claude-3-opus, anthropic/claude-3-sonnet)
Google (google/gemini-2.0-flash)
Ollama (ollama/llama3.2 - local models)
Groq, Together AI, Deepseek, and more

Example:

Defined in: crawl4ai/async_configs.py

Sources: crawl4ai/async_configs.py README.md367-556 tests/async/test_0.4.2_config_params.py1-212

Output Formats

Crawl4AI produces multiple output formats optimized for different use cases:

Markdown Variants

Raw Markdown (result.markdown.raw_markdown):

Complete conversion of page content
No filtering or optimization
Includes all headings, paragraphs, lists, tables, code blocks
Preserves page structure exactly

Fit Markdown (result.markdown.fit_markdown):

Filtered for LLM consumption
Removes navigation, footers, ads, irrelevant content
Optimized for context window limits
Uses heuristic, BM25, or LLM-based filtering

Citation Links (result.markdown.references_markdown):

Converts inline links to numbered references
Creates bibliography-style citation list
Clean for LLM processing

Extracted Content

When using an extraction strategy, result.extracted_content contains a JSON string with structured data:

CSS Extraction:

LLM Extraction:

Media and Links

Links Dictionary:

Media Dictionary:

Sources: README.md148-224 docs/examples/llm_extraction_openai_pricing.py1-56 docs/examples/docker_example.py159-221

Key Concepts

Async-First Design

Crawl4AI is built on Python's asyncio for concurrent operations:

Context Manager Pattern: async with AsyncWebCrawler()
Batch Processing: arun_many() for concurrent URL crawling
Resource Management: Automatic browser cleanup on context exit

Smart Caching

The caching system uses content hashes to avoid redundant fetches:

Cache Validation: ETag, Last-Modified, HEAD fingerprints
SQLite Metadata: Cache metadata stored in ~/.crawl4ai/crawl4ai.db
File Storage: Content files in ~/.crawl4ai/html_content/, ~/.crawl4ai/markdown_content/
Cache Modes: Control read/write behavior per request

Database structure: crawl4ai/database.py

Strategy Pattern

Core functionality is pluggable through strategy interfaces:

ExtractionStrategy: JsonCssExtractionStrategy, LLMExtractionStrategy, CosineStrategy
ContentFilter: PruningContentFilter, BM25ContentFilter, LLMContentFilter
MarkdownGenerator: DefaultMarkdownGenerator
ChunkingStrategy: RegexChunking, NlpSentenceChunking, TopicSegmentationChunking

Extend by subclassing base strategy classes.

Session Persistence

Use session_id for multi-step workflows:

Cookies and local storage persist across requests with the same session_id.

Sources: README.md520-558 CHANGELOG.md1-262

Next Steps

Installation: See Installation and Setup
Quick Start: See Quick Start Guide
Core Architecture Deep Dive: See Core Architecture
Configuration Reference: See Configuration System
Working Examples: See Usage Examples
Docker Deployment: See Docker Deployment

For version-specific changes and migration guides, see the Release History.

Sources: README.md225-934 docs/md_v2/blog/index.md1-94

Introduction to Crawl4AI

Purpose and Scope

What is Crawl4AI?

System Architecture Overview

Main Entry Points

Python SDK: AsyncWebCrawler

Command Line: crwl

Docker REST API

Core Data Flow

Configuration System

BrowserConfig (Instance Level)

CrawlerRunConfig (Request Level)

LLMConfig (AI Feature Level)

Output Formats

Markdown Variants

Extracted Content

Media and Links

Key Concepts

Async-First Design

Smart Caching

Strategy Pattern

Session Persistence

Next Steps

On this page

Introduction to Crawl4AI

Purpose and Scope

What is Crawl4AI?

System Architecture Overview

Main Entry Points

Python SDK: AsyncWebCrawler

Command Line: crwl

Docker REST API

Core Data Flow

Configuration System

BrowserConfig (Instance Level)

CrawlerRunConfig (Request Level)

LLMConfig (AI Feature Level)

Output Formats

Markdown Variants

Extracted Content

Media and Links

Key Concepts

Async-First Design

Smart Caching

Strategy Pattern

Session Persistence

Next Steps

On this page