Crawl and convert any website into LLM-ready markdown.
Firecrawl Simple is a stripped down and stable version of firecrawl optimized for self-hosting and ease of contribution. Billing logic and AI features are completely removed.
playwright
is replaced with heros such that fire-engine
and scrapingbee
are not required for guarded pages.
Only the v1 /scrape
, /crawl/{id}
, and /crawl
routes are supprted in firecrawl simple, see the openapi spec here. Also, creditsUsed
has been removed from the API response on the /crawl/{id}
route.
Posthog, supabase, stripe, langchain, logsnag, sentry, bullboard, and several other deps from the package.json are removed.
This is a lot to maintain by ourselves and we are actively looking for others who would like to help. There are paid part-time maintainer positions available. We currently have bounties on a couple of issues, but would like someone interested in being an active maintainer longer-term.
The upstream firecrawl repo contains the following blurb:
This repository is in development, and we're still integrating custom modules into the mono repo. It's not fully ready for self-hosted deployment yet, but you can run it locally.
Firecrawl's API surface and general functionality were ideal for our Trieve sitesearch product, but we needed a version ready for self-hosting that was easy to contribute to and scale on Kubernetes. Therefore, we decided to fork and begin maintaining a stripped down, stable version.
Fire-engine, Firecrawl's solution for anti-bot pages, being closed source is the biggest deal breaker requiring us to maintain this. Further, our purposes not requiring the SaaS and AI dependencies also pushes our use-case far enough away from Firecrawl's current mission that it doesn't seem like merging into the upstream is viable at this time.
You should add the following services to your docker-compose as follows. We trust that you can configure Kubernetes or other hosting solutions to run these services.
name: firecrawl
services:
# Firecrawl services
playwright-service:
image: trieve/puppeteer-service-ts:v0.0.6
environment:
- PORT=3000
- PROXY_SERVER=${PROXY_SERVER}
- PROXY_USERNAME=${PROXY_USERNAME}
- PROXY_PASSWORD=${PROXY_PASSWORD}
- BLOCK_MEDIA=${BLOCK_MEDIA}
- MAX_CONCURRENCY=${MAX_CONCURRENCY}
- TWOCAPTCHA_TOKEN=${TWOCAPTCHA_TOKEN}
networks:
- backend
firecrawl-api:
image: trieve/firecrawl:v0.0.46
networks:
- backend
environment:
- REDIS_URL=${FIRECRAWL_REDIS_URL:-redis://redis:6379}
- REDIS_RATE_LIMIT_URL=${FIRECRAWL_REDIS_URL:-redis://redis:6379}
- PLAYWRIGHT_MICROSERVICE_URL=${PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000}
- PORT=${PORT:-3002}
- NUM_WORKERS_PER_QUEUE=${NUM_WORKERS_PER_QUEUE}
- BULL_AUTH_KEY=${BULL_AUTH_KEY}
- TEST_API_KEY=${TEST_API_KEY}
- HOST=${HOST:-0.0.0.0}
- SELF_HOSTED_WEBHOOK_URL=${SELF_HOSTED_WEBHOOK_URL}
- LOGGING_LEVEL=${LOGGING_LEVEL}
extra_hosts:
- "host.docker.internal:host-gateway"
depends_on:
- playwright-service
ports:
- "3002:3002"
command: ["pnpm", "run", "start:production"]
firecrawl-worker:
image: trieve/firecrawl:v0.0.46
networks:
- backend
environment:
- REDIS_URL=${FIRECRAWL_REDIS_URL:-redis://redis:6379}
- REDIS_RATE_LIMIT_URL=${FIRECRAWL_REDIS_URL:-redis://redis:6379}
- PLAYWRIGHT_MICROSERVICE_URL=${PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000}
- PORT=${PORT:-3002}
- NUM_WORKERS_PER_QUEUE=${NUM_WORKERS_PER_QUEUE}
- BULL_AUTH_KEY=${BULL_AUTH_KEY}
- TEST_API_KEY=${TEST_API_KEY}
- SCRAPING_BEE_API_KEY=${SCRAPING_BEE_API_KEY}
- HOST=${HOST:-0.0.0.0}
- SELF_HOSTED_WEBHOOK_URL=${SELF_HOSTED_WEBHOOK_URL}
- LOGGING_LEVEL=${LOGGING_LEVEL}
extra_hosts:
- "host.docker.internal:host-gateway"
depends_on:
- playwright-service
- firecrawl-api
command: ["pnpm", "run", "workers"]
redis:
image: redis:alpine
networks:
- backend
command: redis-server --bind 0.0.0.0
networks:
backend:
driver: bridge
Oxylabs env values are recommended for the proxy and optionally also consider setting up 2captcha.
Firecrawl simple works as follows:
crawl
endpoint starts on a URL and gets the sitemap or HTML for the page depending on request- URL's from the sitemap or HTML which match the
include
andexclude
criteria are added to the redis queue - Workers pick those URL's and get their HTML using the
/scrape
endpoint on theplaywright-service
. - URL's which have not already been scraped and match the
include
andexclude
criteria from the HTML received from the scrape get added to the queue from each worker - Steps 2-4 continue until no new links are found or the
limit
specified on the crawl is reached
Your scaling bottlenecks will be the following in-order:
MAX_CONCURRENCY
(number of headless puppeteer browsers) on each of theplaywright-service
- Actual number of
playwright-service
's you have behind your load-balancer - Number of
firecrawl-worker
's you have (very rarely the case this is your bottleneck)
Used to crawl a URL and all accessible subpages. This submits a crawl job and returns a job ID to check the status of the crawl.
curl -X POST https://<your-url>/v1/crawl \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer fc-YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev",
"limit": 100,
"scrapeOptions": {
"formats": ["markdown", "html"]
}
}'
Returns a crawl job id and the url to check the status of the crawl.
{
"success": true,
"id": "123-456-789",
"url": "https://<your-url>/v1/crawl/123-456-789"
}
Used to check the status of a crawl job and get its result.
curl -X GET https://<your-url>/v1/crawl/123-456-789 \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY'
{
"status": "completed",
"total": 36,
"expiresAt": "2024-00-00T00:00:00.000Z",
"data": [
{
"markdown": "[Firecrawl Docs home page![light logo](https://mintlify.s3-us-west-1.amazonaws.com/firecrawl/logo/light.svg)!...",
"html": "<!DOCTYPE html><html lang=\"en\" class=\"js-focus-visible lg:[--scroll-mt:9.5rem]\" data-js-focus-visible=\"\">...",
"metadata": {
"title": "Build a 'Chat with website' using Groq Llama 3 | Firecrawl",
"language": "en",
"sourceURL": "https://docs.firecrawl.dev/learn/rag-llama3",
"description": "Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.",
"ogLocaleAlternate": [],
"statusCode": 200
}
}
]
}
Used to scrape a URL and get its content in the specified formats.
curl -X POST https://<your-url>/v1/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev",
"formats" : ["markdown", "html"]
}'
Response:
{
"success": true,
"data": {
"markdown": "Launch Week I is here! [See our Day 2 Release 🚀](https://www.firecrawl.dev/blog/launch-week-i-day-2-doubled-rate-limits)[💥 Get 2 months free...",
"html": "<!DOCTYPE html><html lang=\"en\" class=\"light\" style=\"color-scheme: light;\"><body class=\"__variable_36bd41 __variable_d7dc5d font-inter ...",
"metadata": {
"title": "Home - Firecrawl",
"description": "Firecrawl crawls and converts any website into clean markdown.",
"language": "en",
"keywords": "Firecrawl,Markdown,Data,Mendable,Langchain",
"robots": "follow, index",
"ogTitle": "Firecrawl",
"ogDescription": "Turn any website into LLM-ready data.",
"ogUrl": "https://www.firecrawl.dev/",
"ogImage": "https://www.firecrawl.dev/og.png?123",
"ogLocaleAlternate": [],
"ogSiteName": "Firecrawl",
"sourceURL": "https://firecrawl.dev",
"statusCode": 200
}
}
}