Scrape - ScrapeGraphAI

Overview

The Scrape service provides direct access to raw HTML content from web pages, with optional JavaScript rendering support. This service is perfect for applications that need the complete HTML structure of a webpage, including dynamically generated content.

Try the Scrape service instantly in our interactive playground - no coding required!

Getting Started

Quick Start

from scrapegraph_py import Client
from scrapegraph_py.logger import sgai_logger

sgai_logger.set_logging(level="INFO")

# Initialize the client
sgai_client = Client(api_key="your-api-key")

# Scrape request
response = sgai_client.htmlify(
    website_url="https://example.com",
    render_heavy_js=False,  # Set to True for heavy JavaScript rendering
    branding=True           # Set to True to extract brand design and metadata
)

print("HTML Content:", response.html)
print("Request ID:", response.scrape_request_id)
print("Status:", response.status)
# Optional branding result
if response.branding:
    print("Branding extracted")

Parameters

Parameter	Type	Required	Description
apiKey	string	Yes	The ScrapeGraph API Key.
website_url	string	Yes	The URL of the webpage to scrape.
render_heavy_js	boolean	No	Set to true for heavy JavaScript rendering. Default: false
branding	boolean	No	Return extracted brand design and metadata. Default: false
stealth	boolean	No	Enable stealth mode for anti-bot protection. Adds additional credits. Default: false

Get your API key from the dashboard

Example Response (without branding)

{
  "scrape_request_id": "2f0f7a7e-7eb3-4bd2-8f8d-ae8a7f2d9c1a",
  "status": "completed",
  "html": "<!DOCTYPE html><html><head><title>Example Page</title></head><body><h1>Welcome to Example.com</h1><p>This is the raw HTML content...</p></body></html>",
  "error": ""
}

The response includes:

scrape_request_id: Unique identifier for tracking your request
status: Current status of the scraping operation
html: Raw HTML content of the webpage
error: Error message (if any occurred during scraping)

Example Response (with branding=true)

{
  "scrape_request_id": "2f0f7a7e-7eb3-4bd2-8f8d-ae8a7f2d9c1a",
  "status": "completed",
  "html": "<!DOCTYPE html><html>...</html>",
  "error": "",
  "branding": {
    "branding": {
      "colorScheme": "light",
      "colors": {
        "primary": "#0B5FFF",
        "accent": "#FF8A00",
        "background": "#FFFFFF",
        "textPrimary": "#111827",
        "link": "#0B5FFF"
      },
      "fonts": [
        { "family": "Inter", "role": "body" }
      ],
      "typography": {
        "fontFamilies": { "primary": "Inter", "heading": "Inter" },
        "fontStacks": { "heading": ["Inter"], "body": ["Inter"] },
        "fontSizes": { "h1": "32px", "h2": "24px", "body": "16px" }
      },
      "spacing": { "baseUnit": 4, "borderRadius": "6px" },
      "components": {
        "input": { "borderColor": "#E5E7EB", "borderRadius": "6px" },
        "buttonPrimary": {
          "background": "#0B5FFF",
          "textColor": "#FFFFFF",
          "borderRadius": "6px",
          "shadow": "..."
        }
      },
      "images": {
        "logo": "https://example.com/logo.svg",
        "favicon": "https://example.com/favicon.ico",
        "ogImage": "https://example.com/og.png"
      },
      "designSystem": { "framework": "tailwind", "componentLibrary": null },
      "confidence": { "overall": 0.86 }
    },
    "metadata": {
      "title": "Example",
      "language": "en",
      "favicon": "https://example.com/favicon.ico"
    }
  }
}

When branding=true is passed, the response includes a branding object with brand design data and page metadata.

Key Features

Raw HTML Access

Get complete HTML structure including all elements

JavaScript Rendering

Optional support for heavy JavaScript rendering

Branding Extraction

Optionally extract brand colors, fonts, typography, UI components, images, and metadata

Fast Processing

Quick extraction for simple HTML content

Reliable Output

Consistent results across different websites

Use Cases

Web Development

Extract HTML templates
Analyze page structure
Test website rendering
Debug HTML issues

Data Analysis

Parse HTML content
Extract specific elements
Monitor website changes
Build web scrapers

Content Processing

Process dynamic content
Handle JavaScript-heavy sites
Extract embedded data
Analyze page performance

Want to learn more about our AI-powered scraping technology? Visit our main website to discover how we’re revolutionizing web data extraction.

JavaScript Rendering

The render_heavy_js parameter controls whether JavaScript should be executed on the target page:

When to Use JavaScript Rendering

Single Page Applications (SPAs): React, Vue, Angular apps
Dynamic Content: Content loaded via AJAX/fetch
Interactive Elements: Dropdowns, modals, infinite scroll
Client-side Routing: Hash-based or history API routing

When to Skip JavaScript Rendering

Static HTML Pages: Traditional server-rendered content
Performance: Faster processing for simple pages
Cost Optimization: Lower API usage for basic scraping
Reliability: More predictable results for static content

Advanced Usage

Async Support

For applications requiring asynchronous execution, the Scrape service provides async support:

from scrapegraph_py import AsyncClient
import asyncio

async def main():
    async with AsyncClient(api_key="your-api-key") as client:
        response = await client.htmlify(
            website_url="https://example.com",
            render_heavy_js=True
        )
        print(response)

# Run the async function
asyncio.run(main())

Concurrent Processing

Process multiple URLs concurrently for better performance:

import asyncio
from scrapegraph_py import AsyncClient
from scrapegraph_py.logger import sgai_logger

sgai_logger.set_logging(level="INFO")

async def main():
    # Initialize async client
    sgai_client = AsyncClient(api_key="your-api-key")

    # URLs to scrape
    urls = [
        "https://example.com",
        "https://scrapegraphai.com/",
        "https://github.com/ScrapeGraphAI/Scrapegraph-ai",
    ]

    tasks = [sgai_client.htmlify(website_url=url, render_heavy_js=False) for url in urls]

    # Execute requests concurrently
    responses = await asyncio.gather(*tasks, return_exceptions=True)

    # Process results
    for i, response in enumerate(responses):
        if isinstance(response, Exception):
            print(f"\nError for {urls[i]}: {response}")
        else:
            print(f"\nPage {i+1} HTML:")
            print(f"URL: {urls[i]}")
            print(f"HTML Length: {len(response['html'])} characters")

    await sgai_client.close()

if __name__ == "__main__":
    asyncio.run(main())

Integration Options

Official SDKs

Python SDK - Perfect for automation and data processing
JavaScript SDK - Ideal for web applications and browser tools

AI Framework Integrations

LangChain Integration - Use Scrape in your content pipelines
LlamaIndex Integration - Create searchable knowledge bases

Best Practices

Performance Optimization

Use render_heavy_js=false for static content
Process multiple URLs concurrently
Cache results when possible
Monitor API usage and costs

Error Handling

Always check the status field
Handle network timeouts gracefully
Implement retry logic for failed requests
Log errors for debugging

Content Processing

Validate HTML structure before parsing
Handle different character encodings
Extract only needed content sections
Clean up HTML for further processing

Example Projects

Check out our cookbook for real-world examples:

Web scraping automation tools
Content monitoring systems
HTML analysis applications
Dynamic content extractors

API Reference

For detailed API documentation, see the API Reference.

Support & Resources

Documentation

Comprehensive guides and tutorials

API Reference

Detailed API documentation

Community

Join our Discord community

GitHub

Check out our open-source projects

Main Website

Visit our official website

Get Started

Services

Official SDKs

Integrations

Contribute

Resources

​Overview

​Getting Started

​Quick Start

​Parameters

​Key Features

Raw HTML Access

JavaScript Rendering

Branding Extraction

Fast Processing

Reliable Output

​Use Cases

​Web Development

​Data Analysis

​Content Processing

​JavaScript Rendering

​When to Use JavaScript Rendering

​When to Skip JavaScript Rendering

​Advanced Usage

​Async Support

​Concurrent Processing

​Integration Options

​Official SDKs

​AI Framework Integrations

​Best Practices

​Performance Optimization

​Error Handling

​Content Processing

​Example Projects

​API Reference

​Support & Resources

Documentation

API Reference

Community

GitHub

Main Website

Ready to Start?

Overview

Getting Started

Quick Start

Parameters

Key Features

Use Cases

Web Development

Data Analysis

Content Processing

JavaScript Rendering

When to Use JavaScript Rendering

When to Skip JavaScript Rendering

Advanced Usage

Async Support

Concurrent Processing

Integration Options

Official SDKs

AI Framework Integrations

Best Practices

Performance Optimization

Error Handling

Content Processing

Example Projects

API Reference

Support & Resources