Skip to main content
Scrape Service

Overview

The Scrape service provides direct access to raw HTML content from web pages, with optional JavaScript rendering support. This service is perfect for applications that need the complete HTML structure of a webpage, including dynamically generated content.
Try the Scrape service instantly in our interactive playground - no coding required!

Getting Started

Quick Start

from scrapegraph_py import Client
from scrapegraph_py.logger import sgai_logger

sgai_logger.set_logging(level="INFO")

# Initialize the client
sgai_client = Client(api_key="your-api-key")

# Scrape request
response = sgai_client.htmlify(
    website_url="https://example.com",
    render_heavy_js=False,  # Set to True for heavy JavaScript rendering
    branding=True           # Set to True to extract brand design and metadata
)

print("HTML Content:", response.html)
print("Request ID:", response.scrape_request_id)
print("Status:", response.status)
# Optional branding result
if response.branding:
    print("Branding extracted")

Parameters

ParameterTypeRequiredDescription
apiKeystringYesThe ScrapeGraph API Key.
website_urlstringYesThe URL of the webpage to scrape.
render_heavy_jsbooleanNoSet to true for heavy JavaScript rendering. Default: false
brandingbooleanNoReturn extracted brand design and metadata. Default: false
stealthbooleanNoEnable stealth mode for anti-bot protection. Adds additional credits. Default: false
Get your API key from the dashboard
{
  "scrape_request_id": "2f0f7a7e-7eb3-4bd2-8f8d-ae8a7f2d9c1a",
  "status": "completed",
  "html": "<!DOCTYPE html><html><head><title>Example Page</title></head><body><h1>Welcome to Example.com</h1><p>This is the raw HTML content...</p></body></html>",
  "error": ""
}
The response includes:
  • scrape_request_id: Unique identifier for tracking your request
  • status: Current status of the scraping operation
  • html: Raw HTML content of the webpage
  • error: Error message (if any occurred during scraping)
{
  "scrape_request_id": "2f0f7a7e-7eb3-4bd2-8f8d-ae8a7f2d9c1a",
  "status": "completed",
  "html": "<!DOCTYPE html><html>...</html>",
  "error": "",
  "branding": {
    "branding": {
      "colorScheme": "light",
      "colors": {
        "primary": "#0B5FFF",
        "accent": "#FF8A00",
        "background": "#FFFFFF",
        "textPrimary": "#111827",
        "link": "#0B5FFF"
      },
      "fonts": [
        { "family": "Inter", "role": "body" }
      ],
      "typography": {
        "fontFamilies": { "primary": "Inter", "heading": "Inter" },
        "fontStacks": { "heading": ["Inter"], "body": ["Inter"] },
        "fontSizes": { "h1": "32px", "h2": "24px", "body": "16px" }
      },
      "spacing": { "baseUnit": 4, "borderRadius": "6px" },
      "components": {
        "input": { "borderColor": "#E5E7EB", "borderRadius": "6px" },
        "buttonPrimary": {
          "background": "#0B5FFF",
          "textColor": "#FFFFFF",
          "borderRadius": "6px",
          "shadow": "..."
        }
      },
      "images": {
        "logo": "https://example.com/logo.svg",
        "favicon": "https://example.com/favicon.ico",
        "ogImage": "https://example.com/og.png"
      },
      "designSystem": { "framework": "tailwind", "componentLibrary": null },
      "confidence": { "overall": 0.86 }
    },
    "metadata": {
      "title": "Example",
      "language": "en",
      "favicon": "https://example.com/favicon.ico"
    }
  }
}
When branding=true is passed, the response includes a branding object with brand design data and page metadata.

Key Features

Raw HTML Access

Get complete HTML structure including all elements

JavaScript Rendering

Optional support for heavy JavaScript rendering

Branding Extraction

Optionally extract brand colors, fonts, typography, UI components, images, and metadata

Fast Processing

Quick extraction for simple HTML content

Reliable Output

Consistent results across different websites

Use Cases

Web Development

  • Extract HTML templates
  • Analyze page structure
  • Test website rendering
  • Debug HTML issues

Data Analysis

  • Parse HTML content
  • Extract specific elements
  • Monitor website changes
  • Build web scrapers

Content Processing

  • Process dynamic content
  • Handle JavaScript-heavy sites
  • Extract embedded data
  • Analyze page performance
Want to learn more about our AI-powered scraping technology? Visit our main website to discover how we’re revolutionizing web data extraction.

JavaScript Rendering

The render_heavy_js parameter controls whether JavaScript should be executed on the target page:

When to Use JavaScript Rendering

  • Single Page Applications (SPAs): React, Vue, Angular apps
  • Dynamic Content: Content loaded via AJAX/fetch
  • Interactive Elements: Dropdowns, modals, infinite scroll
  • Client-side Routing: Hash-based or history API routing

When to Skip JavaScript Rendering

  • Static HTML Pages: Traditional server-rendered content
  • Performance: Faster processing for simple pages
  • Cost Optimization: Lower API usage for basic scraping
  • Reliability: More predictable results for static content

Advanced Usage

Async Support

For applications requiring asynchronous execution, the Scrape service provides async support:
from scrapegraph_py import AsyncClient
import asyncio

async def main():
    async with AsyncClient(api_key="your-api-key") as client:
        response = await client.htmlify(
            website_url="https://example.com",
            render_heavy_js=True
        )
        print(response)

# Run the async function
asyncio.run(main())

Concurrent Processing

Process multiple URLs concurrently for better performance:
import asyncio
from scrapegraph_py import AsyncClient
from scrapegraph_py.logger import sgai_logger

sgai_logger.set_logging(level="INFO")

async def main():
    # Initialize async client
    sgai_client = AsyncClient(api_key="your-api-key")

    # URLs to scrape
    urls = [
        "https://example.com",
        "https://scrapegraphai.com/",
        "https://github.com/ScrapeGraphAI/Scrapegraph-ai",
    ]

    tasks = [sgai_client.htmlify(website_url=url, render_heavy_js=False) for url in urls]

    # Execute requests concurrently
    responses = await asyncio.gather(*tasks, return_exceptions=True)

    # Process results
    for i, response in enumerate(responses):
        if isinstance(response, Exception):
            print(f"\nError for {urls[i]}: {response}")
        else:
            print(f"\nPage {i+1} HTML:")
            print(f"URL: {urls[i]}")
            print(f"HTML Length: {len(response['html'])} characters")

    await sgai_client.close()

if __name__ == "__main__":
    asyncio.run(main())

Integration Options

Official SDKs

AI Framework Integrations

Best Practices

Performance Optimization

  1. Use render_heavy_js=false for static content
  2. Process multiple URLs concurrently
  3. Cache results when possible
  4. Monitor API usage and costs

Error Handling

  • Always check the status field
  • Handle network timeouts gracefully
  • Implement retry logic for failed requests
  • Log errors for debugging

Content Processing

  • Validate HTML structure before parsing
  • Handle different character encodings
  • Extract only needed content sections
  • Clean up HTML for further processing

Example Projects

Check out our cookbook for real-world examples:
  • Web scraping automation tools
  • Content monitoring systems
  • HTML analysis applications
  • Dynamic content extractors

API Reference

For detailed API documentation, see the API Reference.

Support & Resources

Ready to Start?

Sign up now and get your API key to begin scraping web content!