fix: Optimize memory consumption for `HttpxHttpClient`, fix proxy handling by Mantisus · Pull Request #905 · apify/crawlee-python

Mantisus · 2025-01-15T09:28:50Z

Description

Use a shared transport and SSL context for all httpx.AsyncClient instances
Fix proxy handling by implementing proper binding of session_id to proxy_url

Issues

Closes: High RAM consumption by any HttpCrawlers when working with large proxy pools #895

Testing

Updated test tests/unit/proxy_configuration/test_tiers.py::test_successful_request_makes_tier_go_down. In the previous version, when session_id was passed, we should not obtain a new proxy even if the tier changes. The updated test now aligns with both the documentation and the test_retrying_request_makes_tier_go_up test.

Mantisus · 2025-01-15T09:36:14Z

Something very strange is going on inside httpx. Since the changes made reduce memory consumption from 1.7 Gigabytes to 120 Megabytes, for 1000 httpx.AsyncClient

This is so, weird that someone might want to double-check it and tell me where I went wrong.

In order to test to exclude delays and errors that can cause real proxies, I tested using a local environment. Here is the configuration used

# compose.yaml

services:
  target-server:
    image: kennethreitz/httpbin
    ports:
      - "8000:80"
    networks:
      - test-net

  toxiproxy:
    image: ghcr.io/shopify/toxiproxy
    ports:
      - "8474:8474"
      - "8001-10000:8001-10000"
    networks:
      - test-net

networks:
  test-net:
    driver: bridge

# crawlee_test.py

import asyncio
import os
from contextlib import suppress

import psutil
from toxiproxy import Toxiproxy
from crawlee import ConcurrencySettings, Request
from crawlee.http_clients import HttpxHttpClient
from crawlee.sessions import SessionPool
from crawlee.crawlers import ParselCrawler, ParsedHttpCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration

current_process = psutil.Process(os.getpid())

def log_memory_usage():
    current_size_bytes = int(current_process.memory_info().rss)
    for child in current_process.children(recursive=True):
        with suppress(psutil.NoSuchProcess):
            current_size_bytes += int(child.memory_info().rss)
    
    memory_mb = current_size_bytes / 1024 /1024
    return memory_mb

async def setup_toxiproxy(proxy_count: int = 1000):
    toxiproxy = Toxiproxy()
    
    toxiproxy.destroy_all()
    
    proxies = []
    i = 0
    while proxy_count > 0:
        try:
            port = 8001 + i
            toxiproxy.create(
                name=f"proxy_{i}",
                upstream="target-server:80",
                enabled=True,
                listen=f"0.0.0.0:{port}"
            )
            proxies.append(f'http://localhost:{port}')
            proxy_count -= 1
        except:
            pass
        i += 1
    
    print(f"Created {len(proxies)} proxies")
    return proxies

async def run():
    proxies = await setup_toxiproxy(1500)

    session_pool = SessionPool()
    http_client = HttpxHttpClient(
        headers={'accept-encoding': 'gzip, deflate, br, zstd',
                 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36', 'accept-language': 'en',
                 },
                 )
    crawler = ParselCrawler(
        http_client=http_client,
        concurrency_settings=ConcurrencySettings(min_concurrency=20, desired_concurrency=50),
        proxy_configuration=ProxyConfiguration(proxy_urls=proxies),
        max_requests_per_crawl=4000,
        session_pool=session_pool
    )

    @crawler.router.default_handler
    async def request_handler(context: ParsedHttpCrawlingContext) -> None:
        ip = context.parsed_content.jmespath('origin').get()
        memory = log_memory_usage()
        context.log.info(f'Processing {context.request.url} with response ip {ip}, {context.session.id} {context.proxy_info.url} clients {len(http_client._client_by_proxy_url)} {memory:.2f} MB"')

    requests = [Request.from_url(url='http://httpbin.org/get', always_enqueue=True) for _ in range(5000)]
    await crawler.run(requests)


if __name__ == "__main__":
    asyncio.run(run())

Mantisus · 2025-01-15T09:48:06Z

Also @janbuchar was right. Binding of session_id and proxy_url worked incorrectly. As a result, we were creating httpx.AsyncClient for each proxy in the proxy pool. This didn't show up while I was using the Apify platform with Apify proxy for testing, but it did in the local test environment.

src/crawlee/http_clients/_httpx.py

janbuchar

Just a couple of notes, looks great as a whole

janbuchar · 2025-01-17T09:11:57Z

src/crawlee/http_clients/_httpx.py

-                'transport': _HttpxTransport(
-                    proxy=proxy_url, http1=self._http1, http2=self._http2, verify=self._verify
-                ),


I checked now and it looks like the transport doesn't need to know the proxy URL since proxying is handled by the httpx.AsyncClient (it creates a separate transport for that).

src/crawlee/http_clients/_httpx.py

janbuchar · 2025-01-17T09:46:26Z

src/crawlee/proxy_configuration.py

            raise RuntimeError('Invalid state')

-        if session_id is None or request is not None:
+        if session_id is None:


Did the request is not None condition cause any trouble here?

The condition session_id is None or request is None, will cause the proxy rotation for

info = await config.new_proxy_info(`session_id`, None, None) info = await config.new_proxy_info(`session_id`, None, None)

Sorry, I mean request is not None - the same as it was in the original code.

Yes, it caused the proxy to rotate on each request (

info = await config.new_proxy_info(`session_id`, request, None) info = await config.new_proxy_info(`session_id`, request, None)

Different proxies for the same session

janbuchar · 2025-01-17T09:53:45Z

tests/unit/proxy_configuration/test_new_proxy_info.py

    assert info.url == proxy_urls[1]

-    info = await config.new_proxy_info(sessions[2], None, None)
+    info = await config.new_proxy_info(sessions[2], request, None)


But that way, we won't test the request=None case anywhere, right?

In this test there are 5 calls, 3 with request=None and 2 with request. So we check both cases, because the presence of request should not affect the proxy rotation. Each new session gets its own proxy

True. We should probably test requests with no session though, assuming that there is a use case for this. Maybe @barjin would know more? He wrote the original tiered proxies functionality for JS crawlee.

added a similar test, without the session.

Co-authored-by: Jan Buchar <[email protected]>

vdusek

Nice! And thanks for adding the tests. Could you please also update the docstring of ProxyConfiguration.new_proxy_info and ProxyConfiguration.new_url to explain the behavior when None values are passed in?

src/crawlee/http_clients/_httpx.py

janbuchar

Nicely done! I agree with @vdusek's comments though - let's address those

vdusek

Sorry, last thing, I didn't express myself precisely enough before.

vdusek · 2025-01-21T09:47:04Z

src/crawlee/proxy_configuration.py


        If called repeatedly with the same request, it is assumed that the request is being retried.
        If a previously used session ID is received, it will return the same proxy url.
+        If neither session_id nor request is provided, a new proxy url will be returned on each call.
        """


Could you please add the Args section and describe all of them? It is probably one of the last public components where it is missing.

Same for the new_url method.

(I am sorry, I know you are not an author of that and these are not your changes.)

vdusek

Amazing, thank you

Mantisus added 2 commits January 15, 2025 09:09

optimization HttpxHttpClient

7479049

fix proxy_configuration

5c48200

Mantisus self-assigned this Jan 15, 2025

janbuchar requested review from Pijukatel, barjin, janbuchar and vdusek and removed request for barjin and janbuchar January 15, 2025 10:02

vdusek added this to the 106th sprint - Tooling team milestone Jan 15, 2025

update tests for proxy

665b879

Pijukatel reviewed Jan 16, 2025

View reviewed changes

src/crawlee/http_clients/_httpx.py Show resolved Hide resolved

src/crawlee/http_clients/_httpx.py Show resolved Hide resolved

fix http1 in transport

55d41ef

janbuchar reviewed Jan 17, 2025

View reviewed changes

Mantisus and others added 2 commits January 17, 2025 12:31

Update src/crawlee/http_clients/_httpx.py

be484cf

Co-authored-by: Jan Buchar <[email protected]>

update order setting for transport and verify

a5acfdb

Mantisus force-pushed the http-clients-optimization branch from 60738c1 to a5acfdb Compare January 17, 2025 11:17

update test proxy for check rotation without session

833af09

Pijukatel approved these changes Jan 17, 2025

View reviewed changes

vdusek reviewed Jan 17, 2025

View reviewed changes

src/crawlee/http_clients/_httpx.py Outdated Show resolved Hide resolved

update docs

75b9812

Mantisus requested review from janbuchar and vdusek January 20, 2025 05:54

janbuchar approved these changes Jan 20, 2025

View reviewed changes

vdusek reviewed Jan 21, 2025

View reviewed changes

update docs

40ba7b0

vdusek approved these changes Jan 21, 2025

View reviewed changes

vdusek merged commit d7ad480 into apify:master Jan 21, 2025

Conversation

Mantisus commented Jan 15, 2025

Description

Issues

Testing

Uh oh!

Mantisus commented Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mantisus commented Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janbuchar left a comment

Choose a reason for hiding this comment

Uh oh!

janbuchar Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

janbuchar Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mantisus Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janbuchar Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

Mantisus Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

Mantisus Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mantisus Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

janbuchar left a comment

Choose a reason for hiding this comment

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

vdusek Jan 21, 2025

Choose a reason for hiding this comment

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Mantisus commented Jan 15, 2025 •

edited

Loading

Mantisus commented Jan 15, 2025 •

edited

Loading

janbuchar Jan 17, 2025 •

edited

Loading

Mantisus Jan 17, 2025 •

edited

Loading

janbuchar Jan 17, 2025 •

edited

Loading