fix: Optimize memory consumption for HttpxHttpClient, fix proxy handling#905
fix: Optimize memory consumption for HttpxHttpClient, fix proxy handling#905vdusek merged 9 commits intoapify:masterfrom
HttpxHttpClient, fix proxy handling#905Conversation
|
Something very strange is going on inside This is so, weird that someone might want to double-check it and tell me where I went wrong. In order to test to exclude delays and errors that can cause real proxies, I tested using a local environment. Here is the configuration used # crawlee_test.py
import asyncio
import os
from contextlib import suppress
import psutil
from toxiproxy import Toxiproxy
from crawlee import ConcurrencySettings, Request
from crawlee.http_clients import HttpxHttpClient
from crawlee.sessions import SessionPool
from crawlee.crawlers import ParselCrawler, ParsedHttpCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration
current_process = psutil.Process(os.getpid())
def log_memory_usage():
current_size_bytes = int(current_process.memory_info().rss)
for child in current_process.children(recursive=True):
with suppress(psutil.NoSuchProcess):
current_size_bytes += int(child.memory_info().rss)
memory_mb = current_size_bytes / 1024 /1024
return memory_mb
async def setup_toxiproxy(proxy_count: int = 1000):
toxiproxy = Toxiproxy()
toxiproxy.destroy_all()
proxies = []
i = 0
while proxy_count > 0:
try:
port = 8001 + i
toxiproxy.create(
name=f"proxy_{i}",
upstream="target-server:80",
enabled=True,
listen=f"0.0.0.0:{port}"
)
proxies.append(f'http://localhost:{port}')
proxy_count -= 1
except:
pass
i += 1
print(f"Created {len(proxies)} proxies")
return proxies
async def run():
proxies = await setup_toxiproxy(1500)
session_pool = SessionPool()
http_client = HttpxHttpClient(
headers={'accept-encoding': 'gzip, deflate, br, zstd',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36', 'accept-language': 'en',
},
)
crawler = ParselCrawler(
http_client=http_client,
concurrency_settings=ConcurrencySettings(min_concurrency=20, desired_concurrency=50),
proxy_configuration=ProxyConfiguration(proxy_urls=proxies),
max_requests_per_crawl=4000,
session_pool=session_pool
)
@crawler.router.default_handler
async def request_handler(context: ParsedHttpCrawlingContext) -> None:
ip = context.parsed_content.jmespath('origin').get()
memory = log_memory_usage()
context.log.info(f'Processing {context.request.url} with response ip {ip}, {context.session.id} {context.proxy_info.url} clients {len(http_client._client_by_proxy_url)} {memory:.2f} MB"')
requests = [Request.from_url(url='http://httpbin.org/get', always_enqueue=True) for _ in range(5000)]
await crawler.run(requests)
if __name__ == "__main__":
asyncio.run(run()) |
|
Also @janbuchar was right. Binding of session_id and proxy_url worked incorrectly. As a result, we were creating |
janbuchar
left a comment
There was a problem hiding this comment.
Just a couple of notes, looks great as a whole
| 'transport': _HttpxTransport( | ||
| proxy=proxy_url, http1=self._http1, http2=self._http2, verify=self._verify | ||
| ), |
There was a problem hiding this comment.
I checked now and it looks like the transport doesn't need to know the proxy URL since proxying is handled by the httpx.AsyncClient (it creates a separate transport for that).
| raise RuntimeError('Invalid state') | ||
|
|
||
| if session_id is None or request is not None: | ||
| if session_id is None: |
There was a problem hiding this comment.
Did the request is not None condition cause any trouble here?
There was a problem hiding this comment.
The condition session_id is None or request is None, will cause the proxy rotation for
info = await config.new_proxy_info(`session_id`, None, None)
info = await config.new_proxy_info(`session_id`, None, None)There was a problem hiding this comment.
Sorry, I mean request is not None - the same as it was in the original code.
There was a problem hiding this comment.
Yes, it caused the proxy to rotate on each request (
info = await config.new_proxy_info(`session_id`, request, None)
info = await config.new_proxy_info(`session_id`, request, None)
Different proxies for the same session
| assert info.url == proxy_urls[1] | ||
|
|
||
| info = await config.new_proxy_info(sessions[2], None, None) | ||
| info = await config.new_proxy_info(sessions[2], request, None) |
There was a problem hiding this comment.
But that way, we won't test the request=None case anywhere, right?
There was a problem hiding this comment.
In this test there are 5 calls, 3 with request=None and 2 with request. So we check both cases, because the presence of request should not affect the proxy rotation. Each new session gets its own proxy
There was a problem hiding this comment.
True. We should probably test requests with no session though, assuming that there is a use case for this. Maybe @barjin would know more? He wrote the original tiered proxies functionality for JS crawlee.
There was a problem hiding this comment.
added a similar test, without the session.
Co-authored-by: Jan Buchar <[email protected]>
60738c1 to
a5acfdb
Compare
vdusek
left a comment
There was a problem hiding this comment.
Nice! And thanks for adding the tests. Could you please also update the docstring of ProxyConfiguration.new_proxy_info and ProxyConfiguration.new_url to explain the behavior when None values are passed in?
vdusek
left a comment
There was a problem hiding this comment.
Sorry, last thing, I didn't express myself precisely enough before.
|
|
||
| If called repeatedly with the same request, it is assumed that the request is being retried. | ||
| If a previously used session ID is received, it will return the same proxy url. | ||
| If neither session_id nor request is provided, a new proxy url will be returned on each call. | ||
| """ |
There was a problem hiding this comment.
Could you please add the Args section and describe all of them? It is probably one of the last public components where it is missing.
Same for the new_url method.
(I am sorry, I know you are not an author of that and these are not your changes.)
Description
httpx.AsyncClientinstancessession_idtoproxy_urlIssues
Testing
tests/unit/proxy_configuration/test_tiers.py::test_successful_request_makes_tier_go_down. In the previous version, whensession_idwas passed, we should not obtain a new proxy even if the tier changes. The updated test now aligns with both the documentation and thetest_retrying_request_makes_tier_go_uptest.