I'm trying to add response caching via hishel transports, but am not seeing a way to customize the transport used by the Crawlee client as it is created internally in _get_client():
def _get_client(self, proxy_url: str | None) -> httpx.AsyncClient:
"""Helper to get a HTTP client for the given proxy URL.
If the client for the given proxy URL doesn't exist, it will be created and stored.
"""
if proxy_url not in self._client_by_proxy_url:
# Prepare a default kwargs for the new client.
kwargs: dict[str, Any] = {
'transport': _HttpxTransport(
proxy=proxy_url,
http1=self._http1,
http2=self._http2,
),
'proxy': proxy_url,
'http1': self._http1,
'http2': self._http2,
}
# Update the default kwargs with any additional user-provided kwargs.
kwargs.update(self._async_client_kwargs)
client = httpx.AsyncClient(**kwargs)
self._client_by_proxy_url[proxy_url] = client
return self._client_by_proxy_url[proxy_url]
Is there a way to customize the httpx client transport that I'm not seeing?
Or instead of using a 3rd party library, does Crawlee have a native method for storing long term persistent caches of responses?
Somewhat related question, if its not possible to customize the transport. Is overriding HttpxHttpClient._get_client() the recommended way to use a custom httpx client, or is there a cleaner way?
hishel_client = await _create_hishel_client(cache_path)
class HishelCacheClient(HttpxHttpClient):
def _get_client(self, proxy_url: str | None) -> httpx.AsyncClient:
return hishel_client
http_client = HishelCacheClient()
crawler = BeautifulSoupCrawler(
http_client=http_client,
)
I'm trying to add response caching via hishel transports, but am not seeing a way to customize the transport used by the Crawlee client as it is created internally in _get_client():
Is there a way to customize the httpx client transport that I'm not seeing?
Or instead of using a 3rd party library, does Crawlee have a native method for storing long term persistent caches of responses?
Somewhat related question, if its not possible to customize the transport. Is overriding
HttpxHttpClient._get_client()the recommended way to use a custom httpx client, or is there a cleaner way?