fix: Allow config change on PlaywrightCrawler#1186
Conversation
PlaywrightCrawler
janbuchar
left a comment
There was a problem hiding this comment.
Nice catch, thank you!
| async def test_overwrite_configuration() -> None: | ||
| """Check that the configuration is allowed to be passed to the Playwrightcrawler.""" | ||
| configuration = Configuration(default_dataset_id='my_dataset_id') | ||
| PlaywrightCrawler(configuration=configuration) |
There was a problem hiding this comment.
Could you please add one assert that checks that such configuration is actually set when passed to the PlaywrightCrawler.
There was a problem hiding this comment.
You can use this:
from crawlee import service_locator
# ...
async def test_overwrite_configuration() -> None:
"""Check that the configuration is allowed to be passed to the Playwrightcrawler."""
configuration = Configuration(default_dataset_id='my_dataset_id')
PlaywrightCrawler(configuration=configuration)
used_configuration = service_locator.get_configuration()
assert used_configuration is configurationThere was a problem hiding this comment.
Perfect, nice! Thanks for thinking along! 🙌
|
I had to empty commit twice in order to make the tests rerun and all pass. They seem a bit flaky (every commit failing a different tests). No idea if this is expected or not? Also no idea if my change is causing this, it doesn't feel like it's supposed to? WDYT? |
Sorry for that. I have checked and both failures seems to be somehow related to |
|
unfortunately it does not work if you first create a browser pool: because |
|
Thanks for the heads up @sherpya! Do you have an executable code snippet that would reproduce the issue? |
import os
from datetime import timedelta
from crawlee import ConcurrencySettings, service_locator
from crawlee.browsers import BrowserPool
from crawlee.configuration import Configuration
from crawlee.crawlers import PlaywrightCrawler
from .request import RequestType, make_request
from .routes import router
from ..settings import settings
__all__ = ['WACrawler']
class WACrawler(PlaywrightCrawler):
def __init__(self, **kwargs):
#service_locator.set_configuration(
# Configuration(persist_storage=False, default_request_queue_id=str(os.getpid())))
browser_pool = BrowserPool.with_default_plugin(
headless=kwargs.pop('headless', None),
browser_type='chromium',
fingerprint_generator=None,
browser_launch_options=kwargs.pop('browser_launch_options', None),
browser_new_context_options=kwargs.pop('browser_new_context_options', None),
use_incognito_pages=False,
operation_timeout=timedelta(milliseconds=settings.TIMEOUT),
)
concurrency_settings = ConcurrencySettings(max_concurrency=1)
super().__init__(browser_pool=browser_pool,
concurrency_settings=concurrency_settings,
request_handler=router,
request_handler_timeout=timedelta(milliseconds=settings.TIMEOUT),
configuration=Configuration(persist_storage=False, default_request_queue_id=str(os.getpid())),
**kwargs)The commented code is what I use right now |
|
I think that you should be able to circumvent this if you set the configuration before the creation of the |
yes, is what I do, look at the commented code above |
That seems to work, so have you found any limitations to the mentioned approach? |
no, it works well, but not in the way this merge should make it work |
Description
It is not possible to apply
Configurationto the Playwright crawler (see issue). This is a proposal for a solution.Issues
Testing
Added a test that failed before the fix, but works now.
Checklist