feat: Add requests argument to EnqueueLinksFunction#1024
Conversation
src/crawlee/_types.py
Outdated
| """A function for enqueueing new URLs to crawl based on elements selected by a given selector. | ||
| """A function for enqueueing new URLs to crawl based on elements selected by a given selector and explicit requests. | ||
|
|
||
| It adds explicitly passed requests to the `RequestManager`. |
There was a problem hiding this comment.
I can imagine at least two instances of confusing behavior:
- you pass in
requestsandselector, the function enqueues the requests (even though their elements don't match the selector), and it extracts and enqueues some additional links from the current page - you pass in
requestsandtransform_request_function, but the function won't be called
I know that the docblock makes it pretty clear that this is what will happen, but I also know that there will be bug reports about this 😁
In my opinion, we should make two overloads and throw a runtime error if someone passes both requests and one or more of the other arguments.
There was a problem hiding this comment.
I can imagine at least two instances of confusing behavior:
1. you pass in `requests` and `selector`, the function enqueues the requests (even though their elements don't match the selector), and it extracts and enqueues some additional links from the current page 2. you pass in `requests` and `transform_request_function`, but the function won't be calledI know that the docblock makes it pretty clear that this is what will happen, but I also know that there will be bug reports about this 😁
In my opinion, we should make two overloads and throw a runtime error if someone passes both
requestsand one or more of the other arguments.
Ok, so the function will basically become two in one alias for:
add_requests(requests)add_requests(extract_requests(...))
There was a problem hiding this comment.
I agree with @janbuchar, rather be more strict here.
| label: str | None = None, | ||
| user_data: dict[str, Any] | None = None, | ||
| transform_request_function: Callable[[RequestOptions], RequestOptions | RequestTransformAction] | None = None, | ||
| requests: Sequence[str | Request] | None = None, |
There was a problem hiding this comment.
I now noticed that the JS counterpart accepts just urls as an array of strings. We should either restrict this, or extend the JS version 🙂
If we choose restricting this one, then most of the other parameters (barring selector) would actually start making sense in combination with urls.
There was a problem hiding this comment.
I would prefer to keep it as it is for consistency, since we use request: str | Request everywhere else.
vdusek
left a comment
There was a problem hiding this comment.
Shouldn't there be a new protocol ExtractLinksFunction in the src/crawlee/_types.py?
I am not really sure that is necessary. |
However, |
Ok, so adding it to contexts as well. I updated the change. Is that what you had in mind @vdusek ? |
vdusek
left a comment
There was a problem hiding this comment.
LGTM, just regarding the docs...
We have the following doc pages that exaplin enqueue_links:
- Crawl all links on website
- Crawl specific links on website
- Crawl website with relative links
We should extend them to explain the extract_links as well.
I added whole new section with two code examples to Crawl specific links on website. The other two pages seemed to be focused on something slightly different, so explaining extract_links there might be little distracting from the focus of the example. |
Description
Add
requestsargument toEnqueueLinksFunction.Split
EnqueueLinksFunctionimplementations toextract_linksandadd_requests.Add overload variants of
EnqueueLinksFunction.Raise error in
EnqueueLinksFunctionimplementations if called with mutually exclusive arguments.Relates to : #906