-
Notifications
You must be signed in to change notification settings - Fork 705
Add enqueue_links helper #5
Copy link
Copy link
Closed
Labels
enhancementNew feature or request.New feature or request.t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.
Milestone
Metadata
Metadata
Assignees
Labels
enhancementNew feature or request.New feature or request.t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.
Type
Fields
Give feedbackNo fields configured for issues without a type.
We should provide a similar helper to what we have in crawlee.
https://crawlee.dev/api/core/function/enqueueLinks
In a nutshell, there is base implementation, which requires a list of URLs, filters them based on the provided options (e.g. globs/regexps or the enqueue strategies) and adds them to the RQ. Then we have contextual helpers in each crawler, e.g. CheerioCrawler has its own context-aware variant, which operates on the current page, and automatically finds all the links (matching the
selectoroption, which defaults to justa).The enqueuing strategies are described here:
https://crawlee.dev/api/core/enum/EnqueueStrategy
We should first come up with the basic support for autoscaling, and have a
BasicCrawlerandBeautifulsoupCrawlerclasses.We could start with a simple variant that will only work with
regexps, and add more features/options going forward.