-
Notifications
You must be signed in to change notification settings - Fork 705
Improve the deduplication of requests #178
Copy link
Copy link
Closed
Labels
solutioningThe issue is not being implemented but only analyzed and planned.The issue is not being implemented but only analyzed and planned.t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.
Milestone
Metadata
Metadata
Assignees
Labels
solutioningThe issue is not being implemented but only analyzed and planned.The issue is not being implemented but only analyzed and planned.t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.
Type
Fields
Give feedbackNo fields configured for issues without a type.
Context
A while ago, Honza Javorek raised some good points regarding the deduplication process in the request queue (#190).
The first one:
In response, we improved the unique key generation logic in the Python SDK (PR #193) to align with the TS Crawlee. This logic was lates copied to
crawlee-pythonand can be found in crawlee/_utils/requests.py.The second one:
Currently, HTTP headers are not considered in the computation of unique keys. Additionally, we do not offer an option to explicitly bypass request deduplication, unlike the
dont_filteroption in Scrapy (docs).Questions
unique_keyandextended_unique_keycomputation?dont_filterfeature?always_enqueue)?use_extended_unique_keybe set as the default behavior?