Conversation
|
@fnesveda Just FYI; not requesting a review from you, taking into account the current situation, so I added Jirka as a reviewer instead. |
src/apify/_utils.py
Outdated
| parsed_url = urlparse(url.strip()) | ||
| search_params = dict(parse_qsl(parsed_url.query)) # Convert query to a dict | ||
|
|
||
| # Remove any 'utm_' parameters |
There was a problem hiding this comment.
Shouldn't this be more generic, utm_ isn't the only tracking parameter I'd say?
There was a problem hiding this comment.
Basically, it's the same as in the #193 (comment).
@B4nan Should we add the removal of more tracking parameters (if there are any, TBH I don't know)? Or rather keep the parity.
There was a problem hiding this comment.
I'm convinced that this is way out of scope of the PR.
There was a problem hiding this comment.
Same as with the other problem, I'd say we want consistency now, if we should change something, it should be changed in the JS version too (which we surely can do as well).
@jirimoravcik do you have some suggestions on what else to ignore? I don't think we can find a "more generic solution", but we can blacklist more common parameters (at the same time we need to be careful to not blacklist something that might be used for other purposes).
I'm convinced that this is way out of scope of the PR.
Indeed, if we want to change this, let's just create an issue for it and keep the same behavior as the JS version
There was a problem hiding this comment.
Yeah, I just wanted to point it out for discussion. E.g. for tracking, there's https://easylist.to/easylist/easyprivacy.txt
Description
Request Queue
RequestQueue.add_requestmethod by improving its unique key generation logic.add_request.apify/_utils.py:get_short_base64_hashbased on the _hashPayload.normalize_urlbased on the normalizeUrl.compute_unique_keybased on the _computeUniqueKey.Scrapy integration
scrapy.Request.dont_filterfield in theto_apify_request.Issues
uniqueKeygeneration logic #141Testing
Unit tests
Manual testing / execution
The
YieldPostSpidertests the case of the POST requests to the same URL.The
DontFilterSpidertests the case of the Scrapy request withdont_filteroption.And
src/main.pyto execute these spiders with Apify:Execute it using Scrapy:
And using Apify (need to change the Spider in the
main.pymanually):And it produces the same output 🎉.