Improve unique key generation logic by vdusek · Pull Request #193 · apify/apify-sdk-python

vdusek · 2024-03-11T17:01:18Z

Description

Request Queue

Updates RequestQueue.add_request method by improving its unique key generation logic.
- The logic is exposed via new optional parameters of add_request.
I also extended the docstring to make clear what it does (mainly regarding the deduplication).
The PR introduces 3 new "helper" functions in the apify/_utils.py:
- get_short_base64_hash based on the _hashPayload.
- normalize_url based on the normalizeUrl.
- compute_unique_key based on the _computeUniqueKey.

Scrapy integration

Use this new Request Queue functionality in the Scrapy integration.
Also add support for scrapy.Request.dont_filter field in the to_apify_request.

Issues

Closing both of these issues:
- Check Request class in Crawlee and replicate uniqueKey generation logic #141
  - Technical debt in Request Queue.
- Scrapy integration silently does just one POST request #190
  - Bug in the Scrapy integration.

Testing

Unit tests

The new code is covered by unit tests.

Manual testing / execution

The YieldPostSpider tests the case of the POST requests to the same URL.

It's copied from #issuecomment-1978609687 - thank you @honzajavorek!

# spiders/yield_post.py
import json
import json
from typing import Generator, cast

from scrapy import Request, Spider as BaseSpider
from scrapy.http import TextResponse


class YieldPostSpider(BaseSpider):
    name = 'yield-post'

    def start_requests(self) -> Generator[Request, None, None]:
        for number in range(3):
            yield Request(
                'https://httpbin.org/post',
                method='POST',
                body=json.dumps(dict(code=f'CODE{number:0>4}', rate=number)),
                headers={'Content-Type': 'application/json'},
            )

    def parse(self, response: TextResponse) -> Generator[dict, None, None]:
        data = json.loads(cast(dict, response.json())['data'])
        yield data

The DontFilterSpider tests the case of the Scrapy request with dont_filter option.

It's copied from issuecomment-1978953372 - thank you @honzajavorek!

from typing import Generator

from scrapy import Request, Spider as BaseSpider
from scrapy.http import TextResponse


class DontFilterSpider(BaseSpider):
    name = 'dont-filter'

    def start_requests(self) -> Generator[Request, None, None]:
        for _ in range(3):
            yield Request('https://httpbin.org/get', method='GET', dont_filter=True)

    def parse(self, response: TextResponse) -> Generator[dict, None, None]:
        yield {'something': True}

And src/main.py to execute these spiders with Apify:

from __future__ import annotations

from scrapy.crawler import CrawlerProcess

from apify import Actor
from apify.scrapy.utils import apply_apify_settings

from .spiders.yield_post import YieldPostSpider as Spider
# from .spiders.dont_filter import DontFilterSpider as Spider


async def main() -> None:
    """Apify Actor main coroutine for executing the Scrapy spider."""
    async with Actor:
        Actor.log.info('Actor is being executed...')

        # Apply Apify settings, it will override the Scrapy project settings
        settings = apply_apify_settings()

        # Execute the spider using Scrapy CrawlerProcess
        process = CrawlerProcess(settings, install_root_handler=False)
        process.crawl(Spider)
        process.start()

Execute it using Scrapy:

scrapy crawl 'dont-filter' -o dont_filter_output.json

scrapy crawl 'yield-post' -o yield_post_output.json

And using Apify (need to change the Spider in the main.py manually):

apify run --purge

And it produces the same output 🎉.

vdusek · 2024-03-11T18:04:22Z

@fnesveda Just FYI; not requesting a review from you, taking into account the current situation, so I added Jirka as a reviewer instead.

src/apify/_utils.py

jirimoravcik · 2024-03-12T10:20:23Z

src/apify/_utils.py

+        parsed_url = urlparse(url.strip())
+        search_params = dict(parse_qsl(parsed_url.query))  # Convert query to a dict
+
+        # Remove any 'utm_' parameters


Shouldn't this be more generic, utm_ isn't the only tracking parameter I'd say?

Basically, it's the same as in the #193 (comment).

@B4nan Should we add the removal of more tracking parameters (if there are any, TBH I don't know)? Or rather keep the parity.

I'm convinced that this is way out of scope of the PR.

Same as with the other problem, I'd say we want consistency now, if we should change something, it should be changed in the JS version too (which we surely can do as well).

@jirimoravcik do you have some suggestions on what else to ignore? I don't think we can find a "more generic solution", but we can blacklist more common parameters (at the same time we need to be careful to not blacklist something that might be used for other purposes).

I'm convinced that this is way out of scope of the PR.

Indeed, if we want to change this, let's just create an issue for it and keep the same behavior as the JS version

Yeah, I just wanted to point it out for discussion. E.g. for tracking, there's https://easylist.to/easylist/easyprivacy.txt

janbuchar

LGTM

src/apify/_utils.py

tests/unit/test_utils.py

Improve unique key generation logic

30c44cf

github-actions bot assigned vdusek Mar 11, 2024

github-actions bot added this to the 85th sprint - Tooling team milestone Mar 11, 2024

github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Mar 11, 2024

This comment was marked as outdated.

Sign in to view

Changelog and version 1.7.0

47709d9

This comment was marked as outdated.

Sign in to view

add use_extended_unique_key as default for Scrapy

eaac2ac

vdusek requested review from janbuchar and jirimoravcik March 11, 2024 17:55

jirimoravcik approved these changes Mar 12, 2024

View reviewed changes

id_ to request_id

eafc68a

janbuchar approved these changes Mar 12, 2024

View reviewed changes

src/apify/_utils.py Outdated Show resolved Hide resolved

tests/unit/test_utils.py Show resolved Hide resolved

tests/unit/test_utils.py Show resolved Hide resolved

caller responsible for handling the exception

5593637

vdusek force-pushed the unique-key-fix branch from a859fb8 to 5593637 Compare March 12, 2024 14:44

vdusek added 2 commits March 12, 2024 15:56

improving compute_short_hash

82aae6b

just fix the docstring...

4cd9cee

vdusek merged commit 49265e8 into master Mar 12, 2024

vdusek deleted the unique-key-fix branch March 12, 2024 15:20

This was referenced Mar 12, 2024

Check Request class in Crawlee and replicate uniqueKey generation logic #141

Closed

Scrapy integration silently does just one POST request #190

Closed

vdusek mentioned this pull request Jun 10, 2024

Improve the deduplication of requests apify/crawlee-python#178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve unique key generation logic#193

Improve unique key generation logic#193
vdusek merged 7 commits intomasterfrom
unique-key-fix

vdusek commented Mar 11, 2024 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

vdusek commented Mar 11, 2024

Uh oh!

Uh oh!

Uh oh!

jirimoravcik Mar 12, 2024

Uh oh!

vdusek Mar 12, 2024

Uh oh!

janbuchar Mar 12, 2024

Uh oh!

B4nan Mar 12, 2024 •

edited

Loading

Uh oh!

jirimoravcik Mar 12, 2024

Uh oh!

janbuchar left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vdusek commented Mar 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Request Queue

Scrapy integration

Issues

Testing

Unit tests

Manual testing / execution

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

vdusek commented Mar 11, 2024

Uh oh!

Uh oh!

Uh oh!

jirimoravcik Mar 12, 2024

Choose a reason for hiding this comment

Uh oh!

vdusek Mar 12, 2024

Choose a reason for hiding this comment

Uh oh!

janbuchar Mar 12, 2024

Choose a reason for hiding this comment

Uh oh!

B4nan Mar 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jirimoravcik Mar 12, 2024

Choose a reason for hiding this comment

Uh oh!

janbuchar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vdusek commented Mar 11, 2024 •

edited

Loading

B4nan Mar 12, 2024 •

edited

Loading