Skip to content

feat: add max requests per crawl to BasicCrawler#198

Merged
vdusek merged 1 commit intomasterfrom
add-max-requests-per-crawl
Jun 25, 2024
Merged

feat: add max requests per crawl to BasicCrawler#198
vdusek merged 1 commit intomasterfrom
add-max-requests-per-crawl

Conversation

@vdusek
Copy link
Copy Markdown
Collaborator

@vdusek vdusek commented Jun 19, 2024

Description

  • Add max requests per crawl to BasicCrawler.

Related issues

Testing

  • A new unit test was implemented

Code sample for manual replication:

import asyncio
import logging

from crawlee.autoscaling import ConcurrencySettings
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

logging.basicConfig(level=logging.INFO)


async def main() -> None:
    crawler = BeautifulSoupCrawler(
        concurrency_settings=ConcurrencySettings(max_concurrency=1),
        max_requests_per_crawl=4,
    )

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        await context.enqueue_links()
        data = {
            'url': context.request.url,
            'title': context.soup.title.string if context.soup.title else None,
        }
        await context.push_data(data)

    await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())

Checklist

  • Changes are described in the CHANGELOG.md
  • CI passed

@github-actions github-actions bot added this to the 92nd sprint - Tooling team milestone Jun 19, 2024
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Jun 19, 2024
@vdusek vdusek marked this pull request as ready for review June 20, 2024 12:14
@vdusek vdusek requested a review from janbuchar June 20, 2024 12:15
Copy link
Copy Markdown
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for that default, this looks solid!

@vdusek vdusek requested a review from janbuchar June 20, 2024 12:49
@vdusek vdusek requested a review from janbuchar June 20, 2024 13:09
@vdusek vdusek force-pushed the add-max-requests-per-crawl branch from a3c114c to bb6d121 Compare June 20, 2024 15:45
@vdusek
Copy link
Copy Markdown
Collaborator Author

vdusek commented Jun 20, 2024

The tests are failing/timing out. Let's first address the #194. Let's get back here once it's resolved.

@vdusek vdusek force-pushed the add-max-requests-per-crawl branch 2 times, most recently from 21073fe to 9b6565b Compare June 24, 2024 12:22
@vdusek vdusek force-pushed the add-max-requests-per-crawl branch from 9b6565b to 8f9f136 Compare June 24, 2024 14:48
@vdusek vdusek merged commit b5b3053 into master Jun 25, 2024
@vdusek vdusek deleted the add-max-requests-per-crawl branch June 25, 2024 06:33
@janbuchar janbuchar mentioned this pull request Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants