feat: Capture statistics about the crawler run by janbuchar · Pull Request #142 · apify/crawlee-python

What about defining some user-friendly __repr__ of FinalStatistics?
I could not make a periodic logger log something, could you please provide me an example?

src/crawlee/_utils/models.py

src/crawlee/statistics/statistics.py

Co-authored-by: Vlada Dusek <[email protected]>

vdusek

Nice, thanks.

One last thing - just thinking, what about request_avg_failed_duration (or finished) when no request failed?

Do we really want to use timedelta.max ? 🤔

{
  "requests_finished": 16,
  "requests_failed": 0,
  "retry_histogram": [
    16
  ],
  "request_avg_failed_duration": 86400000000000.0,
  "request_avg_finished_duration": 0.192724,
  "requests_finished_per_minute": 521,
  "requests_failed_per_minute": 0,
  "request_total_duration": 3.083576,
  "requests_total": 16,
  "crawler_runtime": 1.844269
}

janbuchar · 2024-05-20T15:53:23Z

Nice, thanks.

One last thing - just thinking, what about request_avg_failed_duration (or finished) when no request failed?

Do we really want to use timedelta.max ? 🤔
{
  "requests_finished": 16,
  "requests_failed": 0,
  "retry_histogram": [
    16
  ],
  "request_avg_failed_duration": 86400000000000.0,
  "request_avg_finished_duration": 0.192724,
  "requests_finished_per_minute": 521,
  "requests_failed_per_minute": 0,
  "request_total_duration": 3.083576,
  "requests_total": 16,
  "crawler_runtime": 1.844269
}

Tough one. I guess None would make the most sense here, is that right? We could probably use that internally and only have infinity in the persistent json. That way, the timedelta.max would not be necessary anymore.

vdusek

LGTM!

Tested with:

import asyncio
import logging

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.enqueue_strategy import EnqueueStrategy
from crawlee.storages import Dataset, RequestQueue

logging.basicConfig(level=logging.INFO)


async def main() -> None:
    rq = await RequestQueue.open()
    await rq.add_request('https://crawlee.dev')
    dataset = await Dataset.open()
    crawler = BeautifulSoupCrawler(request_provider=rq)

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        await context.enqueue_links(strategy=EnqueueStrategy.SAME_DOMAIN)

        record = {
            'title': context.soup.title.text if context.soup.title else '',
            'url': context.request.url,
        }

        await dataset.push_data(record)

    stats = await crawler.run()
    print(stats)


if __name__ == '__main__':
    asyncio.run(main())

Output:

$ python run.py 
INFO:crawlee.autoscaling.snapshotter:Setting max_memory_size of this run to 3.84 GB.
INFO:crawlee.statistics.statistics:crawlee.basic_crawler.basic_crawler request statistics {
  "requests_finished": 0,
  "requests_failed": 0,
  "retry_histogram": [
    0
  ],
  "request_avg_failed_duration": null,
  "request_avg_finished_duration": null,
  "requests_finished_per_minute": 0,
  "requests_failed_per_minute": 0,
  "request_total_duration": 0.0,
  "requests_total": 0,
  "crawler_runtime": 0.010923
}
INFO:crawlee.autoscaling.autoscaled_pool:current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
INFO:httpx:HTTP Request: GET https://crawlee.dev "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/guides/javascript-rendering "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/guides/typescript-project "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/guides/avoid-blocking "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/guides/cheerio-crawler-guide "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/guides/jsdom-crawler-guide "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/api/core/class/AutoscaledPool "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/guides/proxy-management "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/guides/result-storage "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/guides/request-storage "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/api/utils/namespace/social "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/api/utils "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/api/basic-crawler/interface/BasicCrawlerOptions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/quick-start "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/deployment/aws-cheerio "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/deployment/gcp-cheerio "HTTP/1.1 200 OK"
INFO:crawlee.autoscaling.autoscaled_pool:Waiting for remaining tasks to finish
{
  "requests_finished": 16,
  "requests_failed": 0,
  "retry_histogram": [
    16
  ],
  "request_avg_failed_duration": null,
  "request_avg_finished_duration": 0.096618,
  "requests_finished_per_minute": 936,
  "requests_failed_per_minute": 0,
  "request_total_duration": 1.545896,
  "requests_total": 16,
  "crawler_runtime": 1.02534
}

feat: Capture statistics about the crawler run

20b2cb0

janbuchar requested a review from vdusek May 2, 2024 16:19

github-actions bot assigned janbuchar May 2, 2024

github-actions bot added this to the 88th sprint - Tooling team milestone May 2, 2024

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label May 2, 2024

github-actions bot reviewed May 2, 2024

View reviewed changes

janbuchar added 3 commits May 6, 2024 17:53

Port over statistic state structures

cc25049

Implement FinalStatistics calculation

56da9a0

Manage Statistics lifecycle in BasicCrawler

a9e2a91

github-actions bot reviewed May 6, 2024

View reviewed changes

janbuchar added 3 commits May 10, 2024 15:14

Basic persistence implementation

8e956e6

Handle crawler start/stop timestamps

9f02973

Implement public Statistics methods

aafb54a

github-actions bot added the tested Temporary label used only programatically for some analytics. label May 10, 2024

github-actions bot reviewed May 10, 2024

View reviewed changes

Track job statistics in BasicCrawler

7ceaa76

janbuchar mentioned this pull request May 13, 2024

Implement ErrorTracker #151

Closed

Basic ErrorTracker implementation

57c52bc

github-actions bot reviewed May 13, 2024

View reviewed changes

Implement statistic resets

3da41ab

github-actions bot reviewed May 14, 2024

View reviewed changes

janbuchar added 5 commits May 14, 2024 17:11

Periodic logging of statistics

3a6e1a6

Add missing docblocks

dc867ef

Fix mypy error

9f54b09

Fix tests

3c9afc3

Merge remote-tracking branch 'origin/master' into crawler-statistics

8792c37

janbuchar force-pushed the crawler-statistics branch from 37b6be3 to 8792c37 Compare May 15, 2024 13:17

janbuchar added 3 commits May 15, 2024 15:21

Fix merge-related errors

ee4c239

Adjust models

f356a3b

Test final statistics from BasicCrawler

9d45243

janbuchar added 2 commits May 15, 2024 17:21

Test http status code statistics

5d8b580

Fix lint errors

5218867

janbuchar marked this pull request as ready for review May 15, 2024 15:25

janbuchar added 2 commits May 16, 2024 09:39

Measure job duration with a higher precision

f40af44

Make test_final_statistics slightly longer

dbb3c37

vdusek requested changes May 16, 2024

View reviewed changes

janbuchar and others added 5 commits May 16, 2024 14:23

Apply suggestions from code review

80f0349

Co-authored-by: Vlada Dusek <[email protected]>

Address comments on model utils

64f8b5c

Address comments on statistics module

c1a2da2

Add test for periodic statistics logging

4683bca

Add __str__ to FinalStatistics

1d552ee

vdusek reviewed May 20, 2024

View reviewed changes

Allow None for min/max request duration in persisted statistics

6196246

janbuchar requested a review from vdusek May 21, 2024 10:18

Fix test assertions

8a78674

vdusek approved these changes May 21, 2024

View reviewed changes

vdusek modified the milestones: 88th sprint - Tooling team, 90th sprint - Tooling team May 21, 2024

janbuchar merged commit eeebe9b into master May 21, 2024

janbuchar deleted the crawler-statistics branch May 21, 2024 10:40

Conversation

janbuchar commented May 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

janbuchar commented May 20, 2024

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

janbuchar commented May 2, 2024 •

edited

Loading