feat: Capture statistics about the crawler run#142
Conversation
There was a problem hiding this comment.
None of the pull request and linked issue has estimate
There was a problem hiding this comment.
None of the pull request and linked issue has estimate
There was a problem hiding this comment.
None of the pull request and linked issue has estimate
There was a problem hiding this comment.
None of the pull request and linked issue has estimate
There was a problem hiding this comment.
None of the pull request and linked issue has estimate
37b6be3 to
8792c37
Compare
vdusek
left a comment
There was a problem hiding this comment.
Good job! Just a few comments in the code and two questions:
- What about defining some user-friendly
__repr__ofFinalStatistics? - I could not make a periodic logger log something, could you please provide me an example?
vdusek
left a comment
There was a problem hiding this comment.
Nice, thanks.
One last thing - just thinking, what about request_avg_failed_duration (or finished) when no request failed?
Do we really want to use timedelta.max ? 🤔
{
"requests_finished": 16,
"requests_failed": 0,
"retry_histogram": [
16
],
"request_avg_failed_duration": 86400000000000.0,
"request_avg_finished_duration": 0.192724,
"requests_finished_per_minute": 521,
"requests_failed_per_minute": 0,
"request_total_duration": 3.083576,
"requests_total": 16,
"crawler_runtime": 1.844269
}
Tough one. I guess None would make the most sense here, is that right? We could probably use that internally and only have infinity in the persistent json. That way, the timedelta.max would not be necessary anymore. |
vdusek
left a comment
There was a problem hiding this comment.
LGTM!
Tested with:
import asyncio
import logging
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.enqueue_strategy import EnqueueStrategy
from crawlee.storages import Dataset, RequestQueue
logging.basicConfig(level=logging.INFO)
async def main() -> None:
rq = await RequestQueue.open()
await rq.add_request('https://crawlee.dev')
dataset = await Dataset.open()
crawler = BeautifulSoupCrawler(request_provider=rq)
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
await context.enqueue_links(strategy=EnqueueStrategy.SAME_DOMAIN)
record = {
'title': context.soup.title.text if context.soup.title else '',
'url': context.request.url,
}
await dataset.push_data(record)
stats = await crawler.run()
print(stats)
if __name__ == '__main__':
asyncio.run(main())Output:
$ python run.py
INFO:crawlee.autoscaling.snapshotter:Setting max_memory_size of this run to 3.84 GB.
INFO:crawlee.statistics.statistics:crawlee.basic_crawler.basic_crawler request statistics {
"requests_finished": 0,
"requests_failed": 0,
"retry_histogram": [
0
],
"request_avg_failed_duration": null,
"request_avg_finished_duration": null,
"requests_finished_per_minute": 0,
"requests_failed_per_minute": 0,
"request_total_duration": 0.0,
"requests_total": 0,
"crawler_runtime": 0.010923
}
INFO:crawlee.autoscaling.autoscaled_pool:current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
INFO:httpx:HTTP Request: GET https://crawlee.dev "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/guides/javascript-rendering "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/guides/typescript-project "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/guides/avoid-blocking "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/guides/cheerio-crawler-guide "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/guides/jsdom-crawler-guide "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/api/core/class/AutoscaledPool "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/guides/proxy-management "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/guides/result-storage "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/guides/request-storage "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/api/utils/namespace/social "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/api/utils "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/api/basic-crawler/interface/BasicCrawlerOptions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/quick-start "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/deployment/aws-cheerio "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://crawlee.dev/docs/deployment/gcp-cheerio "HTTP/1.1 200 OK"
INFO:crawlee.autoscaling.autoscaled_pool:Waiting for remaining tasks to finish
{
"requests_finished": 16,
"requests_failed": 0,
"retry_histogram": [
16
],
"request_avg_failed_duration": null,
"request_avg_finished_duration": 0.096618,
"requests_finished_per_minute": 936,
"requests_failed_per_minute": 0,
"request_total_duration": 1.545896,
"requests_total": 16,
"crawler_runtime": 1.02534
}
TODO
StatisticsclassBasicCrawler