Skip to content

fix: http clients detect 4xx as errors by default#498

Merged
vdusek merged 7 commits intomasterfrom
http-clients-detect-4xx
Sep 6, 2024
Merged

fix: http clients detect 4xx as errors by default#498
vdusek merged 7 commits intomasterfrom
http-clients-detect-4xx

Conversation

@vdusek
Copy link
Copy Markdown
Collaborator

@vdusek vdusek commented Sep 4, 2024

Description

  • Error codes between 400 and 599 are considered as errors by default.
  • Before this change only codes between 500 and 599 were handled by default.

Issues

Testing

  • I added a new test to test that 4xx status codes are not being retried.

Checklist

  • CI passed

@github-actions github-actions bot added this to the 97th sprint - Tooling team milestone Sep 4, 2024
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 4, 2024
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Sep 4, 2024
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

@vdusek vdusek force-pushed the http-clients-detect-4xx branch from 9fbf479 to 600a05a Compare September 4, 2024 14:33
@vdusek vdusek marked this pull request as ready for review September 4, 2024 14:34
@vdusek vdusek requested a review from janbuchar September 4, 2024 14:34
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

@janbuchar
Copy link
Copy Markdown
Collaborator

I haven't read the code yet, but does it retry on server error and fail immediately on client error?

@vdusek
Copy link
Copy Markdown
Collaborator Author

vdusek commented Sep 5, 2024

I haven't read the code yet, but does it retry on server error and fail immediately on client error?

No, we do retry for all error codes.

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

@janbuchar
Copy link
Copy Markdown
Collaborator

I haven't read the code yet, but does it retry on server error and fail immediately on client error?

No, we do retry for all error codes.

Well, I tried running JS crawlee against httpbin/status/{4xx, 5xx} with CheerioCrawler, this is what I got:

INFO  CheerioCrawler: Starting the crawler.
WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. 533 - Internal Server Error:
 {"id":"w6oAmtuwtHhtL97","url":"https://httpbin.org/status/533","retryCount":1}
ERROR CheerioCrawler: Request failed and reached maximum retries. Error: Resource https://httpbin.org/status/406 is not available in the format requested by the Accept header. Skipping resource.
    at CheerioCrawler._abortDownloadOfBody (/home/janbuchar/Projekty/Apify/repro-mtggoldfish-js/node_modules/@crawlee/src/internals/http-crawler.ts:894:19)
    at HttpCrawler.postNavigationHooks (/home/janbuchar/Projekty/Apify/repro-mtggoldfish-js/node_modules/@crawlee/src/internals/http-crawler.ts:447:45)
    at CheerioCrawler._executeHooks (/home/janbuchar/Projekty/Apify/repro-mtggoldfish-js/node_modules/@crawlee/src/internals/basic-crawler.ts:1599:23)
    at CheerioCrawler._handleNavigation (/home/janbuchar/Projekty/Apify/repro-mtggoldfish-js/node_modules/@crawlee/src/internals/http-crawler.ts:625:20)
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
    at async CheerioCrawler._runRequestHandler (/home/janbuchar/Projekty/Apify/repro-mtggoldfish-js/node_modules/@crawlee/src/internals/http-crawler.ts:509:13)
    at async CheerioCrawler._runRequestHandler (/home/janbuchar/Projekty/Apify/repro-mtggoldfish-js/node_modules/@crawlee/src/internals/cheerio-crawler.ts:235:9)
    at async wrap (/home/janbuchar/Projekty/Apify/repro-mtggoldfish-js/node_modules/@apify/src/index.ts:77:27) {"id":"S2gfgn5uedS6zfo","url":"https://httpbin.org/status/406","method":"GET","uniqueKey":"https://httpbin.org/status/406"}
WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. 533 - Internal Server Error:
 {"id":"w6oAmtuwtHhtL97","url":"https://httpbin.org/status/533","retryCount":2}
WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. 533 - Internal Server Error:
 {"id":"w6oAmtuwtHhtL97","url":"https://httpbin.org/status/533","retryCount":3}
ERROR CheerioCrawler: Request failed and reached maximum retries. Error: 533 - Internal Server Error:
    at CheerioCrawler._parseResponse (/home/janbuchar/Projekty/Apify/repro-mtggoldfish-js/node_modules/@crawlee/src/internals/http-crawler.ts:753:19)
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
    at async CheerioCrawler._runRequestHandler (/home/janbuchar/Projekty/Apify/repro-mtggoldfish-js/node_modules/@crawlee/src/internals/http-crawler.ts:512:28)
    at async CheerioCrawler._runRequestHandler (/home/janbuchar/Projekty/Apify/repro-mtggoldfish-js/node_modules/@crawlee/src/internals/cheerio-crawler.ts:235:9)
    at async wrap (/home/janbuchar/Projekty/Apify/repro-mtggoldfish-js/node_modules/@apify/src/index.ts:77:27) {"id":"w6oAmtuwtHhtL97","url":"https://httpbin.org/status/533","method":"GET","uniqueKey":"https://httpbin.org/status/533"}
INFO  CheerioCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO  CheerioCrawler: Final request statistics: {"requestsFinished":0,"requestsFailed":2,"retryHistogram":[1,null,null,1],"requestAvgFailedDurationMillis":442,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":103,"requestTotalDurationMillis":883,"requestsTotal":2,"crawlerRuntimeMillis":1155}
INFO  CheerioCrawler: Error analysis: {"totalErrors":2,"uniqueErrors":2,"mostCommonErrors":["1x: Resource https://httpbin.org/status/406 is not available in the format requested by the Accept header. Skipping resource. (/home/janbuchar/Projekty/Apify/repro-mtggoldfish-js/node_modules/@crawlee/src/internals/http-crawler.ts:894:19)","1x: 533 - Internal Server Error: (/home/janbuchar/Projekty/Apify/repro-mtggoldfish-js/node_modules/@crawlee/src/internals/http-crawler.ts:753:19)"]}
INFO  CheerioCrawler: Finished! Total 2 requests: 0 succeeded, 2 failed. {"terminal":true}

We should keep the parity here...

@B4nan
Copy link
Copy Markdown
Member

B4nan commented Sep 5, 2024

we also have some special handling for blocked status codes like 403 when it comes to sessions (and this needs to be configurable)

https://github.com/search?q=repo%3Aapify%2Fcrawlee%20blocked%20status%20codes&type=code

@vdusek vdusek force-pushed the http-clients-detect-4xx branch from 6c3e0a4 to 38af191 Compare September 6, 2024 10:55
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

@vdusek vdusek marked this pull request as draft September 6, 2024 11:02
@vdusek vdusek force-pushed the http-clients-detect-4xx branch from e5d6893 to 898b436 Compare September 6, 2024 11:08
@vdusek vdusek marked this pull request as ready for review September 6, 2024 11:09
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

@vdusek
Copy link
Copy Markdown
Collaborator Author

vdusek commented Sep 6, 2024

Update:

we also have some special handling for blocked status codes like 403 when it comes to sessions (and this needs to be configurable)

AFAIK this is already implemented.

@vdusek vdusek requested a review from B4nan September 6, 2024 11:12
This client uses the `HTTPX` library to perform HTTP requests in crawlers (`BasicCrawler` subclasses)
and to manage sessions, proxies, and error handling.

The client raises an `HttpStatusCodeError` when it encounters an error response, defined by default as any
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should apply to any http client, right? Couldn't we just put this into the base client doc and link to it here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should, but the implementation of such behavior is in the specific HTTP clients, not the base. If someone creates their own HTTP client, they are not required to implement it in the same way. As a result, the docstring may not be true.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point. In that case, we may want to pull this logic up into BaseHttpClient though. It seems unlikely you'd want to change this behavior, and at the same time, we don't want to make http-client-authors implement this over and over again.

@vdusek vdusek requested a review from janbuchar September 6, 2024 13:00
@vdusek vdusek merged commit 1895dca into master Sep 6, 2024
@vdusek vdusek deleted the http-clients-detect-4xx branch September 6, 2024 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Throw an exception when we receive a 4xx status code

3 participants