Skip to content

fix: JSON handling with Parsel#490

Merged
janbuchar merged 1 commit intomasterfrom
fix-parsel-json
Sep 2, 2024
Merged

fix: JSON handling with Parsel#490
janbuchar merged 1 commit intomasterfrom
fix-parsel-json

Conversation

@janbuchar
Copy link
Copy Markdown
Collaborator

@janbuchar janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 2, 2024
@janbuchar janbuchar requested a review from vdusek September 2, 2024 11:41
@github-actions github-actions bot added this to the 97th sprint - Tooling team milestone Sep 2, 2024
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Sep 2, 2024
Copy link
Copy Markdown
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job. Just a couple of questions:

  1. Shouldn't we handle this in BS crawler as well?

  2. What about adding try-except block when parsing the response? (Same for BS crawler.)

try:
    parsel_selector = await asyncio.to_thread(lambda: Selector(body=context.http_response.read()))
except ValueError as exc:
    raise ValueError(
        f'Failed to parse the response body. Ensure the response is parseble via Parsel: {exc}'
    ) from exc

@janbuchar
Copy link
Copy Markdown
Collaborator Author

  1. Shouldn't we handle this in BS crawler as well?

I'm not sure how to go about that. BS is super tolerant to malformed HTML:

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('{"hello": "world"}')
<html><body><p>{"hello": "world"}</p></body></html>

So... let me propose employing the Ostrich algorithm yet again?

  1. What about adding try-except block when parsing the response? (Same for BS crawler.)
try:
    parsel_selector = await asyncio.to_thread(lambda: Selector(body=context.http_response.read()))
except ValueError as exc:
    raise ValueError(
        f'Failed to parse the response body. Ensure the response is parseble via Parsel: {exc}'
    ) from exc

Are you sure the original exception is not clear enough? This will make the traceback longer, which is not great for readability.

Copy link
Copy Markdown
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM then

@janbuchar janbuchar merged commit ebf5755 into master Sep 2, 2024
@janbuchar janbuchar deleted the fix-parsel-json branch September 2, 2024 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cannot use CSS on a Selector of type 'json' ( ParselCrawler )

2 participants