Skip to content

feat: Implement ParselCrawler that adds support for Parsel#348

Merged
janbuchar merged 8 commits intoapify:masterfrom
asymness:feat-parsel-crawler
Aug 7, 2024
Merged

feat: Implement ParselCrawler that adds support for Parsel#348
janbuchar merged 8 commits intoapify:masterfrom
asymness:feat-parsel-crawler

Conversation

@asymness
Copy link
Copy Markdown
Contributor

Description

  • Implemented ParselCrawler which adds support for Parsel
  • Added unit tests for ParselCrawler
  • Added example in the docs for ParselCrawler usage

Issues

Testing

  • Testing example included in the docs.

Checklist

  • Changes are described in the CHANGELOG.md
  • CI passed

@asymness asymness marked this pull request as draft July 23, 2024 14:26
Comment on lines +215 to +222
# Simulate ImportError for parsel
with mock.patch.dict('sys.modules', {'parsel': None}):
# Invalidate ParselCrawler import
sys.modules.pop('crawlee.parsel_crawler', None)
sys.modules.pop('crawlee.parsel_crawler.parsel_crawler', None)

with pytest.raises(ImportError) as import_error:
from crawlee.parsel_crawler import ParselCrawler
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice trick!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 🙌

@asymness asymness force-pushed the feat-parsel-crawler branch from 2d92749 to bcaf1d2 Compare July 23, 2024 18:03
@asymness asymness marked this pull request as ready for review July 23, 2024 18:09
@asymness asymness requested a review from janbuchar July 23, 2024 18:09
Copy link
Copy Markdown
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job, thank you! Just two comments to think about.

class ParselCrawlingContext(HttpCrawlingResult, BasicCrawlingContext):
"""Crawling context used by ParselCrawler."""

parsel: Selector
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a quick glance at the docs, I believe that the idiomatic name for this would be selector. What do you think?

)

# Regex for identifying email addresses on a webpage.
EMAIL_REGEX = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm conflicted about parsing e-mail addresses with regexp like this. The presented use case is cool, but there is almost certainly some mistake here.

@fnesveda fnesveda added the t-tooling Issues with this label are in the ownership of the tooling team. label Jul 31, 2024
Copy link
Copy Markdown
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; just docs:

  • Could we add link to Parsel library?
  • Could we add API reference link to ParselCrawler?
  • Could we mention the installation of the "parsel" extra?

@janbuchar janbuchar merged commit a3832e5 into apify:master Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add support for Parsel

4 participants