feat: add utility for load and parse Sitemap and SitemapRequestLoader#1169
feat: add utility for load and parse Sitemap and SitemapRequestLoader#1169vdusek merged 45 commits intoapify:masterfrom
SitemapRequestLoader#1169Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a sitemap utility feature that integrates new routing logic for various sitemap formats and refactors endpoint signatures for consistency.
- Updated request routing in tests/unit/server.py to use a dictionary mapping paths to endpoint handler functions.
- Refactored endpoint functions to include consistent parameters (scope, _receive, send).
- Added a new get_sitemap_endpoint to serve sitemap content and implemented extensive tests in tests/unit/_utils/test_sitemap.py.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tests/unit/server.py | Refactored endpoint function signatures and routing logic; added new sitemap endpoint. |
| tests/unit/_utils/test_sitemap.py | Added comprehensive tests covering XML, gzipped, plain text, and invalid sitemap scenarios. |
SitemapRequestLoader
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a new utility for loading and parsing sitemaps and adds the SitemapRequestLoader to facilitate integrating sitemap-based requests into the framework. Key changes include:
- Refactoring the server routing to support dynamic endpoint functions with a unified signature.
- Adding comprehensive tests for sitemap loading, including gzip and plain text variants.
- Implementing the SitemapRequestLoader and integrating it with the existing request loader framework.
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/server.py | Refactored endpoint routing to use a path-to-handler mapping. |
| tests/unit/request_loaders/test_sitemap_request_loader.py | New tests ensuring proper sitemap request loader functionality. |
| tests/unit/_utils/test_sitemap.py | Extensive tests for sitemap parsing and various sitemap formats. |
| src/crawlee/request_loaders/_sitemap_request_loader.py | New implementation of SitemapRequestLoader with background sitemap loading. |
| src/crawlee/request_loaders/init.py | Updated all to export SitemapRequestLoader. |
| src/crawlee/_utils/robots.py | Extended RobotsTxtFile to support sitemap parsing and URL extraction. |
Comments suppressed due to low confidence (2)
tests/unit/server.py:120
- Switching from prefix-based matching to extracting a specific part from the URL may affect routing behavior; please verify that this logic meets all desired routing cases (e.g. deeper nested paths).
path_parts = URL(scope['path']).parts
src/crawlee/_utils/robots.py:89
- The docstring for 'parse_sitemaps' indicates it returns a list of Sitemap instances, but the implementation returns a single Sitemap instance; please update the docstring to accurately reflect the return type.
async def parse_sitemaps(self) -> Sitemap:
janbuchar
left a comment
There was a problem hiding this comment.
Seems promising, thanks 🙂
Co-authored-by: Jan Buchar <[email protected]>
### Description - Add `stream` method for `HttpClient` - Add an async context manager for cleaning up resources when closing a `HttpClient` Relates: #1169
janbuchar
left a comment
There was a problem hiding this comment.
I'd appreciate if @vdusek or @Pijukatel could also look into this as it's pretty big. I don't see any issues now.
vdusek
left a comment
There was a problem hiding this comment.
Looks good! Just maybe could you update the Request loaders guide to cover SitemapRequestLoader as well? 🙂
|
Looks good to me. I have a question though. This seems like a PR that deals with a problem that is already partially solved by existing packages. Probably not our specific request handling stuff, but at least the sitemap parsing part. For example this package seems pretty mature: https://github.com/GateNLP/ultimate-sitemap-parser/blob/main/usp/fetch_parse.py#L358 and it has word "ultimate" in name, so it must be really good :D |
Uh, thank you. I missed that when I was looking at available solutions before implementation. I'll see how promising it will be to use their solution. The only thing that confuses me is the presence of |
I took a closer look at their code base. Unfortunately, despite the fact that in the documentation they write about stream processing, their code does not implement it. In our case the possibility of streaming is quite important, because in some cases sitemaps can be..... take up a huge amount of memory. |
Thanks for checking it out and explanation. |
Description
SitemapRequestLoaderfor comfortable working withSitemapand easy integration into the frameworkSitemap, loads, and stream parsingIssues
Sitemapparser utility #1161Testing
SitemapRequestLoader