Skip to content

Conversation

@SuaYoo
Copy link
Member

@SuaYoo SuaYoo commented Oct 3, 2023

Resolves #1222

Changes

  • Limits URL list entry to 1,000 URLs
  • Limits additional URL list entry to 100 URLs
  • Shows first invalid URL in list in error message
  • Quick and dirty fix for long URLs wrapping: Show URLs in list on one line, with entire container scrolling

Manual testing

  1. Log in and go to Crawling
  2. Click "New Workflow" -> "URL List"
  3. Type in a valid URL and press "Enter". Verify that help text shows "1 URL entered"
  4. Type in an invalid URL and press "Enter". Verify that help text shows invalid URL
  5. Fix invalid URL and press "Enter". Verify help text updates to 2 urls
  6. Repeat 3-5 and click outside of textarea instead of pressing "Enter". Verify validation updates as expected
  7. Paste in 1,000 URLs (I generated them with a mock data web tool)
  8. Enter in one more URL. Verify that help text updates to show max error
  9. Go to review tab. Verify you're unable to save workflow with max error
  10. Repeat with "Seeded Crawl" -> "Additional URLs", checking for max 100 instead of 1,000

Screenshots

Page Image/video
New Workflow - URL List - Scope (updated info text) Screenshot 2023-10-02 at 7 58 37 PM
New Workflow - URL List - Scope (invalid URL) Screenshot 2023-10-02 at 7 59 23 PM
New Workflow - URL List - Scope (too many URLs) Screenshot 2023-10-02 at 7 58 53 PM

@SuaYoo SuaYoo requested review from Shrinks99, ikreymer and tw4l October 3, 2023 03:26
@SuaYoo SuaYoo marked this pull request as ready for review October 3, 2023 03:27
@Shrinks99
Copy link
Contributor

Shrinks99 commented Oct 3, 2023

Selecting htt:// as the protocol is seen as a valid URL. h:// and ht are correctly classified as invalid.

Screenshot 2023-10-02 at 11 48 17 PM Screenshot 2023-10-02 at 11 48 24 PM

Any three or four character string as the protocol appears to be valid.

Screenshot 2023-10-02 at 11 50 47 PM

HTTP and HTTPS should be valid, unsure if there's any others we should support.


After setting the max pages per crawl value in the org settings to 100, the URL list still accepts up to 1000 URLs. Anish mentioned checking the max pages per crawl org setting and adjusting this value accordingly. Our standard plan will support up to 2000 pages per crawl, while 1000 is under that either way we should probably at least check if the org limit is below 1000 and set accordingly if yes?

Screenshot 2023-10-03 at 12 03 06 AM Screenshot 2023-10-03 at 12 03 49 AM

Personally I think this is kind of an edge case and can't really imagine a situation people would run into this one in the wild... But would be nice to not let people enter a value that will result in not crawling everything intended?


Other than this, looks good!

@SuaYoo
Copy link
Member Author

SuaYoo commented Oct 3, 2023

Selecting htt:// as the protocol is seen as a valid URL. h:// and ht are correctly classified as invalid.
HTTP and HTTPS should be valid, unsure if there's any others we should support.

Generally I favor permissive URLs (new URL('a://b.c') passes, for example) but since we're crawling webpages, it does make sense to limit to http/s. @ikreymer any instances where this wouldn't be the case?

After setting the max pages per crawl value in the org settings to 100, the URL list still accepts up to 1000 URLs.

Could you log a follow-up issue for this, since it needs to be implemented regardless of the URL list max limit?

@Shrinks99
Copy link
Contributor

Could you log a follow-up issue for this, since it needs to be implemented regardless of the URL list max limit?

@SuaYoo Spun out into #1245!

@tw4l
Copy link
Member

tw4l commented Oct 3, 2023

Selecting htt:// as the protocol is seen as a valid URL. h:// and ht are correctly classified as invalid.
HTTP and HTTPS should be valid, unsure if there's any others we should support.

Generally I favor permissive URLs (new URL('a://b.c') passes, for example) but since we're crawling webpages, it does make sense to limit to http/s. @ikreymer any instances where this wouldn't be the case?

The crawler won't work for a URL that doesn't start with http:// or https://, so probably good to be restrictive here!

@ikreymer
Copy link
Member

ikreymer commented Oct 4, 2023

Yep, should consider a non-http/https scheme as invalid, will fail in the crawler.
We could potentially automatically add a scheme if none is present, default to https, so if user enters
'example.com' convert it to 'https://example.com' but maybe that should be a separate issue...

@SuaYoo
Copy link
Member Author

SuaYoo commented Oct 4, 2023

The crawler won't work for a URL that doesn't start with http:// or https://, so probably good to be restrictive here!

Updated!

@SuaYoo SuaYoo requested a review from Shrinks99 October 4, 2023 02:41
Copy link
Contributor

@Shrinks99 Shrinks99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested, works nicely!

@ikreymer ikreymer merged commit 38efecc into main Oct 4, 2023
@SuaYoo SuaYoo deleted the frontend-url-list-max branch October 4, 2023 04:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Limit number of URLs entered in URL list UI

5 participants