Skip to content

how to help the crawler to find links on specific sites #693

Open
@tuehlarsen

Description

It would be useful if you could be able to figure out what are considered links or clickable element, to make it easier to debug when a resource is not indexed.

Perhaps having some kind of overlay view, were you could load in some metadata, that could visualize what elements the indexer consider clickable, that would make it easier to see whats going on.

Also would be good to guide the indexer with a list of css selector to help it find elements to call click on.
e.g see google search top menu "More" in crawl ID manual-20230307134720-b23cfddf-bfa or in soundcloud podcast crawl ID manual-20230310092711-7c02d217-c4b - none of these links are found by the crawler

Its not clear if the indexer can handle js triggered downloads? or how it handles downloads in general.
Sometimes a site will do the following to trigger a download of a resource
The client app will call out to an api and store the response from the api in memory and then use a browser api to save the content to disk. If the indexer could capture those files too it would be great. See e.g. statstidende.dk/publications pdf download in Crawl ID
manual-20230304090628-8b0d5f9f-97b

and example of a library that might be used https://github.com/eligrey/FileSaver.js/

Metadata

Assignees

Labels

investigationResearch and/or prototyping before dev workquestionFurther information is requested, label should be removed once answeredui/uxThis issue requires UI/UX work

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions