Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request to omit all statements on csarven.ca #2

Closed
csarven opened this issue Jun 14, 2018 · 6 comments
Closed

Request to omit all statements on csarven.ca #2

csarven opened this issue Jun 14, 2018 · 6 comments

Comments

@csarven
Copy link

csarven commented Jun 14, 2018

Hi. I'm the current owner of csarven.ca. I would appreciate it if the dataset, map, and anything else, can omit all statements pertaining to csarven.ca. If there is any information that's currently on csarven.ca that the crawler is discovering, please let me know, I can remove those. Same goes for any other place that I might have access to. Thanks.

@snarfed
Copy link
Owner

snarfed commented Nov 3, 2018

oh wow, hey @csarven. apologies, i totally missed this. absolutely, will do.

@snarfed snarfed closed this as completed in ce794b9 Nov 3, 2018
snarfed added a commit that referenced this issue Nov 3, 2018
@snarfed
Copy link
Owner

snarfed commented Nov 3, 2018

done! your site is now gone from the dataset. details.

fwiw, here's what i see in http://csarven.ca/robots.txt right now. i do get that you may want to allow some crawlers but not others, like indie map.

User-agent: *
Disallow: /archives
Disallow: /scripts
Disallow: /url
Disallow: /labs
Disallow: /presentations
Disallow: /webstream
Disallow: /search
Disallow: /statistical-linked-dataspaces-and-analysis-overview
Allow: /labs/indexability
Allow: /archives/articles

@csarven
Copy link
Author

csarven commented Nov 5, 2018

Thanks!

Pardon me if I'm looking at the wrong code, but perhaps it is also worthwhile to update the user-agent value: https://github.com/snarfed/indie-map/blob/master/crawl/wget.sh#L9 ?

@snarfed
Copy link
Owner

snarfed commented Nov 5, 2018

interesting idea. you mean, set it to Indie Map? i could! technically the user agent still is wget though, right? Indie Map is just the use case? maybe I'm splitting hairs.

@csarven
Copy link
Author

csarven commented Nov 5, 2018

I would classify indie-map (the software of this repository) as the user-agent, as opposed to a particular library that's doing the fetching. Just as Firefox uses its own library to negotiate resources. But yes, generally it can be arbitrary.

So, you can do something like User-Agent: indie-map or if you're feeling adventurous use User-Agent: https://github.com/snarfed/indie-map/ or some other HTTP URI that can provide a structured description for the application eg view-source https://dokie.li/

@snarfed
Copy link
Owner

snarfed commented Nov 5, 2018

also fwiw, i've only actually done this whole crawl once, and i have no plans to do it again right now, regularly or otherwise.

still, good idea. thanks for the nudge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants