README

About

The founders of Google used python to fetch pages to build their search engine.

In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python"

I recommend reading the thesis: http://infolab.stanford.edu/~backrub/google.html

This project reflects my curiosity to create a web crawler as a programming challege.

The stack:

Python 3.6
Flask
BeautifulSoup
PyMysql

Progress Log

2019/07/16

In progress...

2017/4/28

Start URLs: https://www.reddit.com, https://nytimes.com, http://dmoztools.net Crawl time: Approx 4 hours Unique URLS: 1,003,156 Unique Domains: 79,302

2017/4/27

Start URL: https://www.reddit.com Crawl time: Approx 4 hours Unique URLS: 84,477 Unique Domains: 2,747

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.gitignore		.gitignore
LINKS.txt		LINKS.txt
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
string_clean.py		string_clean.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

About

The stack:

Progress Log

2019/07/16

2017/4/28

2017/4/27

About

Releases

Packages

Contributors 2

Languages

rbk/python-crawler

Folders and files

Latest commit

History

Repository files navigation

README

About

The stack:

Progress Log

2019/07/16

2017/4/28

2017/4/27

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages