The founders of Google used python to fetch pages to build their search engine.
In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python"
I recommend reading the thesis: http://infolab.stanford.edu/~backrub/google.html
This project reflects my curiosity to create a web crawler as a programming challege.
- Python 3.6
- Flask
- BeautifulSoup
- PyMysql
In progress...
Start URLs: https://www.reddit.com, https://nytimes.com, http://dmoztools.net Crawl time: Approx 4 hours Unique URLS: 1,003,156 Unique Domains: 79,302
Start URL: https://www.reddit.com Crawl time: Approx 4 hours Unique URLS: 84,477 Unique Domains: 2,747