This repository contains Python scrapers designed to extract business listings and detailed business information from SuperPages, a widely used online business directory. The scrapers help collect essential data for lead generation, market research, and competitive analysis.
➡ Read the full blog here to learn more.
This repository includes two scrapers:
-
SuperPages Listings Scraper (
superpages_listings_scraper.py) – Extracts business listings, including:- Business Name
- Address
- Phone Number
- Website URL
- Business Profile Page Link
Scraper efficiently handle pagination.
-
SuperPages Business Details Scraper (
superpages_business_details_scraper.py) – Extracts detailed business information from individual listing pages, including:- Business Name
- Operating Hours
- Contact Information (e.g., email, additional phone numbers)
Ensure that Python is installed on your system. Check the version using:
python --versionInstall the required dependencies:
pip install requests beautifulsoup4- Requests – Handles HTTP requests to retrieve web pages.
- BeautifulSoup – Parses and extracts structured data from HTML.
This scraper extracts multiple business listings from SuperPages.
-
Modify the Target Search Query Update the
fetch_listings()function insuperpages_listings_scraper.pywith your desired search terms and location. -
Run the Scraper
python superpages_listings_scraper.py
-
Extracted Data Format The results are saved in superpages_listings.json as an array of business listings.
This scraper extracts detailed information from individual business pages.
-
Update the Business URLs
- Modify the urls list in
superpages_business_details_scraper.pywith the SuperPages business listing URLs you want to scrape.
- Modify the urls list in
-
Run the Scraper
python superpages_business_details_scraper.py
-
Extracted Data Format The results are saved in business_details.json, containing structured details for each business.
To make our SuperPages scraper more robust and faster, we can use Crawlbase Smart Proxy. Smart Proxy provides IP rotation and anti-bot protection, helping us avoid rate limits and blocks during long data collection.
Below example shows how to use Crawlbase Smart Proxy in Python.
import requests
# Replace _USER_TOKEN_ with your Crawlbase Token
proxy_url = 'http://_USER_TOKEN_:@smartproxy.crawlbase.com:8012'
def request_with_crawlbase_smart_proxy(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:131.0) Gecko/20100101 Firefox/131.0"
}
proxies = {"http": proxy_url, "https": proxy_url}
try:
response = requests.get(url=url, headers=headers, proxies=proxies, verify=False)
response.raise_for_status()
return response.text # Return page content instead of full response object
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
return NoneNote: Sign up on Crawlbase and get an API token.
With Crawlbase Smart Proxy, requests appear to originate from different locations, ensuring uninterrupted scraping.
- Enhance scraping logic to extract additional business details (e.g., social media links, customer reviews).
- Improve error handling for failed or slow requests.
- Support automated discovery of business detail page links from search listings.
This project is ideal for businesses, marketers, and analysts looking to gather SuperPages data for lead generation and market insights. 🚀