This package provides tools for conducting algorithm audits of web search and
includes a scraper built on selenium with tools for geolocating, conducting,
and saving searches. It also includes a modular parser built on selectolax
for decomposing a SERP into list of components with categorical classifications
and position-based specifications.
0.10.2: Documented running the browser backends without a GUI -- on a headless server, CI runner, or container -- via an Xvfb virtual display (new README section); the backends must run headed, so a no-display host needs a virtual display0.10.1: Reorganized the flat parse modules into a singleWebSearcher.parserspackage (public entrypoints unchanged; deep imports of the old flat paths must switch) and hardened the parse pipeline -- every component is classified before any is parsed, andComponent.to_dict()now returns a copy -- with output byte-identical (snapshot-pinned). Also dropped thesnakeviz/ipykerneldev dependencies to evict the transitivetornadoadvisories0.10.0: Reliable CAPTCHA detection from the/sorry/block-redirect URL (not just the page text), with the browser backends capturing the live URL and HTML on a blocked request. Automated the geotargets locations refresh (update_locations_file, a tracked CSV + ledger, and a weekly cron). Richer parsed output under the two-tier result schema — right-hand knowledge-panel entity facts,evlb_*videodetails, itemvisible/timestampflags, and the per-resulterrormoved intodetails(breaking output);local_resultssub_typeis now a closed set (breaking output). AddedSearchEngine.to_record()/save_record(), optimized the parse hot path, and renamed the internalsearch_methodssubpackage tosearchers(the publicSearchEngineimports are unchanged)0.9.0: Breaking internal rewrite of the parse pipeline ontoselectolax(lexbor backend) for ~2x faster parsing, dropping the BeautifulSoup + lxml runtime dependencies. Theparse_serp/SearchEngineAPI and output schema are unchanged, butmake_soup/load_soupnow return aselectolaxnode and the right-hand knowledge-panel rows are retyped totype=side_bar. Also broadenskp-wholepageknowledge-panel coverage, addselection_*component types and afeatures.main_layoutfield, and ships the demos in-package via a singlews-democommand
See CHANGELOG.md for a longer history of changes by version.
- WebSearcher
# Install from PyPI
pip install WebSearcher
# Or install with uv
uv add WebSearcher
# Install development version from GitHub
pip install git+https://github.com/gitronald/WebSearcher@devWebSearcher ships runnable demos inside the package, so they work straight after pip install WebSearcher. Search and parse a query with ws-demo search, passing the query as the first argument:
uv run ws-demo search "election news"This collects the SERP, parses it, and saves the outputs (described below). The other demos run the same way: ws-demo parse <file> (offline parse of one HTML file), ws-demo searches (a battery of queries spanning component types), ws-demo headers <query> (custom request headers), and ws-demo locations <query> (localized search). Search results change constantly, especially for news, but you can review the parsed components of any saved query with ws-demo show (add --details for a details column, --list to enumerate saved queries):
uv run ws-demo show "election news"WebSearcher v0.9.0a0 | qry='election news' | 22 components
type title url
---------------- ------------------------------------------------------------ ------------------------------------------------------------
ad Latest Election News https://www.election-integrity.org/news
top_stories Latest on California governor election as public awaits r... https://www.usatoday.com/story/news/politics/elections/20...
top_stories California election results still undecided as Los Angele... https://www.foxnews.com/politics/california-election-resu...
top_stories California Governor Primary Election 2026 Live Results https://www.nbcnews.com/politics/2026-primary-elections/c...
local_news San Mateo County elections division has more than 100K ba... https://localnewsmatters.org/2026/06/05/san-mateo-county-...
local_news Sorry, Silicon Valley, it isn’t that easy to buy an election https://sfstandard.com/2026/06/03/matt-mahan-silicon-vall...
general California pushes back on Trump's primary election ... https://www.nbcsandiego.com/news/local/california-trump-c...
general 5 things to know about California's election results https://calmatters.org/politics/2026/06/primary-election-...
videos Latest on California governor, L.A. mayor primary electio... https://www.youtube.com/watch?v=--eGQRVD6ms
videos KTLA 5 News Election Coverage: Votes continue to be ... Y... https://www.youtube.com/watch?v=wMXxRGZHjKg
general Elections 2026 https://www.npr.org/sections/elections/
general Ballotpedia.org https://ballotpedia.org/Main_Page
general Election Night Results https://electionresults.sos.ca.gov/
By default, that script will save the outputs to a directory (data/demo-ws-{version}/) as JSON lines files: serps.json (the HTML plus search metadata), parsed.json (the parsed results and features), and searches.json (the search metadata only, excluding HTML).
ls -hal data/demo-ws-v0.9.0a0/total 1020K
drwxr-xr-x 2 user user 4.0K 2024-11-11 10:55 ./
drwxr-xr-x 8 user user 4.0K 2024-11-11 10:54 ../
-rw-r--r-- 1 user user 16K 2024-11-11 10:55 parsed.json
-rw-r--r-- 1 user user 2.0K 2024-11-11 10:55 searches.json
-rw-r--r-- 1 user user 990K 2024-11-11 10:55 serps.json
Example search and parse pipeline (via requests):
import WebSearcher as ws
se = ws.SearchEngine() # 1. Initialize collector
se.search('election news') # 2. Conduct a search
se.parse_serp() # 3. Parse search results
se.save_serp(append_to='serps.json') # 4. Save HTML and metadata
se.save_parsed(append_to='parsed.json') # 5. Save parsed resultsimport WebSearcher as ws
# Initialize collector with method and other settings
se = ws.SearchEngine(
method="selenium",
selenium_config = {
"headless": False,
"use_subprocess": False,
"driver_executable_path": "",
"version_main": None, # auto-detected from installed Chrome when None
}
)se.search('election news')
# 2026-05-26 09:14:22.318 | INFO | WebSearcher.searchers | 200 | election newsThe example below is primarily for parsing search results as you collect HTML.
See ws.parse_serp(html) for parsing existing HTML data.
se.parse_serp()
# Show first result
se.parsed.results[0]
{'section': 'main',
'cmpt_rank': 0,
'sub_rank': 0,
'type': 'ad',
'sub_type': 'standard',
'title': 'Latest Election News',
'url': 'https://www.election-integrity.org/news',
'text': 'Latest Election News',
'cite': 'https://www.election-integrity.org',
'details': None,
'serp_rank': 0}Every result shares the same lean core fields (type, sub_type, title,
url, text, cite, plus the section / cmpt_rank / sub_rank / serp_rank
rank metadata). Anything extra lives in details, which is either None
(a clean row) or a dict that always carries a type:
# clean row -- nothing extra
{..., 'details': None}
# typed content payload (a specific label)
{..., 'details': {'type': 'ratings', 'rating': '4.6', 'n_reviews': '6.3K'}}
{..., 'details': {'type': 'hyperlinks', 'items': [{'url': '...', 'text': '...'}]}}
# metadata-only row (generic 'item' type): a parse error, a hidden
# carousel-tail card, an extracted timestamp/thumbnail, etc.
{..., 'details': {'type': 'item', 'error': 'no subcomponents parsed'}}
{..., 'details': {'type': 'item', 'visible': False, 'heading': 'What people are saying'}}
{..., 'details': {'type': 'item', 'timestamp': '2 hours ago', 'img_url': 'https://...'}}The reserved metadata keys (error, visible, timestamp, img_url) are
recorded only when they carry information — visible only when False, the
others when present — so the common case keeps details as None.
Recommended: Append html and meta data as lines to a json file for larger or ongoing collections.
se.save_serp(append_to='serps.json')Alternative: Save individual html files in a directory, named by a provided or (default) generated serp_id. Useful for smaller qualitative explorations where you want to quickly look at what is showing up. No meta data is saved, but timestamps could be recovered from the files themselves.
se.save_serp(save_dir='./serps')Save to a json lines file.
se.save_parsed(append_to='parsed.json')To conduct localized searches--from a location of your choice--you only need
one additional data point: The "Canonical Name" of each location. These are
available online, and can be downloaded using a built in function
(ws.download_locations()) to check for the most recent version.
A brief guide on how to select a canonical name and use it to conduct a
localized search is available in a jupyter notebook here.
The browser backends (selenium -- the default -- plus the optional patchright and
zendriver) drive a real, visible Chrome: search engines reliably block Chrome's own
--headless mode, so the browser must run headed. On a server, CI runner, or container with
no display ($DISPLAY unset), a headed Chrome has nothing to attach to and won't launch. (The
requests backend is pure HTTP and needs no display -- this only applies to the browser
backends.)
The fix is Xvfb, an in-memory X display server: it lets Chrome run genuinely headed -- no headless code path, no monitor, no GPU. Install it (Debian/Ubuntu):
sudo apt-get install -y xvfbThen wrap your collection command with xvfb-run:
env -u DISPLAY xvfb-run -a --server-args="-screen 0 1920x1080x24" \
python your_collection_script.pyenv -u DISPLAYremoves any inherited display so the run can't silently fall back to a real one (e.g. an X-forwarded SSH session) -- the display Xvfb creates is then the only one in scope.xvfb-run -aauto-picks a free display number, so concurrent jobs don't collide.-screen 0 1920x1080x24gives a realistic window geometry.
The collection code itself is unchanged:
import WebSearcher as ws
se = ws.SearchEngine() # default browser backend, headed
se.search("immigration news")
se.parse_serp()
se.save_serp(append_to="serps.json")If you parallelize collection across processes, one shared Xvfb covers them all -- child
workers inherit the parent's DISPLAY, so wrap the top-level command once rather than starting
an Xvfb per worker.
Happy to have help! If you see a component that we aren't covering yet, please add it using the process below. If you aren't sure about how to write a parser, you can also create an issue and I'll try to check it out. When creating that type of issue, providing the query that produced the new component and the time it was seen are essential, a screenshot of the component would be helpful, and the HTML would be ideal. Feel free to reach out if you have questions or need help.
- Examine parser names in
/parsers/components/__init__.py - Find parser file as
/parsers/components/{cmpt_name}.py.
- Add classifier to
classifiers/{main,footer,headers}.py - Add parser as new file in
/parsers/components - Add new parser to imports and catalogue in
/parsers/components/__init__.py
Run tests:
uv run pytest tests/ -qUpdate snapshots:
uv run pytest tests/ --snapshot-updateShow snapshot diffs with -vv:
uv run pytest tests/ -vvRun a specific snapshot test by serp_id prefix:
uv run pytest tests/ -k "4f4d0fed0592"Tests load from the consolidated compressed corpus tests/fixtures/serps.json.bz2. After adding or updating records, refresh the snapshots:
uv run pytest tests/ --snapshot-updateTest Workflow (.github/workflows/test.yml)
Runs the test suite on every push to dev.
Release Workflow (.github/workflows/publish.yml)
Publishes to PyPI when a pull request is merged into master:
- Builds the package using uv
- Publishes using trusted publishing (no API tokens required)
To release a new version:
- Merge
devintomastervia PR - Once merged, the package is automatically published to PyPI
Many of the packages I've found for collecting web search data via python are no longer maintained, but others are still ongoing and interesting or useful. The primary strength of WebSearcher is its parser, which provides a level of detail that enables examinations of SERP composition by recording the type and position of each result, and its modular design, which has allowed us to (itermittenly) maintain it for so long and to cover such a wide array of component types (currently 45 without considering sub_types). Feel free to add to the list of packages or services through a pull request if you are aware of others:
- https://github.com/jarun/googler
- http://googolplex.sourceforge.net
- https://github.com/Jayin/google.py
- https://github.com/ecoron/SerpScrap
- https://github.com/henux/cli-google
- https://github.com/Kaiz0r/netcrawler
- https://github.com/nabehide/WebSearch
- https://github.com/NikolaiT/se-scraper
- https://github.com/rrwen/search_google
- https://github.com/howie6879/magic_google
- https://github.com/rohithpr/py-web-search
- https://github.com/MarioVilas/googlesearch
- https://github.com/aviaryan/python-gsearch
- https://github.com/nickmvincent/you-geo-see
- https://github.com/anthonyhseb/googlesearch
- https://github.com/KokocGroup/google-parser
- https://github.com/vijayant123/google-scrap
- https://github.com/BirdAPI/Google-Search-API
- https://github.com/bisoncorps/search-engine-parser
- https://github.com/the-markup/investigation-google-search-audit
- http://googlesystem.blogspot.com/2008/04/google-search-rest-api.html
- https://valentin.app
- https://app.samuelschmitt.com/
Copyright (C) 2017-2026 Ronald E. Robertson [email protected]
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.