Proof of Concept for using Scrapy to write and execute scrapers that obtain open civic data.
- Install the necessary version of python using
pyenv
- If necessary,
pip install poetry
poetry install
You should now have the scrapy
tool installed in your poetry environment.
Warning: currently something is wrong with the project structure, trying to import from the core
package from
a scraper a couple layers deep in the scrapers
package. I can only run this with a modification to
PYTHONPATH (which PyCharm adds by default, running in PyCharm is fine). How to set PYTHONPATH if you want to just run
this in a terminal:
- Find the path to your Poetry environment folder:
poetry env info
- Set PYTHONPATH to include both the root folder of this repo and the
site-packages
folder of the poetry env:export PYTHONPATH='/home/jesse/repo/openstates/scrapy-test:/home/jesse/.cache/pypoetry/virtualenvs/scrapy-test-vH1KNbGC-py3.9/lib/python3.9/site-packages
- You will need to change the paths in the command above to match those on your machine.
python -m scrapy.cmdline crawl nv-bills -a session=2023Special35 -O nv-bills-scout.json -s "ITEM_PIPELINES={}"
- This command disables the
DropStubsPipeline
(ITEM_PIPELINES={}
), which by default drops stub entities - Results are output to
nv-bills-scout.json
python -m scrapy.cmdline crawl nv-bills -a session=2023Special35 -O nv-bills.json -s "DOWNLOADER_MIDDLEWARES={}"
- This command disables the
ScoutOnlyDownloaderMiddleware
(DOWNLOADER_MIDDLEWARES={}
), which by default ignores requests that are not markedis_scout
in themeta
property of the request. - Results are output to
nv-bills.json
- Please note that the scraper is not fully ported over yet, so there is still missing data.
John did a scrapy PoC in the ( private) Plural Engineering Experiments repo
- Very popular: easy to find developers who are familiar
- Very mature: battle-tested layers of abstraction, flexibility to meet our goals
- Reduce overall surface area of "in-house" code we need to maintain
One way to think of success for this project: can it achieve most or all of the goals of the spatula project, without requiring much custom code?
- Can we run a scout scrape that returns basic info on entities without making the extra requests required for retrieving full info on entities?
- Can we run a normal scrape that does not output the partial/stub info generated by scout code?
- Are there barriers involved in using necessary elements of
openstates-core
code here? For instance we want to be able to easily port code, and continue to use the core entity models w/ helper functions etc. - Can the scraper output an equivalent directory of json files that can be compared 1:1 to an existing scraper?
We have a repository of existing open data scrapers in openstates-scrapers. These form the baseline of quality and expected output for scrapers in this test repository.
Those scrapers rely on some shared code from a PyPi package called openstates
, the code for
which is found here.
There is currently a barrier to adding that shared code to this repo (see Problems below), so some of that shared code
is temporarily copied into this repo.
Some technical notes regarding porting code:
- All the scrapers in the
scrapers_next
folder use the spatula scraper framework. A few of the ones in thescrapers
folder do as well (seenv/bills.py
). But most of the scrapers in thescrapers
folder use an older framework called scrapelib. - There are often multiple requests needed to compile enough data to fully represent a Bill, so these sequences of
requests and parsing can end up looking like long procedures (scrapelib) or nested abstractions where it's not clear
how they are tied together (spatula). In scrapy, we should handle this by yielding Requests that pass along the
partial entity using
cb_kwargs
.- In spatula, you'll see a pattern where subsequent parsing functions access
self.input
to access that partial data. In scrapy, passed-down partial data is available as a named kwarg, such asbill
orbill_stub
.
- In spatula, you'll see a pattern where subsequent parsing functions access
- Fundamentally, the CSS and Xpath selectors remain the same, just some of the syntax around them changes:
doc.xpath()
orself.root.xpath()
becomesresponse.xpath()
CSS("#title").match_one(self.root).text
becomesresponse.css("#title::text").get()
The scrapy-based scrapers need to perform at least as well as the equivalent scraper in that repo.
The most important expectation to meet is that the new scraper must be at least as information-complete and accurate as the old scraper. Is the output the same (or better)? See documentation on Open States scrapers.
Old Open States scrapers will output a JSON file for each scraped item to a local directory: ./_data
:
jesse@greenbookwork:~/repo/openstates/openstates-scrapers/scrapers/_data$ cd nv
jesse@greenbookwork:~/repo/openstates/openstates-scrapers/scrapers/_data/nv$ ls -alh
total 56K
drwxrwxr-x 2 jesse jesse 4.0K Nov 6 18:29 .
drwxrwxr-x 6 jesse jesse 4.0K Nov 5 19:36 ..
-rw-rw-r-- 1 jesse jesse 2.1K Nov 6 18:28 bill_9f6f717c-7d04-11ee-aeef-01ae5adc5576.json
-rw-rw-r-- 1 jesse jesse 2.1K Nov 6 18:28 bill_a2526bec-7d04-11ee-aeef-01ae5adc5576.json
-rw-rw-r-- 1 jesse jesse 13K Nov 6 18:29 bill_a54dad84-7d04-11ee-aeef-01ae5adc5576.json
-rw-rw-r-- 1 jesse jesse 1.9K Nov 6 18:29 bill_a8d58f30-7d04-11ee-aeef-01ae5adc5576.json
-rw-rw-r-- 1 jesse jesse 2.1K Nov 6 18:29 bill_abc9288c-7d04-11ee-aeef-01ae5adc5576.json
-rw-rw-r-- 1 jesse jesse 3.8K Nov 6 18:28 jurisdiction_ocd-jurisdiction-country:us-state:nv-government.json
-rw-rw-r-- 1 jesse jesse 171 Nov 6 18:28 organization_9dac2f10-7d04-11ee-aeef-01ae5adc5576.json
-rw-rw-r-- 1 jesse jesse 187 Nov 6 18:28 organization_9dac2f11-7d04-11ee-aeef-01ae5adc5576.json
-rw-rw-r-- 1 jesse jesse 189 Nov 6 18:28 organization_9dac2f12-7d04-11ee-aeef-01ae5adc5576.json
The above output was created by running the following command from within the scrapers
subdirectory:
poetry run python -m openstates.cli.update --scrape nv bills session=2023Special35
(Nevada session 2023Special35
is a nice example because it is quick: only 5 bills.)
We can compare this output to the nv-bills.json
output mentioned above.
Other evaluation criteria:
- A scraper for bills should accept a
session
argument that accepts a legislative session identifier (string). See example. - Comments that provide context for otherwise-opaque HTML selectors/traversal are helpful!
- Long procedures should be broken into reasonable-sized functions. Often it makes sense to have a separate function for
handling sub-entities, eg
add_actions()
,add_sponsors()
,add_versions()
etc.. Request
s that are required to get theBillStub
level of info should have"is_scout": True
set in themeta
arg. This allows us to run the scraper in "scout" mode: only running the minimum requests needed for basic info so we can frequently assess when new entities are posted (and avoid flodding the source with requests).
- The
spatula
library specifies an older version of theattrs
package as a dependency.scrapy
also hasattrs
as a dependency. These versions conflict. And sinceopenstates
hasspatula
as a dependency, we currently cannot addopenstates
as a dependency to this project! To try to quickly work around this, I copied a bunch of library code out of theopenstates-core
repo and into thecore
package within this repo. This is a very temporary solution. - Scraper is not fully ported
- Pass input properties from the comamnd line to the scraper using the
-a
flag, ie-a session=2023Special35
. This allows us to copyos-update
behavior where we can pass in runtime parameters to the scraper. - Override scrapy settings at runtime with the
-s
flag. This allows us to set things like which Item pipelines and middleware is enabled at runtime. This allows us to switch the behavior between scout/normal scrape at runtime. - Item pipelines do things with items returned by scrapers. Using this to drop "stub" items when in normal scrape mode.
- Downloader middleware allows us to change behavior of a Request before it is made. Currently requiring the scraper to
mark
is_scout: True
on themeta
property of the Request, so that we can ignore non-scout requests when desired.