All notable changes to this project will be documented in this file. Format of this file follows these guidelines. This project adheres to Semantic Versioning.
- Nothing
selenium
4.1.0 -> 4.5.0- CircleCI setup_remote_docker version 19.03.13 -> 20.10.17
- Nothing
- Nothing
install-chrome.sh
was failing - error message suggested it needs anapt-get update -y
beforeapt-get install
so added the updatedev-env
0.6.19 -> 0.6.21- change
generate-circleci-config.py
to start using CircleCI Scheduled Pipelines
- Nothing
- added
resource_class: medium
to the CircleCI config generated bygenerate-circleci-config.py
dev-env
0.6.17 -> 0.6.19
- Nothing
- added sample spider
alpine_releases.py
- added
--pretty
command line option torun-sample.sh
- simple approach to skipping CircleCI build, test and deploy of runtime and runtime lite docker images - very useful during development when upgrading major things like Python and/or OS versions
- added explicit resource class to CircleCI config
dev-env
0.6.13 -> 0.6.17python-dateutil
2.8.1 -> 2.8.2selenium
3.141.0 -> 4.1.0bin/install-chromedriver.sh
was failing for newer versions of chromium because the format returned by "chromium-browser --version" changed - fix this problem- for runtime lite Alphe base image 3.12 -> 3.15
- refined
bin/install-chromedriver.sh
output when installing on Alpine - simonsdave/bionic-dev-env:v0.6.14 -> simonsdave/focal-dev-env:v0.6.16
- fixed
install-chrome.sh
usage message - added 2022 to
License
- removed LGTM workflows and badges in main README.md
- Nothing
- fixed how
generate-circleci-config.py
uses/callsint-test-run-all-spiders-in-ci-pipeline.py
- Nothing
- Nothing
- fixed a silly bug in
int-test-run-all-spiders-in-ci-pipeline.py
and how it made the command unusable - also put in real python logging for this command and real command line option handling
- Nothing
- Nothing
- update
generate-circleci-config.py
to eliminate the need for requirements.txt in spider repos - runtime docker images no longer samples init.py as executable
- Nothing
- added optional
--samples
command line option tospiders.py
- added optional
samples
argument toSpiderDiscovery()
constructor - added
categories
to spider metadata - if no categories are specified then the name of the package containing the spider is assumed to be the category name - only place that categories are current used is in the API as a means to group spiders - added
absoluteFilename
property to spider metadata - this value is generated by Cloudfeaster - added
fullyQualifiedClassName
property to spider metadata - this value is generated by Cloudfeaster
- docker based development environment now parses repo's setup.py for pre-reqs that need to be install when the development docker image is built - this change enabled the removal of requirements.txt from the repo's root directory
- change format of metadata returned by
spiders.py
andcloudfeaster.spider.Spider
- Nothing
- Circle CI pipeline now saves generated python distributions as Circle CI artifacts
- added
int-test-run-all-spiders-in-ci-pipeline.py
which is intended for use in spider repo CI pipelines
- Nothing
- Nothing
- use
update-alternatives
in runtime docker image sopython
"points to"python3.7
-
cloudfeaster-lite
docker image is now based on Alpine 3.12 (used to be Alpine 3.8) -
install-chrome.sh
now able to install both Chrome and Chromium based on command line switches -
install-chromedriver.sh
determines which version of chromedriver to install based on which version of Chrome or Chromium is installed - see this for a complete description of the version selection process -
default Chrome command line options are now
- --headless
- --window-size=1280x1024
- --no-sandbox
- --disable-dev-shm-usage
- --disable-gpu
- --disable-software-rasterizer
- --single-process
- --user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36
- Nothing
- Nothing
install-dev-env-scripts.sh
now requires virtual envdev-env
v0.6.12 -> v0.6.13
- Nothing
- Nothing
dev-env
0.6.11 -> 0.6.12
- Nothing
- Nothing
- fixed CircleCI pipeline for releases
- Nothing
- generate
cloudfeaster
andcloudfeaster-lite
docker images which can be used as the basis for building docker images of spiders
- remove extra whitespace @ EO
generate-circleci-config.py
output spiders.py
output now includes a_metadata
property
- Nothing
- added
spiders.py
to enable infrastructure spider discovery - added
get-clf-version.sh
the encapsulate the pattern of parsingsetup.py
to extract the cloudfeaster version - added runtime docker image for running spiders
_metadata.spider.name
in spider output is now name of file containing spider rather than spider's class name. This change was made as a result of learning more about the spider hosting infrastructure.- selenium 3.14.0 -> 3.141.0
generate-circleci-config.py
now generates a CircleCI config file that packages all spiders in a docker imageinstall-chromedriver.sh
now install ChromeDriver version based on Google Chrome version
- removed
Browser.wait_for_login_to_complete()
andBrowser.wait_for_signin_to_complete()
because they used an old sync pattern and the methods really weren't being used anymore
- add CircleCI docker executor authenticated pull
- per this article, added
explicit version to
setup_remote_docker
in CircleCI pipeline - add CircleCI docker executor authenticated pull
for CircleCI config generated by
generate-circleci-config.py
- Nothing
- Nothing
- Nothing
- logging level in
generate-circleci-config.py
changed fromINFO
toDEBUG
which is intended to make it simplier to debug CI pipeline crawl failures
- Nothing
- Nothing
- fix:
generate-circleci-config.py
which was generating references to the docker imagesimonsdave/cloudfeaster-bionic-dev-env
instead ofsimonsdave/cloudfeaster-dev-env
- fix: docker image badge in main
README.md
- Nothing
- add clair-cicd docker image vulnerability assessment to CircleCI pipeline
- add LGTM badges to main
README.md
dev-env
0.6.10 -> 0.6.11- changes to
generate-circleci-config.py
to improve reliability of capturing crawl results when crawls fail
- Nothing
- Nothing
- fix
bin/install-dev-env-scripts.sh
to work withoutdev_env/dev-env-version.txt
- Nothing
- Nothing
- dev-env v0.6.8 -> v0.6.10
- eliminated the nasty looking
Warning: apt-key output should not be parsed (stdout is not a terminal)
message generated bybin/install-chrome.sh
- Nothing
- Nothing
run-spider.sh
now outputs only json- dev-env v0.6.7 -> v0.6.8
- Nothing
- Nothing
generate-circleci-config.py
adds backrun-pip-check.sh
to generated CircleCI pipeline which now works after upgrade to Python 3.7
- Nothing
- Nothing
pip3 install
->python3.7 -m pip install
- dev-env v0.6.6 -> v0.6.7
- Nothing
- Nothing
- fix bug in
generate-circleci-config.py
which was generating aKeyError: 'CRAWL_OUTPUT'
error.
- Nothing
- add comprehensive artifact storage for spiders run by the CircleCI
workflow generated by
generate-circleci-config.py
- remove debugging statement from
run-all-spiders.sh
- Nothing
- Nothing
- fix:
generate-circleci-config.py
had oustanding problems from Python 2.7 -> 3.7 - usability: improve usability of
run-all-spiders.sh
andrun-spider.sh
in spider repos - docs: fix docker image badge in main README.md
- Nothing
- add
--verbose
command line argument todocker_image_integration_tests.sh
- add
--verbose
and--debug
command line options torun-sample.sh
- add
CrawlDebugger
and use in sample spiders - start of improving debugging - setting
CLF_DEBUG
can now be used to generatespiderLog
andchromeDriverLog
in spider output - add
CrawlResponse.SC_UNKNOWN
- when a spider fails all attempts are made to take a screenshot of the browser window
- spiderArgs in crawl results now crawlArgs
run-spider.sh
now accepts full file name of spider rather than just base name - sorun-spider.sh xe_exchange_rates
is nowrun-spider.sh xe_exchange_rates.py
- python-dateutil 2.8.0 -> 2.8.1
- dev-env v0.5.25 -> v0.6.6
- :MATERIAL CHANGE: Python 2.7 -> Python 3.7
- Nothing
- Nothing
- Nothing
- remove Snyk from CI pipeline & docs
- add more
BeautifulSoup
andScrapy
doc references - dev-env 0.5.21 -> 0.5.25
- add Codecov upload to CircleCI pipeline
SpiderCrawler
haschromedriver_log_file
allowing callers access to the ChromeDriver debug logs when thedebug
property forSpiderCrawler
is set toTrue
_debug
property in crawl response under all circumstances
- Nothing
- fix logging of CLF_CHROME value
- Nothing
- Nothing
bin/install-chromedriver.sh
installs chromedriver 2.46 -> 2.43 motivated by this
- Nothing
- Nothing
- selenium 3.141.0 -> selenium==3.14.0 motivated by this
- Nothing
- Nothing
dev-env
0.5.20 -> 0.5.21install-dev-env-scripts.sh
now uses theinstall-dev-env.sh
--dev-env-version
command line option
- Nothing
- Nothing
- install
dev-env
usinginstall-dev-env.sh
instead ofpip install
- Nothing
- Nothing
- dev-env 0.5.19 -> 0.5.20
- Nothing
- add
bin/generate-circleci-config.py
to setup.py
- Nothing
- Nothing
- Nothing
- fix
bin/generate-circleci-config.py
that was generating incorrect CircleCI config
- Nothing
- add
bin/check-circleci-config.sh
to setup.py as script - should have done this when addingbin/check-circleci-config.sh
- Nothing
- Nothing
- add
bin/check-consistent-clf-version.sh
- add
bin/generate-circleci-config.py
bin/install_chrome.sh
->bin/install-chrome.sh
bin/install_chromedriver.sh
->bin/install-chromedriver.sh
- remove
bin/chromedriver_version.sh
since this script was no longer used - remove
dev-env-version
label from docker image since this label is no longer used
- add
check-consistent-clf-version.sh
tosetup.py
as script which is installed as part of the Cloudfeaster python package
- Nothing
- Nothing
- add
install-dev-env-scripts.sh
for use in CircleCI pipeline
- Nothing
- Nothing
- add
check-consistent-dev-env-version.sh
to CircleCI pipeline - add
run-bandit.sh
to CircleCI pipeline - add
.cut-release-version.sh
in support of using new revs of dev-env
- renamed
run_sample.sh
->run-sample.sh
- the
ttlInSeconds
property in spider metadata is nowttl
and the value associated with the property is now a string instead of an integer - the string has the form<number><duration>
where<number>
is a non-zero integer and<duration>
is one ofs
,m
,h
ord
representing seconds, minutes, hours and days respectively - the
maxCrawlTimeInSeconds
spider metadata property is nowmaxCrawlTime
and is also a string instead of an integer - the string has the form<number><duration>
where<number>
is a non-zero integer and<duration>
is one ofs
orm
representing seconds and minutes respectively dev-env
0.5.15 -> 0.5.19- sha1 -> sha256 after running bandit
- Nothing
bin/install-dev-env-scripts.sh
can now be used by spider repos to add dev env host scripts to a spider repo's host env
- Nothing
- Nothing
- added
run-all-spiders.sh
andrun-spider.sh
- by default Chrome now started with
--no-sandbox
which should mean that Chrome can run as root which simplifies a whole host of complexity
- Nothing
- Nothing
- .travis.yml now runs
run_repo_security_scanner.sh
- added
xe_exchange_rates.py
sample spider - added sha1 hash of spiders args to spider output
- ChromeDriver 2.38 -> 2.46
- Selenium 3.12.0 -> 3.141.0
- twine 1.11.0 -> 1.12.1
- dateutil 2.7.3 -> 2.7.5
- material simplifcation of way to use
run_sample.sh
- removed
bank_of_canada_daily_exchange_rates.py
sample spider - removed
spiderhost.py
,spiderhost.sh
,spiders.py
andspiders.sh
- Nothing
- Selenium 3.11.0 -> 3.12.0
- python-dateutil 2.7.2 -> 2.7.3
- spider metadata changed to camel case instead of snake case to get closer to these JSON style guidelines
- crawl results metadata now grouped in the
_metadata
property and use camel case instead of snake case - crawl results are now validated against this jsonschema
- added spiders.sh and spiderhost.sh to enable the API for a docker image container spiders to be expressed in a manner that's independant from Python and Webdriver
- Nothing
- support pip 10.x
- simonsdave/cloudfeaster docker image is now based on Ubuntu 16.04
- ChromeDriver 2.37 -> 2.38
- Nothing
- added cloudfeaster/samples/pypi.py sample spider
- spiders meta data - url string property is now validated using jsonschema uri format instead of pattern
- selenium 3.9.0 -> 3.11.0
- python-dateutil 2.6.1 -> 2.7.2
- ChromeDriver 2.35 -> 2.37
- twine 1.10.0 -> 1.11.0
- identifying_factors and authenticating_factors properties will now always appear
in
spiders.py
output
- Nothing
- Nothing
- samples/pypi_spider.py -> samples/pythonwheels_spider.py
- spider metadata property name change = max_concurrency -> max_concurrent_crawls
- Nothing
- Nothing
- Selenium 3.8.1 -> 3.9.0
- Nothing
- Nothing
- ChromeDriver 2.34 -> 2.35
- Nothing
max concurrency
per spider property is now part of the output fromSpider.get_validated_metadata()
regardless of whether or not it is specified as part of the explicit spider metadata declaration- added
paranoia_level
to spider metadata - added
max_crawl_time_in_seconds
to spider metadata ttl_in_seconds
now has an upper bound of 86,400 (1 day in seconds)max_concurrency
now has an upper bound of 25
- Selenium 3.7.0 -> 3.8.1
- ChromeDriver 2.33 -> 2.34
- breaking change
ttl
->ttl_in_seconds
in spider metadata
- Nothing
- added
.prep-for-release-master-branch-changes.sh
so package version number is automatically bumped when cutting a relase .prep-for-release-master-branch-changes.sh
now generates Python packages for PyPI from release branch
- bug fix in
.prep-for-release-release-branch-changes.sh
to links in mainREADME.md
work correctly after a release
- removed
cloudfeaster.util
module since it wasn't used
- added --log command line option to spiders.py
- added --samples command line option to spiders.py
cloudfeaster.webdriver_spider.WebElement
now has ais_element_present()
method that functions just likecloudfeaster.webdriver_spider.Browser
- per this article
headless Chrome
is now available and
Cloudfeaster
will use it by default which means we're also able to remove the need to Xvfb which is a really nice simplification and reduction in required crawling resources - also, because we're removing Xvfbbin/spiderhost.sh
was also removed - selenium 3.3.3 -> 3.7.0
- requests 2.13.0 -> >=2.18.2
- ndg-httpsclient 0.4.2 -> 0.4.3
- ChromeDriver 2.29 -> 2.33
- simonsdave/cloudfeaster docker image now uses the latest version of pip
- removed all code related to Signal FX
- pypi_spider.py now included with distro in cloudfeaster.samples
- upgrade selenium 3.0.2 -> 3.3.3
- upgrade chromedriver 2.27 -> 2.29
- Nothing
- added _crawl_time to crawl results
- upgrade to ChromeDriver 2.27 from 2.24
- Nothing
- Nothing
- fix crawl response key errors - _status & _status_code in crawl
response were missing the leading underscore for the following responses
- SC_CTR_RAISED_EXCEPTION
- SC_INVALID_CRAWL_RETURN_TYPE
- SC_CRAWL_RAISED_EXCEPTION
- SC_SPIDER_NOT_FOUND
- Nothing
- Nothing
- dev env upgraded to docker 1.12
- BREAKING CHANGE = selenium 2.53.6 -> 3.0.1 which resulted in requiring an upgrade to ChromeDriver 2.24 from 2.22 and it turns out 2.22 does not work with selenium 3.0.1
- spider version # in crawl results now include hash algo along with the hash value
- BREAKING CHANGE = the spidering infrastructure augments crawl results with data such as the time to crawl, spider name & version number, etc - in order to more easily differentiate crawl results from augmented data, the top level property names for all augment data is now prefixed with an underscore - as an example, below shows the new output from running the PyPI sample spider
>./pypi_spider.py | jq .
{
"virtualenv": {
"count": 46718553,
"link": "http://pypi-ranking.info/module/virtualenv",
"rank": 5
},
"_status_code": 0,
"setuptools": {
"count": 63758431,
"link": "http://pypi-ranking.info/module/setuptools",
"rank": 2
},
"simplejson": {
"count": 182739575,
"link": "http://pypi-ranking.info/module/simplejson",
"rank": 1
},
"requests": {
"count": 53961784,
"link": "http://pypi-ranking.info/module/requests",
"rank": 4
},
"six": {
"count": 54950976,
"link": "http://pypi-ranking.info/module/six",
"rank": 3
},
"_spider": {
"version": "sha1:ccb6a042dd11f2f7fb7b9541d4ec888fc908a8ef",
"name": "__main__.PyPISpider"
},
"_crawl_time_in_ms": 4773,
"_status": "Ok"
}
- upgrade dev env to docker 1.12
- Nothing
- Nothing
- fixed bug that was duplicating crawl response data in
CrawlResponseOk
- Nothing
- support docker 1.12
- version bumps for dependancies:
- chromedriver 2.22
- selenium 2.53.6
- requests 2.11.0
- ndg-httpsclient 0.4.2
- set of simplifications in dev env setup
- temporary removal of authenticated proxy support
- Cloudfeaster spiders can be developed on pretty much
any operating systems/browser combinations that can
run Selenium
but Cloudfeaster Services always runs spiders on Ubuntu and Chrome;
some web sites present different responses to browser
requests based on the originating browser and/or operating system;
if, for example, development of a spider is done on Mac OS X
using Chrome, the xpath expressions embedded in a spider may
not be valid when the spider is run on Ubuntu using Chrome;
to address this disconnect, spider authors can force Cloudfeaster
Services to use a user agent header that matches their development
environment by providing a value for the
user_agent
argument ofBrowser
class' constructor.
- added proxy support to permit use of anonymity networks like those listed below - proxy support is exposed
by 2 new flags in
spiderhost.py
(--proxy
and--proxy-user
)
>spiderhost.py --help
Usage: spiderhost.py <spider> [<arg1> ... <argN>]
spider hosts accept the name of a spider, the arguments to run the spider and
optionally proxy server details. armed with all this info the spider host runs
a spider and dumps the result to stdout.
Options:
-h, --help show this help message and exit
--log=LOGGING_LEVEL logging level
[DEBUG,INFO,WARNING,ERROR,CRITICAL,FATAL] - default =
ERROR
--proxy=PROXY proxy - default = None
--proxy-user=PROXY_USER
proxy-user - default = None
>
>spiderhost.py --proxy=abc
Usage: spiderhost.py <spider> [<arg1> ... <argN>]
spiderhost.py: error: option --proxy: required format is host:port
>
>spiderhost.py --proxy-user=abc
Usage: spiderhost.py <spider> [<arg1> ... <argN>]
spiderhost.py: error: option --proxy-user: required format is user:password
>
- colorama now req'd to be @ least version 0.3.5 instead of only 0.3.5
- command line args to bin/spiderhost.sh have been simplified - now just take spider name and spider args just as you'd expect - no more url encoding of args and ----- indicating no spider args
- like the changes to bin/spiderhost.sh, bin/spiderhost.py now just accepts regular command line arguments of a spider name and spider args - much easier
- bin/spiders.sh is no longer needed - callers now access bin/spiders.py directly rather that getting at bin/spiders.py through bin/spiders.sh
- not really the initial release but intro'ed CHANGELOG.md late
- initial clf commit to github was 13 Oct '13