Skip to content

Latest commit

 

History

History
1181 lines (732 loc) · 26.6 KB

CHANGELOG.md

File metadata and controls

1181 lines (732 loc) · 26.6 KB

Change Log

All notable changes to this project will be documented in this file. Format of this file follows these guidelines. This project adheres to Semantic Versioning.

[%RELEASE_VERSION%] - [%RELEASE_DATE%]

Added

  • Nothing

Changed

  • selenium 4.1.0 -> 4.5.0
  • CircleCI setup_remote_docker version 19.03.13 -> 20.10.17

Removed

  • Nothing

[0.9.59] - [2022-01-09]

Added

  • Nothing

Changed

  • install-chrome.sh was failing - error message suggested it needs an apt-get update -y before apt-get install so added the update
  • dev-env 0.6.19 -> 0.6.21
  • change generate-circleci-config.py to start using CircleCI Scheduled Pipelines

Removed

  • Nothing

[0.9.58] - [2022-01-03]

Added

  • added resource_class: medium to the CircleCI config generated by generate-circleci-config.py

Changed

  • dev-env 0.6.17 -> 0.6.19

Removed

  • Nothing

[0.9.57] - [2022-01-02]

Added

  • added sample spider alpine_releases.py
  • added --pretty command line option to run-sample.sh
  • simple approach to skipping CircleCI build, test and deploy of runtime and runtime lite docker images - very useful during development when upgrading major things like Python and/or OS versions
  • added explicit resource class to CircleCI config

Changed

  • dev-env 0.6.13 -> 0.6.17
  • python-dateutil 2.8.1 -> 2.8.2
  • selenium 3.141.0 -> 4.1.0
  • bin/install-chromedriver.sh was failing for newer versions of chromium because the format returned by "chromium-browser --version" changed - fix this problem
  • for runtime lite Alphe base image 3.12 -> 3.15
  • refined bin/install-chromedriver.sh output when installing on Alpine
  • simonsdave/bionic-dev-env:v0.6.14 -> simonsdave/focal-dev-env:v0.6.16
  • fixed install-chrome.sh usage message
  • added 2022 to License

Removed

  • removed LGTM workflows and badges in main README.md

[0.9.56] - [2021-03-09]

Added

  • Nothing

Changed

  • fixed how generate-circleci-config.py uses/calls int-test-run-all-spiders-in-ci-pipeline.py

Removed

  • Nothing

[0.9.55] - [2021-03-04]

Added

  • Nothing

Changed

  • fixed a silly bug in int-test-run-all-spiders-in-ci-pipeline.py and how it made the command unusable - also put in real python logging for this command and real command line option handling

Removed

  • Nothing

[0.9.54] - [2021-03-04]

Added

  • Nothing

Changed

  • update generate-circleci-config.py to eliminate the need for requirements.txt in spider repos
  • runtime docker images no longer samples init.py as executable

Removed

  • Nothing

[0.9.53] - [2021-03-03]

Added

  • added optional --samples command line option to spiders.py
  • added optional samples argument to SpiderDiscovery() constructor
  • added categories to spider metadata - if no categories are specified then the name of the package containing the spider is assumed to be the category name - only place that categories are current used is in the API as a means to group spiders
  • added absoluteFilename property to spider metadata - this value is generated by Cloudfeaster
  • added fullyQualifiedClassName property to spider metadata - this value is generated by Cloudfeaster

Changed

  • docker based development environment now parses repo's setup.py for pre-reqs that need to be install when the development docker image is built - this change enabled the removal of requirements.txt from the repo's root directory
  • change format of metadata returned by spiders.py and cloudfeaster.spider.Spider

Removed

  • Nothing

[0.9.52] - [2021-01-29]

Added

  • Circle CI pipeline now saves generated python distributions as Circle CI artifacts
  • added int-test-run-all-spiders-in-ci-pipeline.py which is intended for use in spider repo CI pipelines

Changed

  • Nothing

Removed

  • Nothing

[0.9.51] - [2021-01-24]

Added

  • use update-alternatives in runtime docker image so python "points to" python3.7

Changed

  • cloudfeaster-lite docker image is now based on Alpine 3.12 (used to be Alpine 3.8)

  • install-chrome.sh now able to install both Chrome and Chromium based on command line switches

  • install-chromedriver.sh determines which version of chromedriver to install based on which version of Chrome or Chromium is installed - see this for a complete description of the version selection process

  • default Chrome command line options are now

    • --headless
    • --window-size=1280x1024
    • --no-sandbox
    • --disable-dev-shm-usage
    • --disable-gpu
    • --disable-software-rasterizer
    • --single-process
    • --user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36

Removed

  • Nothing

[0.9.50] - [2020-12-29]

Added

  • Nothing

Changed

  • install-dev-env-scripts.sh now requires virtual env
  • dev-env v0.6.12 -> v0.6.13

Removed

  • Nothing

[0.9.49] - [2020-12-27]

Added

  • Nothing

Changed

  • dev-env 0.6.11 -> 0.6.12

Removed

  • Nothing

[0.9.48] - [2020-12-05]

Added

  • Nothing

Changed

  • fixed CircleCI pipeline for releases

Removed

  • Nothing

[0.9.47] - [2020-12-05]

Added

  • generate cloudfeaster and cloudfeaster-lite docker images which can be used as the basis for building docker images of spiders

Changed

  • remove extra whitespace @ EO generate-circleci-config.py output
  • spiders.py output now includes a _metadata property

Removed

  • Nothing

[0.9.46] - [2020-11-24]

Added

  • added spiders.py to enable infrastructure spider discovery
  • added get-clf-version.sh the encapsulate the pattern of parsing setup.py to extract the cloudfeaster version
  • added runtime docker image for running spiders

Changed

  • _metadata.spider.name in spider output is now name of file containing spider rather than spider's class name. This change was made as a result of learning more about the spider hosting infrastructure.
  • selenium 3.14.0 -> 3.141.0
  • generate-circleci-config.py now generates a CircleCI config file that packages all spiders in a docker image
  • install-chromedriver.sh now install ChromeDriver version based on Google Chrome version

Removed

  • removed Browser.wait_for_login_to_complete() and Browser.wait_for_signin_to_complete() because they used an old sync pattern and the methods really weren't being used anymore

[0.9.45] - [2020-10-25]

Added

  • add CircleCI docker executor authenticated pull
  • per this article, added explicit version to setup_remote_docker in CircleCI pipeline
  • add CircleCI docker executor authenticated pull for CircleCI config generated by generate-circleci-config.py

Changed

  • Nothing

Removed

  • Nothing

[0.9.44] - [2020-08-30]

Added

  • Nothing

Changed

  • logging level in generate-circleci-config.py changed from INFO to DEBUG which is intended to make it simplier to debug CI pipeline crawl failures

Removed

  • Nothing

[0.9.43] - [2020-08-29]

Added

  • Nothing

Changed

  • fix: generate-circleci-config.py which was generating references to the docker image simonsdave/cloudfeaster-bionic-dev-env instead of simonsdave/cloudfeaster-dev-env
  • fix: docker image badge in main README.md

Removed

  • Nothing

[0.9.42] - [2020-08-28]

Added

  • add clair-cicd docker image vulnerability assessment to CircleCI pipeline
  • add LGTM badges to main README.md

Changed

  • dev-env 0.6.10 -> 0.6.11
  • changes to generate-circleci-config.py to improve reliability of capturing crawl results when crawls fail

Removed

  • Nothing

[0.9.41] - [2020-04-11]

Added

  • Nothing

Changed

  • fix bin/install-dev-env-scripts.sh to work without dev_env/dev-env-version.txt

Removed

  • Nothing

[0.9.40] - [2020-04-10]

Added

  • Nothing

Changed

  • dev-env v0.6.8 -> v0.6.10
  • eliminated the nasty looking Warning: apt-key output should not be parsed (stdout is not a terminal) message generated by bin/install-chrome.sh

Removed

  • Nothing

[0.9.39] - [2020-03-23]

Added

  • Nothing

Changed

  • run-spider.sh now outputs only json
  • dev-env v0.6.7 -> v0.6.8

Removed

  • Nothing

[0.9.38] - [2020-02-22]

Added

  • Nothing

Changed

  • generate-circleci-config.py adds back run-pip-check.sh to generated CircleCI pipeline which now works after upgrade to Python 3.7

Removed

  • Nothing

[0.9.37] - [2020-02-21]

Added

  • Nothing

Changed

  • pip3 install -> python3.7 -m pip install
  • dev-env v0.6.6 -> v0.6.7

Removed

  • Nothing

[0.9.36] - [2020-02-16]

Added

  • Nothing

Changed

  • fix bug in generate-circleci-config.py which was generating a KeyError: 'CRAWL_OUTPUT' error.

Removed

  • Nothing

[0.9.35] - [2020-02-16]

Added

  • add comprehensive artifact storage for spiders run by the CircleCI workflow generated by generate-circleci-config.py

Changed

  • remove debugging statement from run-all-spiders.sh

Removed

  • Nothing

[0.9.34] - [2020-02-16]

Added

  • Nothing

Changed

  • fix: generate-circleci-config.py had oustanding problems from Python 2.7 -> 3.7
  • usability: improve usability of run-all-spiders.sh and run-spider.sh in spider repos
  • docs: fix docker image badge in main README.md

Removed

  • Nothing

[0.9.33] - [2020-02-15]

Added

  • add --verbose command line argument to docker_image_integration_tests.sh
  • add --verbose and --debug command line options to run-sample.sh
  • add CrawlDebugger and use in sample spiders - start of improving debugging
  • setting CLF_DEBUG can now be used to generate spiderLog and chromeDriverLog in spider output
  • add CrawlResponse.SC_UNKNOWN
  • when a spider fails all attempts are made to take a screenshot of the browser window

Changed

  • spiderArgs in crawl results now crawlArgs
  • run-spider.sh now accepts full file name of spider rather than just base name - so run-spider.sh xe_exchange_rates is now run-spider.sh xe_exchange_rates.py
  • python-dateutil 2.8.0 -> 2.8.1
  • dev-env v0.5.25 -> v0.6.6
  • :MATERIAL CHANGE: Python 2.7 -> Python 3.7

Removed

  • Nothing

[0.9.32] - [2019-11-11]

Added

  • Nothing

Changed

  • Nothing

Removed

  • remove Snyk from CI pipeline & docs

[0.9.31] - [2019-08-05]

Added

  • add more BeautifulSoup and Scrapy doc references
  • dev-env 0.5.21 -> 0.5.25
  • add Codecov upload to CircleCI pipeline

Changed

  • SpiderCrawler has chromedriver_log_file allowing callers access to the ChromeDriver debug logs when the debug property for SpiderCrawler is set to True

Removed

  • _debug property in crawl response under all circumstances

[0.9.30] - [2019-07-18]

Added

  • Nothing

Changed

  • fix logging of CLF_CHROME value

Removed

  • Nothing

[0.9.29] - [2019-07-08]

Added

  • Nothing

Changed

  • bin/install-chromedriver.sh installs chromedriver 2.46 -> 2.43 motivated by this

Removed

  • Nothing

[0.9.28] - [2019-07-08]

Added

  • Nothing

Changed

  • selenium 3.141.0 -> selenium==3.14.0 motivated by this

Removed

  • Nothing

[0.9.27] - [2019-06-23]

Added

  • Nothing

Changed

  • dev-env 0.5.20 -> 0.5.21
  • install-dev-env-scripts.sh now uses the install-dev-env.sh --dev-env-version command line option

Removed

  • Nothing

[0.9.26] - [2019-06-23]

Added

  • Nothing

Changed

  • install dev-env using install-dev-env.sh instead of pip install

Removed

  • Nothing

[0.9.25] - [2019-06-23]

Added

  • Nothing

Changed

  • dev-env 0.5.19 -> 0.5.20

Removed

  • Nothing

[0.9.24] - [2019-06-05]

Added

  • add bin/generate-circleci-config.py to setup.py

Changed

  • Nothing

Removed

  • Nothing

[0.9.23] - [2019-06-04]

Added

  • Nothing

Changed

  • fix bin/generate-circleci-config.py that was generating incorrect CircleCI config

Removed

  • Nothing

[0.9.22] - [2019-06-03]

Added

  • add bin/check-circleci-config.sh to setup.py as script - should have done this when adding bin/check-circleci-config.sh

Changed

  • Nothing

Removed

  • Nothing

[0.9.21] - [2019-05-23]

Added

  • add bin/check-consistent-clf-version.sh
  • add bin/generate-circleci-config.py

Changed

  • bin/install_chrome.sh -> bin/install-chrome.sh
  • bin/install_chromedriver.sh -> bin/install-chromedriver.sh

Removed

  • remove bin/chromedriver_version.sh since this script was no longer used
  • remove dev-env-version label from docker image since this label is no longer used

[0.9.20] - [2019-05-19]

Added

  • add check-consistent-clf-version.sh to setup.py as script which is installed as part of the Cloudfeaster python package

Changed

  • Nothing

Removed

  • Nothing

[0.9.19] - [2019-05-19]

Added

  • add install-dev-env-scripts.sh for use in CircleCI pipeline

Changed

  • Nothing

Removed

  • Nothing

[0.9.18] - [2019-05-18]

Added

  • add check-consistent-dev-env-version.sh to CircleCI pipeline
  • add run-bandit.sh to CircleCI pipeline
  • add .cut-release-version.sh in support of using new revs of dev-env

Changed

  • renamed run_sample.sh -> run-sample.sh
  • the ttlInSeconds property in spider metadata is now ttl and the value associated with the property is now a string instead of an integer - the string has the form <number><duration> where <number> is a non-zero integer and <duration> is one of s, m, h or d representing seconds, minutes, hours and days respectively
  • the maxCrawlTimeInSeconds spider metadata property is now maxCrawlTime and is also a string instead of an integer - the string has the form <number><duration> where <number> is a non-zero integer and <duration> is one of s or m representing seconds and minutes respectively
  • dev-env 0.5.15 -> 0.5.19
  • sha1 -> sha256 after running bandit

Removed

  • Nothing

[0.9.17] - [2019-04-15]

Added

  • bin/install-dev-env-scripts.sh can now be used by spider repos to add dev env host scripts to a spider repo's host env

Changed

  • Nothing

Removed

  • Nothing

[0.9.16] - [2019-04-01]

Added

  • added run-all-spiders.sh and run-spider.sh
  • by default Chrome now started with --no-sandbox which should mean that Chrome can run as root which simplifies a whole host of complexity

Changed

  • Nothing

Removed

  • Nothing

[0.9.15] - [2019-03-30]

Added

  • .travis.yml now runs run_repo_security_scanner.sh
  • added xe_exchange_rates.py sample spider
  • added sha1 hash of spiders args to spider output

Changed

  • ChromeDriver 2.38 -> 2.46
  • Selenium 3.12.0 -> 3.141.0
  • twine 1.11.0 -> 1.12.1
  • dateutil 2.7.3 -> 2.7.5
  • material simplifcation of way to use run_sample.sh

Removed

  • removed bank_of_canada_daily_exchange_rates.py sample spider
  • removed spiderhost.py, spiderhost.sh, spiders.py and spiders.sh

[0.9.14] - [2018-05-30]

Added

  • Nothing

Changed

  • Selenium 3.11.0 -> 3.12.0
  • python-dateutil 2.7.2 -> 2.7.3
  • spider metadata changed to camel case instead of snake case to get closer to these JSON style guidelines
  • crawl results metadata now grouped in the _metadata property and use camel case instead of snake case
  • crawl results are now validated against this jsonschema
  • added spiders.sh and spiderhost.sh to enable the API for a docker image container spiders to be expressed in a manner that's independant from Python and Webdriver

Removed

  • Nothing

[0.9.13] - [2018-04-24]

Added

Changed

  • simonsdave/cloudfeaster docker image is now based on Ubuntu 16.04
  • ChromeDriver 2.37 -> 2.38

Removed

  • Nothing

[0.9.12] - [2018-04-07]

Added

Changed

  • spiders meta data - url string property is now validated using jsonschema uri format instead of pattern
  • selenium 3.9.0 -> 3.11.0
  • python-dateutil 2.6.1 -> 2.7.2
  • ChromeDriver 2.35 -> 2.37
  • twine 1.10.0 -> 1.11.0
  • identifying_factors and authenticating_factors properties will now always appear in spiders.py output

Removed

  • Nothing

[0.9.11] - [2018-02-26]

Added

  • Nothing

Changed

  • samples/pypi_spider.py -> samples/pythonwheels_spider.py
  • spider metadata property name change = max_concurrency -> max_concurrent_crawls

Removed

  • Nothing

[0.9.10] - [2018-02-09]

Added

  • Nothing

Changed

  • Selenium 3.8.1 -> 3.9.0

Removed

  • Nothing

[0.9.9] - [2018-02-02]

Added

  • Nothing

Changed

Removed

  • Nothing

[0.9.8] - [2018-01-10]

Added

  • max concurrency per spider property is now part of the output from Spider.get_validated_metadata() regardless of whether or not it is specified as part of the explicit spider metadata declaration
  • added paranoia_level to spider metadata
  • added max_crawl_time_in_seconds to spider metadata
  • ttl_in_seconds now has an upper bound of 86,400 (1 day in seconds)
  • max_concurrency now has an upper bound of 25

Changed

  • Selenium 3.7.0 -> 3.8.1
  • ChromeDriver 2.33 -> 2.34
  • breaking change ttl -> ttl_in_seconds in spider metadata

Removed

  • Nothing

[0.9.7] - [2017-11-27]

Added

  • added .prep-for-release-master-branch-changes.sh so package version number is automatically bumped when cutting a relase
  • .prep-for-release-master-branch-changes.sh now generates Python packages for PyPI from release branch

Changed

  • bug fix in .prep-for-release-release-branch-changes.sh to links in main README.md work correctly after a release

Removed

  • removed cloudfeaster.util module since it wasn't used

[0.9.6] - [2017-11-25]

Added

  • added --log command line option to spiders.py
  • added --samples command line option to spiders.py
  • cloudfeaster.webdriver_spider.WebElement now has a is_element_present() method that functions just like cloudfeaster.webdriver_spider.Browser

Changed

  • per this article headless Chrome is now available and Cloudfeaster will use it by default which means we're also able to remove the need to Xvfb which is a really nice simplification and reduction in required crawling resources - also, because we're removing Xvfb bin/spiderhost.sh was also removed
  • selenium 3.3.3 -> 3.7.0
  • requests 2.13.0 -> >=2.18.2
  • ndg-httpsclient 0.4.2 -> 0.4.3
  • ChromeDriver 2.29 -> 2.33
  • simonsdave/cloudfeaster docker image now uses the latest version of pip

Removed

  • removed all code related to Signal FX

[0.9.5] - [2017-04-17]

Added

  • pypi_spider.py now included with distro in cloudfeaster.samples

Changed

  • upgrade selenium 3.0.2 -> 3.3.3
  • upgrade chromedriver 2.27 -> 2.29

Removed

  • Nothing

[0.9.4] - [2017-03-05]

Added

  • added _crawl_time to crawl results

Changed

Removed

  • Nothing

[0.9.3] - [2017-03-03]

Added

  • Nothing

Changed

  • fix crawl response key errors - _status & _status_code in crawl response were missing the leading underscore for the following responses
    • SC_CTR_RAISED_EXCEPTION
    • SC_INVALID_CRAWL_RETURN_TYPE
    • SC_CRAWL_RAISED_EXCEPTION
    • SC_SPIDER_NOT_FOUND

Removed

  • Nothing

[0.9.2] - [2017-02-12]

Added

  • Nothing

Changed

  • dev env upgraded to docker 1.12
  • BREAKING CHANGE = selenium 2.53.6 -> 3.0.1 which resulted in requiring an upgrade to ChromeDriver 2.24 from 2.22 and it turns out 2.22 does not work with selenium 3.0.1
  • spider version # in crawl results now include hash algo along with the hash value
  • BREAKING CHANGE = the spidering infrastructure augments crawl results with data such as the time to crawl, spider name & version number, etc - in order to more easily differentiate crawl results from augmented data, the top level property names for all augment data is now prefixed with an underscore - as an example, below shows the new output from running the PyPI sample spider
>./pypi_spider.py | jq .
{
  "virtualenv": {
    "count": 46718553,
    "link": "http://pypi-ranking.info/module/virtualenv",
    "rank": 5
  },
  "_status_code": 0,
  "setuptools": {
    "count": 63758431,
    "link": "http://pypi-ranking.info/module/setuptools",
    "rank": 2
  },
  "simplejson": {
    "count": 182739575,
    "link": "http://pypi-ranking.info/module/simplejson",
    "rank": 1
  },
  "requests": {
    "count": 53961784,
    "link": "http://pypi-ranking.info/module/requests",
    "rank": 4
  },
  "six": {
    "count": 54950976,
    "link": "http://pypi-ranking.info/module/six",
    "rank": 3
  },
  "_spider": {
    "version": "sha1:ccb6a042dd11f2f7fb7b9541d4ec888fc908a8ef",
    "name": "__main__.PyPISpider"
  },
  "_crawl_time_in_ms": 4773,
  "_status": "Ok"
}
  • upgrade dev env to docker 1.12

Removed

  • Nothing

[0.9.1] - [2016-08-17]

Added

  • Nothing

Changed

  • fixed bug that was duplicating crawl response data in CrawlResponseOk

Removed

  • Nothing

[0.9.0] - [2016-08-16]

Added

  • support docker 1.12

Changed

  • version bumps for dependancies:
    • chromedriver 2.22
    • selenium 2.53.6
    • requests 2.11.0
    • ndg-httpsclient 0.4.2
  • set of simplifications in dev env setup

Removed

  • temporary removal of authenticated proxy support

[0.8.0] - [2016-06-14]

Added

  • Cloudfeaster spiders can be developed on pretty much any operating systems/browser combinations that can run Selenium but Cloudfeaster Services always runs spiders on Ubuntu and Chrome; some web sites present different responses to browser requests based on the originating browser and/or operating system; if, for example, development of a spider is done on Mac OS X using Chrome, the xpath expressions embedded in a spider may not be valid when the spider is run on Ubuntu using Chrome; to address this disconnect, spider authors can force Cloudfeaster Services to use a user agent header that matches their development environment by providing a value for the user_agent argument of Browser class' constructor.

[0.7.0] - [2016-05-03]

Added

>spiderhost.py --help
Usage: spiderhost.py <spider> [<arg1> ... <argN>]

spider hosts accept the name of a spider, the arguments to run the spider and
optionally proxy server details. armed with all this info the spider host runs
a spider and dumps the result to stdout.

Options:
  -h, --help            show this help message and exit
  --log=LOGGING_LEVEL   logging level
                        [DEBUG,INFO,WARNING,ERROR,CRITICAL,FATAL] - default =
                        ERROR
  --proxy=PROXY         proxy - default = None
  --proxy-user=PROXY_USER
                        proxy-user - default = None
>
>spiderhost.py --proxy=abc
Usage: spiderhost.py <spider> [<arg1> ... <argN>]

spiderhost.py: error: option --proxy: required format is host:port
>
>spiderhost.py --proxy-user=abc
Usage: spiderhost.py <spider> [<arg1> ... <argN>]

spiderhost.py: error: option --proxy-user: required format is user:password
>

[0.6.0] - [2016-01-24]

Changed

  • colorama now req'd to be @ least version 0.3.5 instead of only 0.3.5
  • command line args to bin/spiderhost.sh have been simplified - now just take spider name and spider args just as you'd expect - no more url encoding of args and ----- indicating no spider args
  • like the changes to bin/spiderhost.sh, bin/spiderhost.py now just accepts regular command line arguments of a spider name and spider args - much easier

Removed

  • bin/spiders.sh is no longer needed - callers now access bin/spiders.py directly rather that getting at bin/spiders.py through bin/spiders.sh

[0.5.0] - [2015-05-10]

  • not really the initial release but intro'ed CHANGELOG.md late
  • initial clf commit to github was 13 Oct '13