Releases: clearlydefined/crawler
v2.0.1
v2.0.0
Release tag: v2.0.0
Upgrade Notes
No steps are required to upgrade to this release as a user of ClearlyDefined. Any local harvesters will need to get the latest crawler image from Docker Hub and restart their crawler.
All major changes are related to data output changes brought in by updates to license identification tools and the license extraction process.
Note: Requests for definitions do not initiate a harvest request when a definition already exists. A harvest request is required to update raw tool results from which the definition will be constructed. Note as well that harvesting takes significant time. There will be a delay from the time the harvest request is made before the results will be reflected in a definition request.
What’s changed
Major Changes
Update license detection tools
- Update licensee scan tool updated from v9.12 to v9.16.1 by @yashkohli88 in #549
- Update scancode-toolkit from v30.1.0 to v32.1.0 by @lumaxis in #537
Modifications to ClearlyDefined license extraction
- Update PodExtract tool version by @qtomlinson in #566
- Derive license from info.license over classifiers in pypi registry data by @qtomlinson in #586
Minor Changes
New traversal policy
- Introduce “reharvestAlways” traversal policy to make re-harvest simpler by @qtomlinson in #598
New “reharvestAlways” policy behavior:
- When the tool result for a component is available, the tool will be rerun and tool result updated, similar to the "always" policy.
- When the tool result for a component is not available, the component will be fetched and the tool will be run. This differs from the “always” policy which skips running when the results do not already exist.
Other minor changes
- Remove rimraf by @lumaxis in #558
- Update spdx parsing which includes support for passing in LicenseRef map by @ljones140 in #606
Bug Fixes and Patches
Development related
- add sha and version to ‘/‘ endpoint by @elrayle in #574
- Fix fetching latest version for some pod components by @qtomlinson in #588
- Make scancode parallelism configurable by @RomanIakovlev in #612
DevOps
- Deploy production crawler to Clearly Defined’s Azure account, along with MSFT by @ljones140 in #608
- Deploys to dev on master merge by @ljones140 in #601
- Deploy dev crawler via GitHub action by @ljones140 in #599
- tests should run for changes in prod and have the option to run manually by @elrayle in #592
- Add separate workflow step for testing Docker build by @lumaxis in #580
- docs: add SECURITY.md by @nickvidal in #584
Dependencies
- Bump express from 4.18.2 to 4.19.2 by @dependabot in #564
- Bump debug from 4.1.1 to 4.3.5 by @dependabot in #581
- Bump braces and patch-package by @dependabot in #582
- Updated deprecated dependency request-promise-native by @yashkohli88 in #576
- Cleanup dependencies by @lumaxis in #557
New Contributors
- @nickvidal made their first contribution in #584
- @ljones140 made their first contribution in #591
Full Changelog: v1.2.0...v2.0.0
v1.2.0
Release Highlights
Release tag: v1.2.0
This release includes a single configuration change that allows deploys to specify the location of the queues that hold input for the crawler.
Upgrade Notes
No Action Required. Optionally, you can set a configuration to control where input queues will be constructed.
What’s changed
Changes: v1.1.0..v1.2.0
Minor Changes
Configure location of queues
- Make Crawler queue in Azure separate from Azure results storage (#591) (@ljones140)
Release v1.2.0 adds the support of running Crawler queues in a separate Azure
account as the results storage blobs.
Requirement came from organizations that want to submit results to clearlydefinedprod
Azure but don't want to have the queues in the same Azure account.
The crawler configuration takes an additional env var CRAWLER_QUEUE_AZURE_CONNECTION_STRING
If provided the crawler will use this storage account for the queues.
If not provided it will use same connection defined in CRAWLER_AZBLOB_CONNECTION_STRING
Bug Fixes and Patches
- docs: add SECURITY.md (#584) (@nickvidal)
- Add separate job for testing Docker build (#580) (@lumaxis)
v1.1.0
Release Highlights
Release tag: v1.1.0
There is one change of interest:
- Conda was added as a package manager source. Details on usage are provided below under the Add Conda support section.
Upgrade Notes
No Action Required. Optionally, you can start requesting harvests for Conda packages. See details below.
What’s changed
Changes: v1.0.2..v1.1.0
Minor Changes
Add Conda support
There is one significant change in this release to add support for Conda package manager. It is classified as minor because it is additive. It does not impact the functioning of previously supported package managers.
Conda exposes packages in a different format from other Python repositories like pypi. Conda is a Python environment locked to a specific Python version. It deals with packages locked to a specific version for a version of the channel, this ensures packages do not break due to one incompatibility or another as the packages are managed for compatibility, similar to how you'd ship a docker container.
The primary consumption point is the "packages" themselves which are accompanied with scripts to modify the environment and setup the packages and dependencies which are then consumed by the setup application. The packages may also contain DLLs, scripts, compiled Python binary (.pyc), python code. etc.
The structure of Conda repositories and their indexing process are described in Channels and generating an index (Conda docs).
Conda has three main channels: anaconda-main, anaconda-r, and conda-forge which is more geared toward business uses
We crawl both the packages and the source code (not always specified) for the licensing metadata and other metadata about the package.
The source from which the Conda packages are created is often, but not always, provided via a URL that links a compressed source file hosted externally, sometimes via GitHub, or another website. Note that this is a file and not a git repository.
The main Conda package is hosted on the Conda channels themselves and is compressed and contains necessary licensing information, compilers, environment configuration scripts, dependencies, etc. that are needed to make the package work.
Coordinates syntax:
- type (required) - identifies to use the Conda provider (values: conda | condasource)
- provider (required) - channel on which the package will be crawled. (values: conda-forge | anaconda-main | anaconda-r)
- namespace (optional) - architecture and OS of the package to be crawled (e.g. win64, linux-aarch64). If no architecture is specified, any architecture is chosen.
- package name (required): name of the package
- revision (optional): package version and optional build version (format:
(${version} | )-(${buildversion} | )
) (e.g.0.3.0
,0.3.0-py36hffe2fc
). If it is a conda coordinate type, the build version of the package is usually a conda-specific representation of the build tools and environment configuration, and build iteration of the package (e.g. for a Python 3.9 environment, buildversion ispy39H443E
). If none is specified, the latest one will be selected using the package's timestamp.
Examples:
- conda/conda-forge/linux-aarch64/numpy/1.13.0
- condasource/conda-forge/linux-aarch64/numpy/1.13.0
- conda/conda-forge/-/numpy/1.13.0/
- conda/conda-forge/linux-aarch64/numpy/-py36
Conda-forge is a community effort and packages are published by opening PRs on their GitHub repository as described in Contributing packages (Conda Forge docs).
Bug Fixes and Patches
Development related
- Pin reuse version to the most recent 3.0.1 (#559) (@qtomlinson)
- Fix ENOENT error during harvesting Conda components (#575) (@qtomlinson)
DevOps
Dependencies
- Bump follow-redirects from 1.15.5 to 1.15.6 (#562) (@dependabot[bot])
v1.0.2
Release Highlights
Release tag: v1.0.2
This is a patch release with bug fixes.
Upgrade Notes
No Action Required
What’s changed
Changes: v1.0.1..v1.0.2
Bug Fixes and Patches
Bug Fixes
- Rename variable for consistency (#570) (@lumaxis)
- Fix URL to fetch Go packages with latest version (#569) (@yashkohli88)
v1.0.0
Release v1.0.0 is a re-release of the current production crawler which was last released Dec 5, 2022. There was a release recently on Apr 2, 2024. This was triggered by a merge of master into prod. I would expect this to be release 1.1.0. The purpose of the v1.0.0 release is to establish a known baseline as the starting point for the transition to using Semantic Versioning for the released versions. The purpose of the v1.1.0 release is to capture the changes that exist in master at this moment in time.
Releases are published as Docker images to Docker Hub. Future releases will be published to Docker Hub and GitHub Packages.
Release Highlights
Release tag: v1.0.0
NOTE: The version in package.json differs from the release tag because it was previously set and could not be changed.
Breaking Changes
none
Upgrade Notes
No Action Required
What’s changed
This release is identical to the code that has been the production release since Dec 5, 2022.
previous-release:
- tag: v0.1.1 tagged but not published as a release
- date: 8-3-2022
Changes: v0.1.1..v1.0.0
v1.0.1
Release Highlights
Release tag: v1.0.1
This is a patch release with bug fixes, dependency updates, documentation improvements, and devops maintenance related to running tests.
Upgrade Notes
No Action Required
What’s changed
Changes: v1.0.0..v1.0.1
Bug Fixes and Patches
Bug Fixes
- Fix extracting license information for pypi packages (#518) (@qtomlinson)
- Fix harvesting git components (#517) (@qtomlinson)
- fixing bundler install error by locking verison (#512) (@mpcen)
- Exclude .git directory content when calculating package file count (#525) (@qtomlinson)
- lowercasing package names for nuget api fetching (#515) (@mpcen)
Documentation
- Update README.md - fix
docker run
for Mac OS (#560) (@yashkohli88) - Update readme - describe request
type
(#526) (@qtomlinson)
Update Dependencies
- Update Node version and dependencies (#522) (@lumaxis)
- Bump @babel/traverse from 7.12.9 to 7.23.9 (#545) (@dependabot[bot])
- Bump follow-redirects from 1.15.1 to 1.15.5 (#544) (@dependabot[bot])
- Bump axios from 0.27.2 to 1.6.0 (#543) (@dependabot[bot])
- Bump xml2js from 0.4.23 to 0.5.0 (#542) (@dependabot[bot])
- Bump luxon from 2.3.0 to 2.5.2 (#511) (@dependabot[bot])
- update @clearlydefined/spdx to 0.1.7 (#530) (@qtomlinson)
DevOps/Maintenance