See this gist about post-hoc action input pruning and the limitations w.r.t to caching.
buck2
has starlark APIs for (limited) dynamic deps (i.e. dynamic action creation albeit only with input subsetting): https://buck2.build/docs/rule_authors/dynamic_dependencies/
Bazel's internal dependency engine (skyframe) is capable of representing such dependencies and internal rulesets make use of this functionality (i.e. ThinLTO and C++20 module support in rules_cc
) however — starlark rules have no way to express such graphs.
In the specific case described in the previous gist the dynamism is constrained enough (i.e. the number of actions are known statically, only the relevant subset of inputs is "dynamic") that we can attempt to model it in Bazel with TreeArtifact
s 12. At a high level this works by:
- running a (quick) action that has all of the files listed as inputs and producing (for each actual action) a
TreeArtifact
containing the slimmed down set of inputs - running the actual action with the
TreeArtifact
from#1
listed as the inputs
This kind of a priori pruning of unused inputs has a couple of upsides:
- no potential correctness issues — unlike with input pruning (i.e.
unused_inputs_list
) where Bazel has to take it on faith that the action genuinely didn't rely on any of the file inputs that are being listed as inputs, here the sandbox is enforcing that "unused" inputs actually are unused- note that the correctness issues are limited in scope to local incremental builds and cannot result in cache corruption due to the nature of
unused_inputs_list
and its interaction with caching — see the previous gist for details)
- note that the correctness issues are limited in scope to local incremental builds and cannot result in cache corruption due to the nature of
- better interaction with caching: because
#2
has the narrowed set of inputs listed when its executed, it's genuinely not sensitive to the "unused" inputs — even on clean builds
The downsides are that:
- some work is done twice; i.e. we'll end up parsing inputs to discover used headers in both
#1
and#2
(in the steps listed above) - a little bit of work needs to be done to create a tool to drive the actual tool to discover headers and to then assemble the
TreeArtifact
- potentially an added bit of maintenance burden? probably not burdensome in practice though, provided your tool has a way to discover headers and stop
- getting the symlinks right is a little tricky; see:
- bazelbuild/bazel#20891
- bazelbuild/bazel#15454
- example usage of relative symlinks in a
TreeArtifact
: https://github.com/bazel-contrib/rules_oci/pull/559/files - also note that there are maybe some bad interactions with BwoB + RBE? (unclear if this is still an issue when files are expressed as inputs correctly to the
TreeArtifact
producing action)
- this only provides upside if the tool invocation in
#1
is very fast compared with#2
- leaning heavily on this technique (instead of expressing the exact set of used headers statically) reduces the fidelity of the static build graph
- it's a tradeoff between user burden and perf/static information, as always
Note
This approach is not dissimilar to Bazel's (kind of Blaze only — i.e. google internal only) "include scanning" feature for rules_cc
. (also see here)
The motivation for ^ is to reduce the number/size of files that must be sent to remote workers when using RBE (remote build execution). Unlike this approach, include scanning (today — there has been discussion about using clang-scan-deps
) uses a brittle grep-based tool to prune down the referenced headers of an action. Additionally, as far as I know, include scanning does not actually rewrite action cache keys and thus has similar caveats to unused_inputs_list
.
Note
If you're familiar with how build systems model ThinLTO or C++20 Modules the above probably sounds familiar and janky — as mentioned, Bazel does use its native dynamic dependency capabilities (1, 2a, 2b) to model these language/toolchain features.
Run the following:
bazel build //:a --disk_cache=./disk-cache
- note the times in the output
bazel clean --expunge
- modify one of the unused headers (i.e.
d.header
,e.header
) bazel build //:a --disk_cache=./disk-cache
- 1 action should run (scan deps), 1 should hit in the cache (compile)
- the compile action should have a timestamp before the scan deps invocation's output
^ demonstrates that even on a cold clean build, the action cache key for the compile action is scoped down to the headers that are actually used (unlike with the unused_inputs_list
approach)
note that an actual rules_cc
-esque ruleset would probably run several validation actions:
- check that each public header parses and does not have any implicit header dependencies (
--process_headers_in_dependencies
+parse_headers
feature) - check that no (non-public) headers are unused by a library
- afaik Bazel does not do this
- mostly unrelated to header input pruning and requires support from the tool: check that no source files are implicitly relying on transitively available headers (
layering_check
)- details on
layering_check
inclang
here - note that
clang
actually has support for specifying a user-facing module name (i.e. a bazel target label!) in the module maps so that layering check errors are more useful to users- see here
- details on
tangentially relevant: ninja-build/ninja#1303