title |
---|
Project Parsing |
- The
dbt parse
command - Partial parsing profile config and CLI flags
- Experimental parser CLI flag
At the start of every dbt invocation, dbt reads all the files in your project, extracts information, and constructs a manifest containing every object (model, source, macro, etc). Among other things, dbt uses the ref()
, source()
, and config()
macro calls within models to set properties, infer dependencies, and construct your project's DAG.
Parsing projects can be slow, especially as projects get bigger—hundreds of models, thousands of files—which is frustrating in development. There are a handful of ways to optimize dbt performance today:
- LibYAML bindings for PyYAML
- Partial parsing, which avoids re-parsing unchanged files between invocations
- An experimental parser, which extracts information from simple models much more quickly
- RPC server, which keeps a manifest in memory, and re-parses the project at server startup/hangup
These optimizations can be used in combination to reduce parse time from minutes to seconds. At the same time, each has some known limitations, so they are disabled by default.
dbt uses PyYAML to read and validate YAML files in your project. PyYAML is written in pure Python, but it can leverage LibYAML (written in C, much faster) if it's available in your system. Whenever it parses your project, dbt will always check first to see if LibYAML is available.
You can test to see if LibYAML is installed by running this command in the environment where you've installed dbt:
python -c "from yaml import CLoader"
After parsing your project, dbt stores an internal project manifest in a file called partial_parse.msgpack
. When partial parsing is enabled, dbt will use that internal manifest to determine which files have been changed (if any) since it last parsed the project. Then, it will only parse the changed files, or files related those changes.
Partial parsing is off by default, and it can be enabled via profile config or CLI flags. In development, partial parsing can significantly reduce the time spent waiting at the start of a run, which translates to faster dev cycles and iteration.
Use caution when enabling partial parsing in dbt, as there are known limitations today:
- A change in environment variables does not trigger a re-parse. Files which depend on
env_var
may be incorrect on subsequent parses. - Changes to macros called within a model's
config()
block will not result in re-parsing that model. - A file that depends on "volatile" Jinja variables, such as
run_started_at
orinvocation_id
, will quickly get stale. A file is not re-parsed in subsequent invocations if the file's contents have not changed. - If certain inputs change between runs, dbt will trigger a full re-parse. Today those inputs are:
--vars
profiles.yml
contentdbt_project.yml
content- installed packages
- dbt version
If you ever get into a bad state, you can disable partial parsing and trigger a full re-parse with the --no-partial-parse
CLI flag, or by deleting target/partial_parse.msgpack.
At parse time, dbt needs to extract the contents of ref()
, source()
, and config()
from all models in the project. Traditionally, dbt has extracted those values by rendering the Jinja in every model file, which can be slow. In v0.20.0, we're trying out a new way to statically analyze model files, leveraging tree-sitter
, which we're calling an "experimental parser". You can see the code for an initial Jinja2 grammar here.
dbt --use-experimental-parser parse
dbt --use-experimental-parser run
dbt --use-experimental-parser test
For now, the experimental parser only works with models, and models whose Jinja is limited to those three special macros (ref
, source
, config
). The experimental parser is at least 3x faster than a full Jinja render. Based on testing with data from dbt Cloud, we believe the current grammar can statically parse 60% of models in the wild. So for the average project, we'd hope to see a 40% speedup in the model parser. You can check this by running dbt parse
and dbt --use-experimental-parser parse
, and comparing target/perf_info.json
produced by each.
The experimental parser is off by default. We believe it can offer some speedup to 95% of projects.
Do not use the experimental parser if you've overridden the ref
, source
, or config
macro with a custom implementation.