-
Notifications
You must be signed in to change notification settings - Fork 984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Beta] Python models #1754
[Beta] Python models #1754
Conversation
✅ Deploy Preview for docs-getdbt-com ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
2f01cbe
to
231b515
Compare
This needs another pass-through, but I'd like to open it up for feedback now. Aiming to get these docs live & ready for beta testers, as soon as there's a beta prerelease to start using. |
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Show resolved
Hide resolved
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
This is so extremely exciting :) Jeremy H beat me to some of the things I initially noticed (explaining why view + ephemeral materializations are not supported, spelling typo), but I left some minor aesthetic comments as well. I think bigger picture (which I don't think should be a blocker at all to releasing this tomorrow) that I was thinking about as I was reading this was:
|
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
```python | ||
def add_one(x): | ||
return x + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are functions only:
- user-definable inside of the same file
- importable from a public PyPI package
?
Or can you do something chaotic like import add_one from my_python_model
in my_second_python_model.py
?
Related but broader question: I assume there are no Python macros? Either Jinja to template python or a global space for add_one
to be defined once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently you cannot do that. It is possible to support(not very pretty way). @jtcohen6 maybe that's something to call out that we don't support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really consider this an open question for us!
I'm disinclined to do this one, given the chaos it could lead to, and over-reliance on the local file system:
Or can you do something chaotic like
import add_one from my_python_model
inmy_second_python_model.py
?
I think we have two good options to pursue:
- Some platforms support registering these functions as "named" / persistent UDFs. This would be the case for "dbt should know about
function
s." These UDFs have some drawbacks; inconsistent support; and the main use case seems to be defining a Python UDF that then gets called from within a SQL model. That seems valuable for exposing a prediction function output of a trained ML model, less so for a utility likeadd_one
. - Potentially more promising: most of these platforms allow you to upload your own home-spun "Python package," as a
.zip
, ==. You could thenimport
it and use its functions, as if it were a public PyPI package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that you shouldn't be able to do arbitrary imports from functions defined in other models in the models/
directory.
most of these platforms allow you to upload your own home-spun "Python package," as a .zip, ==. You could then import it and use its functions, as if it were a public PyPI package
Where would the code that gets zipped and uploaded be source controlled? This feels very analogous to the before-state of https://discourse.getdbt.com/t/using-dbt-to-manage-user-defined-functions/18 - code is magically available but no one really knows where it came from.
If we can resolve that, I do like the idea of an internal functions package.
Would finding .py
functions defined in the macros
directory (or add functions/
to your macro-paths
) and embedding them all into an internal package be good enough at breaking reliance on the filesystem? This assumes that you're disinclined to enable imports from inside a model because it would be bad practice, not that it would be hard to implement technically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where would the code that gets zipped and uploaded be source controlled?
Yup — 100% needs to live within the dbt project, and be version-controlled along with it.
I think that "upload" step will require different behind-the-scenes implementations across different data warehouses. We could even opt for UDFs as the backing implementation, if a given warehouse doesn't support arbitrary package upload. There's risk of a leaky abstraction here, but it feels like a place where dbt can help in a big way, and where we won't see the full value of Python models without some answer here.
Questions we need to answer:
- Is this "upload" step part of
dbt deps
? What if these reusable methods live in the root project, rather than a dbt package? - How to support the transition from local → cloud development (if we want to preserve that workflow)?
- Those Python models should ideally be (a) pure functions with (b) thorough unit testing. Can dbt help here? Or should we encourage standard best practices Python development (
pytest
etc)?
@lostmygithubaccount @ChenyuLInx This feels like one of our highest-priority spikes during the v1.3 beta → final period.
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
```python | ||
def add_one(x): | ||
return x + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently you cannot do that. It is possible to support(not very pretty way). @jtcohen6 maybe that's something to call out that we don't support it.
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
|
||
**Additional setup:** The `user` field in the `dbt-spark` profile, usually optional, is required for Python modeling. | ||
|
||
**Note:** Python models will be created and run as notebooks in your Databricks workspace. The notebooks will be created within the personal workspace of the `user` running dbt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add that we don't delete the notebook so you can use it for interactive development of model and bring the code back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call!
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done! I think it is already giving a lot of info (enough for people to start).
I just suggested a few changes.
The most controversial being that I'd be in favor of using DataFrame
everywhere instead of data frame
|
||
In v1.3, dbt Core is adding support for **Python models**. | ||
|
||
dbt's Python capabilities are an extension of its capabilities with SQL models. We recommend that you read this page first, and then read: ["dbt Models (Python)"](python-models) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth mentioning also that what says .sql
below would need to be replaced by .py
for Python. No need to go into too much detail in that page but just calling it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a few different permutations of this, but I wasn't happy with any. I'm going to stick with minimal note for now. We have a lot of future work to do, to figure out the right ways to integrate Python capabilities / references into the existing docs that say SQL, SQL, .sql
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Show resolved
Hide resolved
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
Thank you for the quick reviews and excellent feedback @lostmygithubaccount @ChenyuLInx @KiraFuruichi @joellabes @jeremyholtzman @b-per!! Thrilled that this turned into a real team effort.
You have the conn, Furuichi
Good by me! Made this change. |
:::caution | ||
|
||
The implementation for Python models on GCP (BigQuery + Dataproc) is the roughest of the three. Running PySpark on Dataproc requires more manual setup and configuration. Clusters require sufficient resources or auto-scaling policies to handle concurrent Python model runs. | ||
|
||
We have made the code available for the beta, but we are reserving the right to leave it out of the final v1.3 release if the experience is too unfriendly to end users. If you're a GCP expert, we're very open to hearing your thoughts in Slack or the GitHub discussions! | ||
|
||
::: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dataders This reads a bit too harsh right now. My goal here is to set clear expectations for any BQ users around both (a) the setup experience, (b) the possibility that this functionality is cut from v1.3 (final). Definitely open to your wordsmithing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:::caution | |
The implementation for Python models on GCP (BigQuery + Dataproc) is the roughest of the three. Running PySpark on Dataproc requires more manual setup and configuration. Clusters require sufficient resources or auto-scaling policies to handle concurrent Python model runs. | |
We have made the code available for the beta, but we are reserving the right to leave it out of the final v1.3 release if the experience is too unfriendly to end users. If you're a GCP expert, we're very open to hearing your thoughts in Slack or the GitHub discussions! | |
::: | |
:::caution | |
Here be dragons; this implementation for Python models on GCP (BigQuery + Dataproc) needs the most love of the three before the official release. | |
For example, running PySpark on Dataproc requires more manual setup and configuration. Clusters require sufficient resources or auto-scaling policies to handle concurrent Python model runs. | |
We have made the code available for the beta, but we are reserving the right to leave it out of the final v1.3 release if the experience is too unfriendly to end users. If you're a GCP expert, we'd love your help; please share your thoughts in Slack or the GitHub discussions! | |
::: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beta docs look great, @jtcohen6 ! small nits. non-blocking.
|
||
Each Python model lives in a `.py` file in your `models/` folder. It defines a function named **`model()`**, which has two parameters: | ||
- **`dbt`**: A class compiled by dbt Core, unique to each model, that enables you to run your Python code in the context of your dbt project and DAG. | ||
- **`session`**: A class representing the connection to the backing engine, which allows you to interact with the data platform. The session is needed to read in tables as DataFrames, and to write DataFrames back to tables. In PySpark, by convention, the `SparkSession` is named `spark`, and available globally. For consistency across platforms, we always pass it into the `model` function as an explicit argument named `session`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- **`session`**: A class representing the connection to the backing engine, which allows you to interact with the data platform. The session is needed to read in tables as DataFrames, and to write DataFrames back to tables. In PySpark, by convention, the `SparkSession` is named `spark`, and available globally. For consistency across platforms, we always pass it into the `model` function as an explicit argument named `session`. | |
- **`session`**: A class representing the connection to the backend engine, which allows you to interact with the data platform. The session is needed to read in tables as DataFrames, and to write DataFrames back to tables. In PySpark, by convention, the `SparkSession` is named `spark`, and available globally. For consistency across platforms, we always pass it into the `model` function as an explicit argument named `session`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if this was a typo so just flagging
- **`dbt`**: A class compiled by dbt Core, unique to each model, that enables you to run your Python code in the context of your dbt project and DAG. | ||
- **`session`**: A class representing the connection to the backing engine, which allows you to interact with the data platform. The session is needed to read in tables as DataFrames, and to write DataFrames back to tables. In PySpark, by convention, the `SparkSession` is named `spark`, and available globally. For consistency across platforms, we always pass it into the `model` function as an explicit argument named `session`. | ||
|
||
The `model()` function must return a single DataFrame. On Snowpark (Snowflake), this can be a Snowpark or Pandas DataFrame. On PySpark (Databricks + BigQuery), this should be a PySpark DataFrame (converted back from Pandas if needed). For more about choosing between Pandas and native DataFrames, see ["DataFrame API + syntax"](#dataframe-api--syntax) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The `model()` function must return a single DataFrame. On Snowpark (Snowflake), this can be a Snowpark or Pandas DataFrame. On PySpark (Databricks + BigQuery), this should be a PySpark DataFrame (converted back from Pandas if needed). For more about choosing between Pandas and native DataFrames, see ["DataFrame API + syntax"](#dataframe-api--syntax) | |
The `model()` function must return a single DataFrame. On Snowpark (Snowflake), this can be a Snowpark or Pandas DataFrame. On PySpark (Databricks + BigQuery), this should be a PySpark DataFrame (converted back from Pandas if needed). For more about choosing between Pandas and native DataFrames, see ["DataFrame API + syntax"](#dataframe-api--syntax). |
When you `dbt run --select python_model`, dbt will prepare and pass in both arguments (`dbt` and `session`). All you have to do is define the `model()` function that accepts them. | ||
|
||
### Referencing other models | ||
Your Python model will want to read data from other models (SQL or Python) or sources. Do this using the `dbt.ref()` function. The same idea applies for raw source tables, via `dbt.source()`. Those functions return DataFrames, pointing to the upstream source, model, seed, or snapshot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your Python model will want to read data from other models (SQL or Python) or sources. Do this using the `dbt.ref()` function. The same idea applies for raw source tables, via `dbt.source()`. Those functions return DataFrames, pointing to the upstream source, model, seed, or snapshot. | |
Use the `dbt.ref()` function to let your Python model read data from other models (SQL or Python) or sources. The same idea applies for raw source tables from `dbt.source()`. Those functions return DataFrames, pointing to the upstream source, model, seed, or snapshot. |
|
||
</File> | ||
|
||
The `config()` function accepts _only_ literal values (strings, booleans, and numeric types). It is not possible to pass another function or more complex data structure. The reason: dbt statically analyzes the arguments to `config()` while parsing your model, without actually executing any of your Python code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The `config()` function accepts _only_ literal values (strings, booleans, and numeric types). It is not possible to pass another function or more complex data structure. The reason: dbt statically analyzes the arguments to `config()` while parsing your model, without actually executing any of your Python code. | |
The `config()` function accepts _only_ literal values (strings, booleans, and numeric types). It's not possible to pass another function or a more complex data structure. The reason is because dbt statically analyzes the arguments to `config()` while parsing your model, without actually executing any of your Python code. |
|
||
#### Accessing project context | ||
|
||
dbt Python models do not use Jinja to render compiled code. Compared to SQL models, Python models have very limited access to global project context. That context is made available via the `dbt` class, passed in as an argument to the `model()` function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dbt Python models do not use Jinja to render compiled code. Compared to SQL models, Python models have very limited access to global project context. That context is made available via the `dbt` class, passed in as an argument to the `model()` function. | |
dbt Python models don't use Jinja to render compiled code. Compared to SQL models, Python models have very limited access to global project context. That context is made available from the `dbt` class, passed in as an argument to the `model()` function. |
|
||
<div warehouse="Databricks"> | ||
|
||
**Additional setup:** The `user` field in your [Spark connection profile](spark-profile), usually optional, is required for running Python models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**Additional setup:** The `user` field in your [Spark connection profile](spark-profile), usually optional, is required for running Python models. | |
**Additional setup:** The `user` field in your [Spark connection profile](spark-profile) (which is usually optional) is required for running Python models. |
|
||
<div warehouse="BigQuery"> | ||
|
||
The `dbt-bigquery` adapter uses a service called Dataproc to submit your Python models as PySpark jobs. That Python/PySpark code will read from your tables and views in BigQuery, and saves its final result back to BigQuery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The `dbt-bigquery` adapter uses a service called Dataproc to submit your Python models as PySpark jobs. That Python/PySpark code will read from your tables and views in BigQuery, and saves its final result back to BigQuery. | |
The `dbt-bigquery` adapter uses a service called Dataproc to submit your Python models as PySpark jobs. That Python/PySpark code will read from your tables and views in BigQuery, and save its final result back to BigQuery. |
|
||
**Additional setup:** | ||
- Create or use an existing [Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets) | ||
- Create or use an existing [Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster), with the [Spark BigQuery connector initialization action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/connectors#bigquery-connectors). (Google recommends copying the action into your own Cloud Storage bucket, rather than using the example version shown in the screenshot below.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Create or use an existing [Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster), with the [Spark BigQuery connector initialization action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/connectors#bigquery-connectors). (Google recommends copying the action into your own Cloud Storage bucket, rather than using the example version shown in the screenshot below.) | |
- Create or use an existing [Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) with the [Spark BigQuery connector initialization action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/connectors#bigquery-connectors). (Google recommends copying the action into your own Cloud Storage bucket, rather than using the example version shown in the screenshot below.) |
|
||
## New and changed documentation | ||
|
||
- **[Python models](building-models-with-python)** are natively supported in `dbt-core` for the first time, on data warehouses that support Python runtimes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- **[Python models](building-models-with-python)** are natively supported in `dbt-core` for the first time, on data warehouses that support Python runtimes. | |
- **[Python models](building-models-with-python)** are natively supported in `dbt-core` for the first time on data warehouses that support Python runtimes. |
website/docs/terms/dataframe.md
Outdated
|
||
A DataFrame is a way of storing and manipulating tabular data in Python. (It's also used in other languages popular for data processing, such as R and Scala.) | ||
|
||
It's possible to string together a number of DataFrame transformations. For example, if `df` represents a DataFrame containing one row per person living in the Eastern United States over the last decade, we can calculate the number of people living in Philly each year: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider using Philadelphia
unstead of Philly
to be less US centric, less colloquial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible to string together a number of DataFrame transformations. For example, if `df` represents a DataFrame containing one row per person living in the Eastern United States over the last decade, we can calculate the number of people living in Philly each year: | |
It's possible to string together a number of DataFrame transformations. For example, if `df` represents a DataFrame containing one row per person living in the Eastern United States over the last decade, we can calculate the number of people living in Philadelphia each year: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reading this felt like visiting my childhood home as an adult after many years and a new family lives there. Just when I thought I got out of the python data transformation world -- they pull me back in!
some considerations:
- "DataFrame" or "dataframe"? relatedly, do all references of the word need to link out to the glossary?
- I'm not sure what the WH-specific "Installing Packages" advice is regarding. As an antecedent step to defining model level requirements?
- i'm 92% that
dbt.ref()
and the like are better categorized as methods not functions... but don't care that much..
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
- **`dbt`**: A class compiled by dbt Core, unique to each model, that enables you to run your Python code in the context of your dbt project and DAG. | ||
- **`session`**: A class representing the connection to the backing engine, which allows you to interact with the data platform. The session is needed to read in tables as DataFrames, and to write DataFrames back to tables. In PySpark, by convention, the `SparkSession` is named `spark`, and available globally. For consistency across platforms, we always pass it into the `model` function as an explicit argument named `session`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the end user ever have to modify these parameters to be different than dbt
and session
? If not, why expose them to users at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The end user needs to access classmethods of dbt
, and some properties of session
as well. We think, better to document those and make it clearer where they're coming from, rather than pulling them out of the aether.
We could have gone for a "script"-style approach, where dbt
and session
appear to be available "globally":
def some_fn:
...
some_df = dbt.ref("some_model")
final_df = some_df.apply(some_fn)
final_df
Where really we just take users' code, and turn it into:
def model(dbt, session):
def some_fn:
...
some_df = dbt.ref("some_model")
final_df = some_df.apply(some_fn)
return final_df # we add this return?
It's still an option we could consider, if people like that a lot better! (@ChenyuLInx)
# DataFrame representing an upstream model | ||
upstream_model = dbt.ref("upstream_model_name") | ||
|
||
# DataFrame representing an upstream source | ||
upstream_source = dbt.source("upstream_source_name", "table_name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these lazily-evaluated references? or does dbt.source()
read the reference into memory at time of definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lazy evaluation if you're using a DataFrame API worth its salt — so yes! As soon as you convert to Pandas, though, that will read everything into memory.
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
:::caution | ||
|
||
The implementation for Python models on GCP (BigQuery + Dataproc) is the roughest of the three. Running PySpark on Dataproc requires more manual setup and configuration. Clusters require sufficient resources or auto-scaling policies to handle concurrent Python model runs. | ||
|
||
We have made the code available for the beta, but we are reserving the right to leave it out of the final v1.3 release if the experience is too unfriendly to end users. If you're a GCP expert, we're very open to hearing your thoughts in Slack or the GitHub discussions! | ||
|
||
::: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:::caution | |
The implementation for Python models on GCP (BigQuery + Dataproc) is the roughest of the three. Running PySpark on Dataproc requires more manual setup and configuration. Clusters require sufficient resources or auto-scaling policies to handle concurrent Python model runs. | |
We have made the code available for the beta, but we are reserving the right to leave it out of the final v1.3 release if the experience is too unfriendly to end users. If you're a GCP expert, we're very open to hearing your thoughts in Slack or the GitHub discussions! | |
::: | |
:::caution | |
Here be dragons; this implementation for Python models on GCP (BigQuery + Dataproc) needs the most love of the three before the official release. | |
For example, running PySpark on Dataproc requires more manual setup and configuration. Clusters require sufficient resources or auto-scaling policies to handle concurrent Python model runs. | |
We have made the code available for the beta, but we are reserving the right to leave it out of the final v1.3 release if the experience is too unfriendly to end users. If you're a GCP expert, we'd love your help; please share your thoughts in Slack or the GitHub discussions! | |
::: |
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
|
||
**Note:** Python models will be created and run as notebooks in your Databricks workspace. The notebook will be created within the personal workspace of the `user` running dbt, and named after the model it is executing. dbt will update the notebooks on subsequent runs of the same model, but it will not delete it—so you can use the notebook for quicker interactive development. Just remember to update the code in your dbt model, based on your in-notebook iteration, before the next `dbt run`. | ||
|
||
**Installing packages:** We recommend configuring packages on the interactive cluster which you will be using to run your Python models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we recommending to initially install all the packages you might use before creating python models in lieu of specifying requirements as part of a Python model's specifification? As in, it's not easy/possible/practical to configure model-specific python environments, so just do it once manually outside of dbt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This totally varies based on the backend / implementation. On Databricks, it is possible to tell a cluster to run pip install <some_package>
in a first notebook cell. That can take quite a bit of time, though. Their recommendation (and it makes sense to me) is to configure the needed packages on the cluster ahead of time.
By configuring packages in that way, it does mean that two models cannot have conflicting package version requirements. I think that's a recipe for disaster, anyway...
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jtcohen6 Just some suggestions. Mostly would like to see a but of context and examples for the beta so people who see this first, want to try it out!
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
- [Weigh in on our developing best practices](https://github.com/dbt-labs/docs.getdbt.com/discussions/1811) | ||
- Join the **#beta-feedback-python-models** channel in the [dbt Community Slack](https://www.getdbt.com/community/join-the-community/) | ||
|
||
Below, you'll see sections entitled "❓ **Our questions**." We're working to develop our opinionated recommendations ahead of the final release this October—and you can help! Comment in the GitHub discussions, leave thoughts in Slack, talk about it with colleagues and friends. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below, you'll see sections entitled "❓ **Our questions**." We're working to develop our opinionated recommendations ahead of the final release this October—and you can help! Comment in the GitHub discussions, leave thoughts in Slack, talk about it with colleagues and friends. | |
Below, you'll see sections entitled "❓ **Our questions**." We're working to develop our opinionated recommendations ahead of the final release this October—and you can help! Comment in the GitHub discussions, leave thoughts in Slack, discuss dbt Python models with colleagues and friends. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to be a little cute here. Not sure if I'm achieving it, might be better to cut altogether:
Below, you'll see sections entitled "❓ **Our questions**." We're working to develop our opinionated recommendations ahead of the final release this October—and you can help! Comment in the GitHub discussions, leave thoughts in Slack, talk about it with colleagues and friends. | |
Below, you'll see sections entitled "❓ **Our questions**." We're working to develop our opinionated recommendations ahead of the final release this October—and you can help! Comment in the GitHub discussions; leave thoughts in Slack; bring up dbt + Python in casual conversation with colleagues and friends. |
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
website/docs/docs/building-a-dbt-project/building-models/python-models.md
Outdated
Show resolved
Hide resolved
Thanks for the thorough reviews @nghi-ly @dataders @runleonarun!! I'm working through your feedback, haven't gotten all the way through, but pushing up some intermediate progress.
Good question! How does the docs team feel about this? Every instance, just the first one on a given page, something in between?
You’re 100% right. Deliciously, in our implementation, dbt.ref() is a classmethod… pointing to a function named ref()! At the risk of some more inconsistency, I’ve done my best to replace instances of “function” with “method,” when that’s the one I mean.
At the risk of too much text—this page will definitely want to be split up before final release!—I've added a more exciting intro, as well as a placeholder example, before we get into the nitty-gritty implementation details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
resolves #1664
Description & motivation
The first set of "dbt Python model" functionality will be included in v1.3.0-b1 (planned release: next week or the week after). We'll be asking folks to beta-test, and we need to give them docs so that they can.
TODO before merge
TODO
Future — shouldn't block merge for beta
Prerelease docs
If this change is related to functionality in a prerelease version of dbt (delete if not applicable):
Checklist
If you added new pages (delete if not applicable):
website/sidebars.js