[Beta] Python models #1754

jtcohen6 · 2022-07-20T10:44:35Z

resolves #1664

Description & motivation

The first set of "dbt Python model" functionality will be included in v1.3.0-b1 (planned release: next week or the week after). We'll be asking folks to beta-test, and we need to give them docs so that they can.

Add initial docs for Python models
Add v1.3 to the version picker (although it doesn't seem to be showing up?)
Initialize v1.3 migration guide

TODO before merge

Prose / code snippets I called out as TODO
Review based on code implementation that actually ends up in v1.3.0-b1
In places where the syntax differs, should we include warehouse/engine-specific code options for Snowpark + PySpark (+ Pandas)? — I limited the differences as much as possible

Future — shouldn't block merge for beta

Demo / walk-through that @lostmygithubaccount is putting together. Should this go in the docs, or a DevHub post + GitHub repo?
Updates to many, many, many more pages throughout the docs that make reference to dbt models being SQL only

Prerelease docs

If this change is related to functionality in a prerelease version of dbt (delete if not applicable):

I've added versioning components, as described in "Versioning Docs"
I've added a note to the prerelease version's Migration Guide

Checklist

If you added new pages (delete if not applicable):

The page has been added to website/sidebars.js
The new page has a unique filename

netlify · 2022-07-20T10:44:42Z

✅ Deploy Preview for docs-getdbt-com ready!

Name	Link
🔨 Latest commit	`3e4c332`
🔍 Latest deploy log	https://app.netlify.com/sites/docs-getdbt-com/deploys/62e68a7120ea8000083b2382
😎 Deploy Preview	https://deploy-preview-1754--docs-getdbt-com.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

jtcohen6 · 2022-07-28T16:09:50Z

This needs another pass-through, but I'd like to open it up for feedback now. Aiming to get these docs live & ready for beta testers, as soon as there's a beta prerelease to start using.

website/docs/docs/building-a-dbt-project/building-models/python-models.md

website/docs/terms/dataframe.md

website/docs/docs/building-a-dbt-project/building-models/python-models.md

KiraFuruichi · 2022-07-28T20:50:32Z

This is so extremely exciting :) Jeremy H beat me to some of the things I initially noticed (explaining why view + ephemeral materializations are not supported, spelling typo), but I left some minor aesthetic comments as well.

I think bigger picture (which I don't think should be a blocker at all to releasing this tomorrow) that I was thinking about as I was reading this was:

General use cases for Python dbt models: I think this doc covers really well what dbt Python models can do in comparison to what current dbt models do. Are there certain things we don't want people trying out yet? Are there certain scenarios we can provide (or link to) for folks/beta testers to find use cases for Python models? I'm sure if they're trying them out in the beta, they already have some ideas of their own, which is why I don't think this a blocker or anything.
Python model location: I know it's mentioned here that Python models should live in /models directory, which makes total sense to me, but was also left with a sense of, "But where in the /models directory?" I wonder down the line how we could help folks figure out where Python models should live, especially as they start being used for DS/ML. Right now, it makes a ton of senses for Python models to fall under the normal dimensional modeling/modular data modeling structure teams have in-place (since [assumption] they are likely to be purely transformative models like current dbt models). This is also somewhat related to the comment in the UDF section around a function endpoint as you consider Python functions to potentially be reusable across multiple dbt models.
So freaking excited to see a data frame glossary page! In the next few weeks I'm going to (actually do my job and) expand on that glossary page to (selfishly) build in some SEO and follow the format we have going for other glossary pages (if this is cool with you).

website/docs/docs/building-a-dbt-project/building-models/python-models.md

joellabes · 2022-07-28T20:24:28Z

website/docs/docs/building-a-dbt-project/building-models/python-models.md

+```python
+def add_one(x):
+    return x + 1


Are functions only:

user-definable inside of the same file

importable from a public PyPI package
?

Or can you do something chaotic like import add_one from my_python_model in my_second_python_model.py?

Related but broader question: I assume there are no Python macros? Either Jinja to template python or a global space for add_one to be defined once

Currently you cannot do that. It is possible to support(not very pretty way). @jtcohen6 maybe that's something to call out that we don't support it.

I really consider this an open question for us!

I'm disinclined to do this one, given the chaos it could lead to, and over-reliance on the local file system:

Or can you do something chaotic like import add_one from my_python_model in my_second_python_model.py?

I think we have two good options to pursue:

Some platforms support registering these functions as "named" / persistent UDFs. This would be the case for "dbt should know about functions." These UDFs have some drawbacks; inconsistent support; and the main use case seems to be defining a Python UDF that then gets called from within a SQL model. That seems valuable for exposing a prediction function output of a trained ML model, less so for a utility like add_one.

Potentially more promising: most of these platforms allow you to upload your own home-spun "Python package," as a .zip, ==. You could then import it and use its functions, as if it were a public PyPI package.

I agree that you shouldn't be able to do arbitrary imports from functions defined in other models in the models/ directory.

most of these platforms allow you to upload your own home-spun "Python package," as a .zip, ==. You could then import it and use its functions, as if it were a public PyPI package

Where would the code that gets zipped and uploaded be source controlled? This feels very analogous to the before-state of https://discourse.getdbt.com/t/using-dbt-to-manage-user-defined-functions/18 - code is magically available but no one really knows where it came from.

If we can resolve that, I do like the idea of an internal functions package.

Would finding .py functions defined in the macros directory (or add functions/ to your macro-paths) and embedding them all into an internal package be good enough at breaking reliance on the filesystem? This assumes that you're disinclined to enable imports from inside a model because it would be bad practice, not that it would be hard to implement technically.

Where would the code that gets zipped and uploaded be source controlled?

Yup — 100% needs to live within the dbt project, and be version-controlled along with it.

I think that "upload" step will require different behind-the-scenes implementations across different data warehouses. We could even opt for UDFs as the backing implementation, if a given warehouse doesn't support arbitrary package upload. There's risk of a leaky abstraction here, but it feels like a place where dbt can help in a big way, and where we won't see the full value of Python models without some answer here.

Questions we need to answer:

Is this "upload" step part of dbt deps? What if these reusable methods live in the root project, rather than a dbt package?

How to support the transition from local → cloud development (if we want to preserve that workflow)?

Those Python models should ideally be (a) pure functions with (b) thorough unit testing. Can dbt help here? Or should we encourage standard best practices Python development (pytest etc)?

@lostmygithubaccount @ChenyuLInx This feels like one of our highest-priority spikes during the v1.3 beta → final period.

website/docs/docs/building-a-dbt-project/building-models/python-models.md

ChenyuLInx · 2022-07-28T22:31:35Z

website/docs/docs/building-a-dbt-project/building-models/python-models.md

+```python
+def add_one(x):
+    return x + 1


Currently you cannot do that. It is possible to support(not very pretty way). @jtcohen6 maybe that's something to call out that we don't support it.

website/docs/docs/building-a-dbt-project/building-models/python-models.md

ChenyuLInx · 2022-07-28T22:35:16Z

website/docs/docs/building-a-dbt-project/building-models/python-models.md

+
+**Additional setup:** The `user` field in the `dbt-spark` profile, usually optional, is required for Python modeling.
+
+**Note:** Python models will be created and run as notebooks in your Databricks workspace. The notebooks will be created within the personal workspace of the `user` running dbt.


Should we add that we don't delete the notebook so you can use it for interactive development of model and bring the code back?

website/docs/docs/building-a-dbt-project/building-models/python-models.md

b-per

Well done! I think it is already giving a lot of info (enough for people to start).

I just suggested a few changes.
The most controversial being that I'd be in favor of using DataFrame everywhere instead of data frame

website/docs/terms/dataframe.md

website/docs/docs/building-a-dbt-project/building-models.md

b-per · 2022-07-29T09:21:19Z

website/docs/docs/building-a-dbt-project/building-models.md

+
+In v1.3, dbt Core is adding support for **Python models**.
+
+dbt's Python capabilities are an extension of its capabilities with SQL models. We recommend that you read this page first, and then read: ["dbt Models (Python)"](python-models)


Might be worth mentioning also that what says .sql below would need to be replaced by .py for Python. No need to go into too much detail in that page but just calling it out.

I tried a few different permutations of this, but I wasn't happy with any. I'm going to stick with minimal note for now. We have a lot of future work to do, to figure out the right ways to integrate Python capabilities / references into the existing docs that say SQL, SQL, .sql

website/docs/terms/dataframe.md

website/docs/docs/building-a-dbt-project/building-models/python-models.md

website/docs/docs/building-a-dbt-project/building-models.md

jtcohen6 · 2022-07-29T11:37:41Z

Thank you for the quick reviews and excellent feedback @lostmygithubaccount @ChenyuLInx @KiraFuruichi @joellabes @jeremyholtzman @b-per!! Thrilled that this turned into a real team effort.

So freaking excited to see a data frame glossary page! In the next few weeks I'm going to (actually do my job and) expand on that glossary page to (selfishly) build in some SEO and follow the format we have going for other glossary pages (if this is cool with you).

You have the conn, Furuichi

The most controversial being that I'd be in favor of using DataFrame everywhere instead of data frame

Good by me! Made this change.

jtcohen6 · 2022-07-29T17:00:59Z

website/docs/docs/building-a-dbt-project/building-models/python-models.md

+:::caution
+
+The implementation for Python models on GCP (BigQuery + Dataproc) is the roughest of the three. Running PySpark on Dataproc requires more manual setup and configuration. Clusters require sufficient resources or auto-scaling policies to handle concurrent Python model runs.
+
+We have made the code available for the beta, but we are reserving the right to leave it out of the final v1.3 release if the experience is too unfriendly to end users. If you're a GCP expert, we're very open to hearing your thoughts in Slack or the GitHub discussions!
+
+:::


@dataders This reads a bit too harsh right now. My goal here is to set clear expectations for any BQ users around both (a) the setup experience, (b) the possibility that this functionality is cut from v1.3 (final). Definitely open to your wordsmithing

Suggested change

:::caution

The implementation for Python models on GCP (BigQuery + Dataproc) is the roughest of the three. Running PySpark on Dataproc requires more manual setup and configuration. Clusters require sufficient resources or auto-scaling policies to handle concurrent Python model runs.

We have made the code available for the beta, but we are reserving the right to leave it out of the final v1.3 release if the experience is too unfriendly to end users. If you're a GCP expert, we're very open to hearing your thoughts in Slack or the GitHub discussions!

:::

:::caution

Here be dragons; this implementation for Python models on GCP (BigQuery + Dataproc) needs the most love of the three before the official release.

For example, running PySpark on Dataproc requires more manual setup and configuration. Clusters require sufficient resources or auto-scaling policies to handle concurrent Python model runs.

We have made the code available for the beta, but we are reserving the right to leave it out of the final v1.3 release if the experience is too unfriendly to end users. If you're a GCP expert, we'd love your help; please share your thoughts in Slack or the GitHub discussions!

:::

nghi-ly

beta docs look great, @jtcohen6 ! small nits. non-blocking.

nghi-ly · 2022-07-29T19:22:18Z

website/docs/docs/building-a-dbt-project/building-models/python-models.md

+
+Each Python model lives in a `.py` file in your `models/` folder. It defines a function named **`model()`**, which has two parameters:
+- **`dbt`**: A class compiled by dbt Core, unique to each model, that enables you to run your Python code in the context of your dbt project and DAG.
+- **`session`**: A class representing the connection to the backing engine, which allows you to interact with the data platform. The session is needed to read in tables as DataFrames, and to write DataFrames back to tables. In PySpark, by convention, the `SparkSession` is named `spark`, and available globally. For consistency across platforms, we always pass it into the `model` function as an explicit argument named `session`.


Suggested change

- **`session`**: A class representing the connection to the backing engine, which allows you to interact with the data platform. The session is needed to read in tables as DataFrames, and to write DataFrames back to tables. In PySpark, by convention, the `SparkSession` is named `spark`, and available globally. For consistency across platforms, we always pass it into the `model` function as an explicit argument named `session`.

- **`session`**: A class representing the connection to the backend engine, which allows you to interact with the data platform. The session is needed to read in tables as DataFrames, and to write DataFrames back to tables. In PySpark, by convention, the `SparkSession` is named `spark`, and available globally. For consistency across platforms, we always pass it into the `model` function as an explicit argument named `session`.

not sure if this was a typo so just flagging

nghi-ly · 2022-07-29T19:26:47Z

website/docs/docs/building-a-dbt-project/building-models/python-models.md

+- **`dbt`**: A class compiled by dbt Core, unique to each model, that enables you to run your Python code in the context of your dbt project and DAG.
+- **`session`**: A class representing the connection to the backing engine, which allows you to interact with the data platform. The session is needed to read in tables as DataFrames, and to write DataFrames back to tables. In PySpark, by convention, the `SparkSession` is named `spark`, and available globally. For consistency across platforms, we always pass it into the `model` function as an explicit argument named `session`.
+
+The `model()` function must return a single DataFrame. On Snowpark (Snowflake), this can be a Snowpark or Pandas DataFrame. On PySpark (Databricks + BigQuery), this should be a PySpark DataFrame (converted back from Pandas if needed). For more about choosing between Pandas and native DataFrames, see ["DataFrame API + syntax"](#dataframe-api--syntax)


Suggested change

The `model()` function must return a single DataFrame. On Snowpark (Snowflake), this can be a Snowpark or Pandas DataFrame. On PySpark (Databricks + BigQuery), this should be a PySpark DataFrame (converted back from Pandas if needed). For more about choosing between Pandas and native DataFrames, see ["DataFrame API + syntax"](#dataframe-api--syntax)

The `model()` function must return a single DataFrame. On Snowpark (Snowflake), this can be a Snowpark or Pandas DataFrame. On PySpark (Databricks + BigQuery), this should be a PySpark DataFrame (converted back from Pandas if needed). For more about choosing between Pandas and native DataFrames, see ["DataFrame API + syntax"](#dataframe-api--syntax).

nghi-ly · 2022-07-29T19:35:53Z

website/docs/docs/building-a-dbt-project/building-models/python-models.md

+When you `dbt run --select python_model`, dbt will prepare and pass in both arguments (`dbt` and `session`). All you have to do is define the `model()` function that accepts them.
+
+### Referencing other models
+Your Python model will want to read data from other models (SQL or Python) or sources. Do this using the `dbt.ref()` function. The same idea applies for raw source tables, via `dbt.source()`. Those functions return DataFrames, pointing to the upstream source, model, seed, or snapshot. 


Suggested change

Your Python model will want to read data from other models (SQL or Python) or sources. Do this using the `dbt.ref()` function. The same idea applies for raw source tables, via `dbt.source()`. Those functions return DataFrames, pointing to the upstream source, model, seed, or snapshot.

Use the `dbt.ref()` function to let your Python model read data from other models (SQL or Python) or sources. The same idea applies for raw source tables from `dbt.source()`. Those functions return DataFrames, pointing to the upstream source, model, seed, or snapshot.

nghi-ly · 2022-07-29T19:42:30Z

website/docs/docs/building-a-dbt-project/building-models/python-models.md

+
+</File>
+
+The `config()` function accepts _only_ literal values (strings, booleans, and numeric types). It is not possible to pass another function or more complex data structure. The reason: dbt statically analyzes the arguments to `config()` while parsing your model, without actually executing any of your Python code.


Suggested change

The `config()` function accepts _only_ literal values (strings, booleans, and numeric types). It is not possible to pass another function or more complex data structure. The reason: dbt statically analyzes the arguments to `config()` while parsing your model, without actually executing any of your Python code.

The `config()` function accepts _only_ literal values (strings, booleans, and numeric types). It's not possible to pass another function or a more complex data structure. The reason is because dbt statically analyzes the arguments to `config()` while parsing your model, without actually executing any of your Python code.

nghi-ly · 2022-07-29T19:43:20Z

website/docs/docs/building-a-dbt-project/building-models/python-models.md

+
+#### Accessing project context
+
+dbt Python models do not use Jinja to render compiled code. Compared to SQL models, Python models have very limited access to global project context. That context is made available via the `dbt` class, passed in as an argument to the `model()` function.


Suggested change

dbt Python models do not use Jinja to render compiled code. Compared to SQL models, Python models have very limited access to global project context. That context is made available via the `dbt` class, passed in as an argument to the `model()` function.

dbt Python models don't use Jinja to render compiled code. Compared to SQL models, Python models have very limited access to global project context. That context is made available from the `dbt` class, passed in as an argument to the `model()` function.

nghi-ly · 2022-07-29T20:08:00Z

website/docs/docs/building-a-dbt-project/building-models/python-models.md

+
+<div warehouse="Databricks">
+
+**Additional setup:** The `user` field in your [Spark connection profile](spark-profile), usually optional, is required for running Python models.


Suggested change

**Additional setup:** The `user` field in your [Spark connection profile](spark-profile), usually optional, is required for running Python models.

**Additional setup:** The `user` field in your [Spark connection profile](spark-profile) (which is usually optional) is required for running Python models.

nghi-ly · 2022-07-29T20:09:34Z

website/docs/docs/building-a-dbt-project/building-models/python-models.md

+
+<div warehouse="BigQuery">
+
+The `dbt-bigquery` adapter uses a service called Dataproc to submit your Python models as PySpark jobs. That Python/PySpark code will read from your tables and views in BigQuery, and saves its final result back to BigQuery.


Suggested change

The `dbt-bigquery` adapter uses a service called Dataproc to submit your Python models as PySpark jobs. That Python/PySpark code will read from your tables and views in BigQuery, and saves its final result back to BigQuery.

The `dbt-bigquery` adapter uses a service called Dataproc to submit your Python models as PySpark jobs. That Python/PySpark code will read from your tables and views in BigQuery, and save its final result back to BigQuery.

nghi-ly · 2022-07-29T20:11:47Z

website/docs/docs/building-a-dbt-project/building-models/python-models.md

+
+**Additional setup:**
+- Create or use an existing [Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets)
+- Create or use an existing [Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster), with the [Spark BigQuery connector initialization action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/connectors#bigquery-connectors). (Google recommends copying the action into your own Cloud Storage bucket, rather than using the example version shown in the screenshot below.)


Suggested change

- Create or use an existing [Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster), with the [Spark BigQuery connector initialization action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/connectors#bigquery-connectors). (Google recommends copying the action into your own Cloud Storage bucket, rather than using the example version shown in the screenshot below.)

- Create or use an existing [Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) with the [Spark BigQuery connector initialization action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/connectors#bigquery-connectors). (Google recommends copying the action into your own Cloud Storage bucket, rather than using the example version shown in the screenshot below.)

nghi-ly · 2022-07-29T20:15:06Z

website/docs/guides/migration/versions/05-upgrading-to-v1.3.md

+
+## New and changed documentation
+
+- **[Python models](building-models-with-python)** are natively supported in `dbt-core` for the first time, on data warehouses that support Python runtimes.


Suggested change

- **[Python models](building-models-with-python)** are natively supported in `dbt-core` for the first time, on data warehouses that support Python runtimes.

- **[Python models](building-models-with-python)** are natively supported in `dbt-core` for the first time on data warehouses that support Python runtimes.

nghi-ly · 2022-07-29T20:18:22Z

website/docs/terms/dataframe.md

+
+A DataFrame is a way of storing and manipulating tabular data in Python. (It's also used in other languages popular for data processing, such as R and Scala.)
+
+It's possible to string together a number of DataFrame transformations. For example, if `df` represents a DataFrame containing one row per person living in the Eastern United States over the last decade, we can calculate the number of people living in Philly each year:


consider using Philadelphia unstead of Philly to be less US centric, less colloquial

Suggested change

It's possible to string together a number of DataFrame transformations. For example, if `df` represents a DataFrame containing one row per person living in the Eastern United States over the last decade, we can calculate the number of people living in Philly each year:

It's possible to string together a number of DataFrame transformations. For example, if `df` represents a DataFrame containing one row per person living in the Eastern United States over the last decade, we can calculate the number of people living in Philadelphia each year:

dataders

reading this felt like visiting my childhood home as an adult after many years and a new family lives there. Just when I thought I got out of the python data transformation world -- they pull me back in!

some considerations:

"DataFrame" or "dataframe"? relatedly, do all references of the word need to link out to the glossary?
I'm not sure what the WH-specific "Installing Packages" advice is regarding. As an antecedent step to defining model level requirements?
i'm 92% that dbt.ref() and the like are better categorized as methods not functions... but don't care that much..

website/docs/docs/building-a-dbt-project/building-models/python-models.md

dataders · 2022-07-29T18:29:43Z

website/docs/docs/building-a-dbt-project/building-models/python-models.md

+- **`dbt`**: A class compiled by dbt Core, unique to each model, that enables you to run your Python code in the context of your dbt project and DAG.
+- **`session`**: A class representing the connection to the backing engine, which allows you to interact with the data platform. The session is needed to read in tables as DataFrames, and to write DataFrames back to tables. In PySpark, by convention, the `SparkSession` is named `spark`, and available globally. For consistency across platforms, we always pass it into the `model` function as an explicit argument named `session`.


does the end user ever have to modify these parameters to be different than dbt and session? If not, why expose them to users at all?

The end user needs to access classmethods of dbt, and some properties of session as well. We think, better to document those and make it clearer where they're coming from, rather than pulling them out of the aether.

We could have gone for a "script"-style approach, where dbt and session appear to be available "globally":

def some_fn: ... some_df = dbt.ref("some_model") final_df = some_df.apply(some_fn) final_df

Where really we just take users' code, and turn it into:

def model(dbt, session): def some_fn: ... some_df = dbt.ref("some_model") final_df = some_df.apply(some_fn) return final_df # we add this return?

It's still an option we could consider, if people like that a lot better! (@ChenyuLInx)

dataders · 2022-07-29T18:30:59Z

website/docs/docs/building-a-dbt-project/building-models/python-models.md

+    # DataFrame representing an upstream model
+    upstream_model = dbt.ref("upstream_model_name")
+
+    # DataFrame representing an upstream source
+    upstream_source = dbt.source("upstream_source_name", "table_name")


are these lazily-evaluated references? or does dbt.source() read the reference into memory at time of definition?

Lazy evaluation if you're using a DataFrame API worth its salt — so yes! As soon as you convert to Pandas, though, that will read everything into memory.

website/docs/docs/building-a-dbt-project/building-models/python-models.md

dataders · 2022-07-29T19:59:49Z

website/docs/docs/building-a-dbt-project/building-models/python-models.md

+:::caution
+
+The implementation for Python models on GCP (BigQuery + Dataproc) is the roughest of the three. Running PySpark on Dataproc requires more manual setup and configuration. Clusters require sufficient resources or auto-scaling policies to handle concurrent Python model runs.
+
+We have made the code available for the beta, but we are reserving the right to leave it out of the final v1.3 release if the experience is too unfriendly to end users. If you're a GCP expert, we're very open to hearing your thoughts in Slack or the GitHub discussions!
+
+:::


Suggested change

:::caution

The implementation for Python models on GCP (BigQuery + Dataproc) is the roughest of the three. Running PySpark on Dataproc requires more manual setup and configuration. Clusters require sufficient resources or auto-scaling policies to handle concurrent Python model runs.

We have made the code available for the beta, but we are reserving the right to leave it out of the final v1.3 release if the experience is too unfriendly to end users. If you're a GCP expert, we're very open to hearing your thoughts in Slack or the GitHub discussions!

:::

:::caution

Here be dragons; this implementation for Python models on GCP (BigQuery + Dataproc) needs the most love of the three before the official release.

For example, running PySpark on Dataproc requires more manual setup and configuration. Clusters require sufficient resources or auto-scaling policies to handle concurrent Python model runs.

We have made the code available for the beta, but we are reserving the right to leave it out of the final v1.3 release if the experience is too unfriendly to end users. If you're a GCP expert, we'd love your help; please share your thoughts in Slack or the GitHub discussions!

:::

website/docs/docs/building-a-dbt-project/building-models/python-models.md

dataders · 2022-07-29T20:15:42Z

website/docs/docs/building-a-dbt-project/building-models/python-models.md

+
+**Note:** Python models will be created and run as notebooks in your Databricks workspace. The notebook will be created within the personal workspace of the `user` running dbt, and named after the model it is executing. dbt will update the notebooks on subsequent runs of the same model, but it will not delete it—so you can use the notebook for quicker interactive development. Just remember to update the code in your dbt model, based on your in-notebook iteration, before the next `dbt run`.
+
+**Installing packages:** We recommend configuring packages on the interactive cluster which you will be using to run your Python models.


are we recommending to initially install all the packages you might use before creating python models in lieu of specifying requirements as part of a Python model's specifification? As in, it's not easy/possible/practical to configure model-specific python environments, so just do it once manually outside of dbt?

This totally varies based on the backend / implementation. On Databricks, it is possible to tell a cluster to run pip install <some_package> in a first notebook cell. That can take quite a bit of time, though. Their recommendation (and it makes sense to me) is to configure the needed packages on the cluster ahead of time.

By configuring packages in that way, it does mean that two models cannot have conflicting package version requirements. I think that's a recipe for disaster, anyway...

website/docs/docs/building-a-dbt-project/building-models/python-models.md

runleonarun

Hey @jtcohen6 Just some suggestions. Mostly would like to see a but of context and examples for the beta so people who see this first, want to try it out!

website/docs/docs/building-a-dbt-project/building-models/python-models.md

runleonarun · 2022-07-29T20:41:06Z

website/docs/docs/building-a-dbt-project/building-models/python-models.md

+- [Weigh in on our developing best practices](https://github.com/dbt-labs/docs.getdbt.com/discussions/1811)
+- Join the **#beta-feedback-python-models** channel in the [dbt Community Slack](https://www.getdbt.com/community/join-the-community/)
+
+Below, you'll see sections entitled "❓ **Our questions**." We're working to develop our opinionated recommendations ahead of the final release this October—and you can help! Comment in the GitHub discussions, leave thoughts in Slack, talk about it with colleagues and friends.


Suggested change

Below, you'll see sections entitled "❓ **Our questions**." We're working to develop our opinionated recommendations ahead of the final release this October—and you can help! Comment in the GitHub discussions, leave thoughts in Slack, talk about it with colleagues and friends.

Below, you'll see sections entitled "❓ **Our questions**." We're working to develop our opinionated recommendations ahead of the final release this October—and you can help! Comment in the GitHub discussions, leave thoughts in Slack, discuss dbt Python models with colleagues and friends.

I was trying to be a little cute here. Not sure if I'm achieving it, might be better to cut altogether:

Suggested change

Below, you'll see sections entitled "❓ **Our questions**." We're working to develop our opinionated recommendations ahead of the final release this October—and you can help! Comment in the GitHub discussions, leave thoughts in Slack, talk about it with colleagues and friends.

Below, you'll see sections entitled "❓ **Our questions**." We're working to develop our opinionated recommendations ahead of the final release this October—and you can help! Comment in the GitHub discussions; leave thoughts in Slack; bring up dbt + Python in casual conversation with colleagues and friends.

website/docs/docs/building-a-dbt-project/building-models/python-models.md

jtcohen6 · 2022-07-30T14:52:35Z

Thanks for the thorough reviews @nghi-ly @dataders @runleonarun!!

I'm working through your feedback, haven't gotten all the way through, but pushing up some intermediate progress.

@dataders

"DataFrame" or "dataframe"?  Benoit says “DataFrame,” and I think that’s the most technically correct. It looks a bit stuffy to me, but I’m okay with the formality for now — it’s a new concept for many.

relatedly, do all references of the word need to link out to the glossary?

Good question! How does the docs team feel about this? Every instance, just the first one on a given page, something in between?

I’m 92% that dbt.ref() and the like are better categorized as methods not functions... but don't care that much..

You’re 100% right. Deliciously, in our implementation, dbt.ref() is a classmethod… pointing to a function named ref()! At the risk of some more inconsistency, I’ve done my best to replace instances of “function” with “method,” when that’s the one I mean.

@runleonarun

Mostly would like to see a but of context and examples for the beta so people who see this first, want to try it out!

At the risk of too much text—this page will definitely want to be split up before final release!—I've added a more exciting intro, as well as a placeholder example, before we get into the nitty-gritty implementation details.

runleonarun

Looks great!

jtcohen6 requested review from ChenyuLInx and lostmygithubaccount July 20, 2022 10:44

github-actions bot added content Improvements or additions to content size: large This change will more than a week to address and might require more than one person labels Jul 20, 2022

Initialize beta docs for Python models

231b515

jtcohen6 force-pushed the feat/python-models-beta branch from 2f01cbe to 231b515 Compare July 28, 2022 16:08

jtcohen6 marked this pull request as ready for review July 28, 2022 16:08

jtcohen6 requested review from annafil and runleonarun as code owners July 28, 2022 16:08

jtcohen6 added 3 commits July 28, 2022 19:29

Fix build. Self-review

e9cf328

Update links

fdf8bcc

More self-review

a560b32

jeremyholtzman reviewed Jul 28, 2022

View reviewed changes

website/docs/docs/building-a-dbt-project/building-models/python-models.md Show resolved Hide resolved

jeremyholtzman reviewed Jul 28, 2022

View reviewed changes

website/docs/docs/building-a-dbt-project/building-models/python-models.md Outdated Show resolved Hide resolved

jeremyholtzman reviewed Jul 28, 2022

View reviewed changes

website/docs/docs/building-a-dbt-project/building-models/python-models.md Outdated Show resolved Hide resolved

jeremyholtzman reviewed Jul 28, 2022

View reviewed changes

website/docs/docs/building-a-dbt-project/building-models/python-models.md Outdated Show resolved Hide resolved

nghi-ly self-assigned this Jul 28, 2022

KiraFuruichi reviewed Jul 28, 2022

View reviewed changes

website/docs/docs/building-a-dbt-project/building-models/python-models.md Outdated Show resolved Hide resolved

KiraFuruichi reviewed Jul 28, 2022

View reviewed changes

website/docs/docs/building-a-dbt-project/building-models/python-models.md Outdated Show resolved Hide resolved

KiraFuruichi reviewed Jul 28, 2022

View reviewed changes

website/docs/terms/dataframe.md Outdated Show resolved Hide resolved

KiraFuruichi reviewed Jul 28, 2022

View reviewed changes

website/docs/docs/building-a-dbt-project/building-models/python-models.md Outdated Show resolved Hide resolved

joellabes reviewed Jul 28, 2022

View reviewed changes

ChenyuLInx reviewed Jul 28, 2022

View reviewed changes

lostmygithubaccount mentioned this pull request Jul 29, 2022

link fix; minor edits in python-models.md #1809

Merged

link fix; minor edits in python-models.md (#1809)

1eacf5f

b-per requested changes Jul 29, 2022

View reviewed changes

PR feedback

f599b98

jtcohen6 requested a review from dataders as a code owner July 29, 2022 11:31

I survived Dataproc

a3d793d

jtcohen6 commented Jul 29, 2022

View reviewed changes

standardize on CamelCased DataFrame

1328211

nghi-ly approved these changes Jul 29, 2022

View reviewed changes

dataders reviewed Jul 29, 2022

View reviewed changes

runleonarun reviewed Jul 29, 2022

View reviewed changes

jtcohen6 added 2 commits July 30, 2022 16:53

Next round of PR feedback

7fda60b

More PR feedback, new intro

3e4c332

b-per approved these changes Aug 1, 2022

View reviewed changes

runleonarun approved these changes Aug 1, 2022

View reviewed changes

jtcohen6 merged commit d74c7f5 into current Aug 1, 2022

jtcohen6 deleted the feat/python-models-beta branch August 1, 2022 20:30


		Additional setup: The `user` field in the `dbt-spark` profile, usually optional, is required for Python modeling.

		Note: Python models will be created and run as notebooks in your Databricks workspace. The notebooks will be created within the personal workspace of the `user` running dbt.


		In v1.3, dbt Core is adding support for Python models.

		dbt's Python capabilities are an extension of its capabilities with SQL models. We recommend that you read this page first, and then read: ["dbt Models (Python)"](python-models)

	The `model()` function must return a single DataFrame. On Snowpark (Snowflake), this can be a Snowpark or Pandas DataFrame. On PySpark (Databricks + BigQuery), this should be a PySpark DataFrame (converted back from Pandas if needed). For more about choosing between Pandas and native DataFrames, see ["DataFrame API + syntax"](#dataframe-api--syntax)
	The `model()` function must return a single DataFrame. On Snowpark (Snowflake), this can be a Snowpark or Pandas DataFrame. On PySpark (Databricks + BigQuery), this should be a PySpark DataFrame (converted back from Pandas if needed). For more about choosing between Pandas and native DataFrames, see ["DataFrame API + syntax"](#dataframe-api--syntax).

	Your Python model will want to read data from other models (SQL or Python) or sources. Do this using the `dbt.ref()` function. The same idea applies for raw source tables, via `dbt.source()`. Those functions return DataFrames, pointing to the upstream source, model, seed, or snapshot.
	Use the `dbt.ref()` function to let your Python model read data from other models (SQL or Python) or sources. The same idea applies for raw source tables from `dbt.source()`. Those functions return DataFrames, pointing to the upstream source, model, seed, or snapshot.


		</File>

		The `config()` function accepts _only_ literal values (strings, booleans, and numeric types). It is not possible to pass another function or more complex data structure. The reason: dbt statically analyzes the arguments to `config()` while parsing your model, without actually executing any of your Python code.


		#### Accessing project context

		dbt Python models do not use Jinja to render compiled code. Compared to SQL models, Python models have very limited access to global project context. That context is made available via the `dbt` class, passed in as an argument to the `model()` function.

	dbt Python models do not use Jinja to render compiled code. Compared to SQL models, Python models have very limited access to global project context. That context is made available via the `dbt` class, passed in as an argument to the `model()` function.
	dbt Python models don't use Jinja to render compiled code. Compared to SQL models, Python models have very limited access to global project context. That context is made available from the `dbt` class, passed in as an argument to the `model()` function.


		<div warehouse="Databricks">

		Additional setup: The `user` field in your [Spark connection profile](spark-profile), usually optional, is required for running Python models.


		<div warehouse="BigQuery">

		The `dbt-bigquery` adapter uses a service called Dataproc to submit your Python models as PySpark jobs. That Python/PySpark code will read from your tables and views in BigQuery, and saves its final result back to BigQuery.


		## New and changed documentation

		- [Python models](building-models-with-python) are natively supported in `dbt-core` for the first time, on data warehouses that support Python runtimes.


		A DataFrame is a way of storing and manipulating tabular data in Python. (It's also used in other languages popular for data processing, such as R and Scala.)

		It's possible to string together a number of DataFrame transformations. For example, if `df` represents a DataFrame containing one row per person living in the Eastern United States over the last decade, we can calculate the number of people living in Philly each year:

		- `dbt`: A class compiled by dbt Core, unique to each model, that enables you to run your Python code in the context of your dbt project and DAG.
		- `session`: A class representing the connection to the backing engine, which allows you to interact with the data platform. The session is needed to read in tables as DataFrames, and to write DataFrames back to tables. In PySpark, by convention, the `SparkSession` is named `spark`, and available globally. For consistency across platforms, we always pass it into the `model` function as an explicit argument named `session`.


		Note: Python models will be created and run as notebooks in your Databricks workspace. The notebook will be created within the personal workspace of the `user` running dbt, and named after the model it is executing. dbt will update the notebooks on subsequent runs of the same model, but it will not delete it—so you can use the notebook for quicker interactive development. Just remember to update the code in your dbt model, based on your in-notebook iteration, before the next `dbt run`.

		Installing packages: We recommend configuring packages on the interactive cluster which you will be using to run your Python models.

	Below, you'll see sections entitled "❓ Our questions." We're working to develop our opinionated recommendations ahead of the final release this October—and you can help! Comment in the GitHub discussions, leave thoughts in Slack, talk about it with colleagues and friends.
	Below, you'll see sections entitled "❓ Our questions." We're working to develop our opinionated recommendations ahead of the final release this October—and you can help! Comment in the GitHub discussions, leave thoughts in Slack, discuss dbt Python models with colleagues and friends.

[Beta] Python models #1754

[Beta] Python models #1754

Conversation

jtcohen6 commented Jul 20, 2022 • edited Loading

Description & motivation

TODO before merge

Future — shouldn't block merge for beta

Prerelease docs

Checklist

netlify bot commented Jul 20, 2022 • edited Loading

✅ Deploy Preview for docs-getdbt-com ready!

jtcohen6 commented Jul 28, 2022

KiraFuruichi commented Jul 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

b-per left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtcohen6 commented Jul 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nghi-ly left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nghi-ly Jul 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dataders left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

runleonarun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtcohen6 commented Jul 30, 2022 • edited Loading

runleonarun left a comment

Choose a reason for hiding this comment

jtcohen6 commented Jul 20, 2022 •

edited

Loading

netlify bot commented Jul 20, 2022 •

edited

Loading

nghi-ly Jul 29, 2022 •

edited

Loading

jtcohen6 commented Jul 30, 2022 •

edited

Loading