Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Beta] Python models #1754

Merged
merged 10 commits into from
Aug 1, 2022
Merged

[Beta] Python models #1754

merged 10 commits into from
Aug 1, 2022

Conversation

jtcohen6
Copy link
Collaborator

@jtcohen6 jtcohen6 commented Jul 20, 2022

resolves #1664

Description & motivation

The first set of "dbt Python model" functionality will be included in v1.3.0-b1 (planned release: next week or the week after). We'll be asking folks to beta-test, and we need to give them docs so that they can.

  • Add initial docs for Python models
  • Add v1.3 to the version picker (although it doesn't seem to be showing up?)
  • Initialize v1.3 migration guide

TODO before merge

  • Prose / code snippets I called out as TODO
  • Review based on code implementation that actually ends up in v1.3.0-b1
  • In places where the syntax differs, should we include warehouse/engine-specific code options for Snowpark + PySpark (+ Pandas)? — I limited the differences as much as possible

Future — shouldn't block merge for beta

  • Demo / walk-through that @lostmygithubaccount is putting together. Should this go in the docs, or a DevHub post + GitHub repo?
  • Updates to many, many, many more pages throughout the docs that make reference to dbt models being SQL only

Prerelease docs

If this change is related to functionality in a prerelease version of dbt (delete if not applicable):

Checklist

If you added new pages (delete if not applicable):

  • The page has been added to website/sidebars.js
  • The new page has a unique filename

@netlify
Copy link

netlify bot commented Jul 20, 2022

Deploy Preview for docs-getdbt-com ready!

Name Link
🔨 Latest commit 3e4c332
🔍 Latest deploy log https://app.netlify.com/sites/docs-getdbt-com/deploys/62e68a7120ea8000083b2382
😎 Deploy Preview https://deploy-preview-1754--docs-getdbt-com.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@github-actions github-actions bot added content Improvements or additions to content size: large This change will more than a week to address and might require more than one person labels Jul 20, 2022
@jtcohen6 jtcohen6 force-pushed the feat/python-models-beta branch from 2f01cbe to 231b515 Compare July 28, 2022 16:08
@jtcohen6 jtcohen6 marked this pull request as ready for review July 28, 2022 16:08
@jtcohen6
Copy link
Collaborator Author

This needs another pass-through, but I'd like to open it up for feedback now. Aiming to get these docs live & ready for beta testers, as soon as there's a beta prerelease to start using.

@nghi-ly nghi-ly self-assigned this Jul 28, 2022
@KiraFuruichi
Copy link
Contributor

This is so extremely exciting :) Jeremy H beat me to some of the things I initially noticed (explaining why view + ephemeral materializations are not supported, spelling typo), but I left some minor aesthetic comments as well.

I think bigger picture (which I don't think should be a blocker at all to releasing this tomorrow) that I was thinking about as I was reading this was:

  • General use cases for Python dbt models: I think this doc covers really well what dbt Python models can do in comparison to what current dbt models do. Are there certain things we don't want people trying out yet? Are there certain scenarios we can provide (or link to) for folks/beta testers to find use cases for Python models? I'm sure if they're trying them out in the beta, they already have some ideas of their own, which is why I don't think this a blocker or anything.
  • Python model location: I know it's mentioned here that Python models should live in /models directory, which makes total sense to me, but was also left with a sense of, "But where in the /models directory?" I wonder down the line how we could help folks figure out where Python models should live, especially as they start being used for DS/ML. Right now, it makes a ton of senses for Python models to fall under the normal dimensional modeling/modular data modeling structure teams have in-place (since [assumption] they are likely to be purely transformative models like current dbt models). This is also somewhat related to the comment in the UDF section around a function endpoint as you consider Python functions to potentially be reusable across multiple dbt models.
  • So freaking excited to see a data frame glossary page! In the next few weeks I'm going to (actually do my job and) expand on that glossary page to (selfishly) build in some SEO and follow the format we have going for other glossary pages (if this is cool with you).

Comment on lines +197 to +199
```python
def add_one(x):
return x + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are functions only:

  • user-definable inside of the same file
  • importable from a public PyPI package
    ?

Or can you do something chaotic like import add_one from my_python_model in my_second_python_model.py?

Related but broader question: I assume there are no Python macros? Either Jinja to template python or a global space for add_one to be defined once

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently you cannot do that. It is possible to support(not very pretty way). @jtcohen6 maybe that's something to call out that we don't support it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really consider this an open question for us!

I'm disinclined to do this one, given the chaos it could lead to, and over-reliance on the local file system:

Or can you do something chaotic like import add_one from my_python_model in my_second_python_model.py?

I think we have two good options to pursue:

  • Some platforms support registering these functions as "named" / persistent UDFs. This would be the case for "dbt should know about functions." These UDFs have some drawbacks; inconsistent support; and the main use case seems to be defining a Python UDF that then gets called from within a SQL model. That seems valuable for exposing a prediction function output of a trained ML model, less so for a utility like add_one.
  • Potentially more promising: most of these platforms allow you to upload your own home-spun "Python package," as a .zip, ==. You could then import it and use its functions, as if it were a public PyPI package.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that you shouldn't be able to do arbitrary imports from functions defined in other models in the models/ directory.

most of these platforms allow you to upload your own home-spun "Python package," as a .zip, ==. You could then import it and use its functions, as if it were a public PyPI package

Where would the code that gets zipped and uploaded be source controlled? This feels very analogous to the before-state of https://discourse.getdbt.com/t/using-dbt-to-manage-user-defined-functions/18 - code is magically available but no one really knows where it came from.

If we can resolve that, I do like the idea of an internal functions package.

Would finding .py functions defined in the macros directory (or add functions/ to your macro-paths) and embedding them all into an internal package be good enough at breaking reliance on the filesystem? This assumes that you're disinclined to enable imports from inside a model because it would be bad practice, not that it would be hard to implement technically.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where would the code that gets zipped and uploaded be source controlled?

Yup — 100% needs to live within the dbt project, and be version-controlled along with it.

I think that "upload" step will require different behind-the-scenes implementations across different data warehouses. We could even opt for UDFs as the backing implementation, if a given warehouse doesn't support arbitrary package upload. There's risk of a leaky abstraction here, but it feels like a place where dbt can help in a big way, and where we won't see the full value of Python models without some answer here.

Questions we need to answer:

  • Is this "upload" step part of dbt deps? What if these reusable methods live in the root project, rather than a dbt package?
  • How to support the transition from local → cloud development (if we want to preserve that workflow)?
  • Those Python models should ideally be (a) pure functions with (b) thorough unit testing. Can dbt help here? Or should we encourage standard best practices Python development (pytest etc)?

@lostmygithubaccount @ChenyuLInx This feels like one of our highest-priority spikes during the v1.3 beta → final period.

Comment on lines +197 to +199
```python
def add_one(x):
return x + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently you cannot do that. It is possible to support(not very pretty way). @jtcohen6 maybe that's something to call out that we don't support it.


**Additional setup:** The `user` field in the `dbt-spark` profile, usually optional, is required for Python modeling.

**Note:** Python models will be created and run as notebooks in your Databricks workspace. The notebooks will be created within the personal workspace of the `user` running dbt.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add that we don't delete the notebook so you can use it for interactive development of model and bring the code back?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

Copy link
Contributor

@b-per b-per left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done! I think it is already giving a lot of info (enough for people to start).

I just suggested a few changes.
The most controversial being that I'd be in favor of using DataFrame everywhere instead of data frame

website/docs/terms/dataframe.md Outdated Show resolved Hide resolved

In v1.3, dbt Core is adding support for **Python models**.

dbt's Python capabilities are an extension of its capabilities with SQL models. We recommend that you read this page first, and then read: ["dbt Models (Python)"](python-models)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth mentioning also that what says .sql below would need to be replaced by .py for Python. No need to go into too much detail in that page but just calling it out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a few different permutations of this, but I wasn't happy with any. I'm going to stick with minimal note for now. We have a lot of future work to do, to figure out the right ways to integrate Python capabilities / references into the existing docs that say SQL, SQL, .sql

website/docs/terms/dataframe.md Outdated Show resolved Hide resolved
@jtcohen6 jtcohen6 requested a review from dataders as a code owner July 29, 2022 11:31
@jtcohen6
Copy link
Collaborator Author

Thank you for the quick reviews and excellent feedback @lostmygithubaccount @ChenyuLInx @KiraFuruichi @joellabes @jeremyholtzman @b-per!! Thrilled that this turned into a real team effort.

So freaking excited to see a data frame glossary page! In the next few weeks I'm going to (actually do my job and) expand on that glossary page to (selfishly) build in some SEO and follow the format we have going for other glossary pages (if this is cool with you).

You have the conn, Furuichi

The most controversial being that I'd be in favor of using DataFrame everywhere instead of data frame

Good by me! Made this change.

Comment on lines 488 to 494
:::caution

The implementation for Python models on GCP (BigQuery + Dataproc) is the roughest of the three. Running PySpark on Dataproc requires more manual setup and configuration. Clusters require sufficient resources or auto-scaling policies to handle concurrent Python model runs.

We have made the code available for the beta, but we are reserving the right to leave it out of the final v1.3 release if the experience is too unfriendly to end users. If you're a GCP expert, we're very open to hearing your thoughts in Slack or the GitHub discussions!

:::
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dataders This reads a bit too harsh right now. My goal here is to set clear expectations for any BQ users around both (a) the setup experience, (b) the possibility that this functionality is cut from v1.3 (final). Definitely open to your wordsmithing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:::caution
The implementation for Python models on GCP (BigQuery + Dataproc) is the roughest of the three. Running PySpark on Dataproc requires more manual setup and configuration. Clusters require sufficient resources or auto-scaling policies to handle concurrent Python model runs.
We have made the code available for the beta, but we are reserving the right to leave it out of the final v1.3 release if the experience is too unfriendly to end users. If you're a GCP expert, we're very open to hearing your thoughts in Slack or the GitHub discussions!
:::
:::caution
Here be dragons; this implementation for Python models on GCP (BigQuery + Dataproc) needs the most love of the three before the official release.
For example, running PySpark on Dataproc requires more manual setup and configuration. Clusters require sufficient resources or auto-scaling policies to handle concurrent Python model runs.
We have made the code available for the beta, but we are reserving the right to leave it out of the final v1.3 release if the experience is too unfriendly to end users. If you're a GCP expert, we'd love your help; please share your thoughts in Slack or the GitHub discussions!
:::

Copy link
Contributor

@nghi-ly nghi-ly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beta docs look great, @jtcohen6 ! small nits. non-blocking.


Each Python model lives in a `.py` file in your `models/` folder. It defines a function named **`model()`**, which has two parameters:
- **`dbt`**: A class compiled by dbt Core, unique to each model, that enables you to run your Python code in the context of your dbt project and DAG.
- **`session`**: A class representing the connection to the backing engine, which allows you to interact with the data platform. The session is needed to read in tables as DataFrames, and to write DataFrames back to tables. In PySpark, by convention, the `SparkSession` is named `spark`, and available globally. For consistency across platforms, we always pass it into the `model` function as an explicit argument named `session`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **`session`**: A class representing the connection to the backing engine, which allows you to interact with the data platform. The session is needed to read in tables as DataFrames, and to write DataFrames back to tables. In PySpark, by convention, the `SparkSession` is named `spark`, and available globally. For consistency across platforms, we always pass it into the `model` function as an explicit argument named `session`.
- **`session`**: A class representing the connection to the backend engine, which allows you to interact with the data platform. The session is needed to read in tables as DataFrames, and to write DataFrames back to tables. In PySpark, by convention, the `SparkSession` is named `spark`, and available globally. For consistency across platforms, we always pass it into the `model` function as an explicit argument named `session`.

Copy link
Contributor

@nghi-ly nghi-ly Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this was a typo so just flagging

- **`dbt`**: A class compiled by dbt Core, unique to each model, that enables you to run your Python code in the context of your dbt project and DAG.
- **`session`**: A class representing the connection to the backing engine, which allows you to interact with the data platform. The session is needed to read in tables as DataFrames, and to write DataFrames back to tables. In PySpark, by convention, the `SparkSession` is named `spark`, and available globally. For consistency across platforms, we always pass it into the `model` function as an explicit argument named `session`.

The `model()` function must return a single DataFrame. On Snowpark (Snowflake), this can be a Snowpark or Pandas DataFrame. On PySpark (Databricks + BigQuery), this should be a PySpark DataFrame (converted back from Pandas if needed). For more about choosing between Pandas and native DataFrames, see ["DataFrame API + syntax"](#dataframe-api--syntax)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `model()` function must return a single DataFrame. On Snowpark (Snowflake), this can be a Snowpark or Pandas DataFrame. On PySpark (Databricks + BigQuery), this should be a PySpark DataFrame (converted back from Pandas if needed). For more about choosing between Pandas and native DataFrames, see ["DataFrame API + syntax"](#dataframe-api--syntax)
The `model()` function must return a single DataFrame. On Snowpark (Snowflake), this can be a Snowpark or Pandas DataFrame. On PySpark (Databricks + BigQuery), this should be a PySpark DataFrame (converted back from Pandas if needed). For more about choosing between Pandas and native DataFrames, see ["DataFrame API + syntax"](#dataframe-api--syntax).

When you `dbt run --select python_model`, dbt will prepare and pass in both arguments (`dbt` and `session`). All you have to do is define the `model()` function that accepts them.

### Referencing other models
Your Python model will want to read data from other models (SQL or Python) or sources. Do this using the `dbt.ref()` function. The same idea applies for raw source tables, via `dbt.source()`. Those functions return DataFrames, pointing to the upstream source, model, seed, or snapshot.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Your Python model will want to read data from other models (SQL or Python) or sources. Do this using the `dbt.ref()` function. The same idea applies for raw source tables, via `dbt.source()`. Those functions return DataFrames, pointing to the upstream source, model, seed, or snapshot.
Use the `dbt.ref()` function to let your Python model read data from other models (SQL or Python) or sources. The same idea applies for raw source tables from `dbt.source()`. Those functions return DataFrames, pointing to the upstream source, model, seed, or snapshot.


</File>

The `config()` function accepts _only_ literal values (strings, booleans, and numeric types). It is not possible to pass another function or more complex data structure. The reason: dbt statically analyzes the arguments to `config()` while parsing your model, without actually executing any of your Python code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `config()` function accepts _only_ literal values (strings, booleans, and numeric types). It is not possible to pass another function or more complex data structure. The reason: dbt statically analyzes the arguments to `config()` while parsing your model, without actually executing any of your Python code.
The `config()` function accepts _only_ literal values (strings, booleans, and numeric types). It's not possible to pass another function or a more complex data structure. The reason is because dbt statically analyzes the arguments to `config()` while parsing your model, without actually executing any of your Python code.


#### Accessing project context

dbt Python models do not use Jinja to render compiled code. Compared to SQL models, Python models have very limited access to global project context. That context is made available via the `dbt` class, passed in as an argument to the `model()` function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dbt Python models do not use Jinja to render compiled code. Compared to SQL models, Python models have very limited access to global project context. That context is made available via the `dbt` class, passed in as an argument to the `model()` function.
dbt Python models don't use Jinja to render compiled code. Compared to SQL models, Python models have very limited access to global project context. That context is made available from the `dbt` class, passed in as an argument to the `model()` function.


<div warehouse="Databricks">

**Additional setup:** The `user` field in your [Spark connection profile](spark-profile), usually optional, is required for running Python models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Additional setup:** The `user` field in your [Spark connection profile](spark-profile), usually optional, is required for running Python models.
**Additional setup:** The `user` field in your [Spark connection profile](spark-profile) (which is usually optional) is required for running Python models.


<div warehouse="BigQuery">

The `dbt-bigquery` adapter uses a service called Dataproc to submit your Python models as PySpark jobs. That Python/PySpark code will read from your tables and views in BigQuery, and saves its final result back to BigQuery.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `dbt-bigquery` adapter uses a service called Dataproc to submit your Python models as PySpark jobs. That Python/PySpark code will read from your tables and views in BigQuery, and saves its final result back to BigQuery.
The `dbt-bigquery` adapter uses a service called Dataproc to submit your Python models as PySpark jobs. That Python/PySpark code will read from your tables and views in BigQuery, and save its final result back to BigQuery.


**Additional setup:**
- Create or use an existing [Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets)
- Create or use an existing [Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster), with the [Spark BigQuery connector initialization action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/connectors#bigquery-connectors). (Google recommends copying the action into your own Cloud Storage bucket, rather than using the example version shown in the screenshot below.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Create or use an existing [Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster), with the [Spark BigQuery connector initialization action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/connectors#bigquery-connectors). (Google recommends copying the action into your own Cloud Storage bucket, rather than using the example version shown in the screenshot below.)
- Create or use an existing [Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) with the [Spark BigQuery connector initialization action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/connectors#bigquery-connectors). (Google recommends copying the action into your own Cloud Storage bucket, rather than using the example version shown in the screenshot below.)


## New and changed documentation

- **[Python models](building-models-with-python)** are natively supported in `dbt-core` for the first time, on data warehouses that support Python runtimes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **[Python models](building-models-with-python)** are natively supported in `dbt-core` for the first time, on data warehouses that support Python runtimes.
- **[Python models](building-models-with-python)** are natively supported in `dbt-core` for the first time on data warehouses that support Python runtimes.


A DataFrame is a way of storing and manipulating tabular data in Python. (It's also used in other languages popular for data processing, such as R and Scala.)

It's possible to string together a number of DataFrame transformations. For example, if `df` represents a DataFrame containing one row per person living in the Eastern United States over the last decade, we can calculate the number of people living in Philly each year:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using Philadelphia unstead of Philly to be less US centric, less colloquial

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It's possible to string together a number of DataFrame transformations. For example, if `df` represents a DataFrame containing one row per person living in the Eastern United States over the last decade, we can calculate the number of people living in Philly each year:
It's possible to string together a number of DataFrame transformations. For example, if `df` represents a DataFrame containing one row per person living in the Eastern United States over the last decade, we can calculate the number of people living in Philadelphia each year:

Copy link
Contributor

@dataders dataders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading this felt like visiting my childhood home as an adult after many years and a new family lives there. Just when I thought I got out of the python data transformation world -- they pull me back in!

some considerations:

  1. "DataFrame" or "dataframe"? relatedly, do all references of the word need to link out to the glossary?
  2. I'm not sure what the WH-specific "Installing Packages" advice is regarding. As an antecedent step to defining model level requirements?
  3. i'm 92% that dbt.ref() and the like are better categorized as methods not functions... but don't care that much..

Comment on lines 28 to 29
- **`dbt`**: A class compiled by dbt Core, unique to each model, that enables you to run your Python code in the context of your dbt project and DAG.
- **`session`**: A class representing the connection to the backing engine, which allows you to interact with the data platform. The session is needed to read in tables as DataFrames, and to write DataFrames back to tables. In PySpark, by convention, the `SparkSession` is named `spark`, and available globally. For consistency across platforms, we always pass it into the `model` function as an explicit argument named `session`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the end user ever have to modify these parameters to be different than dbt and session? If not, why expose them to users at all?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The end user needs to access classmethods of dbt, and some properties of session as well. We think, better to document those and make it clearer where they're coming from, rather than pulling them out of the aether.

We could have gone for a "script"-style approach, where dbt and session appear to be available "globally":

def some_fn:
    ...
    
some_df = dbt.ref("some_model")
final_df = some_df.apply(some_fn)

final_df

Where really we just take users' code, and turn it into:

def model(dbt, session):
    def some_fn:
        ...
    
    some_df = dbt.ref("some_model")
    final_df = some_df.apply(some_fn)

    return final_df  # we add this return?

It's still an option we could consider, if people like that a lot better! (@ChenyuLInx)

Comment on lines +55 to +59
# DataFrame representing an upstream model
upstream_model = dbt.ref("upstream_model_name")

# DataFrame representing an upstream source
upstream_source = dbt.source("upstream_source_name", "table_name")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these lazily-evaluated references? or does dbt.source() read the reference into memory at time of definition?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lazy evaluation if you're using a DataFrame API worth its salt — so yes! As soon as you convert to Pandas, though, that will read everything into memory.

Comment on lines 488 to 494
:::caution

The implementation for Python models on GCP (BigQuery + Dataproc) is the roughest of the three. Running PySpark on Dataproc requires more manual setup and configuration. Clusters require sufficient resources or auto-scaling policies to handle concurrent Python model runs.

We have made the code available for the beta, but we are reserving the right to leave it out of the final v1.3 release if the experience is too unfriendly to end users. If you're a GCP expert, we're very open to hearing your thoughts in Slack or the GitHub discussions!

:::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:::caution
The implementation for Python models on GCP (BigQuery + Dataproc) is the roughest of the three. Running PySpark on Dataproc requires more manual setup and configuration. Clusters require sufficient resources or auto-scaling policies to handle concurrent Python model runs.
We have made the code available for the beta, but we are reserving the right to leave it out of the final v1.3 release if the experience is too unfriendly to end users. If you're a GCP expert, we're very open to hearing your thoughts in Slack or the GitHub discussions!
:::
:::caution
Here be dragons; this implementation for Python models on GCP (BigQuery + Dataproc) needs the most love of the three before the official release.
For example, running PySpark on Dataproc requires more manual setup and configuration. Clusters require sufficient resources or auto-scaling policies to handle concurrent Python model runs.
We have made the code available for the beta, but we are reserving the right to leave it out of the final v1.3 release if the experience is too unfriendly to end users. If you're a GCP expert, we'd love your help; please share your thoughts in Slack or the GitHub discussions!
:::


**Note:** Python models will be created and run as notebooks in your Databricks workspace. The notebook will be created within the personal workspace of the `user` running dbt, and named after the model it is executing. dbt will update the notebooks on subsequent runs of the same model, but it will not delete it—so you can use the notebook for quicker interactive development. Just remember to update the code in your dbt model, based on your in-notebook iteration, before the next `dbt run`.

**Installing packages:** We recommend configuring packages on the interactive cluster which you will be using to run your Python models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we recommending to initially install all the packages you might use before creating python models in lieu of specifying requirements as part of a Python model's specifification? As in, it's not easy/possible/practical to configure model-specific python environments, so just do it once manually outside of dbt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This totally varies based on the backend / implementation. On Databricks, it is possible to tell a cluster to run pip install <some_package> in a first notebook cell. That can take quite a bit of time, though. Their recommendation (and it makes sense to me) is to configure the needed packages on the cluster ahead of time.

By configuring packages in that way, it does mean that two models cannot have conflicting package version requirements. I think that's a recipe for disaster, anyway...

Copy link
Collaborator

@runleonarun runleonarun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jtcohen6 Just some suggestions. Mostly would like to see a but of context and examples for the beta so people who see this first, want to try it out!

- [Weigh in on our developing best practices](https://github.com/dbt-labs/docs.getdbt.com/discussions/1811)
- Join the **#beta-feedback-python-models** channel in the [dbt Community Slack](https://www.getdbt.com/community/join-the-community/)

Below, you'll see sections entitled "❓ **Our questions**." We're working to develop our opinionated recommendations ahead of the final release this October—and you can help! Comment in the GitHub discussions, leave thoughts in Slack, talk about it with colleagues and friends.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Below, you'll see sections entitled "❓ **Our questions**." We're working to develop our opinionated recommendations ahead of the final release this October—and you can help! Comment in the GitHub discussions, leave thoughts in Slack, talk about it with colleagues and friends.
Below, you'll see sections entitled "❓ **Our questions**." We're working to develop our opinionated recommendations ahead of the final release this October—and you can help! Comment in the GitHub discussions, leave thoughts in Slack, discuss dbt Python models with colleagues and friends.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to be a little cute here. Not sure if I'm achieving it, might be better to cut altogether:

Suggested change
Below, you'll see sections entitled "❓ **Our questions**." We're working to develop our opinionated recommendations ahead of the final release this October—and you can help! Comment in the GitHub discussions, leave thoughts in Slack, talk about it with colleagues and friends.
Below, you'll see sections entitled "❓ **Our questions**." We're working to develop our opinionated recommendations ahead of the final release this October—and you can help! Comment in the GitHub discussions; leave thoughts in Slack; bring up dbt + Python in casual conversation with colleagues and friends.

@jtcohen6
Copy link
Collaborator Author

jtcohen6 commented Jul 30, 2022

Thanks for the thorough reviews @nghi-ly @dataders @runleonarun!!

I'm working through your feedback, haven't gotten all the way through, but pushing up some intermediate progress.


@dataders

"DataFrame" or "dataframe"?

Benoit says “DataFrame,” and I think that’s the most technically correct. It looks a bit stuffy to me, but I’m okay with the formality for now — it’s a new concept for many.

relatedly, do all references of the word need to link out to the glossary?

Good question! How does the docs team feel about this? Every instance, just the first one on a given page, something in between?

I’m 92% that dbt.ref() and the like are better categorized as methods not functions... but don't care that much..

You’re 100% right. Deliciously, in our implementation, dbt.ref() is a classmethod… pointing to a function named ref()! At the risk of some more inconsistency, I’ve done my best to replace instances of “function” with “method,” when that’s the one I mean.


@runleonarun

Mostly would like to see a but of context and examples for the beta so people who see this first, want to try it out!

At the risk of too much text—this page will definitely want to be split up before final release!—I've added a more exciting intro, as well as a placeholder example, before we get into the nitty-gritty implementation details.

Copy link
Collaborator

@runleonarun runleonarun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@jtcohen6 jtcohen6 merged commit d74c7f5 into current Aug 1, 2022
@jtcohen6 jtcohen6 deleted the feat/python-models-beta branch August 1, 2022 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content Improvements or additions to content size: large This change will more than a week to address and might require more than one person
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Beta docs: Python models
10 participants