Skip to content

Commit

Permalink
Merge branch 'current' into ly-docs-qs-core-codespace
Browse files Browse the repository at this point in the history
  • Loading branch information
nghi-ly committed Mar 30, 2023
2 parents a2c17d7 + df8971a commit 8074f8d
Show file tree
Hide file tree
Showing 30 changed files with 1,066 additions and 240 deletions.
2 changes: 1 addition & 1 deletion website/blog/2021-11-15-november-21-product-email.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ Hear their take, and share your own by [registering here](https://coalesce.getdb

### Things to Listen To 🎧

- Julien Le Dem joined the [Analytics Engineer Podcast](https://roundup.getdbt.com/p/ep-10-why-data-lineage-matters-w?utm_campaign=Monthly%20Product%20Updates&utm_source=hs_email&utm_medium=email&_hsenc=p2ANqtz-9SoWbfj9_ZRDew6i8p8yand1JSmLh7yfridIrLwO7bgHTUmnbKcRp3AEKCO8pOytotdxAo) to talk about how OS projects become standards, and why data lineage in particular is in need of an open standard. 
- Julien Le Dem joined the [Analytics Engineer Podcast](https://roundup.getdbt.com/p/ep-10-why-data-lineage-matters-w?utm_campaign=Monthly%20Product%20Updates&utm_source=hs_email&utm_medium=email&_hsenc=p2ANqtz-9SoWbfj9_ZRDew6i8p8yand1JSmLh7yfridIrLwO7bgHTUmnbKcRp3AEKCO8pOytotdxAo) to talk about how OS projects become standards, and why <Term id="data-lineage" /> in particular is in need of an open standard. 

- [The rise of the Analytics Engineer](https://youtu.be/ixyzF4Dy9Us?utm_campaign=Monthly%20Product%20Updates&utm_source=hs_email&utm_medium=email&_hsenc=p2ANqtz-9SoWbfj9_ZRDew6i8p8yand1JSmLh7yfridIrLwO7bgHTUmnbKcRp3AEKCO8pOytotdxAo): Anna, dbt Labs Director of Community, joined Thoughtspot to talk about the evolution of analytics engineering, or the emergence of the "full stack data analyst."

Expand Down
2 changes: 1 addition & 1 deletion website/blog/2021-11-29-dbt-airflow-spiritual-alignment.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ A couple examples:

If your team’s dbt users are analysts rather than engineers, they still may need to be able to dig into the root cause of a failing dbt [source freshness test](/docs/build/sources).

Having your upstream extract + load jobs configured in Airflow means that analysts can pop open the Airflow UI to monitor for issues (as they would a GUI-based [ETL tool](https://www.getdbt.com/analytics-engineering/etl-tools-a-love-letter/)), rather than opening a ticket or bugging an engineer in Slack. The Airflow UI provides the common interface that analysts need to self-serve, up to the point of action needing to be taken.
Having your upstream extract + load jobs configured in Airflow means that analysts can pop open the Airflow UI to monitor for issues (as they would a GUI-based <Term id="etl">ETL tool</Term>), rather than opening a ticket or bugging an engineer in Slack. The Airflow UI provides the common interface that analysts need to self-serve, up to the point of action needing to be taken.

![airflow dashboard](/img/blog/airflow-dbt-dashboard.png "airflow dashboard")

Expand Down
2 changes: 1 addition & 1 deletion website/blog/2021-11-29-open-source-community-growth.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Here are the tools I chose to use:

- dbt seeds data from offline sources and performs necessary transformations on data after it's been loaded into BigQuery.

- OpenLineage collects data lineage and performance metadata as models run, so I can identify issues and find bottlenecks. Also, to be the subject ecosystem for this study :)
- OpenLineage collects <Term id="data-lineage" /> and performance metadata as models run, so I can identify issues and find bottlenecks. Also, to be the subject ecosystem for this study :)

- Superset visualizes and analyzes results, creates dashboards, and helps me communicate with stakeholders.

Expand Down
2 changes: 2 additions & 0 deletions website/blog/2022-07-26-pre-commit-dbt.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ date: 2022-08-03
is_featured: true
---

*Editor's note — since the creation of this post, the package pre-commit-dbt's ownership has moved to another team and it has been renamed to [dbt-checkpoint](https://github.com/dbt-checkpoint/dbt-checkpoint). A redirect has been set up, meaning that the code example below will still work. It is also possible to replace `repo: https://github.com/offbi/pre-commit-dbt` with `repo: https://github.com/dbt-checkpoint/dbt-checkpoint` in your `.pre-commit-config.yaml` file.*

At dbt Labs, we have [best practices](https://docs.getdbt.com/docs/guides/best-practices) we like to follow for the development of dbt projects. One of them, for example, is that all models should have at least `unique` and `not_null` tests on their primary key. But how can we enforce rules like this?

That question becomes difficult to answer in large dbt projects. Developers might not follow the same conventions. They might not be aware of past decisions, and reviewing pull requests in git can become more complex. When dbt projects have hundreds of models, it's hard to know which models do not have any tests defined and aren't enforcing your conventions.
Expand Down
2 changes: 1 addition & 1 deletion website/blog/2022-08-31-august-product-update.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ You’ll hear more in [Tristan’s keynote](https://coalesce.getdbt.com/agenda/k

I just discovered the treasure trove of excellent resources from dbt Labs consulting partners, and want to start sharing more here. Here’s a few you might have missed over the summer:

- **Reduce ETL costs:** I’ve only just seen [this blog](https://www.mighty.digital/blog/how-dbt-helped-us-reduce-our-etl-costs-significantly) from Mighty Digital, but found it to be a super practical (and concise) introductory guide to rethinking your ETL pipeline with dbt.
- **Reduce ETL costs:** I’ve only just seen [this blog](https://www.mighty.digital/blog/how-dbt-helped-us-reduce-our-etl-costs-significantly) from Mighty Digital, but found it to be a super practical (and concise) introductory guide to rethinking your <Term id="etl">ETL pipeline</Term> with dbt.
- **Explore data:** [Part two of a series on exploring data](https://vivanti.com/2022/07/28/exploring-data-with-dbt-part-2-extracting/) brought to you by Vivanti. This post focuses on working with <Term id="json" /> objects in dbt, but I also recommend the preceding post if you want to see how they spun up their stack.
- **Track historical changes:** [](https://blog.montrealanalytics.com/using-dbt-snapshots-with-dev-prod-environments-e5ed63b2c343)Snapshots are a pretty handy feature for tracking changes in dbt, but they’re often overlooked during initial onboarding. [Montreal Analytics explains how to set them up](https://blog.montrealanalytics.com/using-dbt-snapshots-with-dev-prod-environments-e5ed63b2c343) in dev/prod environments
- **Learn dbt:** Have some new faces on the data team that might need an introduction to dbt? Our friends at GoDataDriven are hosting a [virtual dbt Learn Sept 12-14](https://www.tickettailor.com/events/dbtlabs/752537).
Expand Down
2 changes: 1 addition & 1 deletion website/blog/2022-10-24-demystifying-event-streams.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Under the hood, the Merit platform consists of a series of microservices. Each o

![](/img/blog/2022-10-24-demystifying-event-streams/merit-platform.png)

In the past we relied upon an ETL tool (Stitch) to pull data out of microservice databases and into Snowflake. This data would become the main dbt sources used by our report models in BI.
In the past we relied upon an <Term id="etl" /> tool (Stitch) to pull data out of microservice databases and into Snowflake. This data would become the main dbt sources used by our report models in BI.

![](/img/blog/2022-10-24-demystifying-event-streams/merit-platform-stitch.png)

Expand Down
2 changes: 1 addition & 1 deletion website/blog/2022-11-22-move-spreadsheets-to-your-dwh.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ A big benefit of using seeds is that your file will be checked into source contr

## ETL tools

An obvious choice if you have data to load into your warehouse would be your existing [ETL tool](https://www.getdbt.com/analytics-engineering/etl-tools-a-love-letter/) such as Fivetran or Stitch, which I'll dive into in this section. Below is a summary table highlighting the core benefits and drawbacks of certain ETL tooling options for getting spreadsheet data in your data warehouse.
An obvious choice if you have data to load into your warehouse would be your existing [ETL tool](https://www.getdbt.com/analytics-engineering/etl-tools-a-love-letter/) such as Fivetran or Stitch, which I'll dive into in this section. Below is a summary table highlighting the core benefits and drawbacks of certain <Term id="etl" /> tooling options for getting spreadsheet data in your data warehouse.

### Summary table

Expand Down
2 changes: 1 addition & 1 deletion website/blog/2023-01-17-grouping-data-tests.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Between these two extremes lies a gap where human intelligence goes. Analytics e

## Grouped checks

Group-based checks can be important for fully articulating good "business rules" against which to assess data quality. For example, groups could reflect either computationally-relevant dimensions of the <Term id="etl"/> process (e.g. data loaded from different sources) or semantically-relevant dimensions of the real-world process that our data captures (e.g. repeated measures pertaining to many individual customers, patients, product lines, etc.) Such checks can make existing tests more rigorous while others are only expressible at the grouped level.
Group-based checks can be important for fully articulating good "business rules" against which to assess data quality. For example, groups could reflect either computationally-relevant dimensions of the <Term id="etl" /> process (e.g. data loaded from different sources) or semantically-relevant dimensions of the real-world process that our data captures (e.g. repeated measures pertaining to many individual customers, patients, product lines, etc.) Such checks can make existing tests more rigorous while others are only expressible at the grouped level.

### Only expressible
Some types of checks can only be expressed by group. For example, in a dataset containing train schedules across a transit system, an `ARRIVAL_TIME` field might not be unique; however, it would (hopefully) always be unique for a specific `TRACK` and `STATION`
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
title: "Connect Starburst/Trino"
description: "Configure Starburst/Trino connection."
sidebar_label: "Connect Starburst/Trino"
---

The following are the required fields for setting up a connection with Starburst Enterprise, Starburst Galaxy, and Trino cluster. Unless specified, "cluster" will mean any of these products' clusters.

| Field | Description | Examples |
| --- | --- | --- |
| **Host** | The hostname of your cluster. Don't include the HTTP protocol prefix. | `mycluster.mydomain.com` |
| **Port** | The port to connect to your cluster. By default, it's 443 for TLS enabled clusters. | `443` |
| **User** | The username (of the account) to log in to your cluster. When connecting to Starburst Galaxy clusters, you must include the role of the user as a suffix to the username.<br/><br/> | Format for Starburst Enterprise or Trino: <br/> <ul><li>`user.name`</li><li>`[email protected]`</li></ul><br/>Format for Starburst Galaxy:<br/> <ul><li>`[email protected]/role`</li></ul> |
| **Password** | The user's password. | |
| **Database** | The name of a catalog in your cluster. | `my_postgres_catalog` |
| **Schema** | The name of a schema in your cluster that exists within the specified catalog.  | `my_schema` |


## Roles in Starburst Enterprise
When connecting to a Starburst Enterprise cluster with built-in access controls enabled, you won't be able to provide the role as a suffix to the username, so the default role for the provided username will be used instead.

## Schemas and databases
When selecting the database (catalog) and the schema, make sure the user has read and write access to both the provided database (catalog) and schema. This selection does not limit your ability to query the catalog. Instead, they serve as the default location for where tables and views are materialized. This _default_ can be changed later from within your dbt project.
97 changes: 96 additions & 1 deletion website/docs/docs/quickstarts/dbt-cloud/bigquery-qs.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,102 @@ Later, you can connect your business intelligence (BI) tools to these views and

## Build models on top of other models

<Snippet src="quickstarts/build-models-atop-other-models" />
<Snippet src="quickstarts/intro-build-models-atop-other-models" />

1. Create a new SQL file, `models/stg_customers.sql`, with the SQL from the `customers` CTE in our original query.
2. Create a second new SQL file, `models/stg_orders.sql`, with the SQL from the `orders` CTE in our original query.

<File name='models/stg_customers.sql'>

```sql
select
id as customer_id,
first_name,
last_name
from `dbt-tutorial`.jaffle_shop.customers
```

</File>

<File name='models/stg_orders.sql'>

```sql
select
id as order_id,
user_id as customer_id,
order_date,
status
from `dbt-tutorial`.jaffle_shop.orders
```

</File>

3. Edit the SQL in your `models/customers.sql` file as follows:

<File name='models/customers.sql'>

```sql
with customers as (
select * from {{ ref('stg_customers') }}
),
orders as (
select * from {{ ref('stg_orders') }}
),
customer_orders as (
select
customer_id,
min(order_date) as first_order_date,
max(order_date) as most_recent_order_date,
count(order_id) as number_of_orders
from orders
group by 1
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order_date,
customer_orders.most_recent_order_date,
coalesce(customer_orders.number_of_orders, 0) as number_of_orders
from customers
left join customer_orders using (customer_id)
)
select * from final
```

</File>

4. Execute `dbt run`.

This time, when you performed a `dbt run`, separate views/tables were created for `stg_customers`, `stg_orders` and `customers`. dbt inferred the order to run these models. Because `customers` depends on `stg_customers` and `stg_orders`, dbt builds `customers` last. You do not need to explicitly define these dependencies.

#### FAQs {#faq-2}

<FAQ src="Runs/run-one-model" />
<FAQ src="Models/unique-model-names" />
<FAQ src="Project/structure-a-project" alt_header="As I create more models, how should I keep my project organized? What should I name my models?" />


<Snippet src="quickstarts/test-and-document-your-project" />

Expand Down
97 changes: 96 additions & 1 deletion website/docs/docs/quickstarts/dbt-cloud/databricks-qs.md
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,102 @@ Later, you can connect your business intelligence (BI) tools to these views and

## Build models on top of other models

<Snippet src="quickstarts/build-models-atop-other-models" />
<Snippet src="quickstarts/intro-build-models-atop-other-models" />

1. Create a new SQL file, `models/stg_customers.sql`, with the SQL from the `customers` CTE in our original query.
2. Create a second new SQL file, `models/stg_orders.sql`, with the SQL from the `orders` CTE in our original query.

<File name='models/stg_customers.sql'>

```sql
select
id as customer_id,
first_name,
last_name
from jaffle_shop_customers
```

</File>

<File name='models/stg_orders.sql'>

```sql
select
id as order_id,
user_id as customer_id,
order_date,
status
from jaffle_shop_orders
```

</File>

3. Edit the SQL in your `models/customers.sql` file as follows:

<File name='models/customers.sql'>

```sql
with customers as (
select * from {{ ref('stg_customers') }}
),
orders as (
select * from {{ ref('stg_orders') }}
),
customer_orders as (
select
customer_id,
min(order_date) as first_order_date,
max(order_date) as most_recent_order_date,
count(order_id) as number_of_orders
from orders
group by 1
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order_date,
customer_orders.most_recent_order_date,
coalesce(customer_orders.number_of_orders, 0) as number_of_orders
from customers
left join customer_orders using (customer_id)
)
select * from final
```

</File>

4. Execute `dbt run`.

This time, when you performed a `dbt run`, separate views/tables were created for `stg_customers`, `stg_orders` and `customers`. dbt inferred the order to run these models. Because `customers` depends on `stg_customers` and `stg_orders`, dbt builds `customers` last. You do not need to explicitly define these dependencies.

#### FAQs {#faq-2}

<FAQ src="Runs/run-one-model" />
<FAQ src="Models/unique-model-names" />
<FAQ src="Project/structure-a-project" alt_header="As I create more models, how should I keep my project organized? What should I name my models?" />


<Snippet src="quickstarts/test-and-document-your-project" />

Expand Down
Loading

0 comments on commit 8074f8d

Please sign in to comment.