Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modernize legacy insert by period materialization #410

Conversation

HorvathDanielMarton
Copy link

@HorvathDanielMarton HorvathDanielMarton commented Sep 3, 2021

This is a:

  • bug fix PR with no breaking changes — please ensure the base branch is master
  • new functionality — please ensure the base branch is the latest dev/ branch
  • a breaking change — please ensure the base branch is the latest dev/ branch

Description & motivation

The goal of this change is to get the insert_by_period materialization compatible with Snowflake while using the currently latest dbt (0.21.0b2) and keeping the same functionality.

The insert_by_period materialization was introduced in this discourse post and the intention behind it is to help with “problematic” initial builds of data models. Here, problematic means that some tables are just so large and/or complex that a simple --full-refresh is not adequate because the build will be incredibly slow, inefficient, and can even get stuck.

Unfortunately, the currently available version does not support Snowflake and it’s advised to refactor it by using the refreshed code for the incremental materialization as a good starting point. As a result, this change resolves dbt-labs/dbt-labs-experimental-features#32.

The materialization itself was completely built from the current incremental materialization and the modifications done to it was inspired by the original insert_by_period.

The macros that were originally in the insert_by_period.sql was moved to a helpers.sql file where the Snowflake version of get_period_boundaries(), and get_period_sql() are introduced. Also, a new check_for_period_filter() macro has been added.

The materialization seems fully functional and so far works as expected, nevertheless, I'm still testing it against production models.

Ideas

  1. Currently, in the config we can only set a period of day, week, month, etc... as a unit. However, we might want to build our models fortnightly (ie. 2-week chunks). We could introduce an optional configuration that lets us use a configurable number of periods as a chunk. The fortnightly example could look like this:
{{
    config(
        materialized = "insert_by_period",
        timestamp_field = "created_at",
        period = 'week',
        interval_size = 2,
        start_date = "2021-01-01",
        stop_date = "2021-08-31"
    )
}}
  1. In some of our more complex models with multiple sources, we might need to offset the __PERIOD_FILTER__ by some time (for example a month lookback time, but it should be configurable). Currently, this materialization doesn't support this, but that is surely a shortcoming for some of our models where insert_by_period would come in handy because of their massive size & complexity. Also, it seems like our team is not the only ones who bumped into this: Feature: Add optional lookback_interval param to insert_by_period materialization #394

A weird behavior

The materialization can produce a weird behavior if it’s executed on an already existing model because the “last incremental run” may have a very short overlap with the specified date range. Let’s take the following config as an example:

{{
    config(
        materialized = "insert_by_period",
        timestamp_field = "created_at",
        period = 'day',
        start_date = "2021-08-28",
        stop_date = "2021-08-31"
    )
}}

This is going to build a model with events 2021-08-28 < created_at < 2021-08-31, so it will have 3 days, thus in 3 steps it will be built, given it’s a fresh start and the run is uninterrupted. After building the 1st day, if the run is terminated, the next run will have 3 steps, again. It's not unlikely, that the last step will only insert a handful of events (many magnitudes smaller, than the previous ones). Like on this CLI output:

Screen Shot 2021-08-31 at 17 05 03

This may raise some eyebrows, but it makes perfect sense if we look at the query’s where clause - which is basically this:

where created_at > '2021-08-30 23:59:59.852'::timestamp 
   and  created_at <= '2021-08-31 23:59:59.852'::timestamp -- this isn't relevant at all
   and  created_at <  '2021-08-30 23:59:59.999'::timestamp)

So the last iteration is only looking for events that happened between 2021-08-30 23:59:59.852 and 2021-08-30 23:59:59.999, which is far from even being a second. It would be nice to deal with this, so it won't cause suspicion that something is buggy.

This is most conspicuous for large models with a lot of events.

Checklist

  • I have verified that these changes work locally on the following warehouses (Note: it's okay if you do not have access to all warehouses, this helps us understand what has been covered)
    • BigQuery
    • Postgres
    • Redshift
    • Snowflake
  • I have "dispatched" any new macro(s) so non-core adapters can also use them (e.g. the star() source)
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)
  • I have added an entry to CHANGELOG.md

@HorvathDanielMarton HorvathDanielMarton changed the title Modernize legacy insert by period macro Modernize legacy insert by period materialization Sep 6, 2021
select
coalesce(max({{timestamp_field}}), '{{start_date}}')::timestamp as start_timestamp,
coalesce(
{{dbt_utils.dateadd('millisecond',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nanosecond is more accurate, i.e. timestamp(9), millisecond only covers timestamp(3).


select
start_timestamp,
stop_timestamp,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downstream, load_result truncates these timestamp(9) to timestamp(6), which breaks timestamp comparisons downstream, thus inserting duplicates if incremental is rerun after a completed load.

The fix is to cast both start_timestamp and stop_timestamp here to varchar, so that load_result gets a string and not a timestamp.

@joellabes
Copy link
Contributor

Whew! Thanks for all your work on this @HorvathDanielMarton.

I just spent some time talking about this very macro in the dbt Community Slack as well.

In short, we (the dbt Labs Developer Experience team) are very interested in the future of this materialization because it does some important work. What we're not certain about yet is exactly how it fits into the wider dbt ecosystem:

  • Should it be the fifth native materialization?
  • An additional set of config on the existing incremental materialization?
  • Be expanded to cover all four core adapters in scope of this package (Snowflake, BigQuery, Redshift, Postgres)?
  • Be extracted from this package and remain a Redshift-only oddity? Unlikely, given your work here.

I promise this PR won't languish forever, but please bear with us while we work out where we're headed 🧭

@joellabes
Copy link
Contributor

Worth noting: anyone else who comes along in the meantime is of course welcome to make a copy of the code in this PR and apply it to their own project!

@HorvathDanielMarton
Copy link
Author

Worth noting: anyone else who comes along in the meantime is of course welcome to make a copy of the code in this PR and apply it to their own project!

@joellabes Thank you! To make it easier for the curious ones, just update your packages.yml to include the branch as follows

packages:
  - git: "https://github.com/sspinc/dbt-utils.git"
    revision: fix-legacy-insert-by-period-macro

and execute dbt deps. After that you’ll be ready to play around with this materialization.


{%- set period_filter -%}
("{{timestamp_field}}" > '{{start_timestamp}}'::timestamp + interval '{{offset}} {{period}}' and
"{{timestamp_field}}" <= '{{start_timestamp}}'::timestamp + interval '{{offset}} {{period}}' + interval '1 {{period}}' and

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line will not work properly if period = "month"
example:

select  '2018-12-31 00:00:00'::timestamp + interval '4 month' + interval '1 month', 
        '2018-12-31 00:00:00'::timestamp + interval '5 month'

2019-05-30 00:00:00 & 2019-05-31 00:00:00 => one day will not be processed.
Redshift can deal with the brackets, like '{{start_timestamp}}'::timestamp + (interval '{{offset}} {{period}}' + interval '1 {{period}}'), Snowflake will not like it.
Suggestion (that I cannot test right away): calculate offset_plus_one as offset + 1 and use it in setting the filter, like

  {%- set period_filter -%}
    ("{{timestamp_field}}" >  '{{start_timestamp}}'::timestamp + interval '{{offset}} {{period}}' and
     "{{timestamp_field}}" <= '{{start_timestamp}}'::timestamp + interval '{{offset_plus_one}} {{period}}' and
     "{{timestamp_field}}" <  '{{stop_timestamp}}'::timestamp)
  {%- endset -%}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this block and it works ok. Feel free to use it.

{% macro get_period_sql(target_cols_csv, sql, timestamp_field, period, start_timestamp, stop_timestamp, offset) -%}

  {%- set offset_plus_one = offset + 1 -%}

  {%- set period_filter -%}
    ("{{timestamp_field}}" >  '{{start_timestamp}}'::timestamp + interval '{{offset}} {{period}}' and
     "{{timestamp_field}}" <= '{{start_timestamp}}'::timestamp + interval '{{offset_plus_one}} {{period}}' and
     "{{timestamp_field}}" <  '{{stop_timestamp}}'::timestamp)
  {%- endset -%}

  {%- set filtered_sql = sql | replace("__PERIOD_FILTER__", period_filter) -%}

  select
    {{target_cols_csv}}
  from (
    {{filtered_sql}}
  )

{%- endmacro %}

@laconc laconc added enhancement New feature or request and removed enhancement New feature or request labels Dec 2, 2021
@joellabes joellabes mentioned this pull request Jan 23, 2022
12 tasks
@joellabes
Copy link
Contributor

Folks on this thread will probably be interested in this Discussion: #487

I'd appreciate your input!

@smitsrr
Copy link

smitsrr commented Sep 19, 2022

@joellabes
Copy link
Contributor

Unsurprisingly, I can't transfer a PR from one repo to another, but if you want to reopen the PR over there I'll be happy to merge it 👍

@joellabes joellabes closed this Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Modernize insert_by_period materialization
6 participants