Skip to content

Commit

Permalink
Update 2022-04-19-complex-deduplication.md
Browse files Browse the repository at this point in the history
  • Loading branch information
lbenezra-FA authored May 12, 2022
1 parent e21f67f commit f09f1fc
Showing 1 changed file with 6 additions and 6 deletions.
12 changes: 6 additions & 6 deletions website/blog/2022-04-19-complex-deduplication.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Let’s get rid of these dupes and send you on your way to do the rest of the *s

<!--truncate-->

You’re here because your duplicates are *special* duplicates. These special dupes are not the basic ones that have same exact values in every column and duplicate primary keys that can be easily fixed by haphazardly throwing in a `distinct` (yeah that’s right, I called using `distinct` haphazard!). These are *partial* duplicates, meaning your entity of concern's primary key is not unique *on purpose* (or perhaps you're just dealing with some less than ideal data syncing). You may be capturing historical, type-two slowly changing dimensional data, or incrementally building a table with an append-only strategy, because you actually want to capture some change over time for the entity your recording (or, as mentioned, your loader may just be appending data indiscriminately on a schedule without much care for your time and sanity). Whatever has brought you here, you know have a table where the *grain* is not your entity’s primary key, but instead the entity’s primary key + the column values that you’re tracking. Confused? Let’s look at an example.
You’re here because your duplicates are *special* duplicates. These special dupes are not the basic ones that have same exact values in every column and duplicate primary keys that can be easily fixed by haphazardly throwing in a `distinct` (yeah that’s right, I called using `distinct` haphazard!). These are *partial* duplicates, meaning your entity of concern's primary key is not unique *on purpose* (or perhaps you're just dealing with some less than ideal data syncing). You may be capturing historical, type-two slowly changing dimensional data, or incrementally building a table with an append-only strategy, because you actually want to capture some change over time for the entity your recording. (Or, as mentioned, your loader may just be appending data indiscriminately on a schedule without much care for your time and sanity.) Whatever has brought you here, you now have a table where the *grain* is not your entity’s primary key, but instead the entity’s primary key + the column values that you’re tracking. Confused? Let’s look at an example.

Here’s your raw table:

Expand Down Expand Up @@ -68,7 +68,7 @@ Here’s a brief overview of the steps we’ll take:

> Step 1 walks you through how to build a hashed entity id from column values using a macro. You’ll use this key in Step 2 to find the true duplicates and clean them out.
The idea in this step is to enable checking for duplicates in the data by attaching a unique key to the hashed values of the columns that make up the entity grain you want to track. It’s important to note here that the *`[dbt_utils.surrogate_key](https://github.com/dbt-labs/dbt-utils/blob/0.8.2/macros/sql/surrogate_key.sql)`* will not create a unique key yet! Instead, it will create a key that will be the same as the key of another row, as long as the column values we’ve selected for our entity grain are the same. *This is intentional and critical!*  The specific non-uniqueness is how we’ll catch our sneaky duplicates.
The idea in this step is to enable checking for duplicates in the data by attaching a unique key to the hashed values of the columns that make up the entity grain you want to track. It’s important to note here that the *[dbt_utils.surrogate_key](https://github.com/dbt-labs/dbt-utils/blob/0.8.2/macros/sql/surrogate_key.sql)* will not create a unique key yet! Instead, it will create a key that will be the same as the key of another row, as long as the column values we’ve selected for our entity grain are the same. *This is intentional and critical!*  The specific non-uniqueness is how we’ll catch our sneaky duplicates.

In our example, you can see that the `surrogate_key` function builds the same `grain_id` or the two rows we know are duplicates, rows 2 and 3, with row 3 being the most recent row.

Expand All @@ -79,9 +79,9 @@ In our example, you can see that the `surrogate_key` function builds the same `g
| c8b91b84808caaf5870d707866b59c | 1_submitted | 1 | cool | submitted | 2022-03-03 |
| 283ff22afb622dcc6a7da373ae1a0fb | 2_pending | 2 | cool | pending | 2022-02-27 |

Remember, it’s important to only look for duplicate rows for the values that indicate a *true* difference between the rows of data the data; e.g., in type-two data, `updated_at_date` doesn’t mean that the other columns that we’ve decided we’re concerned with have changed since the previous time it was loaded, so that column doesn’t necessarily indicate a true difference between rows (though it usually indicates that something has changed, but that change may be outside our scope of concern in this case). But a change in `important_status`, for our purposes, would indicate a change in the data that you’d probably want to track. If you aren’t applying this technique to type-two data, but instead wanting to remove everything except the most recent data, you may have just a few columns that indicate a true difference between rows (an id at the right grain, and or an id at a larger grain + timestamp).
Remember, it’s important to only look for duplicate rows for the values that indicate a *true* difference between the rows of data the data; e.g., in type-two data, `updated_at_date` doesn’t mean that the other columns that we’ve decided we’re concerned with have changed since the previous time it was loaded, so that column doesn’t necessarily indicate a true difference between rows (though it usually indicates that something has changed, but that change may be outside our scope of concern in this case). But a change in `important_status`, for our purposes, would indicate a change in the data that you’d probably want to track. If you aren’t applying this technique to type-two data, but instead wanting to remove everything except the most recent data, you may have just a few columns that indicate a true difference between rows (an id at the right grain, and/or an id at a larger grain + timestamp).

To build our grain_id key, we use the pure gold of the *[dbt_utils package](https://hub.getdbt.com/dbt-labs/dbt_utils/0.8.0/)*. If you’re unsure of what this package is, stop reading right now and make sure this is installed in your dbt project. It will bring joy to your life and ease to your struggling!
To build our `grain_id` key, we use the pure gold of the *[dbt_utils package](https://hub.getdbt.com/dbt-labs/dbt_utils/0.8.0/)*. If you’re unsure of what this package is, stop reading right now and make sure this is installed in your dbt project. It will bring joy to your life and ease to your struggling!

`dbt_utils.star` is the *star* [Editor’s note: 🤦‍♀️] of the show here, which allows you to grab all the columns, *except* the ones you list. If you only have a couple columns, it may be easier just to list them for the `cols` variable instead of using the `star` function.

Expand All @@ -101,7 +101,7 @@ To build our grain_id key, we use the pure gold of the *[dbt_utils package](http
{% endmacro %}
```

For each row of data, this macro grabs each value from all the columns, except the columns we specify in the exclude list. Then it creates a hash-key using `dbt_utils.surrogate_key` that will reflect the uniqueness of the column values (i.e. if the combination of values is *not* unique, the `surrogate_key` will be the same, which is what we want to capture). The columns in the exclude list are values that we want to ignore when looking for a change in the data table (like `unimportant_value,`a column whose fluctuations we don’t want to indicate a real difference between rows). Call the macro above to create a column in your base or staging layer, and call it `grain_id`, so we can filter out the changes where **`count(grain_id) > 1`:
For each row of data, this macro grabs each value from all the columns, except the columns we specify in the exclude list. Then it creates a hash-key using `dbt_utils.surrogate_key` that will reflect the uniqueness of the column values (i.e. if the combination of values is *not* unique, the `surrogate_key` will be the same, which is what we want to capture). The columns in the exclude list are values that we want to ignore when looking for a change in the data table (like `unimportant_value,`a column whose fluctuations we don’t want to indicate a real difference between rows). Call the macro above to create a column in your base or staging layer, and call it `grain_id`, so we can filter out the changes where `count(grain_id) > 1`:

```sql
{{ build_key_from_columns(source('name', 'table_name')) }} as grain_id,
Expand All @@ -111,7 +111,7 @@ For each row of data, this macro grabs each value from all the columns, except t

> Step 2 walks you through how to filter out duplicates based on your new `grain_id` from Step 1.
To get rid of dupes, find the previous `grain_id` (remember this is a hash-key of all the values in a row), compare it to the most recent `grain_id` **as ordered by a reliable timestamp. If they are not equal, then mark it as a real difference in the data, meaning you’ll keep it! Notice that we `coalesce` our window function with a string `‘first_record’`, so that an entity’s first record, which naturally has no `previous_grain_id`, won’t have a `null` in that column and throw off all our downstream comparisons.
To get rid of dupes, find the previous `grain_id` (remember this is a hash-key of all the values in a row), compare it to the most recent `grain_id` as ordered by a reliable timestamp. If they are not equal, then mark it as a real difference in the data, meaning you’ll keep it! Notice that we `coalesce` our window function with a string `‘first_record’`, so that an entity’s first record, which naturally has no `previous_grain_id`, won’t have a `null` in that column and throw off all our downstream comparisons.

```sql
mark_real_diffs as (
Expand Down

0 comments on commit f09f1fc

Please sign in to comment.