Databricks System Tables: Best Practices & Optimisation

In 2023, Databricks introduced system tables to provide insights into platform usage. While Databricks offers helpful demos, I encountered areas for improvement during implementation.

Addressing Demo Limitations

For instance, I noticed the predicted daily cost occasionally dropped below zero. Although a positive outcome, it’s not something I’d present to stakeholders. Additionally, the demo lacked dashboard filters, hindering users from focusing on specific data.

Boosting Security and Usability

With the public preview of row-level security (RLS), I integrated this feature to ensure data privacy in a data mesh architecture. This addition enhances security and allows for more tailored data access.

This post details the extensions and workarounds I used to create a secure, user-friendly dashboard with a predictive model adapted to the data.

Prerequisites

You should have at least one Unity Catalog-enabled workspace.
An account admin should enable system tables through the Unity Catalog REST API.
More information on enabling system tables can be found in the official Databricks documentation.
To query tables with row filters (for RLS), use a cluster with runtime version 12.2 or higher.

Coding Journey

Demo by Databricks

Here you can find the Forecast Billing Demo by Databricks.

Code and Visualisations

Import Necessary SQL Functions

from pyspark.sql.functions import sum , col, to_date, month, year, concat_ws, current_date, date_sub, when, lit, add_months, lower, regexp_extract, cast, format_number, avg , count

Create a Temporary View

I preferred to use a temporary view, but the same functionality can be achieved through a dataframe for this notebook.

In a federated architecture, it is highly recommended to use tags, for example, to facilitate cross-charging. In the code, I added ‘department’ as an illustrative tag. You can create your own key-value pairs that align with your business.

Moreover, I constructed a view layer atop both the billing.usage and the compute.clusters system tables to enforce row-level security. More information on the views:

vw_clusters
I created this view to enforce row-level security on the compute.clusters system table. It is not possible to apply row functions directly on the system tables. That is the only extra layer that this view adds; the columns are the same. This table and view hold information about the clusters within the Databricks account like cluster ID, name, node types, scaling and other settings.
vw_usage
I created vw_usage, too, to apply row-level security. It links to all the columns from the billing.usage system table. It is an hourly fact log on Databricks consumption by SKU name and workspace. Other important columns are usage unit, usage quantity and usage metadata.
dim_workspace
To be able to refer to the workspace names (instead of just the IDs) in Databricks monitoring dashboards, my colleague Chris created a mapping of those two variants of workspace references. The two columns in this table are workspace ID and workspace name.

I added additional explanatory comments in the script.

%sql
CREATE OR REPLACE TEMPORARY VIEW vw_cost
AS
WITH cte_cluster_tags AS (
  SELECT DISTINCT
    cluster_id
    ,tags.department as department
    ,ROW_NUMBER() OVER (
      PARTITION BY cluster_id
      ORDER BY
        create_time DESC
    ) AS rn -- This numbers rows based on the specified partition. 1 represents the newest tag value.
  FROM
    support.finops.vw_clusters -- View for secured access
)
SELECT
  coalesce(ws.workspace_name, ws.workspace_id) as workspace
  ,u.usage_date
  ,u.sku_name
  ,cast(u.usage_quantity AS DOUBLE) AS dbus
  ,cast(lp.pricing.default * usage_quantity AS DOUBLE) AS cost_at_list_price
  ,c.department
FROM
  support.finops.vw_usage u
INNER JOIN system.billing.list_prices lp 
  ON u.cloud = lp.cloud
  AND u.sku_name = lp.sku_name
  AND u.usage_start_time BETWEEN lp.price_start_time AND COALESCE(lp.price_end_time, '2099-12-31') 
LEFT JOIN cte_cluster_tags c
  ON u.usage_metadata.cluster_id = c.cluster_id
  AND c.rn = 1 -- To make sure that incorrect duplicates are avoided if they ever exist.
LEFT JOIN support.finops.dim_workspace ws
  ON u.workspace_id = ws.workspace_id
WHERE
  u.usage_unit = 'DBU'
  AND u.usage_date BETWEEN current_date() - 93 -- 3 months
AND current_date() -1 -- Not including today, because day is not finished
ORDER BY
  u.usage_date
  ,workspace
  ,u.sku_name

Prepare Functions for Multiselect Widgets

As the built-in multi-select widget function lacks support for an easy ‘select all’ option, I created a custom function. With that function, you will have one option to select all. Figure 1 provides an example of a widget with an ‘ALL’ option.

The first function creates a distinct list of values from a certain column. This could be parameterised to a larger extent.

def cost_col_dist_vals(column_name):
  distinct_values_df = spark.sql(f"select distinct coalesce({column_name}, 'NULL') as {column_name} from vw_cost order by {column_name}") # Query or view/table name could be a parameter
  distinct_values = [row[column_name] for row in distinct_values_df.select(column_name).collect()] # convert df column to list


  return distinct_values


def create_multi_widget(distinct_values, label, default_choice='ALL'):
  if default_choice == 'ALL':
    distinct_values.insert(0,'ALL') # insert ALL option at the top
  
  multi_widget = dbutils.widgets.multiselect(label, default_choice, distinct_values)


  return multi_widget


def get_multiselect_vals(distinct_values, label):
  if getArgument(label) == 'ALL' or 'ALL' in getArgument(label): # if 'ALL' is selected plus one or more other options, ‘ALL’ overrides
    selected_values = distinct_values
    if isinstance(selected_values, list):
      selected_values = ','.join(map("'{0}'".format, selected_values)) # convert to string
  elif getArgument(label).count(',') == 0: # in case 0 or 1 option(s) are selected 
    selected_values = [getArgument(label),'dummy'] # add dummy value so IN statement does not break
    if isinstance(selected_values, list):
      selected_values = ','.join(map("'{0}'".format, selected_values)) # convert to string
  else:
    selected_values = getArgument(label)
    if isinstance(selected_values, list):
      selected_values = ','.join(map("'{0}'".format, selected_values)) # convert to string
    else: # if it’s a string without quotes around the options
      selected_values = "'" + selected_values.replace(",", "','") + "'" 


  return selected_values

Get Distinct Values for Multiselect Filter and Create Widgets

# Get distinct values for multi-select widgets
dist_departments = cost_col_dist_vals('department')
dist_sku_names = cost_col_dist_vals('sku_name')
dist_workspaces = cost_col_dist_vals('workspace')


# Create widgets
create_widget_department = create_multi_widget(dist_departments, 'Department')
create_widget_sku_names = create_multi_widget(dist_sku_names, 'SKU Name')
create_widget_workspaces = create_multi_widget(dist_workspaces, 'Workspace')

Install Prophet

!pip install prophet

Prophet is a time-series forecasting model developed by Facebook.

Prepare Forecast Functions

This section has a couple of fundamental differences compared to the demo code.

Model type
No model is specified in the demo notebook, meaning the default linear model will be used (source: Prophet documentation). With the linear model and a downward slope, you will eventually see negative predicted values (if the forecast period is long enough). In the current use case, it is impossible to have negative Databricks usage and thus negative cost. The Prophet documentation suggests that a logistic model, cap, and floor be specified. If a floor is set, a cap must also be in place.
Minimum training dataset size
The minimum number of training dataset records in the demo notebook is 10. The dataset here is workspace_hist_df. However, I noticed predictions that deviated largely from the actual trend. This is why I increased the minimum to 30, which I noticed is still too limited in some cases. You can play around with this, also with the workspace filter because this can affect the number of data points. In Figure 1, you can see that values above €500 are predicted, while the actual values only reach between €0 and €50.

Country holidays
Prophet considers holidays in its forecasts. The country specified in the demo is US. However, since most of the people who use this Databricks account at my company are working in the Netherlands, I used ‘NL’.

Do not show cap in the y-axis range of the graph
I added plot_cap = False as the function’s default setting. I used a cap value of 1,000,000, which is a multiple of the maximum actual daily cost that will likely never be achieved. If I had included the cap plot in the y-axis range, the actual and forecast fluctuations would be less visible.

from prophet import Prophet
 
#Predict days, for the next 3 months
forecast_frequency='d'
forecast_periods=31*3
 
interval_width=0.8
include_history=True
 
def generate_forecast(history_df, growth = 'logistic', convert_to_pd = False, display_graph = True):
    # convert to pandas df
    if convert_to_pd:
        history_pd = history_df.toPandas()
    else:
        history_pd = history_df
    # drop missing values
    history_pd = history_pd.dropna()
    if history_pd.shape[0] > 30:
        # train and configure the model
        model = Prophet(interval_width=interval_width, growth = growth) # for forecast values with cap (and floor) (source: https://facebook.github.io/prophet/docs/saturating_forecasts.html)
        model.add_country_holidays(country_name='NL')
        model.fit(history_pd)
 
        # make predictions
        future_pd = model.make_future_dataframe(periods=forecast_periods, freq=forecast_frequency, include_history=include_history)


        if 'cap' in history_pd.columns:
            future_pd['cap'] = history_pd['cap'].iloc[0]
        
        if 'floor' in history_pd.columns:
            future_pd['floor'] = history_pd['floor'].iloc[0]


        forecast_pd = model.predict(future_pd)
 
        if display_graph:
           model.plot(forecast_pd, plot_cap = False, include_legend=True)
        # add back y to the history dataset 
        f_pd = forecast_pd[['ds', 'yhat', 'yhat_upper', 'yhat_lower']].set_index('ds')
        # join history and forecast
        results_pd = f_pd.join(history_pd[['ds','y']].set_index('ds'), how='left')
        results_pd.reset_index(level=0, inplace=True)
        results_pd['ds'] = results_pd['ds'].dt.date
    else:
        # not enough data to predict, return history
        for c in ['yhat', 'yhat_upper', 'yhat_lower']:
            history_pd[c] = history_pd['y']
        results_pd = history_pd[['ds','y','yhat', 'yhat_upper', 'yhat_lower']]
    return results_pd

Create Forecast Visual

Since Prophet’s documentation indicates that the input into the main functions should always be a data frame with two columns: ds (date) and y (output variable used for training), I adhered to this advice. I grouped the workspace-filtered query by date (ds). Additionally, Databricks trains an additional model for each independent variable. However, the information about adding variables is quite limited. Due to that and the fact that it would require additional effort to manage more models, especially considering the combination of filters, I skipped the addition of variables.

The drawback is that the model does not take into account variation in the other variables. On the other hand, this blog describes a specialized model with filter functionalities that perform well in terms of predicted values.

Error handling for empty datasets is implemented. It ensures that the output of the cell displays the following if there are no records: “Please verify if you have a workspace selected in the filter widget. Interrupt and re-run the dashboard if you make multiple selection changes.”

# Retrieve selected workspace_ids
selected_workspaces = get_multiselect_vals(dist_workspaces, 'Workspace') 


# vw_cost is limited by the last three months
workspace_hist_df = spark.sql(f"""
                                select 
                                        usage_date as ds
                                        ,sum(cost_at_list_price) as y
                                from vw_cost
                                where workspace in ({selected_workspaces})
                                group by ds
                                order by ds
                              """)


# Specify cap and floor (source: https://facebook.github.io/prophet/docs/saturating_forecasts.html#saturating-minimum)
workspace_hist_df = workspace_hist_df.toPandas()
workspace_hist_df['cap'] = 1000000
workspace_hist_df['floor'] = 0


results_pd = generate_forecast(workspace_hist_df, display_graph = False)


# Although the floor is set to 0, predicted values reach below 0, so negative values are converted to 0
zero_columns_to_convert = ['yhat', 'yhat_upper', 'yhat_lower']
results_pd[zero_columns_to_convert] = results_pd[zero_columns_to_convert].applymap(lambda x: max(0, x))


# Set up date filter widgets
from datetime import datetime, date


if len(results_pd) == 0:
  import pandas as pd
  error_message = ['Please verify if you have a workspace selected in the filter widget. Interrupt and re-run the dashboard, if the you make multiple selection changes.']
  error_df = pd.DataFrame()
  error_df['Error Message'] = error_message
  display(error_df)


else:
## Define default date values
  default_min_date  =  str(min(results_pd['ds'])) # results pd is a pandas df
  default_max_date  =  str(max(results_pd['ds'])) # results pd is a pandas df


  ## Define all widget options
  distinct_dates = results_pd['ds'].astype(str).unique() # retrieve distinct dates and convert to string for readability


  ## Create widgets. This order was applied, because the most recently created widget is created on the left of existing widgets. You can change widget order in the widgets pane (at the top of the notebook)
  dbutils.widgets.dropdown('Minimum Date', default_min_date , distinct_dates)
  dbutils.widgets.dropdown('Maximum Date', default_max_date , distinct_dates)


  ## Retrieve selected values from date widgets
  min_date = date.fromisoformat(getArgument('Minimum Date'))
  max_date = date.fromisoformat(getArgument('Maximum Date'))


  ## Filter results_pd by date
  filtered_results_pd = results_pd[(results_pd['ds'] >= min_date) & (results_pd['ds'] <= max_date)]


  # Create the visualization
  import plotly.graph_objects as go


  fig = go.Figure()
  fig.add_trace(go.Scatter(x=filtered_results_pd['ds'], y=filtered_results_pd['y'][:-4], name='actual usage'))
  fig.add_trace(go.Scatter(x=filtered_results_pd['ds'], y=filtered_results_pd['yhat'], name='forecast cost (pricing list)'))
  fig.add_trace(go.Scatter(x=filtered_results_pd['ds'], y=filtered_results_pd['yhat_upper'], name='uncertainty interval upper bound', line = dict(color='grey', width=1, dash='dot'))) # renamed this (compared to demo) for more clarity
  fig.add_trace(go.Scatter(x=filtered_results_pd['ds'], y=filtered_results_pd['yhat_lower'], name='uncertainty interval lower bound', line = dict(color='grey', width=1, dash='dot'))) # renamed this (compared to demo) for more clarity
  fig.update_layout(margin_b=15,margin_t=10) # to make it look better in the dashboard
  fig.show()

The following image (Figure 3) contains an example output.

*Figure 3: Visualisation ‘Historical and Forecast Cost’*

Create Daily Cost by SKU

The last two visuals do not include forecasts, just like in the original demo. These visualisations differ from those in the demo and provide cost and SKU distribution information.

Daily Cost by SKU Family (Actual)
As an extension of the actual line in the forecast graph, I zoomed in on the cost by SKU over time (Figure 4). Since the SKU_name column would provide unreadable categories, I used the SKU mapping from the demo and referred to this as SKU Family.

*Figure 4: Visualisation ‘Daily Cost by SKU Family (Actual)’*

Total Cost by SKU Name (Actual)
To still be able to see the SKU name-specific cost distribution, users can study below visualization (Figure 5). To inspect the distribution for a specific date (range), people can adjust the minimum and maximum date filters.

*Figure 5: Visualisation ‘Total Cost by SKU Name (Actual)’*

With the following code cell, I retrieve the filter values and insert them into the query string. For this reason, I use a data frame instead of SQL.

# Get filter values
selected_departments = get_multiselect_vals(dist_departments, 'Department')
selected_sku_names = get_multiselect_vals(dist_sku_names, 'SKU Name')
selected_workspaces = get_multiselect_vals(dist_workspaces, 'Workspace') # included in this cell enforce cell run, if widget selection changes
min_date_str = getArgument('Minimum Date') # included in this cell enforce cell run, if widget selection changes
max_date_str = getArgument('Maximum Date') # included in this cell enforce cell run, if widget selection changes


cost_by_sku_family_df = spark.sql(f""" 
    select 
      sub.usage_date                                                          as `Date`
      ,sub.sku_name                                                           as `SKU Name`
      ,CASE WHEN sub.sku_name LIKE "%ALL_PURPOSE%"  THEN "ALL_PURPOSE"
            WHEN sub.sku_name LIKE "%JOBS%"         THEN "JOBS"
            WHEN sub.sku_name LIKE "%DLT%"          THEN "DLT"
            WHEN sub.sku_name LIKE "%SQL%"          THEN "SQL"
            WHEN sub.sku_name LIKE "%INFERENCE%"    THEN "MODEL_INFERENCE"
            ELSE                                     "OTHER"                    
      END                                                                     AS `SKU Family`
      ,sum(sub.cost_at_list_price)                                            as `Cost (List Prices)` -- based on the list_prices table, which does not include special price arrangements with Databricks
    from (
      select 
        workspace
        ,usage_date
        ,sku_name
        ,cost_at_list_price
        ,coalesce(b.`department`, 'NULL')    as `department`
      FROM vw_cost b
    ) sub
    where sub.workspace in ({selected_workspaces})
      and sub.usage_date between '{min_date_str}' and '{max_date_str}'
      and sub.sku_name in ({selected_sku_names})
      and sub.`department` in ({selected_departments})
    group by
      sub.usage_date
      ,sub.sku_name
      ,`SKU Family`
""")


display(cost_by_sku_family_df)

Omitted Elements

billing_forecast table
I did not create a table or view to keep the catalog clean. In the demo, a table is created and then accessed to create the graph, but it worked well in terms of functionality without the table.
detailed_billing_forecast view
I did not want to “smoothen” the data, as mentioned in the demo. I intended to display the data as it is.

Notebook Settings

When filter input is changed, only commands with the getArgument() function of the changed widget are run.
The default notebook language is Python.

Limitations and Opportunities for Improvement

Given that the Prophet documentation states that training and testing are automated, I did not spend effort tweaking the settings. This is still an opportunity to improve the model. Additionally, the model accuracy could be improved by adding regressors or training separate models per additional variable, although the latter option is not recommended in the official Prophet documentation. Another opportunity that I mentioned earlier is to improve model accuracy by specifying a larger minimum historical dataset.

Dashboard

Below, you can see the final dashboard (Figure 6). The dashboard can be utilized to support decision-making related to budgeting and FinOps

In Summary

This post highlights key improvements made to the Databricks billing forecast demo. These enhancements focus on model selection, filtering capabilities, and security. Instead of a linear model, a logistic model now provides more accurate predictions, avoiding unrealistic outcomes. Interactive filtering widgets allow for a more flexible analysis scope. Adding row-level security ensures data privacy and a more precise data representation. To optimise the process, some steps with unclear values were removed. The code was restructured to facilitate the implementation of security filters. These changes result in more informative visualisations that are valuable for stakeholders. While these improvements enhance the demo significantly, further refinements are possible.

Devoteam helps you optimise your Databricks costs

With a team of 1,000+ data consultants with over 960 certifications across leading cloud platforms like AWS, Google Cloud, Microsoft, DataBricks and Snowflake, Devoteam helps you gain control of your Databricks spending.

Begin your AI Journey

AI References