In 2023, Databricks introduced system tables to provide insights into platform usage. While Databricks offers helpful demos, I encountered areas for improvement during implementation.
Addressing Demo Limitations
For instance, I noticed the predicted daily cost occasionally dropped below zero. Although a positive outcome, it’s not something I’d present to stakeholders. Additionally, the demo lacked dashboard filters, hindering users from focusing on specific data.
Boosting Security and Usability
With the public preview of row-level security (RLS), I integrated this feature to ensure data privacy in a data mesh architecture. This addition enhances security and allows for more tailored data access.
This post details the extensions and workarounds I used to create a secure, user-friendly dashboard with a predictive model adapted to the data.
Prerequisites
- You should have at least one Unity Catalog-enabled workspace.
- An account admin should enable system tables through the Unity Catalog REST API.
- More information on enabling system tables can be found in the official Databricks documentation.
- To query tables with row filters (for RLS), use a cluster with runtime version 12.2 or higher.
Coding Journey
Demo by Databricks
Here you can find the Forecast Billing Demo by Databricks.
Code and Visualisations
Import Necessary SQL Functions
from pyspark.sql.functions import sum , col, to_date, month, year, concat_ws, current_date, date_sub, when, lit, add_months, lower, regexp_extract, cast, format_number, avg , count
Create a Temporary View
I preferred to use a temporary view, but the same functionality can be achieved through a dataframe for this notebook.
In a federated architecture, it is highly recommended to use tags, for example, to facilitate cross-charging. In the code, I added ‘department’ as an illustrative tag. You can create your own key-value pairs that align with your business.
Moreover, I constructed a view layer atop both the billing.usage and the compute.clusters system tables to enforce row-level security. More information on the views:
- vw_clusters
I created this view to enforce row-level security on the compute.clusters system table. It is not possible to apply row functions directly on the system tables. That is the only extra layer that this view adds; the columns are the same. This table and view hold information about the clusters within the Databricks account like cluster ID, name, node types, scaling and other settings. - vw_usage
I created vw_usage, too, to apply row-level security. It links to all the columns from the billing.usage system table. It is an hourly fact log on Databricks consumption by SKU name and workspace. Other important columns are usage unit, usage quantity and usage metadata. - dim_workspace
To be able to refer to the workspace names (instead of just the IDs) in Databricks monitoring dashboards, my colleague Chris created a mapping of those two variants of workspace references. The two columns in this table are workspace ID and workspace name.
I added additional explanatory comments in the script.
%sql
CREATE OR REPLACE TEMPORARY VIEW vw_cost
AS
WITH cte_cluster_tags AS (
SELECT DISTINCT
cluster_id
,tags.department as department
,ROW_NUMBER() OVER (
PARTITION BY cluster_id
ORDER BY
create_time DESC
) AS rn -- This numbers rows based on the specified partition. 1 represents the newest tag value.
FROM
support.finops.vw_clusters -- View for secured access
)
SELECT
coalesce(ws.workspace_name, ws.workspace_id) as workspace
,u.usage_date
,u.sku_name
,cast(u.usage_quantity AS DOUBLE) AS dbus
,cast(lp.pricing.default * usage_quantity AS DOUBLE) AS cost_at_list_price
,c.department
FROM
support.finops.vw_usage u
INNER JOIN system.billing.list_prices lp
ON u.cloud = lp.cloud
AND u.sku_name = lp.sku_name
AND u.usage_start_time BETWEEN lp.price_start_time AND COALESCE(lp.price_end_time, '2099-12-31')
LEFT JOIN cte_cluster_tags c
ON u.usage_metadata.cluster_id = c.cluster_id
AND c.rn = 1 -- To make sure that incorrect duplicates are avoided if they ever exist.
LEFT JOIN support.finops.dim_workspace ws
ON u.workspace_id = ws.workspace_id
WHERE
u.usage_unit = 'DBU'
AND u.usage_date BETWEEN current_date() - 93 -- 3 months
AND current_date() -1 -- Not including today, because day is not finished
ORDER BY
u.usage_date
,workspace
,u.sku_name
Prepare Functions for Multiselect Widgets
As the built-in multi-select widget function lacks support for an easy ‘select all’ option, I created a custom function. With that function, you will have one option to select all. Figure 1 provides an example of a widget with an ‘ALL’ option.
The first function creates a distinct list of values from a certain column. This could be parameterised to a larger extent.
def cost_col_dist_vals(column_name):
distinct_values_df = spark.sql(f"select distinct coalesce({column_name}, 'NULL') as {column_name} from vw_cost order by {column_name}") # Query or view/table name could be a parameter
distinct_values = [row[column_name] for row in distinct_values_df.select(column_name).collect()] # convert df column to list
return distinct_values
def create_multi_widget(distinct_values, label, default_choice='ALL'):
if default_choice == 'ALL':
distinct_values.insert(0,'ALL') # insert ALL option at the top
multi_widget = dbutils.widgets.multiselect(label, default_choice, distinct_values)
return multi_widget
def get_multiselect_vals(distinct_values, label):
if getArgument(label) == 'ALL' or 'ALL' in getArgument(label): # if 'ALL' is selected plus one or more other options, ‘ALL’ overrides
selected_values = distinct_values
if isinstance(selected_values, list):
selected_values = ','.join(map("'{0}'".format, selected_values)) # convert to string
elif getArgument(label).count(',') == 0: # in case 0 or 1 option(s) are selected
selected_values = [getArgument(label),'dummy'] # add dummy value so IN statement does not break
if isinstance(selected_values, list):
selected_values = ','.join(map("'{0}'".format, selected_values)) # convert to string
else:
selected_values = getArgument(label)
if isinstance(selected_values, list):
selected_values = ','.join(map("'{0}'".format, selected_values)) # convert to string
else: # if it’s a string without quotes around the options
selected_values = "'" + selected_values.replace(",", "','") + "'"
return selected_values
Get Distinct Values for Multiselect Filter and Create Widgets
# Get distinct values for multi-select widgets
dist_departments = cost_col_dist_vals('department')
dist_sku_names = cost_col_dist_vals('sku_name')
dist_workspaces = cost_col_dist_vals('workspace')
# Create widgets
create_widget_department = create_multi_widget(dist_departments, 'Department')
create_widget_sku_names = create_multi_widget(dist_sku_names, 'SKU Name')
create_widget_workspaces = create_multi_widget(dist_workspaces, 'Workspace')
Install Prophet
!pip install prophet
Prophet is a time-series forecasting model developed by Facebook.
Prepare Forecast Functions
This section has a couple of fundamental differences compared to the demo code.
- Model type
No model is specified in the demo notebook, meaning the default linear model will be used (source: Prophet documentation). With the linear model and a downward slope, you will eventually see negative predicted values (if the forecast period is long enough). In the current use case, it is impossible to have negative Databricks usage and thus negative cost. The Prophet documentation suggests that a logistic model, cap, and floor be specified. If a floor is set, a cap must also be in place. - Minimum training dataset size
The minimum number of training dataset records in the demo notebook is 10. The dataset here is workspace_hist_df. However, I noticed predictions that deviated largely from the actual trend. This is why I increased the minimum to 30, which I noticed is still too limited in some cases. You can play around with this, also with the workspace filter because this can affect the number of data points. In Figure 1, you can see that values above €500 are predicted, while the actual values only reach between €0 and €50.
- Country holidays
Prophet considers holidays in its forecasts. The country specified in the demo is US. However, since most of the people who use this Databricks account at my company are working in the Netherlands, I used ‘NL’.
- Do not show cap in the y-axis range of the graph
I added plot_cap = False as the function’s default setting. I used a cap value of 1,000,000, which is a multiple of the maximum actual daily cost that will likely never be achieved. If I had included the cap plot in the y-axis range, the actual and forecast fluctuations would be less visible.
from prophet import Prophet
#Predict days, for the next 3 months
forecast_frequency='d'
forecast_periods=31*3
interval_width=0.8
include_history=True
def generate_forecast(history_df, growth = 'logistic', convert_to_pd = False, display_graph = True):
# convert to pandas df
if convert_to_pd:
history_pd = history_df.toPandas()
else:
history_pd = history_df
# drop missing values
history_pd = history_pd.dropna()
if history_pd.shape[0] > 30:
# train and configure the model
model = Prophet(interval_width=interval_width, growth = growth) # for forecast values with cap (and floor) (source: https://facebook.github.io/prophet/docs/saturating_forecasts.html)
model.add_country_holidays(country_name='NL')
model.fit(history_pd)
# make predictions
future_pd = model.make_future_dataframe(periods=forecast_periods, freq=forecast_frequency, include_history=include_history)
if 'cap' in history_pd.columns:
future_pd['cap'] = history_pd['cap'].iloc[0]
if 'floor' in history_pd.columns:
future_pd['floor'] = history_pd['floor'].iloc[0]
forecast_pd = model.predict(future_pd)
if display_graph:
model.plot(forecast_pd, plot_cap = False, include_legend=True)
# add back y to the history dataset
f_pd = forecast_pd[['ds', 'yhat', 'yhat_upper', 'yhat_lower']].set_index('ds')
# join history and forecast
results_pd = f_pd.join(history_pd[['ds','y']].set_index('ds'), how='left')
results_pd.reset_index(level=0, inplace=True)
results_pd['ds'] = results_pd['ds'].dt.date
else:
# not enough data to predict, return history
for c in ['yhat', 'yhat_upper', 'yhat_lower']:
history_pd[c] = history_pd['y']
results_pd = history_pd[['ds','y','yhat', 'yhat_upper', 'yhat_lower']]
return results_pd
Create Forecast Visual
Since Prophet’s documentation indicates that the input into the main functions should always be a data frame with two columns: ds (date) and y (output variable used for training), I adhered to this advice. I grouped the workspace-filtered query by date (ds). Additionally, Databricks trains an additional model for each independent variable. However, the information about adding variables is quite limited. Due to that and the fact that it would require additional effort to manage more models, especially considering the combination of filters, I skipped the addition of variables.
The drawback is that the model does not take into account variation in the other variables. On the other hand, this blog describes a specialized model with filter functionalities that perform well in terms of predicted values.
Error handling for empty datasets is implemented. It ensures that the output of the cell displays the following if there are no records: “Please verify if you have a workspace selected in the filter widget. Interrupt and re-run the dashboard if you make multiple selection changes.”
# Retrieve selected workspace_ids
selected_workspaces = get_multiselect_vals(dist_workspaces, 'Workspace')
# vw_cost is limited by the last three months
workspace_hist_df = spark.sql(f"""
select
usage_date as ds
,sum(cost_at_list_price) as y
from vw_cost
where workspace in ({selected_workspaces})
group by ds
order by ds
""")
# Specify cap and floor (source: https://facebook.github.io/prophet/docs/saturating_forecasts.html#saturating-minimum)
workspace_hist_df = workspace_hist_df.toPandas()
workspace_hist_df['cap'] = 1000000
workspace_hist_df['floor'] = 0
results_pd = generate_forecast(workspace_hist_df, display_graph = False)
# Although the floor is set to 0, predicted values reach below 0, so negative values are converted to 0
zero_columns_to_convert = ['yhat', 'yhat_upper', 'yhat_lower']
results_pd[zero_columns_to_convert] = results_pd[zero_columns_to_convert].applymap(lambda x: max(0, x))
# Set up date filter widgets
from datetime import datetime, date
if len(results_pd) == 0:
import pandas as pd
error_message = ['Please verify if you have a workspace selected in the filter widget. Interrupt and re-run the dashboard, if the you make multiple selection changes.']
error_df = pd.DataFrame()
error_df['Error Message'] = error_message
display(error_df)
else:
## Define default date values
default_min_date = str(min(results_pd['ds'])) # results pd is a pandas df
default_max_date = str(max(results_pd['ds'])) # results pd is a pandas df
## Define all widget options
distinct_dates = results_pd['ds'].astype(str).unique() # retrieve distinct dates and convert to string for readability
## Create widgets. This order was applied, because the most recently created widget is created on the left of existing widgets. You can change widget order in the widgets pane (at the top of the notebook)
dbutils.widgets.dropdown('Minimum Date', default_min_date , distinct_dates)
dbutils.widgets.dropdown('Maximum Date', default_max_date , distinct_dates)
## Retrieve selected values from date widgets
min_date = date.fromisoformat(getArgument('Minimum Date'))
max_date = date.fromisoformat(getArgument('Maximum Date'))
## Filter results_pd by date
filtered_results_pd = results_pd[(results_pd['ds'] >= min_date) & (results_pd['ds'] <= max_date)]
# Create the visualization
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=filtered_results_pd['ds'], y=filtered_results_pd['y'][:-4], name='actual usage'))
fig.add_trace(go.Scatter(x=filtered_results_pd['ds'], y=filtered_results_pd['yhat'], name='forecast cost (pricing list)'))
fig.add_trace(go.Scatter(x=filtered_results_pd['ds'], y=filtered_results_pd['yhat_upper'], name='uncertainty interval upper bound', line = dict(color='grey', width=1, dash='dot'))) # renamed this (compared to demo) for more clarity
fig.add_trace(go.Scatter(x=filtered_results_pd['ds'], y=filtered_results_pd['yhat_lower'], name='uncertainty interval lower bound', line = dict(color='grey', width=1, dash='dot'))) # renamed this (compared to demo) for more clarity
fig.update_layout(margin_b=15,margin_t=10) # to make it look better in the dashboard
fig.show()
The following image (Figure 3) contains an example output.
Create Daily Cost by SKU
The last two visuals do not include forecasts, just like in the original demo. These visualisations differ from those in the demo and provide cost and SKU distribution information.
- Daily Cost by SKU Family (Actual)
As an extension of the actual line in the forecast graph, I zoomed in on the cost by SKU over time (Figure 4). Since the SKU_name column would provide unreadable categories, I used the SKU mapping from the demo and referred to this as SKU Family.
- Total Cost by SKU Name (Actual)
To still be able to see the SKU name-specific cost distribution, users can study below visualization (Figure 5). To inspect the distribution for a specific date (range), people can adjust the minimum and maximum date filters.
With the following code cell, I retrieve the filter values and insert them into the query string. For this reason, I use a data frame instead of SQL.
# Get filter values
selected_departments = get_multiselect_vals(dist_departments, 'Department')
selected_sku_names = get_multiselect_vals(dist_sku_names, 'SKU Name')
selected_workspaces = get_multiselect_vals(dist_workspaces, 'Workspace') # included in this cell enforce cell run, if widget selection changes
min_date_str = getArgument('Minimum Date') # included in this cell enforce cell run, if widget selection changes
max_date_str = getArgument('Maximum Date') # included in this cell enforce cell run, if widget selection changes
cost_by_sku_family_df = spark.sql(f"""
select
sub.usage_date as `Date`
,sub.sku_name as `SKU Name`
,CASE WHEN sub.sku_name LIKE "%ALL_PURPOSE%" THEN "ALL_PURPOSE"
WHEN sub.sku_name LIKE "%JOBS%" THEN "JOBS"
WHEN sub.sku_name LIKE "%DLT%" THEN "DLT"
WHEN sub.sku_name LIKE "%SQL%" THEN "SQL"
WHEN sub.sku_name LIKE "%INFERENCE%" THEN "MODEL_INFERENCE"
ELSE "OTHER"
END AS `SKU Family`
,sum(sub.cost_at_list_price) as `Cost (List Prices)` -- based on the list_prices table, which does not include special price arrangements with Databricks
from (
select
workspace
,usage_date
,sku_name
,cost_at_list_price
,coalesce(b.`department`, 'NULL') as `department`
FROM vw_cost b
) sub
where sub.workspace in ({selected_workspaces})
and sub.usage_date between '{min_date_str}' and '{max_date_str}'
and sub.sku_name in ({selected_sku_names})
and sub.`department` in ({selected_departments})
group by
sub.usage_date
,sub.sku_name
,`SKU Family`
""")
display(cost_by_sku_family_df)
Omitted Elements
- billing_forecast table
I did not create a table or view to keep the catalog clean. In the demo, a table is created and then accessed to create the graph, but it worked well in terms of functionality without the table. - detailed_billing_forecast view
I did not want to “smoothen” the data, as mentioned in the demo. I intended to display the data as it is.
Notebook Settings
- When filter input is changed, only commands with the getArgument() function of the changed widget are run.
- The default notebook language is Python.
Limitations and Opportunities for Improvement
Given that the Prophet documentation states that training and testing are automated, I did not spend effort tweaking the settings. This is still an opportunity to improve the model. Additionally, the model accuracy could be improved by adding regressors or training separate models per additional variable, although the latter option is not recommended in the official Prophet documentation. Another opportunity that I mentioned earlier is to improve model accuracy by specifying a larger minimum historical dataset.
Dashboard
Below, you can see the final dashboard (Figure 6). The dashboard can be utilized to support decision-making related to budgeting and FinOps
In Summary
This post highlights key improvements made to the Databricks billing forecast demo. These enhancements focus on model selection, filtering capabilities, and security. Instead of a linear model, a logistic model now provides more accurate predictions, avoiding unrealistic outcomes. Interactive filtering widgets allow for a more flexible analysis scope. Adding row-level security ensures data privacy and a more precise data representation. To optimise the process, some steps with unclear values were removed. The code was restructured to facilitate the implementation of security filters. These changes result in more informative visualisations that are valuable for stakeholders. While these improvements enhance the demo significantly, further refinements are possible.
Devoteam helps you optimise your Databricks costs
With a team of 1,000+ data consultants with over 960 certifications across leading cloud platforms like AWS, Google Cloud, Microsoft, DataBricks and Snowflake, Devoteam helps you gain control of your Databricks spending.