The ML.EXPLAIN_PREDICT function
This document describes the ML.EXPLAIN_PREDICT
function, which lets you
generate a predicted value and a set of feature
attributions for each instance of the input data. Feature attributions indicate how
much each feature in your model contributed to the final prediction for each given
instance. ML.EXPLAIN_PREDICT
is essentially an extended version of
ML.PREDICT
.
Syntax
ML.EXPLAIN_PREDICT( MODEL `project_id.dataset.model_name`, { TABLE `project_id.dataset.table` | (query_statement) }, STRUCT( [number_of_output_tokens AS max_output_tokens] [, top_k AS top_k_features] [, threshold AS threshold] [, integrated_gradients_num_steps AS integrated_gradients_num_steps] [, approx_feature_contrib AS approx_feature_contrib]) )
Arguments
ML.EXPLAIN_PREDICT
takes the following arguments:
project_id
: Your project ID.dataset
: The BigQuery dataset that contains the model.model
: The name of the model.-
table
: The name of the input table that contains the data to be evaluated.If
table
is specified, the input column names in the table must match the column names in the model, and their types should be compatible according to BigQuery implicit coercion rules.If there are unused columns from the table, they are passed through to the output columns.
query_statement
: The GoogleSQL query that is used to generate the evaluation data. For the supported SQL syntax for thequery_statement
clause in GoogleSQL, see Query syntax.If
query_statement
is specified, the input column names from the query must match the column names in the model, and their types should be compatible according to BigQuery implicit coercion rules.If there are unused columns from the table, they are passed through to the output columns.
If you used the
TRANSFORM
clause in theCREATE MODEL
statement that created the model, then only the input columns present in theTRANSFORM
clause can appear inquery_statement
.top_k_features
: anINT64
value that specifies how many top feature attribution pairs are generated for each row of input data. The features are ranked by the absolute values of their attributions.By default,
top_k_features
is set to5
. If its value is greater than the number of features in the training data, the attributions of all features are returned.threshold
: aFLOAT64
value that specifies the cutoff between the two labels for binary classification models. Predictions above the threshold are positive predictions. Predictions below the threshold are negative predictions. Feature attributions are returned only for the predicted label.The
threshold
value must be between0.0
and1.0
. The default value is0.5
.integrated_gradients_num_steps
: anINT64
value that specifies the number of steps to sample between the example being explained and its baseline. This value is used to approximate the integral in integrated gradients attribution methods. Increasing the value improves the precision of feature attributions, but can be slower and more computationally expensive.This option only applies to deep neural network (DNN) models, which use integrated gradients attribution methods. The default value is
15
.approx_feature_contrib
: aBOOL
value that indicates whether to use approximate feature contribution method in XGBoost model explanation. This option applies only to boosted tree and random forest models.This capability is provided by the XGBoost library; BigQuery ML only passes this option through to it. For more information, see Package 'xgboost' and search for
approxcontrib
.The default value is
FALSE
.
Output
ML.EXPLAIN_PREDICT
returns the following columns in addition to any
passthrough columns:
predicted_<label_column_name>
: aSTRING
value that contains either the predicted value of the label for regression models or the predicted label class for classification models.probability
: aFLOAT64
value that contains the probability of the predicted label class. This column is only present for classification models.top_feature_attributions
: AnARRAY<STRUCT>
value that contains the attributions of the top k features to the final prediction:top_feature_attributions.feature
: aSTRING
value that contains the feature name.top_feature_attributions.attribution
: aFLOAT64
value that contains the attribution of the feature to the final prediction.
baseline_prediction_value
: aFLOAT64
value that contains one of the following:- For linear models, the
baseline_prediction_value
value is the intercept of the model. - For DNN models, the
baseline_prediction_value
value is the mean across all numerical features andNULL
for other types of features. - For boosted tree and random forest models, the
baseline_prediction_value
value is equal to the bias term, which is the expected output of the model over the training dataset. See Tree SHAP documentation for more information.
- For linear models, the
prediction_value
: The raw prediction value.- For regression models, this is a
FLOAT64
value that contains the value of the column identified bypredicted_<label_column_name>
. - For classification models, this is an
INT
orSTRING
value that contains the logit value (also called log-odds) for the predicted class. The predicted class probabilities are obtained by applying the softmax transformation to the logit values.
- For regression models, this is a
approximation_error
:Exact attribution methods like Tree SHAP are defined as follows:
$$\frac{|\texttt{prediction_value} - \texttt{baseline_prediction_value} - \sum{\texttt{feature_attributions}}|}{|\texttt{prediction_value} - \texttt{baseline_prediction_value}|}$$Because of this explanation of the contributions to the predicted value, there is no approximation error for these types of methods, and this column value is
0
. Exact attribution methods are used for the following types of models:Integrated gradients is an approximated attribution method that is defined as follows:
$$\texttt{baseline_prediction_value} + \sum{\texttt{feature_attributions}} = \texttt{prediction_value}$$For integrated gradients, this column value is greater than
0
. The integrated gradients method is used with DNN models.
Examples
The following examples assume that your model and input table are in your default project.
Explain a prediction generated by a linear regression model
The following example explains a prediction for a linear regression model by generating the top three attributions.
Assume a linear regression model stored in mydataset.mymodel
was trained with the
table mydataset.table
with the following columns:
label
column1
column2
column3
column4
column5
SELECT * FROM ML.EXPLAIN_PREDICT(MODEL `mydataset.mymodel`, ( SELECT label, column1, column2, column3, column4, column5 FROM `mydataset.mytable`), STRUCT(3 AS top_k_features))
Explain a prediction generated by a boosted tree or a random forest binary classification model
The following example explains a prediction generated by a boosted tree or a random forest binary classification model. It generates the top three attributions with a custom threshold.
Assume a boosted tree or a random forest binary classification model stored
in mydataset.mymodel
is trained with the table mydataset.table
with the
following columns:
label
column1
column2
column3
column4
column5
SELECT * FROM ML.EXPLAIN_PREDICT(MODEL `mydataset.mymodel`, ( SELECT label, column1, column2, column3, column4, column5 FROM `mydataset.mytable`), STRUCT(3 AS top_k_features, 0.7 AS threshold))
Explain a prediction generated by a DNN classifier model
The following example explains a prediction generated by a DNN classifier model.
Assume a DNN classifier is stored in mydataset.mymodel
and trained with the
table mydataset.table
with the following columns:
label
column1
column2
column3
column4
column5
SELECT * FROM ML.EXPLAIN_PREDICT(MODEL `mydataset.mymodel`, ( SELECT label, column1, column2, column3, column4, column5 FROM `mydataset.mytable`), STRUCT(3 AS top_k_features, 30 AS integrated_gradients_num_steps))
What's next
- For information about Explainable AI, see BigQuery Explainable AI overview.
- For information about the supported SQL statements and functions for each model type, see End-to-end user journey for each model.