Monitor health, resource utilization, and jobs
As a BigQuery administrator, you can use resource charts to monitor your organization's health, slots use, and jobs performance over time. The following charts are available:
-
Monitor operational health of BigQuery. BigQuery real-time operational health monitoring is a centralized monitoring system that lets you observe BigQuery usage across the organization in multiple locations.
-
View administrative resource utilization. Primary use cases for administrative resource charts include root-cause analysis for capacity planning based on historical resource utilization and diagnosing performance changes.
To view the operational health dashboard, you must have at least one reservation which you access with an administration project. Resource Charts are available to all users with the right permissions.
Required roles
To get the permissions that you need to view all data in the administrative resource charts, ask your administrator to grant you the following IAM roles on your organization:
-
BigQuery Resource Viewer (
roles/bigquery.resourceViewer
) -
BigQuery Metadata Viewer (
roles/bigquery.metadataViewer
)
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to view all data in the administrative resource charts. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to view all data in the administrative resource charts:
-
bigquery.jobs.listExecutionMetadata or bigquery.jobs.listAll
on the organization -
bigquery.reservationAssignments.list
on the reservation administration project -
bigquery.capacityCommitments.list
on the reservation administration project -
To see summary data for operational health:
bigquery.tables.get or bigquery.tables.list
on the organization -
To see partial data at the reservation level:
-
bigquery.reservations.list
on the reservation administration project -
bigquery.reservationAssignments.list
on the reservation administration project
-
-
To see data at project level:
bigquery.jobs.listAll
on the project
You might also be able to get these permissions with custom roles or other predefined roles.
Monitor operational health across an organization
The operational health dashboard displays key metrics for your organization and its reservations across all of the location where you have reservations. You can use this dashboard to monitor the following metrics:
- Slot usage
- Shuffle usage
- Job concurrency
- Errors
- Job duration
- Bytes processed
- Total storage
To monitor the operational health, follow these steps:
Go to the BigQuery page.
In the project menu, select the administration project that you used to purchase slots and create reservations.
In the navigation panel, select Administration > Monitoring. By default, the operational health chart displays summary metrics for all locations and all reservations.
To view more frequently updated data, you can toggle Live data. If Live data is enabled, then data is automatically refreshed every five minutes. If Live data is disabled, then the maximum data staleness is approximately one hour. The last update time for data is displayed.
To view the detailed view, select the values for which you want to view the detailed timeline charts:
- In the Select location list, select a specific region.
- In the Select reservation list, select the reservation.
- In the Select project list, select the project.
- In the Time range list, select a time range.
- To get more information from one of the key metrics charts, click Explore more. To get more information in Jobs explorer, click Explore more from the Top active queries chart and Insights table.
BigQuery provides this information by querying the following
INFORMATION_SCHEMA
views:
INFORMATION_SCHEMA.JOBS
INFORMATION_SCHEMA.JOBS_TIMELINE
INFORMATION_SCHEMA.RESERVATIONS
INFORMATION_SCHEMA.TABLE_STORAGE
Views
The Operational health tab displays the following views:
Summary view
The summary view provides a view of the health of subsystems including reservations and regions in the last 30 minutes for your organization.
To view the summary view, do the following:
- In the Operational Health chart, in the Select location list, select All locations.
Detailed view
The detailed view shows detailed timeline charts of different metrics at a location or a reservation level.
To view the detailed view, do the following:
- In the Operational Health chart, in the Select location list, select the region or edition.
You can further refine the data displayed in the detailed view using the following optional filters:
- In the Select reservation list, select the reservation name.
- In the Select project list, select the project.
Charts
The Operational health tab displays the following chart configuration options:
- Summary table (from Summary view only)
- Filter and search data
- Metric timeline charts (from Detailed view only)
- Error donut chart
- Top job list bar chart
- Insights table
Summary table
The summary tables display following metrics:
- Slot usage. The average slot utilization. The total slot capacity equals baseline capacity and autoscaled max capacity.
- Shuffle usage. The maximum of all average reservation shuffle usage ratio.
- Concurrency. The maximum number of jobs that ran simultaneously.
- Bytes processed. The total bytes processed by completed jobs.
- Job duration. The average run time of the jobs completed, calculated from the time when the job was created until the time when the job ended.
- Total storage. The total active logical bytes at the last updated time.
There are two summary tables presented in the view by location or reservations. Each table row represents one location or one reservation usage.
The tables display metrics for the 30 minutes prior to the last update. If Live data is enabled, then data is automatically refreshed every five minutes. If Live data is disabled, then the maximum data staleness is approximately one hour.
The table cell is color coded if the metric is higher than the threshold. All thresholds are predefined and can't be customized. Color code is applied when there is an increased number of performance insights for repeated jobs and higher than normal metric value. If the metric value over the 30 minutes prior to the last update is greater than the P99 value of the last week's usage for the same day, the table cell is marked with a dark blue background. If the metrics value over the 30 minutes is between the P95 - P99 value of the last week's usage for the same day, the table cell is marked with a light blue background.
Filter data
You can filter data in charts based on the following values:
- Location. Regions where the organization has had activity within the past 30 days are available. The summary view presents a region-wise summary table for tracking usage across all active regions.
- Reservation. This field is available only for detailed view. All reservations owned by the viewing administrative projects are available. You can only choose one reservation.
- Project. This field is available only for detailed view. All projects that have traffic in the organization are available. You can select only one project.
- Time range. This field is available only for detailed view. It affects the Metrics timeline charts and the Error donut chart in the detailed view. The data scale in the Metrics timeline charts is adjusted automatically according to the selected time range.
Top active queries chart
This chart shows the top ten jobs with active resources that are sorted in descending order. By using the drop-down menu, you can select a sorting option based upon slot usage or job. duration. The job ID and relevant resource usage number are presented in the bar chart. Select Explore more or the job ID in the top active queries chart to view more details in Jobs explorer. To learn more about the execution details and diagnose any performance issues for your BigQuery jobs, see query execution graph.
Error donut chart
This chart shows the proportion of the top causes of failure in the selected time period. In the summary view, it defaults to cover the last 30 minutes. In the detailed view, the time range selector controls its coverage. You can group the errors by type, owner project, or reservation. The count of failed jobs is presented in the donut chart.
Metric timeline charts
These charts display an overview of supported metrics across a maximum of 30 days. The default time period is one hour. As the selected time period changes, the granularity of each data point in the chart is scaled automatically.
These charts display an aggregated value over a region or a reservation. Displaying data for multiple regions or multiple reservations is not supported.
The Metric timeline charts support the following metrics:
- Slot usage. The average slot usage in the selected time period. It displays the overall slot usage from all jobs and failed jobs separately.
- Job concurrency. The maximum number of jobs that ran simultaneously in the selected time period. It displays the count of running jobs and pending jobs separately and is shown as a stacked bar chart.
- Bytes scanned. The total bytes processed by completed jobs in the selected time period.
- Shuffle. The maximum of all reservation average shuffle usage ratio in the selected time period.
- Job duration. The average run time of jobs completed in the selected time period, calculated from job creation time to job end time.
- Error by type. The Error donut chart shows the proportion of different causes of failure in the selected time period.
Besides the metric trends, the these charts display reference lines of the P95 and P99 metrics value of the last week's usage for the same day. For the Job concurrency chart, it shows the threshold for the sum of pending and running jobs. Those referenced values are used as the color coding thresholds in the summary table.
To learn more about the chart, click Explore more. You are redirected to the Resource Utilization tab.
Insights table
This table aggregates quota errors, access denied errors, and performance insights gathered during job executions within the selected time period. Insights are aggregated at reservation level. Each row provides information about the insight type, location, reservation, insight detail, and sample job IDs. Click job IDs to view more job details in Jobs explorer. In the summary view, the default observation timeframe covers the last 30 minutes. In the detailed view, the time range selector controls the specific time period for which observation data is displayed.
Monitor operational health for a single project
The project operational health dashboard is the default view you get if you don't have access to the entire organization, or your project doesn't own any reservation. This view can help project analysts monitor system health for their projects, much like the operational health dashboard at the organizational level, but it shows only project-level data in its charts and filters.
View resource utilization
Data in the administrative resource charts is based on the
INFORMATION_SCHEMA.JOBS
view and the
INFORMATION_SCHEMA.JOBS_TIMELINE
view,
which update in real time, going back a maximum of 30 days.
You can view your BigQuery resource utilization by using the administrative resource charts. These resource charts have the following three primary interactive components:
The event timeline chart displays an overview of data across a maximum of 30 days. The default is 1 day.
The main chart displays the selected metrics–slot usage, job concurrency, job performance, errors, or bytes processed–over time for the organization or the reservation administrative project. The chart includes a table legend that correlates with data in the chart, showing more detailed statistics aggregated over the time period.
A side panel to pick different charts, breakdowns, and filter options.
The following diagram provides an overview of various administrative chart options in BigQuery:
Before you begin
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
You can create a separate Google Cloud project reservation admin project to manage the reservation by giving it a descriptive name likebq-COMPANY_NAME-admin
. -
Make sure that billing is enabled for your Google Cloud project.
-
Enable the BigQuery Reservation API.
For more information, see Enable the BigQuery Reservation API.
In the Google Cloud console, view your slot quotas:
To purchase slots, you must have enough slot quota for the region in which you want to purchase slots.
If your slot quota for the region is less than the number of slots you want to purchase, see Request a quota increase.
View administrative resource charts
By default, you have access to Edition administrative resource charts when navigating from the reservation administrative project. You can toggle between the on-demand administrative resource charts and the Edition administrative resource charts from the reservation administrative project (Preview).
To view administrative resource charts, follow these steps:
Go to the BigQuery page.
In the project menu, select the administrative project that you used to purchase slots and create reservations.
In the navigation panel, select Administration > Monitoring.
In the Chart options > Scope section, you can select the Billing model from the list, which includes three types of Editions and on-demand. By default, the billing model is set to the Enterprise edition.
In the Chart options > Chart Configurations section, do the following:
- From the Chart list, select the chart type. For information about different chart options, see Chart options.
- From the Metrics list, select the available metrics type.
- From the Group by list, select the dimension by which you want to group the data in the chart. For information about different dimensions, see Group by options.
- From the Filters list, select the available filter options. For information about different filters, see Filters and search.
- Set the timeframe for which you want to view the data in the chart.
Chart configurations
You can adjust the view of your administrative resource charts by changing the following chart configuration options:
Chart options
BigQuery provides the following metric types to display in the charts:
Slot Usage. Average slot usage. If the Group by option is set to Overall, or if Filters are either None or a single reservation, then capacity and borrowed slot usage metrics are also displayed.
- Total Usage. Average slot usage in the selected time period.
- Borrowed Slot Usage. Average usage in excess of the reservation's capacity. This is usually idle slots borrowed from other reservations and commitments, or rounding imprecision in usage data.
- Commitment Capacity. The edition commitment size.
- Baseline Capacity. The reservation baseline size.
- Autoscale Capacity. The autoscaled slot size.
- Maximum Capacity. The maximum number of slots the user have access to.
Job Concurrency. Maximum number of jobs that ran simultaneously in the selected time period.
Job Performance. Average job run time of completed jobs, calculated from job creation time to job end time.
Failed Jobs. Average slot usage in the selected time period from jobs that failed.
Bytes Processed. Total bytes processed by completed jobs in the selected time period.
Shuffle Usage. Average shuffle usage ratio in the selected time period (preview).
Table component
The table displays metrics that are relevant to the time period and dimension that you selected in the administrative resource chart.
The Slot Usage chart displays the Average slot usage for all of the jobs that ran during the selected time period. Jobs that didn't finish within the selected time period include only slots used within the time period.
For the Job Performance chart, the table component displays the following metrics:
Number of jobs. The number of completed jobs ran or partially ran within the selected time period.
Average job duration. The average job duration for completed jobs that at least partially ran within the selected time period.
Median job duration. The median job duration for completed jobs that at least partially ran within the selected time period.
Min job duration. The minimum job duration for completed jobs that at least partially ran within the selected time period. This column is hidden by default, but you can enable it in Column display options.
Max job duration. The maximum job duration for completed jobs that at least partially ran within the selected time period. This column is hidden by default, but you can enable it in Column display options.
For the Failed Jobs chart, the table component displays the following data:
Average slot usage. Same as the Slot Usage chart, but only for failed jobs.
Total failed jobs. The number of jobs that failed within the selected time period.
Group by options
Based on the type of chart, you can group data in the chart view by several dimensions:
Overall. The overall view is a high-level overview of the selected metric over time across the entire organization. For slot usage, capacity and usage are shown on the same chart to give an idea of overall slot utilization over time.
Project. The project view shows the selected metric grouped by project. If an organization has more than 10 projects, then only the top 10 projects with the highest slot usage, job concurrency, job duration, or failed jobs in the selected time period are shown. For slot usage, all other projects are grouped into the Other category at the bottom of the chart for easier comparison to overall capacity.
All other group-by views are similar to the project view, with data grouped by the respective dimension.
Timeframe option
You can modify the time period in the following ways:
Select and drag the time period into the event timeline chart.
Select and drag the time period into the main chart.
The alignment period updates automatically as the selected timeframe changes. The smaller the alignment period, the more detailed the view. To better view resources that change frequently, for example the Slot Usage option, reduce the alignment period.
Filter and search
To narrow down the chart data, apply filters in the filter panel. Some filters are only available for certain charts. The Reservations, Folders, Projects, and Users filters are populated with the respective resources that have consumed slots in the selected timeframe. For example, if a project hasn't been used in the last 30 days, it does not appear in the project filter list.
The chart refreshes after you apply filters to show data within the selected parameters.
View project level administrative charts data
You can adjust the scope in the Chart options to generate an overview of administrative charts at the project level. This view can help project analysts monitor resource utilization for owned projects, similar to administrative charts at the organizational level, with the exception of some chart options and group by options that are not applicable. This view shows the overall resource utilization regardless of billing mode within the contextual project.
When you navigate to the administrative charts from a non-admin project with organization level permission(s), the administrative chart switches to the project level by default. You can switch back to the correct view using the Scope drop down in the Chart options.
Billing
Administrative resource charts are available at no additional cost. Queries used to populate these charts aren't billed and don't use slots in user-owned reservations. Queries that process too much data are timed out.
What's next
- Learn about reservations.
- Learn about purchasing slots.
- Learn how to estimate slot capacity requirements.
- Learn how to view slot recommendations and insight.