Probabilistic Deep Learning for Wind Turbines

How to apply Gaussian Processes on big data

Michael Berk
Towards Data Science
7 min readNov 1, 2021

--

Model speed can be a deal breaker on large datasets. Leveraging an empirical study, we will look at two dimension reduction techniques and how they can be applied to a Gaussian Processes.

deep learning feature reduction gaussian process latent feature machine learning sparse gaussian process
Figure 1: overview of the method. CNN is a convolutional neural net and GPR/VGPR are different Gaussian Process Regressions. Image by author.

Regarding implementation of the method, anyone familiar with the basics of conditional probability can develop a Gaussian Process model. However, to fully leverage the capabilities of the framework, a fair amount of in-depth knowledge is required. Gaussian processes also are not very computationally efficient, but their flexibility is makes them a common choice for niche regression problems.

Without further ado, let’s dive in.

Technical TLDR

Gaussian Processes (GPs) are non-parametric Bayesian models. The look to model the covariance between data points and thereby are impractical for datasets larger than 10,000 data points. However, their flexibility is unparalleled.

To run GPs on large datasets, we outline two feature reduction methods. The first is a convolutional neural network that learns latent features. The paper cited reducing 2501 features down to 4 which made the problem computationally tractable. From there, they leveraged a Sparse Gaussian Process which reduced the number of data points needed, further improving on model speed.

Overall the Gaussian Process observed the highest accuracy, but the Sparse Gaussian Process exhibited similar accuracy and much faster run time.

Here’s the paper.

But what’s actually going on?

Let’s slow down a bit and discuss how the feature reduction methods work.

1 — Background

When creating wind farms, which are just arrays of wind turbines, placement of the turbines is very important. Air that is refracted from turbines can dramatically impact the efficiency of subsequent turbines. To optimize this configuration, we turn to computational fluid dynamics.

Photo by Nicholas Doherty on Unsplash

Most fluid (air, water, etc.) simulations rely both on empirical data and physics equations. However, many datasets contain large volumes of noisy and complex data. In our case, the dataset cited in the paper was collected from a wind farm in Texas over 2 years and contained 2501 features for each turbine. The number of rows was not disclosed.

If we are looking to develop real-time optimizations, we’re going to have to simplify our data. And, that’s where autoencoders come in.

2 .1 — Autoencoders

Autoencoders are a subset of the encoder-decoder framework that look to maintain feature structure. In simple terms, we’re looking to perform two steps.

deep learning feature reduction gaussian process latent feature machine learning sparse gaussian process
Figure 2: autoencoder framework. The key is that data input and output structure is the same. Image by author.

First, we encode our data into a small number of latent variables. These latent variables can then be used to train our model which can dramatically reduce training time. In the paper, the authors were able to encode 2501 raw features to 4 latent features.

Second, we decode those latent variables back to the original structure of our data. We need this step because we can’t work with latent variables — we must have real predictions.

Note that here we’re using autoencoders for feature reduction, but there are many other use cases, some of which include anomaly detection and data “cleaning”.

2.2— Convolutional Autoencoders

One of the most popular autoencoder frameworks leverage convolutional neural nets (CNNs).

deep learning feature reduction gaussian process latent feature machine learning sparse gaussian process
Figure 3: convolution example — src. Image by author.

CNNs are a simple neural net where a “window” is repeatedly moved over a dataset. They are very popular with image classification because they are very effective and easy to understand. However, when scaling up beyond 2 dimensional data, CNNs are still very effective.

We apply a CNN auto-encoder framework to our wind dataset and end up with a significantly smaller dataset. We are now ready to fit the “real” model.

3 — Gaussian Probabilistic Regression

In this section we will provide a high-level explanation of gaussian probabilistic (GP) models. The paper also implemented a multilayer perceptron and active learning models, but both showed lower accuracy or computational efficiency relative to the GP models.

The Most Accurate Method — Exact Gaussian Process

Exact Gaussian Processes are non-parametric Bayesian models. We start with a prior assumption about our data then leverage covariances between our data points to update our prior assumption. In the end, we wind up with an estimate of the probability of our dependent variable (Y) given our latent features (X). This probability is also called the posterior probability.

deep learning feature reduction gaussian process latent feature machine learning sparse gaussian process
Figure 4: Bayes theorem broken down by components. Image by author.

For our example application, we’re looking to find the probability that we observe wind turbine flow given our 4 latent features. That entire probability is broken down by Bayes theorem in figure 4.

Each of the three components on the right can be estimated, however the main component that requires some engineering is the prior.

The prior probability distribution, often just called the “prior”, is the probability distribution of our Y variable prior to looking at our X values. To calculate this baseline, we just assume it’s normally distributed and estimate a mean and standard deviation using our Y’s.

Estimation of the other two components are out of the scope of this post, but check the comments for some useful links.

From a performance standpoint, exact Gaussian Processes have a time complexity of an exact gaussian process is O(n³), where n is the number of rows in our data. For all intents and purposes, this bounds our dataset size to ~10,000 data points. Below we outline a method that reduces the runtime complexity to O(nm²), where m is a set of sparse features taken from our original dataset.

The Fastest Method — Sparse Gaussian Process

To improve speed of the model fit the authors implemented a Sparse Gaussian Process (SGP).

In short, SGPs leverage a set of m data points that best approximate our observed data. We can then use them to fit our model. While these accuracy changes are highly dependent upon the dataset, for the wind turbine dataset, each of the 4 latent features exhibited a 7%, 20%, 14%, and 5% decrease in accuracy respectively.

So, if the purpose of these models is realtime optimization, a Sparse Gaussian Process may be a solid alternative.

Let’s quickly look at how these m data points are chosen. Unfortunately, the data referenced in the paper is not publicly available, so we will create our own.

deep learning feature reduction gaussian process latent feature machine learning sparse gaussian process
Figure 5: training set of 1,000 data points (blue) that roughly follow our latent data-generating function (black)— src. Image by author.

In figure 5 above, we can see our training data represented by blue X’s. These are generated using the latent function (the solid black line) and some random noise. Out of the 1000 training points, our goal is to estimate m=30 points that best approximate our training data.

By only fitting with 30 points, we hope to significantly reduce runtime complexity without sacrificing accuracy. The optimal 30 points are shown below in figure 6.

deep learning feature reduction gaussian process latent feature machine learning sparse gaussian process
Figure 6: optimal inducing variables where m=30— src. Image by author.

Without getting too technical, we look to describe our m data points using their mean and covariance. After finding optimal values of each, we can then sample 30 data points from a normal distribution N(mean_m, cov_m). If you want a deep dive, here’s a terrific resource.

Now that we have much fewer data points, we can fit our model much faster without sacrificing too much accuracy.

Finally, here’s a fun animation of the optimization process.

deep learning feature reduction gaussian process latent feature machine learning sparse gaussian process
Figure 7: animation of a sparse gaussian process optimization — src. Image by author.

And there you have it — a high level overview of GPs and how to apply them to latent features derived with a CNN.

Summary

In this post we discussed how to simplify computationally intractable models. We first used a convolutional neural net to find latent features which reduce the number of columns in our dataset. We also fit a Sparse Gaussian Process to reduce the number of rows in our dataset.

The exact Gaussian Process exhibited the highest accuracy, however in cases where realtime optimization is necessary, the sparse option may be better.

Thanks for reading! I’ll be writing 29 more posts that bring academic research to the DS industry. Check out my comment for links to the main source for this post and some useful resources.

--

--

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Responses (2)