Time-Series Extreme Event Forecasting With Neural Networks at Uber
Time-Series Extreme Event Forecasting With Neural Networks at Uber
Time-Series Extreme Event Forecasting With Neural Networks at Uber
Classical time-series models, such as those found in the • We propose a new LSTM-based architecture and train
standard R forecast(Hyndman & Khandakar, 2008) pack- a single model using heterogeneous time-series.
age are popular methods to provide a univariate base-level
• Experiments based on proprietary and public data are
1
Uber Technologies, San Francisco, CA, USA. Correspon- presented showing the generalization and scalability
dence to: Nikolay Laptev <[email protected]>, Jason Yosin-
power of the discussed model.
ski <[email protected]>, Li Erran Li <[email protected]>, The rest of this paper is structured as follows: Section 2
Slawek Smyl <[email protected]>.
provides a brief background on classical and neural net-
ICML 2017 Time Series Workshop, Sydney, Australia. Copyright work based time-series forecasting models. Section 3 de-
2017 by the author(s). scribes the data and more specifically how it was con-
Time-series Extreme Event Forecasting with Neural Networks at Uber
(a) Creating an input for the model requires two sliding windows (b) A scaled sample input to our model
for x and for y
Figure 1. Real-world time-series examples.
structed and preprocessed to be used as input to the LSTM (b).
model. Section 4 describes the architectural changes to our
Creating a training dataset requires a sliding window X (in-
initial LSTM model. Sections 5 and 6 provide results and
put) and Y (output) of, respectively, desired look-back and
subsequent discussion.
forecast horizon. X, Y are comprised of (batch, time, fea-
tures). See Figure 1 (a) for an example of X and Y .
2. Background
Neural networks are sensitive to unscaled data (Hochreiter
Extreme event prediction has become a popular topic for & Schmidhuber, 1997), therefore we normalize every mini-
estimating peak electricity demand, traffic jam severity batch. Furthermore, we found that de-trending the data, as
and surge pricing for ride sharing and other applications opposed to de-seasoning, produces better results.
(Friederichs & Thorarinsdottir, 2012). In fact there is a 4. Modeling
branch of statistics known as extreme value theory (EVT)
(de Haan & Ferreira, 2006) that deals directly with this In this section we first present the strategy used for uncer-
challenge. To address the peak forecasting problem, uni- tainty computation in our model and then in Section 4.2,
variate time-series and machine learning approaches have we propose a new scalable neural network architecture for
been proposed. time-series forecasting.
While univariate time-series approaches directly model the 4.1. Uncertainty estimation
temporal domain, they suffer from a frequent retraining re-
quirement (Ye & Keogh, 2009). Machine learning models The extreme event problem is probabilistic in nature and
are often used in conjunction with the univariate time-series robust uncertainty estimation in neural network based time-
models resulting in a bulky two-step process for address- series forecasting is therefore critical. A number of ap-
ing the extreme event forecasting problem (Opitz, 2015). proaches exist for uncertainty estimation ranging from
LSTMs, like traditional time-series approaches, can model Bayesian to those based on the bootstrap theory (Gal,
temporal domain well while also modeling the nonlinear 2016). In our work we combine Bootstrap and Bayesian
feature interactions and residuals (Assaad et al., 2008). approaches to produce a simple, robust and tight uncer-
tainty bound with good coverage and provable convergence
We found that the vanilla LSTM model’s performance is properties (Li & Maddala, 1996).
worse than our baseline. Thus, we propose a new architec-
ture, that leverages an autoencoder for feature extraction,
achieving superior performance compared to our baseline. Listing 1. Practical implementation of estimating the uncertainty
bound
vals = []
3. Data f o r r i n range ( 1 0 0 ) :
v a l s . a p p e n d ( model . e v a l ( i n p u t ,
At Uber we have anonymized access to the rider and driver d r o p o u t = random ( 0 , 1 ) ) )
data from hundreds of cities. While we have plethora of mean = np . mean ( v a l s )
data, challenges arise due to the data sparsity found in new v a r = np . v a r ( v a l s )
cities and for special events. To circumvent the lack of data
we use additional features including weather information The implementation of this approach is extremely simple
(e.g., precipitation, wind speed, temperature) and city level and practical (see listing 1). Figures 2 (a) and (b) describe
information (e.g., current trips, current users, local holi- the uncertainty derivation and the underlying model used.
days). An example of a raw dataset is shown in Figure 1 The uncertainty calculation above is included for complete-
Time-series Extreme Event Forecasting with Neural Networks at Uber
(a) Model and forecast uncertainty derivation (b) Model uncertainty is estimated via the architecture on the left
while the forecast uncertainty is estimated via the architecture on
the right.
Figure 2. Model and forecast uncertainty
ness of the proposed end-to-end forecasting model and can SMAPE was used as a forecast error metric defined as
100 |ŷ−yt |
be replaced by other uncertainty measures. We leave the n ×Σn |yˆt |+|yt | /2. The described production Neural Net-
discussion of approximation bound, comparison with other work Model was trained on thousands of time-series with
methods (Kendall & Gal, 2017) and other detailed uncer- thousands of data points each.
tainty experiments for a longer version of the paper.
5.1. Special Event Forecasting Accuracy
4.2. Heterogeneous forecasting with a single model
A five year daily history of completed trips across top US
It is impractical to train a model per time-series for millions
cities in terms of population was used to provide forecasts
of metrics. Furthermore, training a single vanilla LSTM
across all major US holidays. Figure 4 shows the average
does not produce competitive results. Thus, we propose
SMAPE with the corresponding uncertainty. The uncer-
a novel model architecture that provides a single model
tainty is measured as the Coefficient of Variation defined
for heterogeneous forecasting. As Figure 3 (b) shows, the
as cv = σµ . We find that one of the hardest holidays to
model first primes the network by auto feature extraction,
predict expected Uber trips for is Christmas day which cor-
which is critical to capture complex time-series dynam-
responds to the greatest error and uncertainty. The longer
ics during special events at scale. This is contrary to the
version of the paper will contain more detailed error and
standard feature extraction methods where the features are
uncertainty evaluation per city. The results presented show
manually derived, see Figure 3 (a). Features vectors are
a 2%-18% forecast accuracy improvement compared to the
then aggregated via an ensemble technique (e.g., averag-
current proprietary method comprising a univariate time-
ing or other methods). The final vector is then concate-
series and machine learned model.
nated with the new input and fed to LSTM forecaster for
prediction. Using this approach, we have achieved an aver-
age 14.09% improvement over the multilayer LSTM model 5.2. General Time-Series Forecasting Accuracy
trained over a set of raw inputs. This section describes the forecasting accuracy of the
Note there are different ways to include the extra features trained model on a general time-series. Figure 5 shows
produced by the auto-encoder in Figure 3 (b). The extra the forecasting performance of the model on new time-
features can be included by extending the input size or by series relative to the current propriety forecasting solution.
increasing the depth of LSTM Forecaster in Figure 3 (b) Note that we train a single Neural Network compared to
and thereby removing LSTM auto-encoder. Having a sepa- per query training requirement of the proprietary model.
rate auto-encoder module, however, produced better results Similar preprocessing described in Section 3 was applied
in our experience. Other details on design choices are left to each time-series. Figure 6 shows the performance of
for the longer version of the paper. the same model on the public M3 benchmark consisting of
≈ 1500 monthly time-series (Makridakis & Hibon, 2000).
5. Results
Both experiments indicate an exciting opportunity in the
This section provides empirical results of the described time-series field to have a single generic neural network
model for special events and general time-series forecast- model capable of producing high quality forecasts for
ing accuracy. Training was conducted using an AWS heterogeneous time-series relative to specialized classical
GPU instance with Tensorflow1 . Unless otherwise noted, time-series models.
1
On production, the learned weights and the Tensorflow graph
were exported into an equivalent target language
Time-series Extreme Event Forecasting with Neural Networks at Uber
(a) Classical time-series features that are manu- (b) An auto-encoder can provide a powerful feature extraction used for
ally derived (Hyndman et al., 2015). priming the Neural Network.
6. Discussion
We have presented an end-to-end neural network architec-
ture for special event forecasting at Uber. We have shown
its performance and scalability on Uber data. Finally we
have demonstrated the model’s general forecasting appli-
cability on Uber data and on the M3 public monthly data.
From our experience there are three criteria for picking a
neural network model for time-series: (a) number of time-
series (b) length of time-series and (c) correlation among
the time-series. If (a), (b) and (c) are high then the neural
network might be the right choice, otherwise classical time-
series approach may work best.
Our future work will be centered around utilizing the uncer-
tainty information for neural net debugging and performing
Figure 5. Forecasting errors for production queries relative to the further research towards a general forecasting model for
current proprietary model. heterogeneous time-series forecasting and feature extrac-
tion with similar use-cases as the generic ImageNet model
used for general image feature extraction and classification
(Deng et al., 2009).
Time-series Extreme Event Forecasting with Neural Networks at Uber
de Haan, L. and Ferreira, A. Extreme Value Theory: An In- Ye, Lexiang and Keogh, Eamonn. Time series shapelets: A
troduction. Springer Series in Operations Research and new primitive for data mining. KDD. ACM, 2009.
Financial Engineering. 2006.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
Fei, L. ImageNet: A Large-Scale Hierarchical Image
Database. In CVPR09, 2009.