Time-Series Extreme Event Forecasting With Neural Networks at Uber

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Time-series Extreme Event Forecasting with Neural Networks at Uber

Nikolay Laptev 1 Jason Yosinski 1 Li Erran Li 1 Slawek Smyl 1

Abstract forecast. To incorporate exogenous variables, a machine


learning approach, often based on a Quantile Random For-
Accurate time-series forecasting during high est (Meinshausen, 2006) is employed. This state of the art
variance segments (e.g., holidays), is critical for approach is effective at accurately modeling special events,
anomaly detection, optimal resource allocation, however, it is not flexible and does not scale due to high
budget planning and other related tasks. At Uber retraining frequency.
accurate prediction for completed trips during
special events can lead to a more efficient driver Classical time-series models usually require manual tun-
allocation resulting in a decreased wait time for ing to set seasonality and other parameters. Furthermore,
the riders. while there are time-series models that can incorporate ex-
ogenous variables (Wei, 1994), they suffer from the curse
State of the art methods for handling this task
of dimensionality and require frequent retraining. To more
often rely on a combination of univariate fore-
effectively deal with exogenous variables, a combination of
casting models (e.g., Holt-Winters) and machine
univariate modeling and a machine learned model to handle
learning methods (e.g., random forest). Such a
residuals was introduced in (Opitz, 2015). The resulting
system, however, is hard to tune, scale and add
two-stage model, however, is hard to tune, requires man-
exogenous variables.
ual feature extraction and frequent retraining which is pro-
Motivated by the recent resurgence of Long Short hibitive to millions of time-series.
Term Memory networks we propose a novel end-
to-end recurrent neural network architecture that Relatively recently, time-series modeling based on Long
outperforms the current state of the art event fore- Short Term Memory (LSTM) (Hochreiter & Schmidhuber,
casting methods on Uber data and generalizes 1997) technique gained popularity due to its end-to-end
well to a public M3 dataset used for time-series modeling, ease of incorporating exogenous variables and
forecasting competitions. automatic feature extraction abilities (Assaad et al., 2008).
By providing a large amount of data across numerous di-
mensions it was shown that an LSTM approach can model
complex nonlinear feature interactions (Ogunmolu et al.,
1. Introduction 2016) which is critical to model complex extreme events.
Accurate demand time-series forecasting during high vari- Our initial LSTM implementation did not show superior
ance segments (e.g., holidays, sporting events), is critical performance relative to the state of the art approach de-
for anomaly detection, optimal resource allocation, budget scribed above. In Section 4 we discuss key architecture
planning and other related tasks. This problem is challeng- changes to our initial LSTM implementation that were re-
ing because extreme event prediction depends on numer- quired to achieve good performance at scale for single-
ous external factors that can include weather, city popula- model, heterogeneous time-series forecasting.
tion growth or marketing changes (e.g., driver incentives)
(Horne & Manzenreiter, 2004). This paper makes the following contributions

Classical time-series models, such as those found in the • We propose a new LSTM-based architecture and train
standard R forecast(Hyndman & Khandakar, 2008) pack- a single model using heterogeneous time-series.
age are popular methods to provide a univariate base-level
• Experiments based on proprietary and public data are
1
Uber Technologies, San Francisco, CA, USA. Correspon- presented showing the generalization and scalability
dence to: Nikolay Laptev <[email protected]>, Jason Yosin-
power of the discussed model.
ski <[email protected]>, Li Erran Li <[email protected]>, The rest of this paper is structured as follows: Section 2
Slawek Smyl <[email protected]>.
provides a brief background on classical and neural net-
ICML 2017 Time Series Workshop, Sydney, Australia. Copyright work based time-series forecasting models. Section 3 de-
2017 by the author(s). scribes the data and more specifically how it was con-
Time-series Extreme Event Forecasting with Neural Networks at Uber

(a) Creating an input for the model requires two sliding windows (b) A scaled sample input to our model
for x and for y
Figure 1. Real-world time-series examples.
structed and preprocessed to be used as input to the LSTM (b).
model. Section 4 describes the architectural changes to our
Creating a training dataset requires a sliding window X (in-
initial LSTM model. Sections 5 and 6 provide results and
put) and Y (output) of, respectively, desired look-back and
subsequent discussion.
forecast horizon. X, Y are comprised of (batch, time, fea-
tures). See Figure 1 (a) for an example of X and Y .
2. Background
Neural networks are sensitive to unscaled data (Hochreiter
Extreme event prediction has become a popular topic for & Schmidhuber, 1997), therefore we normalize every mini-
estimating peak electricity demand, traffic jam severity batch. Furthermore, we found that de-trending the data, as
and surge pricing for ride sharing and other applications opposed to de-seasoning, produces better results.
(Friederichs & Thorarinsdottir, 2012). In fact there is a 4. Modeling
branch of statistics known as extreme value theory (EVT)
(de Haan & Ferreira, 2006) that deals directly with this In this section we first present the strategy used for uncer-
challenge. To address the peak forecasting problem, uni- tainty computation in our model and then in Section 4.2,
variate time-series and machine learning approaches have we propose a new scalable neural network architecture for
been proposed. time-series forecasting.

While univariate time-series approaches directly model the 4.1. Uncertainty estimation
temporal domain, they suffer from a frequent retraining re-
quirement (Ye & Keogh, 2009). Machine learning models The extreme event problem is probabilistic in nature and
are often used in conjunction with the univariate time-series robust uncertainty estimation in neural network based time-
models resulting in a bulky two-step process for address- series forecasting is therefore critical. A number of ap-
ing the extreme event forecasting problem (Opitz, 2015). proaches exist for uncertainty estimation ranging from
LSTMs, like traditional time-series approaches, can model Bayesian to those based on the bootstrap theory (Gal,
temporal domain well while also modeling the nonlinear 2016). In our work we combine Bootstrap and Bayesian
feature interactions and residuals (Assaad et al., 2008). approaches to produce a simple, robust and tight uncer-
tainty bound with good coverage and provable convergence
We found that the vanilla LSTM model’s performance is properties (Li & Maddala, 1996).
worse than our baseline. Thus, we propose a new architec-
ture, that leverages an autoencoder for feature extraction,
achieving superior performance compared to our baseline. Listing 1. Practical implementation of estimating the uncertainty
bound
vals = []
3. Data f o r r i n range ( 1 0 0 ) :
v a l s . a p p e n d ( model . e v a l ( i n p u t ,
At Uber we have anonymized access to the rider and driver d r o p o u t = random ( 0 , 1 ) ) )
data from hundreds of cities. While we have plethora of mean = np . mean ( v a l s )
data, challenges arise due to the data sparsity found in new v a r = np . v a r ( v a l s )
cities and for special events. To circumvent the lack of data
we use additional features including weather information The implementation of this approach is extremely simple
(e.g., precipitation, wind speed, temperature) and city level and practical (see listing 1). Figures 2 (a) and (b) describe
information (e.g., current trips, current users, local holi- the uncertainty derivation and the underlying model used.
days). An example of a raw dataset is shown in Figure 1 The uncertainty calculation above is included for complete-
Time-series Extreme Event Forecasting with Neural Networks at Uber

(a) Model and forecast uncertainty derivation (b) Model uncertainty is estimated via the architecture on the left
while the forecast uncertainty is estimated via the architecture on
the right.
Figure 2. Model and forecast uncertainty
ness of the proposed end-to-end forecasting model and can SMAPE was used as a forecast error metric defined as
100 |ŷ−yt |
be replaced by other uncertainty measures. We leave the n ×Σn |yˆt |+|yt | /2. The described production Neural Net-
discussion of approximation bound, comparison with other work Model was trained on thousands of time-series with
methods (Kendall & Gal, 2017) and other detailed uncer- thousands of data points each.
tainty experiments for a longer version of the paper.
5.1. Special Event Forecasting Accuracy
4.2. Heterogeneous forecasting with a single model
A five year daily history of completed trips across top US
It is impractical to train a model per time-series for millions
cities in terms of population was used to provide forecasts
of metrics. Furthermore, training a single vanilla LSTM
across all major US holidays. Figure 4 shows the average
does not produce competitive results. Thus, we propose
SMAPE with the corresponding uncertainty. The uncer-
a novel model architecture that provides a single model
tainty is measured as the Coefficient of Variation defined
for heterogeneous forecasting. As Figure 3 (b) shows, the
as cv = σµ . We find that one of the hardest holidays to
model first primes the network by auto feature extraction,
predict expected Uber trips for is Christmas day which cor-
which is critical to capture complex time-series dynam-
responds to the greatest error and uncertainty. The longer
ics during special events at scale. This is contrary to the
version of the paper will contain more detailed error and
standard feature extraction methods where the features are
uncertainty evaluation per city. The results presented show
manually derived, see Figure 3 (a). Features vectors are
a 2%-18% forecast accuracy improvement compared to the
then aggregated via an ensemble technique (e.g., averag-
current proprietary method comprising a univariate time-
ing or other methods). The final vector is then concate-
series and machine learned model.
nated with the new input and fed to LSTM forecaster for
prediction. Using this approach, we have achieved an aver-
age 14.09% improvement over the multilayer LSTM model 5.2. General Time-Series Forecasting Accuracy
trained over a set of raw inputs. This section describes the forecasting accuracy of the
Note there are different ways to include the extra features trained model on a general time-series. Figure 5 shows
produced by the auto-encoder in Figure 3 (b). The extra the forecasting performance of the model on new time-
features can be included by extending the input size or by series relative to the current propriety forecasting solution.
increasing the depth of LSTM Forecaster in Figure 3 (b) Note that we train a single Neural Network compared to
and thereby removing LSTM auto-encoder. Having a sepa- per query training requirement of the proprietary model.
rate auto-encoder module, however, produced better results Similar preprocessing described in Section 3 was applied
in our experience. Other details on design choices are left to each time-series. Figure 6 shows the performance of
for the longer version of the paper. the same model on the public M3 benchmark consisting of
≈ 1500 monthly time-series (Makridakis & Hibon, 2000).
5. Results
Both experiments indicate an exciting opportunity in the
This section provides empirical results of the described time-series field to have a single generic neural network
model for special events and general time-series forecast- model capable of producing high quality forecasts for
ing accuracy. Training was conducted using an AWS heterogeneous time-series relative to specialized classical
GPU instance with Tensorflow1 . Unless otherwise noted, time-series models.
1
On production, the learned weights and the Tensorflow graph
were exported into an equivalent target language
Time-series Extreme Event Forecasting with Neural Networks at Uber

(a) Classical time-series features that are manu- (b) An auto-encoder can provide a powerful feature extraction used for
ally derived (Hyndman et al., 2015). priming the Neural Network.

Figure 3. Single model heterogeneous forecast.

Figure 6. Forecast on a public M3 dataset. Single neural net-


work was trained on Uber data and compared against the M3-
specialized models.
Figure 4. Individual holiday performance.

6. Discussion
We have presented an end-to-end neural network architec-
ture for special event forecasting at Uber. We have shown
its performance and scalability on Uber data. Finally we
have demonstrated the model’s general forecasting appli-
cability on Uber data and on the M3 public monthly data.
From our experience there are three criteria for picking a
neural network model for time-series: (a) number of time-
series (b) length of time-series and (c) correlation among
the time-series. If (a), (b) and (c) are high then the neural
network might be the right choice, otherwise classical time-
series approach may work best.
Our future work will be centered around utilizing the uncer-
tainty information for neural net debugging and performing
Figure 5. Forecasting errors for production queries relative to the further research towards a general forecasting model for
current proprietary model. heterogeneous time-series forecasting and feature extrac-
tion with similar use-cases as the generic ImageNet model
used for general image feature extraction and classification
(Deng et al., 2009).
Time-series Extreme Event Forecasting with Neural Networks at Uber

References Opitz, T. Modeling asymptotically independent spatial ex-


tremes based on Laplace random fields. ArXiv e-prints,
Assaad, Mohammad, Boné, Romuald, and Cardot, Hu-
2015.
bert. A new boosting algorithm for improved time-series
forecasting with recurrent neural networks. Inf. Fusion, Wei, William Wu-Shyong. Time series analysis. Addison-
2008. Wesley publ Reading, 1994.

de Haan, L. and Ferreira, A. Extreme Value Theory: An In- Ye, Lexiang and Keogh, Eamonn. Time series shapelets: A
troduction. Springer Series in Operations Research and new primitive for data mining. KDD. ACM, 2009.
Financial Engineering. 2006.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
Fei, L. ImageNet: A Large-Scale Hierarchical Image
Database. In CVPR09, 2009.

Friederichs, Petra and Thorarinsdottir, Thordis L. Fore-


cast verification for extreme value distributions with an
application to probabilistic peak wind prediction. Envi-
ronmetrics, 2012.

Gal, Yarin. Uncertainty in Deep Learning. PhD thesis,


University of Cambridge, 2016.

Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-


term memory. Neural Comput., 1997.

Horne, John D. and Manzenreiter, Wolfram. Accounting


for mega-events. International Review for the Sociology
of Sport, 39(2):187–203, 2004.

Hyndman, Rob J and Khandakar, Yeasmin. Automatic time


series forecasting: the forecast package for R. Journal
of Statistical Software, 26(3):1–22, 2008.

Hyndman, Rob J., Wang, Earo, and Laptev, Nikolay.


Large-scale unusual time series detection. In ICDM, pp.
1616–1619, 2015.

Kendall, Alex and Gal, Yarin. What Uncertainties Do We


Need in Bayesian Deep Learning for Computer Vision?
2017.

Li, G. S. Hongyi and Maddala. Bootstrapping time series


models. Econometric Reviews, 15(2):115–158, 1996.

Makridakis, Spyros and Hibon, Michèle. The m3-


competition: results, conclusions and implications. In-
ternational Journal of Forecasting, 16(4):451–476, 00
2000.

Meinshausen, Nicolai. Quantile regression forests. JOUR-


NAL OF MACHINE LEARNING RESEARCH, 7:983–
999, 2006.

Ogunmolu, Olalekan P., Gu, Xuejun, Jiang, Steve B., and


Gans, Nicholas R. Nonlinear systems identification us-
ing deep dynamic neural networks. CoRR, 2016.

You might also like