Transformer For Times Series: An Application To The S&P500: 1 Objectives and General Introduction
Transformer For Times Series: An Application To The S&P500: 1 Objectives and General Introduction
Transformer For Times Series: An Application To The S&P500: 1 Objectives and General Introduction
the S&P500
Pierre Brugière∗ and Gabriel Turinici†
arXiv:2403.02523v1 [cs.AI] 4 Mar 2024
CEREMADE
Université Paris Dauphine - PSL
Place du Marechal de Lattre de Tassigny
75016 PARIS, FRANCE
March 6, 2024
Abstract
The transformer models have been extensively used with good results
in a wide area of machine learning applications including Large Language
Models and image generation. Here, we inquire on the applicability of this
approach to financial time series. We first describe the dataset construc-
tion for two prototypical situations: a mean reverting synthetic Ornstein-
Uhlenbeck process on one hand and real S&P500 data on the other hand.
Then, we present in detail the proposed Transformer architecture and
finally we discuss some encouraging results. For the synthetic data we
predict rather accuratly the next move, and for the S&P500 we get some
interesting results related to quadratic variation and volatility prediction.
1
2 Methodology
2.1 First notations and time series embedding
We use the encoder part of a transformer model for time series prediction. From
a time series of unidimensional variables (y1 , y2 , · · · , ym ) we build a probabilistic
classifier, which from any partial sequence (ϕ(yi ), ϕ(yi+1 ), . . ., ϕ(yi+l−1 )) of
length l calculates the probabilities for yi+l to belong to each interval of a fixed
interval list; these intervals will be referred to as ”buckets” hereafter. Here,
ϕ(.) is an embedding function from R into Rd which will be discussed later.
Transformers are highly efficient in large language models, where yi represents
words (or tokens) and ϕ(yi ) denotes their encodings in a d-dimensional space.
The value d is called the dimension of the model and is often set to 512 or
1024. Here, to achieve dimension d, the undimensional variables yi are encoded
y2 yd
into some new variables xi = ϕ(yi ) with ϕ(yi ) = (yi , 2i , · · · , d!i ). We also
incorporate the possibility of including a positional encoding to the process.
Different values for d can be chosen, but in our base case we take d = 2l . Details
regarding positional encoding are provided later in the paper.
Ideally, the conditional laws L(Yn+l |(Xn , Xn+1 , · · · , Xn+l−1 )) should con-
verge as n increases for the model to perform well. To investigate this prop-
erty, we first utilize simulated data, for which these conditional laws indeed
have a limit. Here, the synthetic dataset is simulated from a mean reverting
Ornstein-Uhlenbeck process, which involves observable variables yi and hidden
state variables hi . We assess the performance of the model by comparing its
results to the values derived from L(Yn+l |Fn+l−1 ). Here, Fn+l−1 contains all
the variables (observable or hidden) up to time n + l − 1, providing the best
possible estimator. Hence, in our simulations, Fn+l−1 encompasses not only the
knowledge of the (xi )i≤n+l−1 but also includes the hidden variables (hi )i≤n+l−1 .
Subsequently, we test the method on market data with the S&P500, initially
predicting the next day return and then the next day quadratic variation from
which avolatilyt prediction could derive.
2
method. In our case we would have:
D1 = {[x]1 , [x]l+1 , · · · }
D2 = {[x]2 , [x]l+2 , · · · }
..
.
Dl = {[x]l , [x]2l , · · · }.
4. A bootstrap method to build, as in method 3, several datasets made of
sequences [x]i where each [x]i is picked at random from the dataset of
all sequences, and then apply an ensemble method to this ensemble of
datasets.
Here, we use method 2 and split the dataset of sequences into {[x]1 , [x]2 , · · · [x]m1 }
for training and {[x]m1 +1 , [x]m1 +2 · · · [x]m−l } for testing.
We use θ > 0 to get a stationary process for the (yi )i∈N and take in our
simulations µ = 0, dt = 1, σ = 1 and θ = 1.
As only one trajectory is modeled (see Figure 2), we choose arbitrarily h0 = 0
and from there, the hidden values (h1 , h2 , · · · hm ) are simulated as well as the
(y1 , y2 , ...ym ). Note that, the various sequences extracted from this time series
with method 2 do not have identical distributions, as each of them is associated
to a different initial hidden value h0 , h1 , h2 · · · etc. This being said, as we can
calculate explicitely L(Yn+l |Fn+l−1 ) = L(Yn+l |Hn+l−1 ) it is not an issue to
evaluate the performance of the model by comparing it to the best possible
classifier, which is obtained by calculating the probabilities knowing all the
variables observed and hidden at the previous time step. The variables used are
recapitulated in Figure 1.
3
Simulation [h0 , h1 , · · · hm ]
shape (m + 1,)
Embedding [e1 , e2 , · · · em ]
shape (m, d)
Positional Encoding
hidden = Target probabilities
Dataset = [[x]1 , [x]2 , · · · , [x]m−l ]
[hl , hl+1 , · · · , hm−1 ] yh= [[yh]1 , [yh]2 · · · , [yh]m−l ]
with [x]i = [e]i + P
shape (m − l, ) shape (m − l, )
shape (m − l, l, d)
x train = x test =
[[x]1 , [x]2 , · · · , [x]t ] [[x]t+1 , [x]t+2 , · · · , [x]m−l+1 ]
shape (t, l, d) shape (m − l − t, l, d)
4
that, ∀j ∈ [0, d2 − 1], each of its 2 × 2 diagonal block Mk (j) is defined by:
cos(kwj ) sin(kwj )
Mk (j) =
−sin(kwj ) cos(kwj )
then,
1. ∀k ∈ [0, d − 1], Tk is an orthonormal matrix. i.e Tk Tk⊥ = Tk⊥ Tk = Idd
2. Tk+t = Tk Tt
d
3. ∀t ∈ [0, d − 1], ∥pt ∥2 = 2
4. pt+k = Tk pt
5. for all vector columns pt and pt+k of P the quantity ⟨pt , pt+k ⟩ only depends
on k and is maximum for k = 0.
6. ∃ a function ψ such that P P ⊥ = [ψ(|i − j|)]
Proof.
1. simple calculation as cos2 (kwj ) + sin2 (kwj ) = 1
5
2.5 The Transformer model for classification
After defining some buckets B1 , B2 , Bk the objective of the model is, for a
sequence (xm+1 , xm+2 , · · · , xm+l ), to estimate the probabilities (p1 , p2 , · · · , pk )
2
for ym+l+1 (resp ym+l+1 for quadratic prediction) to belong to these buckets. In
the base case, we take the number of buckets to be equal to 7 obtaining the list
] − ∞, α1 ], ]α1 , α2 ], · · · , ]α6 , +∞[ with the αi corresponding to the quantiles of
the observed yi (resp yi2 ) from the training set. Therefore, each bucket contains
the same proportion of yi (resp yi2 ), up to the rounding errors, from the training
set.
The model used is the encoder part of a Transformer, which is a natural
choice when solving a classification problem, and we extend the Encoder with
some dense layers at the end for the classification task. We call:
In the learning phase, for parameters estimation, the sequences are processed
by batches. The default number for a batch is 32. It is possible to change the
batch size in model.fit. Here, in our base case, we take the batch size to be 64.
Below is an example of python code that sets the parameters of the model :
1 model . fit (
2 x_train ,
3 yh_train ,
4 v a l i d a t i o n _ s p l i t =0.2 ,
5 epochs =30 ,
6 batch_size =64 ,
7 callbacks = my_callbacks
6
3. a ”Special Purpose” block for class prediction, once the sequences have
been ”contextualised” by the Encoder.
3. An Additive Layer
1 res = x + inputs
7
Figure 3: Structure of the program.
Input Tensor
Multihead Encoder
Multihead Attention (Iterated
Dropout 6 times)
Add
Layer Normalisation
Conv1D
Conv1D
Add
Layer Normalisation
Dense
(could be iterated)
Dropout
Output Tensor
8
This layer has no learnable parameters and adds each unit of the dropout
layer’s output instance (xd1 , xd2 , · · · , xdl ) to each unit of an input instance
(x1 , x2 , · · · , xl ).
x̃ji −→ γj x̃ji + βj
The number of learnable parameters is 2d.
Then comes the Position-wise Feed Forward block with:
5. A 1D Convolution Layer (see Keras [3]).
1 x = layers . Conv1D ( filters = ff_dim , kernel_size =1 , activation = "
relu " ) ( x )
9
Each filter corresponds to a transformation which is affine if ”use bias=
True” (which is the default) and otherwise linear. Here, the shape of the
input tensor is (N one, l, d) and each affine transformation is done along
”axis=1”. As ”kernel size=1”, the transformation is done one xi at a time.
xi −→ wi′ xi + bi
with wi and bi learnable of shape (l). The output tensor is of shape (None,
l, ff dim). In Vaswani [2] ff dim = 4 × d and the activation function
is ”relu”. We use here the same multiplier 4, and the same activation
function. The number of learnable parameters is (d + 1) × l× ff dim.
6. A Dropout Layer
1 x = layers . Dropout ( dropout ) ( x )
This layer finishes the Position-wise Feed Forward block. Then comes,
8. An Additive Layer
1 res = x + inputs
9. A Normalization Layer
1 x = layers . L a y e r N o r m a l i z a t i o n ( axis = -1 , epsilon =1 e -6) ( inputs )
This Encoder Block is iterated 6 times in Vaswani [2]. This is also what
we do in our base case by defining ”num transformer blocks=6”.
Note that our program performs better when the last Normalization Layer
9 is moved in front of the Multihead Layer and it is what we do in our
base case as indicated by the red arrow in Figure 3.
The last part of the model is specific to our classification objective and
its input tensor of shape (None,l,d) must be transformed into a tensor of
shape (None, l,k). This last block is made of the following layers.
10. A GlobalAveragePooling1D layer.
By default the average is done along the sequence axis, transforming an in-
put tensor of shape (N one, l, d) into an output tensor of shape (N one, d).
Here, the model performs better when the average is done along the fea-
tures (channels) axis. By adding the argument ”data format=”channels first”’
the layer treat the input tensor as if its shape was (N one, d, l) therefore
calculating the averages along the last axis. Therefore, with this argument
in our model the output tensor has the shape (N one, l).
10
11. A Dense Layer with a number of neurons equal to ”mlp units”. We take
in our base case ”mlp units =[10]” and implement a single layer with no
iteration. To implement a succession of p Dense Layers, the number of
units of these successive layers are entered as a numpy array ”mlp units
= [n1 , n2 , ....np ]”.
1 x = layers . Dense ( dim , activation = " relu " ) ( x )
13. A final Dense Layer which produces an output of shape (n classes,) i.e
(7, ).
1 outputs = layers . Dense ( n_classes , activation = " softmax " ) ( x )
(N one, l, d) −→ (N one, l, d)
11
K Q V
x1 x1 x1
.. .. ..
We call K = . , Q = . , V = .
xK
l xQ
l
xVl
These matrices are of respective dimensions (l, dk ), (l, dk ), (l, dv ).
We call Attention Matrix A the matrix of dimension (l, l) defined as:
1
A = sof tM ax √ QK ⊥
k
The softmax function is applied line by line so that the sum of each line
of A is one and then we calculate the quantity
AV
zl
Last step is to use a matrix WO of shape (nbheads × dv , d) to
learnable
z1 WO
transform z into V = ... of shape (l, d).
zl WO
12
3.3.2 Categorical accuracy
The categorical accuracy provides another measure of performance by measuring
the percentage of correct bucket predictions (the bucket predicted for yi+1 is
the one maximising the qji for j ∈ J1, kK).
There are k buckets. At the beginning, the model does not know how to
use the [x]i to predict and makes prediction at random and therefore predicts
correctly in k1 of the cases and we observe and accuracy of k1 = 17 = 14.28%.
After training with the simulated data, the accuracy approaches 30%, which
is not far from the best possible accuracy derived with the Ti .
Indeed, the best possible accuracy for bucket prediction is reached when the
target probabilities Ti are used and when the predicted bucket is defined as
with this method, the Categorical Accuracy reached from the sample of 24131
observations is 31.85% on the train set and 32.00% on the test set.
that the classifier should ideally be able to approximate, and compare the quan-
tities H(Ti , Qi ) to our targets H(Ti , Ti ). The comparison is done by calculating
the average of these two quantities on the train set and on the test set.
When the model starts training it does not know how to use past observations
and therefore find a probability Q independent of the [x]i which solves
k
X 1
min − ln(qi )
Q
i=1
k
Therefore, the solution is, ∀i ∈ J1, kK, qi = k1 = 17 which results in a loss function
equal to ln(k) = ln(7) = 1.9459, which is what we get at the beginning of the
optimisation process. As the optimisation progress, we expect the loss function
to get closer to
Average{H(Ti , Ti )} ∼ 1.63
i
13
3.4 Parameters of the model
We make the arbitrary choice in our base model to work with sequences of length
32. The big difference between NLP and times series is that for time series
there is no standard method to embed numbers. We choose here an arbitrary
function ϕ for embedding. Note that, when we calculate in the Multihead
Attention block the scalar products ⟨ϕ(xi ), ϕ(xj )⟩ the choice of ϕ creates some
Kernel K(xi , xj ) = ⟨ϕ(xi ), ϕ(xj )⟩ which generally speaking are important tools
in classification and prediction problems.
The model uses, the categorical cross entropy for the loss function, Adam
for the optimizer and the ”categorical accuracy” for the metrics.
1 model . compile ( loss = " c a t e g o r i c a l _ c r o s s e n t r o p y " ,
2 optimizer = keras . optimizers . Adam ( learning_rate =1 e -3) ,
3 metrics =[ " c a t e g o r i c a l _ a c c u r a c y " ])
We use 80% of the times series for learning and 20% for test. In the learning
phase the model uses 80% of the learning sequences for calibration and 20% for
validation.
4 Results
4.1 Prediction mean reverting Ornstein-Uhlenbeck syn-
thetic data
We present here some results obtained in our base model; the parameters, in-
spired by the base model of Vaswani & Al [2], are given in Table 1.
We considered here a trajectory of the synthetic stochastic process (1) with
24131 data points, the same as the number of daily data we will use for the
prediction on the S&P500 market index. The results are given in Table 2.
We se that the model enables accurate prediction of the bucket probabilities
(the blue points correspond to the perfect probability predictions with the Ti )
but a significant number of instances are needed to train it.
14
Number Number of Train set: Test set: Train set Test set
of observa- Loss H(P, Q), Loss H(P, Q), H(P, T ), H(P, T ),
epochs tions Accuracy Accuracy H(T, T ) H(T, T )
30 24131 1,681 1.697 1.626 1.636
30.33% 28.66% 1.630 1.628
30 241310 1,656 1.656 1.631 1.630
30.60% 30.74% 1.631 1.629
40 241310 1,657 1.657 1.631 1.630
30.56% 30.77% 1.631 1.629
Figure 4: Predictions for the synthetic stochastic process (1) with 24131 obser-
vations, 30 epochs.
Figure 5: Predictions for the synthetic stochastic process (1) with 241310 ob-
servations, 40 epochs.
which are the daily log returns observed (see Figure 6).
In the first program we predict the bucket for yi , i.e, we try to guess the
15
probability distribution of the daily log return for the next business day. In the
second program we predict the bucket for yi2 (the quadratic variation of the log
prices) for the next business day.
Figure 6: Left: Log prices of the S&P500 over the period. Right: first 300
daily log returns y1 , ..., y300 .
As all are equal, this means that the model is failing to make an interesting
prediction because it fails to take into account the conditional information [x]i .
We did not spend too much efforts trying to improve these predictions, for
which we had limited hopes, and directed our efforts towards the yi2 prediction
objective.
2
4.4 Prediction of the next quadratic variation yi+l
2
Here, the aim is from an observed [x]i , to predict the bucket for yi+l . This is the
first step to predict the gamma cost of hedging options and future volatility. As
statistical studies using GARCH models for example have advocated for some
degree of predictibility, we expect that our predictions will surpass random
chance in this case.
The model parameters remain the same as before and we continue to refrain
from positional encoding, which appears to hinder learning. Fifty epochs seem
to be enough to calibrate and we get the following results.
16
Figure 7: bucket predictions for yi+l on the test set for the S&P500.
5 Conclusion
.
In this study, we applied a transformer encoder architecture to times series
prediction. The significant differences from LLM applications of transform-
ers is that here we are dealing with numerical data (of dimension 1) and not
tokens embedded in spaces of dimension 512 or 1024 (with a specific logic be-
hind the embedding process). We felt compelled to embed the numbers into a
17
2
Figure 8: bucket predictions for yi+l on the test set for the S&P500.
2
Figure 9: Naive classifier for yi+l on the test set for the S&P500.
18
References
[1] Ailing Zeng, Muxi Chen, Lei Zhang, Qiang Xu. (2023) Are Transformers Ef-
fective for Times Series Forecasting? The Thirty-Seventh AAAI Conference
on Artificial Intelligence, 2023 (AAAI-23).
[2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, (2017) Attention Is All
You Need, arXiv:1706.03762v7, 2 August 2023.
[3] Conv1D layer, https://keras.io/api/layers/convolution_layers/
convolution1d/, retrieved Feb. 29th 2024
[4] Eli Simhayev, Kashif Rasul, Niels Rogge, (2023) Yes, Transformers are Ef-
fective for Time Series Forecasting (+ Autoformer), https://huggingface.
co/blog/, 16 June 2023.
[5] Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E.Hinton, (2016) Layer Normal-
ization, arXiv:1607.06450v1, 21 July 2016.
[6] Qingsong Wen, Tian Zu, Chaoli Zhang, Weiqi Chen, Ziqing Ma, Junchi Yan,
Liang Sun. (2023) Transformers in Times Series: A Survey, Proceedings of
the Thirty-Second International Joint Conference on Artificial Intelligence
(IJCAI-23).
[7] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen
Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu, (2020) On
Layer Normalization in the Transformer Architecture, arXiv:2002.04745v2,
29 June 2020.
[8] TensorFlow v2.15.0.post1, (2024) https://www.tensorflow.org/api_
docs/python/tf/keras/layers/LayerNormalization, last updated 2024-
01-11 UTC.
19