Transformer For Times Series: An Application To The S&P500: 1 Objectives and General Introduction

Transformer for Times Series: an Application to

the S&P500
Pierre Brugière∗ and Gabriel Turinici†
arXiv:2403.02523v1 [cs.AI] 4 Mar 2024

CEREMADE
Université Paris Dauphine - PSL
Place du Marechal de Lattre de Tassigny
75016 PARIS, FRANCE
March 6, 2024

Abstract
The transformer models have been extensively used with good results
in a wide area of machine learning applications including Large Language
Models and image generation. Here, we inquire on the applicability of this
approach to financial time series. We first describe the dataset construc-
tion for two prototypical situations: a mean reverting synthetic Ornstein-
Uhlenbeck process on one hand and real S&P500 data on the other hand.
Then, we present in detail the proposed Transformer architecture and
finally we discuss some encouraging results. For the synthetic data we
predict rather accuratly the next move, and for the S&P500 we get some
interesting results related to quadratic variation and volatility prediction.
1 Objectives and general introduction

”The transformer models [2], initiated in 2017, have not yet attained maturity
in terms of scientific validation and applicability, despite being used in various
contexts of machine learning applications including Large Language Models and
generative modeling.
Consistent with this general trend, we inquire here about the applicability
of this approach to financial time series.
The outline of this work is as follows: we present the general methodology of
our approach in Section 2. We then describe the specific neural network model
architecture and the choices we made in Section 3. The results on synthetic
Ornstein-Uhlenbeck data and on the S&P500 are discussed in Section 4. The
final discussion and conclusions can be found in Section 5.”
∗ [email protected], ORCID : 0000-0002-8716-9145
† [email protected], https://turinici.com, OrcidID 0000-0003-2713-006X
1
2 Methodology
2.1 First notations and time series embedding
We use the encoder part of a transformer model for time series prediction. From
a time series of unidimensional variables (y1 , y2 , · · · , ym ) we build a probabilistic
classifier, which from any partial sequence (ϕ(yi ), ϕ(yi+1 ), . . ., ϕ(yi+l−1 )) of
length l calculates the probabilities for yi+l to belong to each interval of a fixed
interval list; these intervals will be referred to as ”buckets” hereafter. Here,
ϕ(.) is an embedding function from R into Rd which will be discussed later.
Transformers are highly efficient in large language models, where yi represents
words (or tokens) and ϕ(yi ) denotes their encodings in a d-dimensional space.
The value d is called the dimension of the model and is often set to 512 or
1024. Here, to achieve dimension d, the undimensional variables yi are encoded
y2 yd
into some new variables xi = ϕ(yi ) with ϕ(yi ) = (yi , 2i , · · · , d!i ). We also
incorporate the possibility of including a positional encoding to the process.
Different values for d can be chosen, but in our base case we take d = 2l . Details
regarding positional encoding are provided later in the paper.
Ideally, the conditional laws L(Yn+l |(Xn , Xn+1 , · · · , Xn+l−1 )) should con-
verge as n increases for the model to perform well. To investigate this prop-
erty, we first utilize simulated data, for which these conditional laws indeed
have a limit. Here, the synthetic dataset is simulated from a mean reverting
Ornstein-Uhlenbeck process, which involves observable variables yi and hidden
state variables hi . We assess the performance of the model by comparing its
results to the values derived from L(Yn+l |Fn+l−1 ). Here, Fn+l−1 contains all
the variables (observable or hidden) up to time n + l − 1, providing the best
possible estimator. Hence, in our simulations, Fn+l−1 encompasses not only the
knowledge of the (xi )i≤n+l−1 but also includes the hidden variables (hi )i≤n+l−1 .
Subsequently, we test the method on market data with the S&P500, initially
predicting the next day return and then the next day quadratic variation from
which avolatilyt prediction could derive.
2.2 Creating a dataset from a single time series

We observe a single time series (yi )i∈J1,mK . To ease notations, we denote [x]i =
(xi , xi+1 , · · · , xi+l−1 ) a sequence of length l. Recall that x is a notation for ϕ(y).
From the observation of y and their embedding x, there are different types
of datasets of sequences that can be constructed to train the model:
1. a single dataset of non-overlapping sequences [x]1 to predict yl+1 , [x]l+1
to predict y2l+1 and so on
2. a single dataset of overlapping sequences [x]1 , [x]2 , . . . , [x]m−l with the
corresponding values to predict yl+1 , yl+2 , . . . ym
3. l datasets D1 , D2 ..Dl of non overlapping sequences. Then, classifiers can
be calibrated on each of these datasets and combined with an ensemble
2
method. In our case we would have:
D1 = {[x]1 , [x]l+1 , · · · }
D2 = {[x]2 , [x]l+2 , · · · }
..
.
Dl = {[x]l , [x]2l , · · · }.
4. A bootstrap method to build, as in method 3, several datasets made of
sequences [x]i where each [x]i is picked at random from the dataset of
all sequences, and then apply an ensemble method to this ensemble of
datasets.
Here, we use method 2 and split the dataset of sequences into {[x]1 , [x]2 , · · · [x]m1 }
for training and {[x]m1 +1 , [x]m1 +2 · · · [x]m−l } for testing.
2.3 Ornstein Uhlenbeck process for the yi

A time series (y1 , y2 , · · · , ym ) is generated from some hidden variables hi , in the
following way :

 h0 ∼ ν0 √
∀n ∈ N, hn+1 = hn + θ(µ − hn )dt + σ dtϵn+1 with ϵn+1 ∼ N (0, 1) (1)
yn+1 = hn+1 − hn

We use θ > 0 to get a stationary process for the (yi )i∈N and take in our
simulations µ = 0, dt = 1, σ = 1 and θ = 1.
As only one trajectory is modeled (see Figure 2), we choose arbitrarily h0 = 0
and from there, the hidden values (h1 , h2 , · · · hm ) are simulated as well as the
(y1 , y2 , ...ym ). Note that, the various sequences extracted from this time series
with method 2 do not have identical distributions, as each of them is associated
to a different initial hidden value h0 , h1 , h2 · · · etc. This being said, as we can
calculate explicitely L(Yn+l |Fn+l−1 ) = L(Yn+l |Hn+l−1 ) it is not an issue to
evaluate the performance of the model by comparing it to the best possible
classifier, which is obtained by calculating the probabilities knowing all the
variables observed and hidden at the previous time step. The variables used are
recapitulated in Figure 1.
2.4 Positional encoding

As a common practice for transformer models, we enable the model to integrate
positional encoding. The positional encoding has the same dimension as the
model. So, for a sequence [x]i of shape (l, d) the positional encoding will be a
matrix P ∈ Rl×d . The positional encoding could be be learned, but we use here
the standard method with sinusoidal functions. For this method, it is better to
have an even number for d, which will be the case here.
For j ∈ [0, d2 − 1] we define
1
wj = 2j
10000 d
3
Simulation [h0 , h1 , · · · hm ]
shape (m + 1,)
Time series [y1 , y2 , · · · ym ]

shape (m,)
Embedding [e1 , e2 , · · · em ]
shape (m, d)
Sequencing [[e]1 , [e]2 , · · · , [e]m−l ]

shape (m − l, l, d)
Positional Encoding
hidden = Target probabilities
Dataset = [[x]1 , [x]2 , · · · , [x]m−l ]
[hl , hl+1 , · · · , hm−1 ] yh= [[yh]1 , [yh]2 · · · , [yh]m−l ]
with [x]i = [e]i + P
shape (m − l, ) shape (m − l, )
shape (m − l, l, d)
x train = x test =
[[x]1 , [x]2 , · · · , [x]t ] [[x]t+1 , [x]t+2 , · · · , [x]m−l+1 ]
shape (t, l, d) shape (m − l − t, l, d)
Figure 1: Example of variables used in the simulations.
Figure 2: Trajectory of the process (1) for the parameters dt = 1, θ = 1,

σ = 1, µ = 0. Left: first 301 hidden values h0 , ..., h300 . Right: first 300 values
y1 , ..., y300 .
and for t ∈ [0, d − 1], the column vector pt of P is defined as:

pt,2j = sin(twj )
(2)
pt,2j+1 = cos(twj )
This positional encoding presents several interesting properties.

Property 1. For k ∈ [0, d − 1] let Tk be the 2 × 2 block diagonal matrix such
4
that, ∀j ∈ [0, d2 − 1], each of its 2 × 2 diagonal block Mk (j) is defined by:

cos(kwj ) sin(kwj )
Mk (j) =
−sin(kwj ) cos(kwj )
then,
1. ∀k ∈ [0, d − 1], Tk is an orthonormal matrix. i.e Tk Tk⊥ = Tk⊥ Tk = Idd
2. Tk+t = Tk Tt
d
3. ∀t ∈ [0, d − 1], ∥pt ∥2 = 2
4. pt+k = Tk pt
5. for all vector columns pt and pt+k of P the quantity ⟨pt , pt+k ⟩ only depends
on k and is maximum for k = 0.
6. ∃ a function ψ such that P P ⊥ = [ψ(|i − j|)]
Proof.
1. simple calculation as cos2 (kwj ) + sin2 (kwj ) = 1
2. demonstration block by block using the properties of the trigonometric

functions
3. simple using that sin2 + cos2 = 1
4. demonstration block by block

pt,2j cos(kwj ) sin(kwj ) sin(twj ) sin((k + t)wj )
Mk (j) = =
pt,2j+1 −sin(kwj ) cos(kwj ) cos(twj ) cos((k + t)wj )
which proves the result.
5. ⟨pt , pt+k ⟩ = ⟨pt+k , pt ⟩ = p⊥ ⊥ ⊥ ⊥ ⊥

0 Tt+k Tk p0 = p0 Tk Tt Tt p0 = p0 Tk p0 which
d
only depends on k and ⟨pt , pt ⟩ = 2 while by Cauchy Schwartz |⟨pt , pt+k ⟩| =
d
|p⊥
0 Tk p0 | ≤ ∥p0 ∥∥Tk p0 ∥ = ∥p0 ∥∥p0 ∥ = 2 Q.E.D.
6. This result is a consequence of the previous result.
Property 6 implies that the positional encoding enables, by calculating a

scalar product between the positional vectors ps and pt , to get a notion of time
elapsed between time s and t.
5
2.5 The Transformer model for classification
After defining some buckets B1 , B2 , Bk the objective of the model is, for a
sequence (xm+1 , xm+2 , · · · , xm+l ), to estimate the probabilities (p1 , p2 , · · · , pk )
2
for ym+l+1 (resp ym+l+1 for quadratic prediction) to belong to these buckets. In
the base case, we take the number of buckets to be equal to 7 obtaining the list
] − ∞, α1 ], ]α1 , α2 ], · · · , ]α6 , +∞[ with the αi corresponding to the quantiles of
the observed yi (resp yi2 ) from the training set. Therefore, each bucket contains
the same proportion of yi (resp yi2 ), up to the rounding errors, from the training
set.
The model used is the encoder part of a Transformer, which is a natural
choice when solving a classification problem, and we extend the Encoder with
some dense layers at the end for the classification task. We call:
• l the length of the sequences (x1 , x2 , · · · , xl ) used for each prediction. In

the base case, l = 32.
• d the dimension of the model, which is the dimension of each element xi ,
or equivalently the number of features for each xi . In the base case here,
d = 2l = 16.
• k the number of buckets used for classification. In the base case here,
k = 7.
• p the proportion of sequences of the dataset used for training. In the base
case here, p = 80% and the training sequences use the first observations
in chronological order.
In the learning phase, for parameters estimation, the sequences are processed
by batches. The default number for a batch is 32. It is possible to change the
batch size in model.fit. Here, in our base case, we take the batch size to be 64.
Below is an example of python code that sets the parameters of the model :
1 model . fit (
2 x_train ,
3 yh_train ,
4 v a l i d a t i o n _ s p l i t =0.2 ,
5 epochs =30 ,
6 batch_size =64 ,
7 callbacks = my_callbacks
The model is built as a sequence of the following blocks :
1. a block to create a) the dataset of the [x]i of shape (Nb of sequences,l,d),

with the positional encoding taken into account as an option and b) the
dataset of the yi , to be predicted, of shape (Nb of sequences,l,1). The
dataset is split into a training set and a test set and for each xi we can
associate the corresponding hi and calculate L(Yi+1 |Hi = hi ).
2. an Encoder block made of a MultiHead Attention block and a Position-
wise Feed-Forward block
6
3. a ”Special Purpose” block for class prediction, once the sequences have
been ”contextualised” by the Encoder.
The successive neural layers implemented in Python TensorFlow and Keras

are represented in Figure 3 and their functioning is explained below. We explain
in details below the functioning of each layer.
3 Neural network model details

3.1 Analysis of the different layers
1. The Input tensor defines the shape of an input sequence (instance). It
appears as a tensor of shape (N one, l, d) ”None” being related to the batch
size, which is not explicited in the input tensor construction.
1 inputs = keras . Input ( shape =(32 ,16) )
2. The MultiHeadAttention Layer produces output tensors of the same

shape (l,d) as the inputs.
1 x = layers . M u l t i H e a d A t t e n t i o n ( key_dim = head_size , num_heads =
num_heads , dropout = dropout ) (x , x )
• This layer is explained in more details in the next section. It has

two arguments. The first argument is the instance used to calculate
the Query and the second argument is the instance used to calculate
the Key and the Value. In the Encoder part of a Transformer these
two arguments are the same. The head size is the dimension chosen
for the Queries, Keys and Values. In the original paper ”Attention
is All You Need” by Vaswani and all, the dimension of the model is
512, the number of heads is 8 and the head size is 512/8=64. Here,
we keep the 64 for the head size. For the dropout, we do not use the
dropout layer optionality offered by the Keras MultiHead Attention
Layer but, implement it separately. Therefore, we use ”dropout=0”
and add the following layer.
• A Dropout Layer
1 x = layers . Dropout ( dropout ) ( x )
The dropout layer has no learnable parameters and simply transforms

any input instance of shape (l,d) by replacing some units by zero in
the training phase (while keeping them identical when predicting).
The purpose of the dropout layer is to create some robustness when
learning the parameters of the model. We take dropout=0.25.
3. An Additive Layer
1 res = x + inputs
7
Figure 3: Structure of the program.
Input Tensor
Multihead Encoder
Multihead Attention (Iterated
Dropout 6 times)
Add
Layer Normalisation
Conv1D
Dropout Position-wise Feed Forward
Conv1D
Add
Layer Normalisation
Global Average Pooling 1D Special Purpose
Dense
(could be iterated)
Dropout
Output Tensor
8
This layer has no learnable parameters and adds each unit of the dropout
layer’s output instance (xd1 , xd2 , · · · , xdl ) to each unit of an input instance
(x1 , x2 , · · · , xl ).
input + dropout layer’s output −→ (xd1 + x1 , xd2 + x2 , · · · , xdl + xl )
so, it produces a tensor of shape (None, l,d).

There is no learnable parameter here.
4. A Normalisation Layer (see Tensorflow [8])
1 x = layers . L a y e r N o r m a l i z a t i o n ( axis = -1 , epsilon =1 e -6) ( res )
Layer normalisation is done independently for each sequence (contrarily

to batch normalisation), by normalizing along a single axis or multiple
axis. So, the mean and variance calculated along this/these axis become
respectively zero and 1. By default, Keras normalizes along the last axis.
In a transformer the normalization is done along the feature axis (see [5],
[7]), i.e the dimension of the model. So, as the input tensor of the layer is
of shape (N one, l, d) by saying ”axis=-1” (or ”axis=2”) the normalisation
is done in the dimension d as usual.
So, if xi ∈ Rd is the ith element of a sequence (instance) and xji is the
value of its feature j ∈ [1, d], xi is first transformed into
1
x̃i = (xi − µ(xi )1d )
σ(xi ) + ϵ
s
d d
1
xji 1
(xji − µ(xi ))2
P P
with µ(xi ) = d and σ(xi ) = d
j=1 j=1
The (small) arbitrary parameter ϵ is chosen to insure that no division by

zero occurs. So, each x̃i of a sequence gets normalized features in Rd of
expectation 0 and variance 1.
The second thing that the normalization layer does is an affine transfor-
mation for the axis which has been normalized. Here, the normalization
occurs over [axis=-1] of shape (d) so, the shapes of the scale-tensor γ and
center-tensor β are (d). The x̃ji are transformed by these learnable tensors
β = (β1 , β2 , · · · βd ) and γ = (γ1 , γ2 , · · · γd ) in the following way.
x̃ji −→ γj x̃ji + βj
The number of learnable parameters is 2d.
Then comes the Position-wise Feed Forward block with:
5. A 1D Convolution Layer (see Keras [3]).
1 x = layers . Conv1D ( filters = ff_dim , kernel_size =1 , activation = "
relu " ) ( x )
9
Each filter corresponds to a transformation which is affine if ”use bias=
True” (which is the default) and otherwise linear. Here, the shape of the
input tensor is (N one, l, d) and each affine transformation is done along
”axis=1”. As ”kernel size=1”, the transformation is done one xi at a time.
xi −→ wi′ xi + bi
with wi and bi learnable of shape (l). The output tensor is of shape (None,
l, ff dim). In Vaswani [2] ff dim = 4 × d and the activation function
is ”relu”. We use here the same multiplier 4, and the same activation
function. The number of learnable parameters is (d + 1) × l× ff dim.
6. A Dropout Layer
1 x = layers . Dropout ( dropout ) ( x )
7. Another 1D Convolution Layer, but this time with a number of filters

equal to d to produce an output of shape (l,d).
1 x = layers . Conv1D ( filters = inputs . shape [ -1] , kernel_size =1) ( x )
This layer finishes the Position-wise Feed Forward block. Then comes,
8. An Additive Layer
1 res = x + inputs
9. A Normalization Layer
1 x = layers . L a y e r N o r m a l i z a t i o n ( axis = -1 , epsilon =1 e -6) ( inputs )
This Encoder Block is iterated 6 times in Vaswani [2]. This is also what
we do in our base case by defining ”num transformer blocks=6”.
Note that our program performs better when the last Normalization Layer
9 is moved in front of the Multihead Layer and it is what we do in our
base case as indicated by the red arrow in Figure 3.
The last part of the model is specific to our classification objective and
its input tensor of shape (None,l,d) must be transformed into a tensor of
shape (None, l,k). This last block is made of the following layers.
10. A GlobalAveragePooling1D layer.
1 x = layers . G l o b a l A v e r a g e P o o l i n g 1 D ( data_format = " channe ls_first "

)(x)
By default the average is done along the sequence axis, transforming an in-
put tensor of shape (N one, l, d) into an output tensor of shape (N one, d).
Here, the model performs better when the average is done along the fea-
tures (channels) axis. By adding the argument ”data format=”channels first”’
the layer treat the input tensor as if its shape was (N one, d, l) therefore
calculating the averages along the last axis. Therefore, with this argument
in our model the output tensor has the shape (N one, l).
10
11. A Dense Layer with a number of neurons equal to ”mlp units”. We take
in our base case ”mlp units =[10]” and implement a single layer with no
iteration. To implement a succession of p Dense Layers, the number of
units of these successive layers are entered as a numpy array ”mlp units
= [n1 , n2 , ....np ]”.
1 x = layers . Dense ( dim , activation = " relu " ) ( x )
12. A Dropout Layer

1 x = layers . Dropout ( mlp_dropout ) ( x )
in our base case we take mlp dropout= 0.25.
13. A final Dense Layer which produces an output of shape (n classes,) i.e
(7, ).
1 outputs = layers . Dense ( n_classes , activation = " softmax " ) ( x )
The model is implemented with Python tensorflows on Google Colab (Python

torch is also popular alternative).
3.2 The Multi Head Transformer-Encoder

The different layers of the MultiHead Transformer-Encoder are described in
more details here:
1. a MultiHead Attention Layer. This block produces output tensors of the
same shape as the input tensors.
(N one, l, d) −→ (N one, l, d)
The operations conducted are as follows

• each Head transforms an input tensor of shape (N one, l, d) into a
tensor of shape (N one, l, dv ) via the Attention Mechanism
• these nbheads tensors are then concatenated to form a tensor of shape
(N one, l, nbheads × dv )
• then a linear layer transforms this tensor into a tensor of shape
(N one, l, d)
The MultiHead Attention Mechanism works as follows:
For each ”Head”, each observation xi of shape (d, ) is transformed linearly
according to some learnable matrices
WK of dim (d, dk ) to produce a Key-tensor xKi = xi WK of shape (dk , )
WQ of dim (d, dk ) to produce a Query-tensor xQ i = xi WQ of shape (dk , )
WV of dim (d, dv ) to produce a Value-tensor xVi = xi WV of shape (dv , ).
11
 K  Q  V
x1 x1 x1
 ..   ..   .. 
We call K =  . , Q =  . , V =  . 
xK
l xQ
l
xVl
These matrices are of respective dimensions (l, dk ), (l, dk ), (l, dv ).
We call Attention Matrix A the matrix of dimension (l, l) defined as:
1
A = sof tM ax √ QK ⊥
k
The softmax function is applied line by line so that the sum of each line
of A is one and then we calculate the quantity
AV
Thus, we obtain for each Head amatrix

 AV of dimension (l, dv ) which are
z1
concatenated to form a matrix  ...  of dimension (l, nbheads × dv ).
 
zl
Last step is to use a   matrix WO of shape (nbheads × dv , d) to
learnable
z1 WO
transform z into V =  ...  of shape (l, d).
 
zl WO
3.3 Performance of the model

3.3.1 Crossentropy
The cross entropy between a target probability P = (p1 , p2 , · · · , pk ) of belonging
to the k buckets and a softmax prediction Q = (q1 , q2 , · · · , qk ) is defined as:
k
X
H(P, Q) = − pj ln(qj ) ≥ H(P, P )
j=1
and min H(P, Q) is reached for Q = P .

Q
In the learning phase, the cross entropy is minimised between the dirac
probabilities Pi = (1B1 (yi+l ), 1B2 (yi+l ), · · · , 1Bk (yi+l )) observed and the prob-
abilities Qi = (q1i , q2i , · · · , qki ) predicted from the observation [x]i . For a batch,
made of #B sequences, the loss function minimised is :
1 X
H(Pi , Qi )
#B
i∈B
With simulated data we can calculate the target probabilities Ti based on

all events (hidden and observable) occured up to time i + l − 1 as
tij = P (yi+l ∈ Bj |Hi+l−1 = hi+l−1 )
12
3.3.2 Categorical accuracy
The categorical accuracy provides another measure of performance by measuring
the percentage of correct bucket predictions (the bucket predicted for yi+1 is
the one maximising the qji for j ∈ J1, kK).
There are k buckets. At the beginning, the model does not know how to
use the [x]i to predict and makes prediction at random and therefore predicts
correctly in k1 of the cases and we observe and accuracy of k1 = 17 = 14.28%.
After training with the simulated data, the accuracy approaches 30%, which
is not far from the best possible accuracy derived with the Ti .
Indeed, the best possible accuracy for bucket prediction is reached when the
target probabilities Ti are used and when the predicted bucket is defined as
arg max tij

j
with this method, the Categorical Accuracy reached from the sample of 24131
observations is 31.85% on the train set and 32.00% on the test set.
3.3.3 Pointwise analysis of the predictions

With simulated data, we can calculate the target probabilities Ti based on all
variables (hidden and observable) occured up to time i + l − 1 as
tij = P (yi+l ∈ Bj |Hi+l−1 = hi+l−1 )
that the classifier should ideally be able to approximate, and compare the quan-
tities H(Ti , Qi ) to our targets H(Ti , Ti ). The comparison is done by calculating
the average of these two quantities on the train set and on the test set.
When the model starts training it does not know how to use past observations
and therefore find a probability Q independent of the [x]i which solves
k
X 1
min − ln(qi )
Q
i=1
k
Therefore, the solution is, ∀i ∈ J1, kK, qi = k1 = 17 which results in a loss function
equal to ln(k) = ln(7) = 1.9459, which is what we get at the beginning of the
optimisation process. As the optimisation progress, we expect the loss function
to get closer to
Average{H(Ti , Ti )} ∼ 1.63
i
and we manage to get pretty close.

For each [x]i we compare the probabilities Qi predicted by the model to
the target probabilities Ti by plotting for each bucket j ∈ J1, kK the points
(hi+l−1 , qji ) and the points (hi+l−1 , tij ), see Figure 4 and Figure 5.
13
3.4 Parameters of the model
We make the arbitrary choice in our base model to work with sequences of length
32. The big difference between NLP and times series is that for time series
there is no standard method to embed numbers. We choose here an arbitrary
function ϕ for embedding. Note that, when we calculate in the Multihead
Attention block the scalar products ⟨ϕ(xi ), ϕ(xj )⟩ the choice of ϕ creates some
Kernel K(xi , xj ) = ⟨ϕ(xi ), ϕ(xj )⟩ which generally speaking are important tools
in classification and prediction problems.
Base case NLP encoder Base case time series prediction

length sequence: l ∼ 1024 l = 32
embedding dimension: d = 512 d = 2l = 16
Positional Encoding: sinus and cosinus sinus and cosinus
number of heads: h = 8 h=8
head size: dk = Ndh = 64 dk = 64
number of block iterations: N = 6 N =6
number of units FFN:
df f = 4 × 16 = 64
df f = 4 × d = 2048
Dense layer classifier
mlp units = [10]
Table 1: Parameters of the model.
The model uses, the categorical cross entropy for the loss function, Adam
for the optimizer and the ”categorical accuracy” for the metrics.
1 model . compile ( loss = " c a t e g o r i c a l _ c r o s s e n t r o p y " ,
2 optimizer = keras . optimizers . Adam ( learning_rate =1 e -3) ,
3 metrics =[ " c a t e g o r i c a l _ a c c u r a c y " ])
We use 80% of the times series for learning and 20% for test. In the learning
phase the model uses 80% of the learning sequences for calibration and 20% for
validation.
4 Results
4.1 Prediction mean reverting Ornstein-Uhlenbeck syn-
thetic data
We present here some results obtained in our base model; the parameters, in-
spired by the base model of Vaswani & Al [2], are given in Table 1.
We considered here a trajectory of the synthetic stochastic process (1) with
24131 data points, the same as the number of daily data we will use for the
prediction on the S&P500 market index. The results are given in Table 2.
We se that the model enables accurate prediction of the bucket probabilities
(the blue points correspond to the perfect probability predictions with the Ti )
but a significant number of instances are needed to train it.
14
Number Number of Train set: Test set: Train set Test set
of observa- Loss H(P, Q), Loss H(P, Q), H(P, T ), H(P, T ),
epochs tions Accuracy Accuracy H(T, T ) H(T, T )
30 24131 1,681 1.697 1.626 1.636
30.33% 28.66% 1.630 1.628
30 241310 1,656 1.656 1.631 1.630
30.60% 30.74% 1.631 1.629
40 241310 1,657 1.657 1.631 1.630
30.56% 30.77% 1.631 1.629
Table 2: Analysis of the results obtained with simulated data.
Figure 4: Predictions for the synthetic stochastic process (1) with 24131 obser-
vations, 30 epochs.
Figure 5: Predictions for the synthetic stochastic process (1) with 241310 ob-
servations, 40 epochs.
4.2 Prediction on the S&P500 Index

We now make predictions on market data using the closing prices for the S&P500
Index from the 30th December 1927 to the 1th February 2024. We dispose of
24131 price observations. Here, the variables hi correspond to the logarithm
ln(pi ) of the prices observed and we build predictions for the
yi = ln(pi ) − ln(pi−1 ) (3)
which are the daily log returns observed (see Figure 6).
In the first program we predict the bucket for yi , i.e, we try to guess the
15
probability distribution of the daily log return for the next business day. In the
second program we predict the bucket for yi2 (the quadratic variation of the log
prices) for the next business day.
Figure 6: Left: Log prices of the S&P500 over the period. Right: first 300
daily log returns y1 , ..., y300 .
4.3 Prediction of the next return yi

When we applied the base model it became stuck in predicting constant bucket
probabilities, without differentiating between the [x]i (see Figure 7). We used
10 epochs and for all i in the train set the probability predictions for the k = 7
buckets (defined by the quantiles aj , j = 1, ..., k − 1 as described earlier) were
all equal to :
{0.14200473, 0.14268069, 0.14331234, 0.1455955, 0.1445429, 0.1425599, 0.1393042}
As all are equal, this means that the model is failing to make an interesting
prediction because it fails to take into account the conditional information [x]i .
We did not spend too much efforts trying to improve these predictions, for
which we had limited hopes, and directed our efforts towards the yi2 prediction
objective.
2
4.4 Prediction of the next quadratic variation yi+l
2
Here, the aim is from an observed [x]i , to predict the bucket for yi+l . This is the
first step to predict the gamma cost of hedging options and future volatility. As
statistical studies using GARCH models for example have advocated for some
degree of predictibility, we expect that our predictions will surpass random
chance in this case.
The model parameters remain the same as before and we continue to refrain
from positional encoding, which appears to hinder learning. Fifty epochs seem
to be enough to calibrate and we get the following results.
16
Figure 7: bucket predictions for yi+l on the test set for the S&P500.
Number of Number of Train set Test set

epochs observations Loss H(P, Q) Loss H(P, Q)
Accuracy Accuracy
50 24131 1,861 1.876
21.92% 22.84%
Table 3: bucket predictions for yi2 .
The predicted probabilities for the k = 7 buckets, corresponding to the

successive sequences [x]i in the test set, are illustrated in Figure 8. While the
results are imperfect, they surpass those of a random prediction across the k
buckets (which are equally likely in the training set), yielding a cross-entropy
of ln(k) = ln(7) = 1.9459 and an accuracy of k1 = 17 = 14.28%.
Another natural benchmark for evaluating our classifier is the ”naive” classi-
j=i+l−1
2
(with 100% probability) to the same bucket as 1l yj2 .
P
fier, which assigns yi+l
j=i
For this classifier, the accuracy on the test set is 19.27%, and the predicted
classes for the successive [x]i are depicted in Figure 9. Once again, our classifier
outperforms this naive approach, which is an encouraging outcome.
Undoubtedly, there is room for further refinement to enhance the perfor-
mance of the Encoder Classifier. This could involve adjustments to the struc-
ture, choice of encoding method, model dimensionality, inclusion of additional
variables, and so forth. Nonetheless, we view this current work as a promising
starting point, yielding some encouraging results.
5 Conclusion
.
In this study, we applied a transformer encoder architecture to times series
prediction. The significant differences from LLM applications of transform-
ers is that here we are dealing with numerical data (of dimension 1) and not
tokens embedded in spaces of dimension 512 or 1024 (with a specific logic be-
hind the embedding process). We felt compelled to embed the numbers into a
17
2
Figure 8: bucket predictions for yi+l on the test set for the S&P500.
2
Figure 9: Naive classifier for yi+l on the test set for the S&P500.
higher dimension space in order to prevent significant information loss in the

normalization layers (that encoders possess). That being said, our approach to
embedding numbers is rather naive and just based on the premise that by em-
bedding numbers with a function ϕ the scalar product we create in the process
⟨ϕ(xi ), ϕ(xj )⟩ may be related to some useful Kernel K(xi , xj ) = ⟨ϕ(xi ), ϕ(xj )⟩.
We did not have to do too much fine tuning to have a working model in the
simulation case, whose results were therefore encouraging. However, we were
surprised to find that positional encoding did not enhance prediction accuracy
or improve convergence speed. Additionally, we noticed that altering the value
of θ in the simulations could degrade model performance or impede learning,
necessitating adjustments. The main differences between our structure and the
standard encoder is the normalization layer that we put before the multi-head
instead of before the classifier block. This deviation was motivated by the rec-
ommendation against normalizing before a dense classification layer. Regarding
the prediction of the square of the daily returns of the S&P500, which can be
useful in terms of volatility and cost of gamma hedging predictions, we find the
results encouraging. However, we believe that the model’s performance could
be further improved through more fine-tuning of the transformer’s structure, as
well as the embedding process and selection of additional observation variables.
18
References
[1] Ailing Zeng, Muxi Chen, Lei Zhang, Qiang Xu. (2023) Are Transformers Ef-
fective for Times Series Forecasting? The Thirty-Seventh AAAI Conference
on Artificial Intelligence, 2023 (AAAI-23).
[2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, (2017) Attention Is All
You Need, arXiv:1706.03762v7, 2 August 2023.
[3] Conv1D layer, https://keras.io/api/layers/convolution_layers/
convolution1d/, retrieved Feb. 29th 2024
[4] Eli Simhayev, Kashif Rasul, Niels Rogge, (2023) Yes, Transformers are Ef-
fective for Time Series Forecasting (+ Autoformer), https://huggingface.
co/blog/, 16 June 2023.
[5] Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E.Hinton, (2016) Layer Normal-
ization, arXiv:1607.06450v1, 21 July 2016.
[6] Qingsong Wen, Tian Zu, Chaoli Zhang, Weiqi Chen, Ziqing Ma, Junchi Yan,
Liang Sun. (2023) Transformers in Times Series: A Survey, Proceedings of
the Thirty-Second International Joint Conference on Artificial Intelligence
(IJCAI-23).
[7] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen
Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu, (2020) On
Layer Normalization in the Transformer Architecture, arXiv:2002.04745v2,
29 June 2020.
[8] TensorFlow v2.15.0.post1, (2024) https://www.tensorflow.org/api_
docs/python/tf/keras/layers/LayerNormalization, last updated 2024-
01-11 UTC.
19

Transformer For Times Series: An Application To The S&P500: 1 Objectives and General Introduction

Uploaded by

Copyright:

Available Formats

Transformer For Times Series: An Application To The S&P500: 1 Objectives and General Introduction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Transformer For Times Series: An Application To The S&P500: 1 Objectives and General Introduction

Uploaded by

Copyright:

Available Formats

Transformer for Times Series: an Application to

1 Objectives and general introduction

2.2 Creating a dataset from a single time series

2.3 Ornstein Uhlenbeck process for the yi

2.4 Positional encoding

Time series [y1 , y2 , · · · ym ]

Sequencing [[e]1 , [e]2 , · · · , [e]m−l ]

Figure 1: Example of variables used in the simulations.

Figure 2: Trajectory of the process (1) for the parameters dt = 1, θ = 1,

and for t ∈ [0, d − 1], the column vector pt of P is defined as:

This positional encoding presents several interesting properties.

2. demonstration block by block using the properties of the trigonometric

which proves the result.

5. ⟨pt , pt+k ⟩ = ⟨pt+k , pt ⟩ = p⊥ ⊥ ⊥ ⊥ ⊥

6. This result is a consequence of the previous result.

Property 6 implies that the positional encoding enables, by calculating a

• l the length of the sequences (x1 , x2 , · · · , xl ) used for each prediction. In

The model is built as a sequence of the following blocks :

1. a block to create a) the dataset of the [x]i of shape (Nb of sequences,l,d),

The successive neural layers implemented in Python TensorFlow and Keras

3 Neural network model details

2. The MultiHeadAttention Layer produces output tensors of the same

• This layer is explained in more details in the next section. It has

The dropout layer has no learnable parameters and simply transforms

Dropout Position-wise Feed Forward

Global Average Pooling 1D Special Purpose

input + dropout layer’s output −→ (xd1 + x1 , xd2 + x2 , · · · , xdl + xl )

so, it produces a tensor of shape (None, l,d).

Layer normalisation is done independently for each sequence (contrarily

The (small) arbitrary parameter ϵ is chosen to insure that no division by

7. Another 1D Convolution Layer, but this time with a number of filters

1 x = layers . G l o b a l A v e r a g e P o o l i n g 1 D ( data_format = " channe ls_first "

12. A Dropout Layer

in our base case we take mlp dropout= 0.25.

The model is implemented with Python tensorflows on Google Colab (Python

3.2 The Multi Head Transformer-Encoder

The operations conducted are as follows

Thus, we obtain for each Head amatrix

3.3 Performance of the model

and min H(P, Q) is reached for Q = P .

With simulated data we can calculate the target probabilities Ti based on

tij = P (yi+l ∈ Bj |Hi+l−1 = hi+l−1 )

arg max tij

3.3.3 Pointwise analysis of the predictions

tij = P (yi+l ∈ Bj |Hi+l−1 = hi+l−1 )

and we manage to get pretty close.

Base case NLP encoder Base case time series prediction

Table 1: Parameters of the model.

Table 2: Analysis of the results obtained with simulated data.

4.2 Prediction on the S&P500 Index

yi = ln(pi ) − ln(pi−1 ) (3)

4.3 Prediction of the next return yi

{0.14200473, 0.14268069, 0.14331234, 0.1455955, 0.1445429, 0.1425599, 0.1393042}

Number of Number of Train set Test set

Table 3: bucket predictions for yi2 .

The predicted probabilities for the k = 7 buckets, corresponding to the

higher dimension space in order to prevent significant information loss in the

You might also like