Policy Research Working Paper                    8790




      Estimation of the ex ante Distribution
    of Returns for a Portfolio of U.S. Treasury
           Securities via Deep Learning
                                Andrea Foresti




Market and Counterparty Risk Team
March 2019
Policy Research Working Paper 8790


  Abstract
 This paper presents different deep neural network archi-                           tested as the main building blocks of each architecture. The
 tectures designed to forecast the distribution of returns on                       models are then augmented by cross-sectional data and the
 a portfolio of U.S. Treasury securities. A long short-term                         portfolio’s empirical distribution. The paper also presents
 memory model and a convolutional neural network are                                the fit and generalization potential of each approach.




 This paper is a product of the Market and Counterparty Risk Team. It is part of a larger effort by the World Bank to
 provide open access to its research and make a contribution to development policy discussions around the world. Policy
 Research Working Papers are also posted on the Web at http://www.worldbank.org/research. The author may be contacted
 at aforesti@worldbank.org.




         The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
         issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
         names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
         of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
         its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


                                                       Produced by the Research Support Team
        Estimation of the ex ante Distribution of Returns
            for a Portfolio of U.S. Treasury Securities
                        via Deep Learning
                                Andrea Foresti
                               The World Bank




JEL: C45, C58, G17
Keywords: Machine Learning, Neural Networks, Convolution, LSTM, Market Risk
1 Introduction
Deep learning is a branch of machine learning that may be loosely deﬁned as a collection of
statistical algorithms which feature diﬀerent layers of operations (usually more than three).
These algorithms can be used in either a supervised or unsupervised manner.
Deep learning methods are being actively explored for a variety of practical purposes and
have, in recent years, found considerable success. In particular, these methods can be
used for both regression (e.g. see [Lathuili`ere et al., 2018]) and classiﬁcation tasks (e.g.
see [Krizhevsky et al., 2012]). Both these ﬁelds of application are very signiﬁcant in the
context of ﬁnance.

Moreover, deep learning is of particular interest in all those tasks that present complex
input and output data, as the data coming into the model, ﬂowing from one layer to
another and ﬁnally exiting as output can be arranged in complex multi-dimensional struc-
tures.
As we will later see, for instance, classiﬁcation tasks often entail the speciﬁcation of the
level of conﬁdence across the available classes, thus allowing to give a more nuanced answer
than a simple binary (yes/no) attribution.
As a result of all these advantages, in the past few years Deep Learning has started to see
fervent research and actual application in ﬁnance (as, for instance, recognized in [Board,
2017]).
As an illustration, we have so far seen examples in:

   • time series forecasting (e.g. [Sun et al., 2018])

   • portfolio construction (e.g. [Heaton et al., 2017])

   • credit analysis (e.g. for default prediction see [Hosaka, 2019] or [Hamori et al., 2018])

   There has been, however, limited publicly available research in the ﬁeld of Market Risk.




                                              2
2 Deep neural networks
A deep neural network is a deep learning model composed of several layers of computations
that are loosely inspired by the neurological structure of the brain.
The layers consist of diﬀerent components (neurons) that are connected to the output of
the previous layer according to some pattern. Every layer enacts a transformation on both
the content and the shape of the data that ﬂow through it.
The oldest and most common type of layer is the so called dense or fully connected layer.
This type of layer transforms the data by multiplying their inputs by a weight matrix (and
adding - optionally - a ﬁxed number (called the bias ) to its results):

                                  Xl = Xl−1 · Wll−1 + bl                                 (1)
where xl is the lt h layer in the network, Wll−1 is the weights matrix that transform the
data from layer l − 1 to layer l and bl is the bias vector. From equation 1 we see that only
one bias is applied to every layer, while the weight matrix size is determined by the input
and desired output data dimensions.
The output data of a layer are then usually transformed by a function operating ele-
ment-wise (called the activation function ).
The most commonly used activation functions are:



                                                      0 for x < 0
                     Binary
                                                     1 for x >= 0


                                                              1
                     Sigmoid
                                                           1 + e−x


                                                        ex − e−x
                     TanH
                                                        ex + e−x


                                                      0 for x < 0
                     ReLu
                                                     x for x >= 0



                               Table 1: Common activations


                                             3
    When the outputs of a layer are instead directly passed to the following layers, this is
normally referred to as a linear activation.
What is interesting is that both continuous and discontinuous activation functions are
attractive for diﬀerent purposes during learning.
The fully connected layer works very well for a variety of diﬀerent tasks but is poorly suited
to the analysis of data that are structured in multiple dimensions (such as time-dependent
multivariate variables).
While, in theory, a deep learning neural network comprised only of fully connected layers
would be able to tackle such a problem, it would in reality present two main problems:

  1. in order to be able to learn eﬀectively, the model would need to be of a considerable
     size (at a minimum hundreds of thousands of parameters);

  2. such a model design would have a tendency to overﬁt the training data.

    The ﬁrst obstacle has been successfully tackled by recent research developments (start-
ing with Geoﬀrey Hinton’s seminal paper [Hinton et al., 2006]) and the availability of
ever-greater computing power. The second one, however, has prompted research into dif-
ferent learning architectures that are better suited to extract meta-features from the input
data.
    One particular solution has been to introduce specialized types of layers to work along-
side the fully connected ones.

2.1 Recurrent neural networks and LSTM
A ﬁrst approach to model time-dependent factors is to use a so-called recurrent neural
network (RNN). Recurrent neural networks have the ability to consider time-dependent
data, as they work in a way that allows the same set of weights to be applied sequentially
to all the items contained in a sample.
RNNs have traditionally been very hard to train and thus they were not used exten-
sively. This changed with the introduction of the Long Short-Term Memory (LSTM)
model [Hochreiter and Schmidhuber, 1997]. The LSTM model has, for instance, proven
very successful in natural language processing applications and spurred a lot of applications
that entail, for instance, text prediction.

   The center of the LSTM model is the LSTM cell. A variable number of these cells
comprise an LSTM layer, either working in parallel or stacked sequentially.




                                              4
                                Figure 1: LSTM model cell

    We can clearly see in Figure 1 that the LSTM model cell takes both the current item
in the data sequence (xt , which can itself be a vector) and the result of the previous com-
putation on the preceding data item (kt−1 ) as inputs. There are also some additional data,
the hidden state of the model, that are computed and passed on to the next element of the
data sequence. The ability of the LSTM model to continually update its state across the
input sequence is what allows it to better ﬁt sequential data.


2.2 Convolution
A diﬀerent approach, the use of convolutions, has proven very successful in the ﬁeld of com-
puter vision and especially object recognition. The use of convolutional neural networks to
analyze multivariate time-dependent data appears enticing since these two domains (im-
ages and ﬁnancial time-series data) share similarly shaped data (2D - potentially stacked).
In both cases, the data are susceptible to containing meta-features embedded in the inter-
dependencies across the x and y dimensions.
A convolution operation, in the context of deep learning, is deﬁned as the repeated appli-
cation of a ﬁlter (kernel) with ﬁxed weights to diﬀerent areas of the input data.
More precisely, the output of a convolution operation with kernel size (R,C) and stride S
is deﬁned as:
                                     R   C
                     Output i, j =             Input i·S +r, j ·S + c · F ilter r, c     (2)
                                     r=0 c=0



                                                   5
    Convolutions are particularly powerful when used in sequence, as they can achieve a
signiﬁcant dimensionality reduction on the data, while automatically identifying and ex-
tracting the most signiﬁcant features (see e.g. [Krizhevsky et al., 2012]).
In particular, there is ample evidence of the eﬀectiveness of using several convolutions
paired with data pooling (i.e. applying a ﬁlter that iterates on the data similarly to a
convolution, but that reduces the data in each window to a scalar using a function - such
as the average or maximum value).
Another advantage of Convolutional Neural Networks is that they are faster to train than
regular feed-forward networks composed of solely fully connected layers. On the other
hand, their power often translates in a tendency to overﬁt the data (i.e. ﬁt the noise in the
training data set while having poor performance in the out-of-sample portion of the data).
However, several solutions to the problem of overﬁtting have been proposed, including:


   • better feature selection;
   • expansion of the number of test cases;
   • input data normalization;
   • parameter regularization;
   • dropout;
   • batch normalization.

2.3 Loss functions
The models presented in this paper all belong to the family of supervised learning methods.
This means that the parameters of the model are determined as to minimize a speciﬁc error
(’loss’) quantity.
The loss function allows to quantify this error in terms of the distance between a value
estimated by the model and the actual (’true’) value recorded for the observation.
Common functions used in Machine Learning are:
   • Mean Square Error (L2 loss)
   • Mean Absolute Error (L1 loss)
   • Hinge Loss (Maximum margin loss)
   • Cross Entropy Loss
The actual function used depends on the particular problem being studied. For instance,
the ﬁrst two losses are used in regression tasks whereas the last two are used in classiﬁcation
models.

                                              6
2.4 Data and parameter regularization
Input data are routinely scaled in deep learning in order to achieve faster and better con-
vergence of the model weights. This technique is eﬀectively used to prevent model underﬁt
(as opposed to parameter regularization which aims to achieve the opposite eﬀect).
It is, for instance, common practice either to use either MinMax scaling or Standard scal-
ing. In the ﬁrst case, the data are rescaled between two arbitrary values (usually zero and
one). In the second case, the data are rescaled to have zero mean and unit variance.




2.4.1   L1 and L2 norms
When the problem, on the other hand is model overﬁt, kernel weights, biases and layer
outputs are instead regularized. A common way to do that is via L1 and L2 norms. These
are added to the penalty function (scaled by an arbitrary parameter λ) in order to discour-
age their target from becoming too big.
More precisely (given a vector β of parameters):

                        L1 = Loss(β ) + β 1 = Loss(β ) + λ · |β |
                                                                                        (3)
                      L2 = Loss(β ) + β 2
                                        2 = Loss(β ) + λ ·
                                                            n
                                                            i=0 βi
                                                                  2


2.4.2   Batch normalization
Batch normalization (see [Ioﬀe and Szegedy, 2015]) works by normalizing (i.e. subtracting
the batch mean and dividing by the batch standard deviation) a layer’s activations using
their recorded values during each mini-batch. It is another method that has shown results
in allowing greater model generalization.


2.4.3   Dropout
Introducing node dropout is yet another method used to achieve better model robustness.
In this case a random percentage p of outputs for a pre-determined layer is selected at
every gradient update and the values for those nodes is set to zero.
In order to achieve the best eﬀectiveness the dropout percentage p is normally set to a
value around 0.5.




                                            7
3 An application of deep learning on a portfolio of U.S. Treasury bonds
I here analyze the ability of several deep learning architectures to accurately predict the
future distribution of a linear portfolio of US Treasury exposures. Such a problem is of
particular interest in the ﬁeld of Market Risk management.
It has to be noted here that a classical value-at-risk (VaR) analysis is not immediately
amenable to be performed using the types of models here presented. This is because the
use of classical model backtesting performance (i.e. distance between predicted and actual
breaches over the observations) as a loss function during training would invariably lead the
optimizer to the trivial scalar solution.
In fact, if we deﬁne VaR for a variable X as:

                           V aRα (X ) = inf {x ∈ R : FX (x) > α}                         (4)

and I backtest over a set of training data using the actual portfolio returns, we see that
a model that consistently predicted the n · α worst return would indeed achieve a perfect
score.
In other words, the model would not learn from the data instead moving to the (ﬁxed)
solution that corresponds to the number of breaches that should be observed during train-
ing. This clearly would not have any generalization value. For this reason, it is a lot more
appealing to consider the whole future distribution, instead of a single statistic calculated
on it.


3.1 Portfolio
The portfolio weights for this analysis are ﬁxed in advance as follows:

                     3M    6M    1Y 2Y 3Y 5Y 7Y 10Y 30Y
                                                                                         (5)
                     0.1   0.1   0.1 0.3 0.2 0.1 0.05 0.03 0.02
They will be kept constant across all experiments. A relaxation of this constraint is cer-
tainly deserving of future research.




                                             8
3.2 Market data
I consider 10 years of daily data for US Treasury zero rates (provided by Bloomberg):



                                  Term    Ticker
                                  3M      I02503M Index
                                  6M      I02506M Index
                                  1Y      I0251Y Index
                                  2Y      I0252Y Index
                                  3Y      I0253Y Index
                                  5Y      I0255Y Index
                                  7Y      I0257Y Index
                                  10Y     I02510Y Index
                                  30Y     I02530Y Index

                                    Table 2: Zero rates

   Using these rates, I compute the daily returns on the corresponding zero-coupon bonds:

                                                e−rt ·τ
                                     rt−1,t =                                            (6)
                                               e−rt−1 ·τ
These returns are then arranged in a [500 x 9] matrix for every observation. In other words,
every matrix contains the latest 500 observations of these returns.
The choice of the past 500 observations is motivated by the widely adopted convention in
market risk of using the previous two years of returns for the estimation of the parameters
used for ex ante risk calculations.

In addition to the return matrix, some model versions presented here augment the re-
turns data with cross-sectional data. These models are referred to in this document as
’mixed’ models.
The basic data used in these cases are presented in table 3.




                                             9
                     Description               Ticker
                     US CPI YOY                CPI YOY Index
                     EU CPI YOY                ECCPEMUY Index
                     JP CPI YOY                JNCPIYOY Index
                     EUR 10yr swap rate        EUSA10 Curncy
                     GBP 10yr swap rate        BPSW10 Curncy
                     USD 10yr swap rate        USSW10 Curncy
                     JPY 10yr swap rate        JYSWAP10 Curncy
                     AUD 10yr swap rate        ADSWAP10 Curncy
                     3M USD Zero rate          I02503M Index
                     6M USD Zero rate          I02506M Index
                     1Y USD Zero rate          I0251Y Index
                     2Y USD Zero rate          I0252Y Index
                     3Y USD Zero rate          I0253Y Index
                     5Y USD Zero rate          I0255Y Index
                     7Y USD Zero rate          I0257Y Index
                     10Y USD Zero rate         I02510Y Index
                     30Y USD Zero rate         I02530Y Index
                     10Y US Treasury yield     USGG10YR Index
                     2Y US Treasury yield      USGG2YR Index

                           Table 3: Base cross-sectional data

    In addition, some simple computed features are constructed using the cross-sectional
variables as building blocks (Table 4).




                                          10
                                           Formula
                10Y US Treasury yield         -       2Y US Treasury yield
                USD CPI                      avg      50d
                USD CPI                      avg      100d
                USD CPI                      avg      200d
                3M USD Zero rate             avg      50d
                6M USD Zero rate             avg      50d
                1Y USD Zero rate             avg      50d
                2Y USD Zero rate             avg      50d
                3Y USD Zero rate             avg      50d
                5Y USD Zero rate             avg      50d
                7Y USD Zero rate             avg      50d
                10Y USD Zero rate            avg      50d
                30Y USD Zero rate            avg      50d

                          Table 4: Calculated cross-sectional data

    The rationale behind pre-computing certain features is to achieve a better economy in
model estimation.
Another data source provided to some versions of the models is the empirical (historical)
portfolio distribution. It is calculated using the returns matrix associated to each observa-
tion to compute a vector of portfolio returns. The histogram of these returns is then used
(the portfolio returns having equal weights).
For T observations of a returns vector with members rt and a vector L [l1 . . . ln ] of bucket
delimiters, the empirical distribution is deﬁned as this vector:
                             
                        h1                     T
                                                      1 : rt <= li+1 ∩ rt > li
               H =  . . .  where hi =          Iit                                       (7)
                       h                     t=1
                                                      0 : rt > li +1 ∩ rt <= li
                        n−1

The buckets are delimited according to this rule:
               [−∞; {∀i ∈ [0..20] min(R) + (max(R) − min(R))/20 · i }; ∞]                 (8)
where R is the vector of one-day portfolio returns (deﬁned as: rt ∗ positions where position
is the vector in (5)).

3.3 Target data
The model will try to forecast the distribution of the actual (forward-looking) 10-day
returns of the portfolio.

                                             11
These returns are calculated as:
                                        i+11
                          Retptf
                             i   =                   t
                                                1 + ri −1   · position                    (9)
                                        t=i+1

The returns range is then divided into buckets that are delimited using the same rule de-
ﬁned in equation (8). These buckets can be seen as the classes to which the models will
have to assign probabilities.


3.4 Softmax activation
The ﬁnal layer of all the architectures presented here is a fully connected layer with a
softmax activation function. This particular activation ouputs a vector whose values sum
to one.
The softmax transformation of a vector x (with n elements) into an output vector y is
deﬁned as:
                                        e xi
                               yi = n x for i = 1..n                                 (10)
                                       i=1 e
                                             i


The softmax activation function is widely used in classiﬁcation applications. It allows the
model output to be interpreted as the conﬁdence for the state of the world (i.e. future
portfolio returns) to belong to any of the available classes.
This means that, in the context of this analysis, the output classes are deﬁned as the
discrete bins of the returns distribution.

3.5 Loss function
The loss function used to evaluate the model is the categorical cross-entropy. Over n
observations on a model that outputs C classes (i.e. return buckets), entropy (E) is deﬁned
in this context as:
              n     C
    E=−       i=1   c=1 Ic ln(pi (c))

         
          I : Indicator 0 : class does not contain actual return
                                                                                        (11)
    with                     1 : class contains actual return
         
           pi (c) : probability assigned by the model on sample i for class c
         

The choice of this particular loss function results in a loss of zero for a model with perfect
foresight while having an inﬁnite value in case a zero probability is wrongly assigned to a
class.



                                                 12
3.6 Models
The models were implemented using a Keras 2.2.4 front-end (see [Chollet et al., 2015]) and
a Tensorﬂow 1.12.0 (see [Abadi et al., 2015]) back-end. The ﬁrst model tested is a simple
stacked LSTM model where three LSTM layers operating in sequence are followed by three
dense layers:




                             Figure 2: LSTM implementation

    The second model is a 2D convolutional model with max pooling. The kernel size for
the ﬁrst convolution is (20,3): this is done to recognize longer-term dependencies in the re-
turns data. The following convolutional layers apply an almost-standard (3,3) convolution


                                             13
ﬁlter.
The convolutions use a ’valid’ padding strategy with a unit stride. This results in a fast
loss of dimensionality across the returns data.
No regularization was applied.




                      Figure 3: Simple convolution implementation

   The third model is an enhanced version of the LSTM model. This iteration employs
the original LSTM speciﬁcation but is augmented by the cross-sectional data that enter
through a series of fully connected layers.
The idea behind this extension is to give the model a chance to anchor the predictions to


                                           14
state data or, in other words, to consider diﬀerent ’regimes’.




                         Figure 4: Mixed LSTM implementation

   The same augmentation is applied to the original convolutional model.




                                             15
                      Figure 5: Mixed convolution implementation

    The mixed LSTM and convolutional models are then again augmented using the his-
torical distribution of portfolio returns calculated using the past 500 zero-coupon returns
available in each observation.
The distribution is segmented using the same buckets used for the output layer and it
enters the models through a separate set of dense layers.
When the historical portfolio distribution is added to the mixed LSTM model, I obtain the
’mixed historical LSTM’ model (ﬁgure 6).




                                            16
                  Figure 6: Mixed historical LSTM implementation

   The mixed convolutional model becomes the ’mixed historical convolutional’ model
when the empirical historical portfolio distribution is added to it (ﬁgure 7).




                                        17
                Figure 7: Mixed historical convolutional implementation

    The only regularization applied is one pass of the MinMax data scaler for the cross-
sectional data (when used by the model).

3.7 Benchmark models
In order to better assess the performance of the models introduced here, a couple of refer-
ence approaches are also considered.
The ﬁrst one is a classic multivariate Gaussian parametric model. It is calculated using
the return matrix for each observation and computing from it a covariance matrix. An
exponential weighting with a decay factor of 0.94 is used for the estimation. The expected
return is ﬁxed at zero.
The covariance matrix is used in conjunction with portfolio weights to determine a forward-
looking portfolio standard deviation value, thus determining the ex ante portfolio distri-
bution.
The second benchmark used is the historical distribution of portfolio returns described in
equation 7.



                                            18
4 Results
The models are tested using a 0.8/0.2 training/validation data set randomized split. This
results in having 2,068 training cases and 517 test cases.
The training data set is further randomized at the beginning of each training epoch.
The optimization was run using an Adam optimizer (with the original parameters intro-
duced in [Kingma and Ba, 2014]) and a minibatch size of 128. Each model was run with a
so-called ’patience’ parameter of 10 for early termination. This means that the optimiza-
tion would wait for 10 epochs after the validation loss stopped improving.




                    Figure 8: Training and validation loss per epoch



                                           19
   I now choose as the candidate model versions using these criteria (in order of decreasing
importance):


   • least validation loss

   • least training loss

   • latest epoch

After training the models presented here, this resulted in choosing these models:

         Model                             Epoch     Training loss   Validation loss
         LSTM                                07           2.05            2.00
         Convolutional                       22           2.05            2.00
         Mixed LSTM                          22           1.98            1.95
         Mixed convolutional                 19           2.05            2.00
         Mixed historical LSTM               44           1.76            1.80
         Mixed historical convolutional      49           1.66            1.76
         Parametric                           -           2.19            2.17
         Historical                           -           2.07            2.02

                             Table 5: Optimal model performance

    In order to better judge the models, the shape of the produced distribution is considered.
It is indeed very important to have models that produce meaningful distribution shapes
and that react to diﬀerent data.
Machine learning models have a normal tendency to quickly converge onto degenerate
solutions that nevertheless have a strong explanatory power (e.g. constant solutions).
This, in a way, could be considered as a special case of overﬁt.
So, to measure the variability of the distributions produced by each model, I consider the
Bhattacharyya distance (introduced in [Bhattacharyya, 1946]):


                                BD = −ln            p(x)q (x)                            (12)
                                             x∈ X

For every model, I compute the Bhattacharyya distance between the latest available re-
turns distribution and all the other distributions generated over the rest of the data.
The results are illustrated in Figure 9.




                                             20
Figure 9: Bhattacharyya distance from latest distribution

                           21
   In order to oﬀer a better comparison, the same distance is applied also to the empirical
portfolio distributions (i.e. the historical model distribution) results.

It is clearly possible to appreciate how the complexity of the model, while maybe not
signiﬁcantly improving the training and validation performance, allows the creation of
more reactive models.
This gives us a hint about the possibility that more complex models might oﬀer better
generalization power.
In particular, we see that the naive application of both the LSTM and convolutional model
result in degenerate behavior of the distance among the predicted distributions.

For the sake of illustration, ﬁgure 10 presents the distributions predicted by each model
for the latest available observation in the data set.




                                            22
Figure 10: Latest predicted distribution




                  23
5 Conclusion
The results I obtained corroborate the notion that the use of deep-learning methods for
ﬁnancial application is viable. In particular, a ﬁnding which is consistent with existing
deep learning literature is that models need to cross a complexity threshold in order to
achieve signiﬁcant eﬃcacy.
Whether the models here presented already achieved this critical mass is, though, not clear.
Furthermore, it has to be underlined that these tools are still poorly understood and much
more diﬃcult to calibrate than more traditional methods. It is not immediately apparent
which features are more important than others (especially considering the limited nature
of these experiments).
Moreover, the long time to convergence for some of the models indicates that more could
be done in order to speed up learning.

Finally, these results suggest a few opportunities for further research:

   • Introducing a bigger number of factors onto which the portfolio is mapped;

   • Testing diﬀerent model architectures, hyper-parameters and regularization techniques.
     The use of auto-encoders for dimensionality reduction is, in this context, particularly
     appealing in light of possible real-world applications (i.e. models featuring hundreds
     or thousands of diﬀerent factors);

   • Exploring diﬀerent cross-sectional data features;

   • Verifying the soundness of the model(s) for diﬀerent portfolio structures;

   • Investigating the use of diﬀerent penalty functions that would be more of interest in
     the ﬁeld of risk measurement (e.g. weighted cross-entropy).




                                             24
References
[Abadi et al., 2015] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C.,
  Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp,
  A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg,
  J., Man´ e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J.,
                                                                                      egas,
  Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Vi´
  F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2015).
  TensorFlow: Large-scale machine learning on heterogeneous systems. Software available
  from tensorﬂow.org.
[Bhattacharyya, 1946] Bhattacharyya, A. (1946). On a measure of divergence between two
                                 a: the indian journal of statistics, pages 401–406.
  multinomial populations. Sankhy¯
[Board, 2017] Board, F. S. (2017). Artiﬁcial intelligence and machine learning in ﬁnancial
  services. November, available at: http://www. fsb. org/2017/11/artiﬁcialintelligence-
  and-machine-learning-in-ﬁnancialservice/(accessed 30th January, 2018).
[Chollet et al., 2015] Chollet, F. et al. (2015). Keras. https://keras.io.
[Hamori et al., 2018] Hamori, S., Kawai, M., Kume, T., Murakami, Y., and Watanabe, C.
  (2018). Ensemble learning or deep learning? application to default risk analysis. Journal
  of Risk and Financial Management, 11(1):12.
[Heaton et al., 2017] Heaton, J., Polson, N., and Witte, J. H. (2017). Deep learning for
  ﬁnance: deep portfolios. Applied Stochastic Models in Business and Industry, 33(1):3–12.
[Hinton et al., 2006] Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning
  algorithm for deep belief nets. Neural computation, 18(7):1527–1554.
[Hochreiter and Schmidhuber, 1997] Hochreiter, S. and Schmidhuber, J. (1997).        Long
  short-term memory. Neural computation, 9(8):1735–1780.
[Hosaka, 2019] Hosaka, T. (2019). Bankruptcy prediction using imaged ﬁnancial ratios
  and convolutional neural networks. Expert Systems with Applications, 117:287–299.
[Ioﬀe and Szegedy, 2015] Ioﬀe, S. and Szegedy, C. (2015). Batch normalization: Ac-
   celerating deep network training by reducing internal covariate shift. arXiv preprint
   arXiv:1502.03167.
[Kingma and Ba, 2014] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic
  optimization. arXiv preprint arXiv:1412.6980.
[Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet
  classiﬁcation with deep convolutional neural networks. In Advances in neural information
  processing systems, pages 1097–1105.

                                             25
         ere et al., 2018] Lathuili`
[Lathuili`                         ere, S., Mesejo, P., Alameda-Pineda, X., and Horaud, R.
  (2018). A comprehensive analysis of deep regression. arXiv preprint arXiv:1803.08450.

[Sun et al., 2018] Sun, S., Wei, Y., and Wang, S. (2018). Adaboost-lstm ensemble learn-
  ing for ﬁnancial time series forecasting. In International Conference on Computational
  Science, pages 590–597. Springer.




                                           26