Policy Research Working Paper 8790 Estimation of the ex ante Distribution of Returns for a Portfolio of U.S. Treasury Securities via Deep Learning Andrea Foresti Market and Counterparty Risk Team March 2019 Policy Research Working Paper 8790 Abstract This paper presents different deep neural network archi- tested as the main building blocks of each architecture. The tectures designed to forecast the distribution of returns on models are then augmented by cross-sectional data and the a portfolio of U.S. Treasury securities. A long short-term portfolio’s empirical distribution. The paper also presents memory model and a convolutional neural network are the fit and generalization potential of each approach. This paper is a product of the Market and Counterparty Risk Team. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/research. The author may be contacted at aforesti@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Estimation of the ex ante Distribution of Returns for a Portfolio of U.S. Treasury Securities via Deep Learning Andrea Foresti The World Bank JEL: C45, C58, G17 Keywords: Machine Learning, Neural Networks, Convolution, LSTM, Market Risk 1 Introduction Deep learning is a branch of machine learning that may be loosely defined as a collection of statistical algorithms which feature different layers of operations (usually more than three). These algorithms can be used in either a supervised or unsupervised manner. Deep learning methods are being actively explored for a variety of practical purposes and have, in recent years, found considerable success. In particular, these methods can be used for both regression (e.g. see [Lathuili`ere et al., 2018]) and classification tasks (e.g. see [Krizhevsky et al., 2012]). Both these fields of application are very significant in the context of finance. Moreover, deep learning is of particular interest in all those tasks that present complex input and output data, as the data coming into the model, flowing from one layer to another and finally exiting as output can be arranged in complex multi-dimensional struc- tures. As we will later see, for instance, classification tasks often entail the specification of the level of confidence across the available classes, thus allowing to give a more nuanced answer than a simple binary (yes/no) attribution. As a result of all these advantages, in the past few years Deep Learning has started to see fervent research and actual application in finance (as, for instance, recognized in [Board, 2017]). As an illustration, we have so far seen examples in: • time series forecasting (e.g. [Sun et al., 2018]) • portfolio construction (e.g. [Heaton et al., 2017]) • credit analysis (e.g. for default prediction see [Hosaka, 2019] or [Hamori et al., 2018]) There has been, however, limited publicly available research in the field of Market Risk. 2 2 Deep neural networks A deep neural network is a deep learning model composed of several layers of computations that are loosely inspired by the neurological structure of the brain. The layers consist of different components (neurons) that are connected to the output of the previous layer according to some pattern. Every layer enacts a transformation on both the content and the shape of the data that flow through it. The oldest and most common type of layer is the so called dense or fully connected layer. This type of layer transforms the data by multiplying their inputs by a weight matrix (and adding - optionally - a fixed number (called the bias ) to its results): Xl = Xl−1 · Wll−1 + bl (1) where xl is the lt h layer in the network, Wll−1 is the weights matrix that transform the data from layer l − 1 to layer l and bl is the bias vector. From equation 1 we see that only one bias is applied to every layer, while the weight matrix size is determined by the input and desired output data dimensions. The output data of a layer are then usually transformed by a function operating ele- ment-wise (called the activation function ). The most commonly used activation functions are: 0 for x < 0 Binary 1 for x >= 0 1 Sigmoid 1 + e−x ex − e−x TanH ex + e−x 0 for x < 0 ReLu x for x >= 0 Table 1: Common activations 3 When the outputs of a layer are instead directly passed to the following layers, this is normally referred to as a linear activation. What is interesting is that both continuous and discontinuous activation functions are attractive for different purposes during learning. The fully connected layer works very well for a variety of different tasks but is poorly suited to the analysis of data that are structured in multiple dimensions (such as time-dependent multivariate variables). While, in theory, a deep learning neural network comprised only of fully connected layers would be able to tackle such a problem, it would in reality present two main problems: 1. in order to be able to learn effectively, the model would need to be of a considerable size (at a minimum hundreds of thousands of parameters); 2. such a model design would have a tendency to overfit the training data. The first obstacle has been successfully tackled by recent research developments (start- ing with Geoffrey Hinton’s seminal paper [Hinton et al., 2006]) and the availability of ever-greater computing power. The second one, however, has prompted research into dif- ferent learning architectures that are better suited to extract meta-features from the input data. One particular solution has been to introduce specialized types of layers to work along- side the fully connected ones. 2.1 Recurrent neural networks and LSTM A first approach to model time-dependent factors is to use a so-called recurrent neural network (RNN). Recurrent neural networks have the ability to consider time-dependent data, as they work in a way that allows the same set of weights to be applied sequentially to all the items contained in a sample. RNNs have traditionally been very hard to train and thus they were not used exten- sively. This changed with the introduction of the Long Short-Term Memory (LSTM) model [Hochreiter and Schmidhuber, 1997]. The LSTM model has, for instance, proven very successful in natural language processing applications and spurred a lot of applications that entail, for instance, text prediction. The center of the LSTM model is the LSTM cell. A variable number of these cells comprise an LSTM layer, either working in parallel or stacked sequentially. 4 Figure 1: LSTM model cell We can clearly see in Figure 1 that the LSTM model cell takes both the current item in the data sequence (xt , which can itself be a vector) and the result of the previous com- putation on the preceding data item (kt−1 ) as inputs. There are also some additional data, the hidden state of the model, that are computed and passed on to the next element of the data sequence. The ability of the LSTM model to continually update its state across the input sequence is what allows it to better fit sequential data. 2.2 Convolution A different approach, the use of convolutions, has proven very successful in the field of com- puter vision and especially object recognition. The use of convolutional neural networks to analyze multivariate time-dependent data appears enticing since these two domains (im- ages and financial time-series data) share similarly shaped data (2D - potentially stacked). In both cases, the data are susceptible to containing meta-features embedded in the inter- dependencies across the x and y dimensions. A convolution operation, in the context of deep learning, is defined as the repeated appli- cation of a filter (kernel) with fixed weights to different areas of the input data. More precisely, the output of a convolution operation with kernel size (R,C) and stride S is defined as: R C Output i, j = Input i·S +r, j ·S + c · F ilter r, c (2) r=0 c=0 5 Convolutions are particularly powerful when used in sequence, as they can achieve a significant dimensionality reduction on the data, while automatically identifying and ex- tracting the most significant features (see e.g. [Krizhevsky et al., 2012]). In particular, there is ample evidence of the effectiveness of using several convolutions paired with data pooling (i.e. applying a filter that iterates on the data similarly to a convolution, but that reduces the data in each window to a scalar using a function - such as the average or maximum value). Another advantage of Convolutional Neural Networks is that they are faster to train than regular feed-forward networks composed of solely fully connected layers. On the other hand, their power often translates in a tendency to overfit the data (i.e. fit the noise in the training data set while having poor performance in the out-of-sample portion of the data). However, several solutions to the problem of overfitting have been proposed, including: • better feature selection; • expansion of the number of test cases; • input data normalization; • parameter regularization; • dropout; • batch normalization. 2.3 Loss functions The models presented in this paper all belong to the family of supervised learning methods. This means that the parameters of the model are determined as to minimize a specific error (’loss’) quantity. The loss function allows to quantify this error in terms of the distance between a value estimated by the model and the actual (’true’) value recorded for the observation. Common functions used in Machine Learning are: • Mean Square Error (L2 loss) • Mean Absolute Error (L1 loss) • Hinge Loss (Maximum margin loss) • Cross Entropy Loss The actual function used depends on the particular problem being studied. For instance, the first two losses are used in regression tasks whereas the last two are used in classification models. 6 2.4 Data and parameter regularization Input data are routinely scaled in deep learning in order to achieve faster and better con- vergence of the model weights. This technique is effectively used to prevent model underfit (as opposed to parameter regularization which aims to achieve the opposite effect). It is, for instance, common practice either to use either MinMax scaling or Standard scal- ing. In the first case, the data are rescaled between two arbitrary values (usually zero and one). In the second case, the data are rescaled to have zero mean and unit variance. 2.4.1 L1 and L2 norms When the problem, on the other hand is model overfit, kernel weights, biases and layer outputs are instead regularized. A common way to do that is via L1 and L2 norms. These are added to the penalty function (scaled by an arbitrary parameter λ) in order to discour- age their target from becoming too big. More precisely (given a vector β of parameters): L1 = Loss(β ) + β 1 = Loss(β ) + λ · |β | (3) L2 = Loss(β ) + β 2 2 = Loss(β ) + λ · n i=0 βi 2 2.4.2 Batch normalization Batch normalization (see [Ioffe and Szegedy, 2015]) works by normalizing (i.e. subtracting the batch mean and dividing by the batch standard deviation) a layer’s activations using their recorded values during each mini-batch. It is another method that has shown results in allowing greater model generalization. 2.4.3 Dropout Introducing node dropout is yet another method used to achieve better model robustness. In this case a random percentage p of outputs for a pre-determined layer is selected at every gradient update and the values for those nodes is set to zero. In order to achieve the best effectiveness the dropout percentage p is normally set to a value around 0.5. 7 3 An application of deep learning on a portfolio of U.S. Treasury bonds I here analyze the ability of several deep learning architectures to accurately predict the future distribution of a linear portfolio of US Treasury exposures. Such a problem is of particular interest in the field of Market Risk management. It has to be noted here that a classical value-at-risk (VaR) analysis is not immediately amenable to be performed using the types of models here presented. This is because the use of classical model backtesting performance (i.e. distance between predicted and actual breaches over the observations) as a loss function during training would invariably lead the optimizer to the trivial scalar solution. In fact, if we define VaR for a variable X as: V aRα (X ) = inf {x ∈ R : FX (x) > α} (4) and I backtest over a set of training data using the actual portfolio returns, we see that a model that consistently predicted the n · α worst return would indeed achieve a perfect score. In other words, the model would not learn from the data instead moving to the (fixed) solution that corresponds to the number of breaches that should be observed during train- ing. This clearly would not have any generalization value. For this reason, it is a lot more appealing to consider the whole future distribution, instead of a single statistic calculated on it. 3.1 Portfolio The portfolio weights for this analysis are fixed in advance as follows: 3M 6M 1Y 2Y 3Y 5Y 7Y 10Y 30Y (5) 0.1 0.1 0.1 0.3 0.2 0.1 0.05 0.03 0.02 They will be kept constant across all experiments. A relaxation of this constraint is cer- tainly deserving of future research. 8 3.2 Market data I consider 10 years of daily data for US Treasury zero rates (provided by Bloomberg): Term Ticker 3M I02503M Index 6M I02506M Index 1Y I0251Y Index 2Y I0252Y Index 3Y I0253Y Index 5Y I0255Y Index 7Y I0257Y Index 10Y I02510Y Index 30Y I02530Y Index Table 2: Zero rates Using these rates, I compute the daily returns on the corresponding zero-coupon bonds: e−rt ·τ rt−1,t = (6) e−rt−1 ·τ These returns are then arranged in a [500 x 9] matrix for every observation. In other words, every matrix contains the latest 500 observations of these returns. The choice of the past 500 observations is motivated by the widely adopted convention in market risk of using the previous two years of returns for the estimation of the parameters used for ex ante risk calculations. In addition to the return matrix, some model versions presented here augment the re- turns data with cross-sectional data. These models are referred to in this document as ’mixed’ models. The basic data used in these cases are presented in table 3. 9 Description Ticker US CPI YOY CPI YOY Index EU CPI YOY ECCPEMUY Index JP CPI YOY JNCPIYOY Index EUR 10yr swap rate EUSA10 Curncy GBP 10yr swap rate BPSW10 Curncy USD 10yr swap rate USSW10 Curncy JPY 10yr swap rate JYSWAP10 Curncy AUD 10yr swap rate ADSWAP10 Curncy 3M USD Zero rate I02503M Index 6M USD Zero rate I02506M Index 1Y USD Zero rate I0251Y Index 2Y USD Zero rate I0252Y Index 3Y USD Zero rate I0253Y Index 5Y USD Zero rate I0255Y Index 7Y USD Zero rate I0257Y Index 10Y USD Zero rate I02510Y Index 30Y USD Zero rate I02530Y Index 10Y US Treasury yield USGG10YR Index 2Y US Treasury yield USGG2YR Index Table 3: Base cross-sectional data In addition, some simple computed features are constructed using the cross-sectional variables as building blocks (Table 4). 10 Formula 10Y US Treasury yield - 2Y US Treasury yield USD CPI avg 50d USD CPI avg 100d USD CPI avg 200d 3M USD Zero rate avg 50d 6M USD Zero rate avg 50d 1Y USD Zero rate avg 50d 2Y USD Zero rate avg 50d 3Y USD Zero rate avg 50d 5Y USD Zero rate avg 50d 7Y USD Zero rate avg 50d 10Y USD Zero rate avg 50d 30Y USD Zero rate avg 50d Table 4: Calculated cross-sectional data The rationale behind pre-computing certain features is to achieve a better economy in model estimation. Another data source provided to some versions of the models is the empirical (historical) portfolio distribution. It is calculated using the returns matrix associated to each observa- tion to compute a vector of portfolio returns. The histogram of these returns is then used (the portfolio returns having equal weights). For T observations of a returns vector with members rt and a vector L [l1 . . . ln ] of bucket delimiters, the empirical distribution is defined as this vector:   h1 T 1 : rt <= li+1 ∩ rt > li H =  . . .  where hi = Iit (7) h t=1 0 : rt > li +1 ∩ rt <= li n−1 The buckets are delimited according to this rule: [−∞; {∀i ∈ [0..20] min(R) + (max(R) − min(R))/20 · i }; ∞] (8) where R is the vector of one-day portfolio returns (defined as: rt ∗ positions where position is the vector in (5)). 3.3 Target data The model will try to forecast the distribution of the actual (forward-looking) 10-day returns of the portfolio. 11 These returns are calculated as: i+11 Retptf i = t 1 + ri −1 · position (9) t=i+1 The returns range is then divided into buckets that are delimited using the same rule de- fined in equation (8). These buckets can be seen as the classes to which the models will have to assign probabilities. 3.4 Softmax activation The final layer of all the architectures presented here is a fully connected layer with a softmax activation function. This particular activation ouputs a vector whose values sum to one. The softmax transformation of a vector x (with n elements) into an output vector y is defined as: e xi yi = n x for i = 1..n (10) i=1 e i The softmax activation function is widely used in classification applications. It allows the model output to be interpreted as the confidence for the state of the world (i.e. future portfolio returns) to belong to any of the available classes. This means that, in the context of this analysis, the output classes are defined as the discrete bins of the returns distribution. 3.5 Loss function The loss function used to evaluate the model is the categorical cross-entropy. Over n observations on a model that outputs C classes (i.e. return buckets), entropy (E) is defined in this context as: n C E=− i=1 c=1 Ic ln(pi (c))   I : Indicator 0 : class does not contain actual return  (11) with 1 : class contains actual return  pi (c) : probability assigned by the model on sample i for class c  The choice of this particular loss function results in a loss of zero for a model with perfect foresight while having an infinite value in case a zero probability is wrongly assigned to a class. 12 3.6 Models The models were implemented using a Keras 2.2.4 front-end (see [Chollet et al., 2015]) and a Tensorflow 1.12.0 (see [Abadi et al., 2015]) back-end. The first model tested is a simple stacked LSTM model where three LSTM layers operating in sequence are followed by three dense layers: Figure 2: LSTM implementation The second model is a 2D convolutional model with max pooling. The kernel size for the first convolution is (20,3): this is done to recognize longer-term dependencies in the re- turns data. The following convolutional layers apply an almost-standard (3,3) convolution 13 filter. The convolutions use a ’valid’ padding strategy with a unit stride. This results in a fast loss of dimensionality across the returns data. No regularization was applied. Figure 3: Simple convolution implementation The third model is an enhanced version of the LSTM model. This iteration employs the original LSTM specification but is augmented by the cross-sectional data that enter through a series of fully connected layers. The idea behind this extension is to give the model a chance to anchor the predictions to 14 state data or, in other words, to consider different ’regimes’. Figure 4: Mixed LSTM implementation The same augmentation is applied to the original convolutional model. 15 Figure 5: Mixed convolution implementation The mixed LSTM and convolutional models are then again augmented using the his- torical distribution of portfolio returns calculated using the past 500 zero-coupon returns available in each observation. The distribution is segmented using the same buckets used for the output layer and it enters the models through a separate set of dense layers. When the historical portfolio distribution is added to the mixed LSTM model, I obtain the ’mixed historical LSTM’ model (figure 6). 16 Figure 6: Mixed historical LSTM implementation The mixed convolutional model becomes the ’mixed historical convolutional’ model when the empirical historical portfolio distribution is added to it (figure 7). 17 Figure 7: Mixed historical convolutional implementation The only regularization applied is one pass of the MinMax data scaler for the cross- sectional data (when used by the model). 3.7 Benchmark models In order to better assess the performance of the models introduced here, a couple of refer- ence approaches are also considered. The first one is a classic multivariate Gaussian parametric model. It is calculated using the return matrix for each observation and computing from it a covariance matrix. An exponential weighting with a decay factor of 0.94 is used for the estimation. The expected return is fixed at zero. The covariance matrix is used in conjunction with portfolio weights to determine a forward- looking portfolio standard deviation value, thus determining the ex ante portfolio distri- bution. The second benchmark used is the historical distribution of portfolio returns described in equation 7. 18 4 Results The models are tested using a 0.8/0.2 training/validation data set randomized split. This results in having 2,068 training cases and 517 test cases. The training data set is further randomized at the beginning of each training epoch. The optimization was run using an Adam optimizer (with the original parameters intro- duced in [Kingma and Ba, 2014]) and a minibatch size of 128. Each model was run with a so-called ’patience’ parameter of 10 for early termination. This means that the optimiza- tion would wait for 10 epochs after the validation loss stopped improving. Figure 8: Training and validation loss per epoch 19 I now choose as the candidate model versions using these criteria (in order of decreasing importance): • least validation loss • least training loss • latest epoch After training the models presented here, this resulted in choosing these models: Model Epoch Training loss Validation loss LSTM 07 2.05 2.00 Convolutional 22 2.05 2.00 Mixed LSTM 22 1.98 1.95 Mixed convolutional 19 2.05 2.00 Mixed historical LSTM 44 1.76 1.80 Mixed historical convolutional 49 1.66 1.76 Parametric - 2.19 2.17 Historical - 2.07 2.02 Table 5: Optimal model performance In order to better judge the models, the shape of the produced distribution is considered. It is indeed very important to have models that produce meaningful distribution shapes and that react to different data. Machine learning models have a normal tendency to quickly converge onto degenerate solutions that nevertheless have a strong explanatory power (e.g. constant solutions). This, in a way, could be considered as a special case of overfit. So, to measure the variability of the distributions produced by each model, I consider the Bhattacharyya distance (introduced in [Bhattacharyya, 1946]): BD = −ln p(x)q (x) (12) x∈ X For every model, I compute the Bhattacharyya distance between the latest available re- turns distribution and all the other distributions generated over the rest of the data. The results are illustrated in Figure 9. 20 Figure 9: Bhattacharyya distance from latest distribution 21 In order to offer a better comparison, the same distance is applied also to the empirical portfolio distributions (i.e. the historical model distribution) results. It is clearly possible to appreciate how the complexity of the model, while maybe not significantly improving the training and validation performance, allows the creation of more reactive models. This gives us a hint about the possibility that more complex models might offer better generalization power. In particular, we see that the naive application of both the LSTM and convolutional model result in degenerate behavior of the distance among the predicted distributions. For the sake of illustration, figure 10 presents the distributions predicted by each model for the latest available observation in the data set. 22 Figure 10: Latest predicted distribution 23 5 Conclusion The results I obtained corroborate the notion that the use of deep-learning methods for financial application is viable. In particular, a finding which is consistent with existing deep learning literature is that models need to cross a complexity threshold in order to achieve significant efficacy. Whether the models here presented already achieved this critical mass is, though, not clear. Furthermore, it has to be underlined that these tools are still poorly understood and much more difficult to calibrate than more traditional methods. It is not immediately apparent which features are more important than others (especially considering the limited nature of these experiments). Moreover, the long time to convergence for some of the models indicates that more could be done in order to speed up learning. Finally, these results suggest a few opportunities for further research: • Introducing a bigger number of factors onto which the portfolio is mapped; • Testing different model architectures, hyper-parameters and regularization techniques. The use of auto-encoders for dimensionality reduction is, in this context, particularly appealing in light of possible real-world applications (i.e. models featuring hundreds or thousands of different factors); • Exploring different cross-sectional data features; • Verifying the soundness of the model(s) for different portfolio structures; • Investigating the use of different penalty functions that would be more of interest in the field of risk measurement (e.g. weighted cross-entropy). 24 References [Abadi et al., 2015] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Man´ e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., egas, Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Vi´ F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org. [Bhattacharyya, 1946] Bhattacharyya, A. (1946). On a measure of divergence between two a: the indian journal of statistics, pages 401–406. multinomial populations. Sankhy¯ [Board, 2017] Board, F. S. (2017). Artificial intelligence and machine learning in financial services. November, available at: http://www. fsb. org/2017/11/artificialintelligence- and-machine-learning-in-financialservice/(accessed 30th January, 2018). [Chollet et al., 2015] Chollet, F. et al. (2015). Keras. https://keras.io. [Hamori et al., 2018] Hamori, S., Kawai, M., Kume, T., Murakami, Y., and Watanabe, C. (2018). Ensemble learning or deep learning? application to default risk analysis. Journal of Risk and Financial Management, 11(1):12. [Heaton et al., 2017] Heaton, J., Polson, N., and Witte, J. H. (2017). Deep learning for finance: deep portfolios. Applied Stochastic Models in Business and Industry, 33(1):3–12. [Hinton et al., 2006] Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554. [Hochreiter and Schmidhuber, 1997] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780. [Hosaka, 2019] Hosaka, T. (2019). Bankruptcy prediction using imaged financial ratios and convolutional neural networks. Expert Systems with Applications, 117:287–299. [Ioffe and Szegedy, 2015] Ioffe, S. and Szegedy, C. (2015). Batch normalization: Ac- celerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. [Kingma and Ba, 2014] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. [Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105. 25 ere et al., 2018] Lathuili` [Lathuili` ere, S., Mesejo, P., Alameda-Pineda, X., and Horaud, R. (2018). A comprehensive analysis of deep regression. arXiv preprint arXiv:1803.08450. [Sun et al., 2018] Sun, S., Wei, Y., and Wang, S. (2018). Adaboost-lstm ensemble learn- ing for financial time series forecasting. In International Conference on Computational Science, pages 590–597. Springer. 26