Information-theoretic approaches to portfolio selection

Nathan LASSANCE

Doctoral Thesis 2 | 2020

Université catholique de Louvain

LOUVAIN INSTITUTE OF DATA ANALYSIS AND MODELING IN ECONOMICS AND (LIDAM)

Universite´ catholique de Louvain Louvain School of Management LIDAM & Louvain Finance

Doctoral Thesis

Information-theoretic approaches to portfolio selection

Nathan Lassance

Thesis submitted in partial fulfillment of the requirements for the degree of Docteur en sciences ´economiques et de gestion

Dissertation committee: Prof. Fr´ed´ericVrins (UCLouvain, BE), Advisor Prof. Kris Boudt (Ghent University, BE) Prof. Victor DeMiguel (London Business School, UK) Prof. Guofu Zhou (Washington University, USA) Prof. Marco Saerens (UCLouvain, BE), President

Academic year 2019-2020

“Find a job you enjoy doing, and you will never have to work a day in your life.”

Mark Twain

Contents

Abstract vii

Acknowledgments ix

Research accomplishments xii

List of Figures xii

List of Tables xv

List of Notation xvii

Introduction1

1 Research background7 1.1 Mean-variance approaches...... 7 1.1.1 Definitions...... 8 1.1.2 Estimation risk...... 10 1.1.3 Robust mean-variance portfolios...... 13 1.2 Higher- approaches...... 19 1.2.1 Efficient portfolios...... 21 1.2.2 Downside-risk criteria...... 23 1.2.3 Indirect approaches...... 25 1.3 Risk-parity approaches...... 25 1.3.1 Asset-risk parity...... 26 1.3.2 Factor-risk parity...... 28 1.3.3 Criticisms...... 29 1.4 Information-theoretic approaches...... 30 1.5 Thesis contributions...... 32

2 Minimum R´enyi entropy portfolios 35 2.1 Introduction...... 35 2.2 The notion of entropy...... 36 2.2.1 Shannon entropy...... 37 2.2.2 R´enyi entropy...... 37 2.3 R´enyi entropy and risk measurement...... 39 2.3.1 Connection with deviation risk measures...... 39 2.3.2 The subadditivity property...... 40 2.3.3 Exponential R´enyi entropy as a flexible risk measure...... 42 2.3.4 Appeal of the case α ∈ [0, 1]...... 44 2.4 R´enyi entropy and portfolio selection...... 45

iii Contents iv

2.4.1 Definition...... 45 2.4.2 Entropy and higher moments: A Gram-Charlier expansion.... 45 2.5 Robust m-spacings estimation...... 49 2.5.1 Motivation and expression for the m-spacings estimator...... 49 2.5.2 Properties of the m-spacings estimator...... 51 2.5.2.1 Asymptotic bias...... 51 2.5.2.2 Robustness to outliers...... 52 2.6 Out-of-sample performance...... 53 2.6.1 Data and methodology...... 53 2.6.2 Results...... 56 2.7 Conclusion...... 58 2.8 Appendix...... 59 2.8.1 Proofs of results...... 59 2.8.1.1 Proposition 2.1...... 59 2.8.1.2 Theorem 2.1...... 59 2.8.1.3 Proposition 2.2...... 63 2.8.1.4 Theorem 2.2...... 63 2.8.1.5 Proposition 2.3...... 65 2.8.1.6 Proposition 2.4...... 67 2.8.2 Additional empirical results...... 68

3 Optimal portfolio diversification via independent component analysis 71 3.1 Introduction...... 71 3.2 PCA versus ICA: from decorrelation to independence...... 75 3.2.1 Principal component analysis...... 76 3.2.2 Independent component analysis...... 77 3.2.3 Numerical illustration...... 79 3.3 Factor-variance parity via uncorrelated factors...... 81 3.3.1 Derivation of the factor-variance-parity portfolios...... 81 3.3.2 Arbitrariness of the decorrelation criterion...... 83 3.4 Higher-moment diversification via ICA...... 85 3.4.1 IC-variance-parity portfolios...... 85 3.4.2 diversification...... 86 3.4.3 Data-driven shrinkage portfolio...... 89 3.5 Factor-risk parity with higher-moment risk measures...... 90 3.5.1 Parsimonious estimation of higher moments with ICs...... 90 3.5.2 IC-risk parity with modified Value-at-Risk...... 91 3.6 Out-of-sample performance...... 94 3.6.1 Data and methodology...... 94 3.6.2 Calibration of K and δ ...... 96 3.6.3 Results...... 96 3.7 Conclusion...... 99 3.8 Appendix...... 100 3.8.1 Proofs of results...... 100 3.8.1.1 Proposition 3.1...... 100 Contents v

3.8.1.2 Theorem 3.1...... 101 3.8.1.3 Theorem 3.2...... 102 3.8.1.4 Excess kurtosis of PCVP portfolios in Example 3.5... 104 3.8.1.5 Proposition 3.2...... 105 3.8.1.6 Proposition 3.3...... 105 3.8.2 FastICA algorithm...... 106 3.8.3 Theoretical properties of IC-kurtosis-parity portfolios...... 108 3.8.4 Long-only factor-variance-parity portfolios...... 109

4 Robust portfolio selection using sparse estimation of comoment tensors113 4.1 Introduction...... 113 4.2 Dimension reduction...... 117 4.2.1 Curse of dimensionality...... 117 4.2.2 Reducing dimensionality via principal components...... 118 4.2.3 Approximation of comoment tensors...... 120 4.3 Sparse higher-comoment tensors...... 121 4.3.1 Independent factor model...... 122 4.3.2 Approximation via independent component analysis...... 123 4.3.3 Sparse estimate of optimal portfolio...... 124 4.4 Empirical analysis...... 126 4.4.1 Data and methodology...... 126 4.4.2 Independence of PCs versus ICs...... 128 4.4.3 Results...... 129 4.5 Conclusion...... 133 4.6 Appendix...... 134 4.6.1 Proofs of results...... 134 4.6.1.1 Proposition 4.1...... 134 4.6.1.2 Corollary 4.1...... 134 4.6.1.3 Theorem 4.1...... 135 4.6.1.4 Proposition 4.2...... 135 4.6.2 Robustness test: daily returns...... 135

5 Portfolio selection: A target-distribution approach 137 5.1 Introduction...... 137 5.2 Minimum-divergence portfolio...... 142 5.2.1 General formulation...... 142 5.2.2 The Kullback-Leibler divergence...... 143 5.3 Targeting a generalized-normal return distribution...... 144 5.3.1 Minimum-divergence portfolio under a generalized-normal target. 145 5.3.2 The case of Gaussian asset returns...... 146 5.4 Targeting a Gaussian return distribution...... 149 5.4.1 Decomposition of the KL divergence...... 149 5.4.2 The Dirac-delta target-return distribution...... 150 5.5 A reference-portfolio approach...... 153 5.6 Estimation of the minimum-divergence portfolio...... 155 Contents vi

5.6.1 Estimation for the generalized-normal target return...... 156 5.6.2 Estimation of the portfolio-return entropy H(P )...... 156 5.6.3 Estimation of the expectation E[|P − αˆ|γ]...... 156 5.6.4 Estimation for the Gaussian target return...... 158 5.7 Out-of-sample performance...... 158 5.7.1 Data and methodology...... 158 5.7.2 Reported portfolio strategies...... 162 5.7.3 Results...... 163 5.7.3.1 Full sample...... 163 5.7.3.2 Financial crisis...... 166 5.8 Conclusion...... 168 5.9 Appendix...... 169 5.9.1 Proofs of results...... 169 5.9.1.1 Proposition 5.1...... 169 5.9.1.2 Proposition 5.2...... 170 5.9.1.3 Theorem 5.1...... 171 5.9.1.4 Theorem 5.2...... 171 5.9.1.5 Proposition 5.3...... 171 5.9.1.6 Theorem 5.3...... 172 5.9.1.7 Proposition 5.4...... 173 5.9.1.8 Corollary 5.1...... 173 5.9.2 Choice of kernel bandwidth...... 174 5.9.3 Out-of-sample performance of all considered portfolio strategies. 174 5.9.3.1 Comparison with reference portfolios...... 174 5.9.3.2 Comparison with higher-moment portfolios...... 178 5.9.3.3 Results for the generalized-normal target with γ = 4.. 179 5.9.4 Entropy and diversification...... 180 5.9.4.1 The case of i.i.d. asset returns...... 180 5.9.4.2 Empirical point of view...... 182

6 Conclusion 185 6.1 Summary of main results...... 185 6.2 Open questions for future research...... 188 6.2.1 Specific questions...... 188 6.2.2 General questions...... 195 6.2.2.1 Properties of independent components...... 195 6.2.2.2 The notion of diversification...... 196

References 199 Abstract

Ever since modern portfolio theory was introduced by Harry Markowitz in 1952, a plethora of papers have been written on the mean-variance investment problem. However, due to the non-Gaussian nature of asset returns, the mean and variance statistics are insufficient to adequately represent their full distribution, which depends on higher moments too. Higher-moment portfolio selection is however more complex; a smaller literature has been dedicated to this problem and no consensus emerges about how investors should allocate their wealth when higher moments cannot be ignored. Among the proposed alternatives, researchers have recently considered information theory, and entropy in particular, as a new framework to tackle this problem. Entropy provides an appealing criterion as it measures the amount of randomness embedded in a from the shape of its density function, thus accounting for all moments. The application of information theory to portfolio selection is however nascent and much remains to explore. Therefore, in this thesis, we aim to explore the portfolio-selection problem from an information-theoretic angle, accounting for higher moments.

We review the relevant literature and mathematical concepts in Chapter1. Then, we consider in Chapter2 a natural alternative to the popular minimum-variance portfolio strategy using R´enyi entropy as information-theoretic criterion. We show that the expo- nential R´enyi entropy fulfills natural properties as a risk measure. However, although R´enyi entropy has some nice features, we show that it can be an undesirable investment criterion because it may lead to portfolios with worse higher moments than minimizing the variance. For this reason, we turn in chapters3 to5 to different ways of applying entropy, thereby revisiting two popular frameworks—risk parity and expected utility—to account for higher moments.

In Chapter3, we investigate the factor-risk-parity portfolio—a popular strategy among practitioners—that aims to diversify the portfolio-return risk across uncorrelated factors underlying the asset returns. We show that although principal component analysis (PCA) is very useful for dimension reduction, its resulting factor-risk-parity portfolio is suboptimal. Indeed, PCA merely provides one choice of uncorrelated factors out of infinitely many others, and one would prefer to be diversified over independent factors rather than merely uncorrelated ones. Instead, thus, we propose to diversify the risk across maximally independent factors, provided by independent component analysis (ICA). We Abstract viii show theoretically that this solves the issues related to principal components and provides a natural way of reducing the kurtosis of portfolio returns.

In Chapter4, we apply ICA in a different way in order to obtain robust estimates of moment-based portfolios, such as those based on expected utility. It is well known that these portfolios are difficult to estimate, particularly in high dimensions, because the number of comoments quickly explodes with the number of assets. We propose to address this curse of dimensionality by projecting the asset returns on a small set of maximally independent factors provided by ICA, and neglecting their remaining dependence. In doing so, we obtain sparse approximations of the comoment tensors of asset returns. This drastically decreases the dimensionality of the problem and leads to well-performing and computationally efficient investment strategies with low turnover.

In Chapter5, we introduce an alternative approach to the utility function to capture investors’ preferences. The latter is praised by academics but is difficult to specify when higher moments matter. Because investors ultimately care about the distribution of their portfolio returns, our proposal is to capture their preferences via a target-return distribution. The optimal portfolio is then the one whose distribution minimizes the Kullback-Leibler divergence with respect to the target distribution. Our theoretical exploration shows that Shannon entropy plays a central role as higher-moment criterion in this framework, and our empirical analysis confirms that this strategy outperforms mean-variance portfolios out of sample. Acknowledgments

When I embarked in my PhD journey more than three years ago, holding this thesis in my hands was for me the final, faraway destination. Today, while writing these lines, I see it more as the first piece of the research career I aspire to pursue. Looking back, I am thankful to many people for making me fall in love with scientific research, and making these years so enjoyable.

I could not have hoped for a better advisor and mentor than Prof. Fr´ed´ericVrins to guide me in this journey. Working by your side from the beginning of my master’s thesis to the end of this PhD thesis was truly inspiring and entertaining. Far from being just a supervisor, you were continuously involved in my projects, and few PhD students can boast this. Many times, I went on Overleaf late in the evening and was surprised finding you scratch your head on some mathematical problem we discussed during the day. I keep being inspired by your inane curiosity, rigorousness and broad knowledge; talking with you about entropy, portfolio selection, independent component analysis and the like was always captivating. I will be forever grateful for everything you have taught me; our collaboration has led to several papers that I am very proud of. I will do my best to pass on this legacy in the future.

I wish to warmly thank the members of my dissertation committee. Prof. Marco Saerens (UCLouvain), for presiding the committee. Your course on Java programming during my bachelor’s degree was by far among my favorite ones. Prof. Kris Boudt (Ghent University), you are the author I cite the most in this thesis, and this tells a lot about how influential your work has been to me. Your continuous feedback on my papers has definitely contributed to the quality of this work. I am thankful for that. My greatest recognition goes to Prof. Victor DeMiguel (London Business School), a renowned expert in portfolio selection. I wouldn’t have imagined having the chance to work with someone of your standing. While one may think that top professors are inaccessible, you are actually one of the nicest scholars I have met, and I am thankful you graciously accepted my research visit at LBS. Needless to say that I learnt a lot during those three months, particularly on scientific writing during our several day-long rewriting of our joint paper. I keep fond memories of those days. Finally, I am overwhelmed to say that Prof. Guofu Zhou (Washington University), another recognized expert in finance and portfolio selection, has accepted to join the committee. Thank you, I feel truly honored. Acknowledgments x

I am thankful to other people I discussed research with. Prof. Mikael Petitjean (UCLouvain), your practical point of view was very refreshing. Thank you for your motivating enthusiasm on my work. Prof. Gianluca Fusai (Cass Business School) and Domenico Mignacca (Qatar Investment Authority), for discussions on diversification measurement in portfolio selection. Prof. Massimo Guidolin (Bocconi University), for accepting my post-doctoral visit in a few months. I look forward to working with you. I also thank conference participants and anonymous reviewers for the numerous feedback on my papers. I had the pleasure to pass on some of my knowledge to master’s thesis students: Alexander Ackaert, Guillaume Dequenne, Emmanuel Masson, Lo¨ıcParmentier and Rodolphe Vanderveken. Thank you guys, I hope you had a good time. Finally, I enjoyed sharing my office with you, Cheikh Mbaye. Witnessing your impressive mathematical skills made me stay humble about mine.

This thesis wouldn’t have been possible without financial and administrative support. The aspirant FNRS grant of the Fonds de la Recherche Scientifique (F.R.S.-FNRS) allowed me to pursue research in the best possible conditions. I am incredibly grateful for the research focus it provided me during my PhD. I also thank the F´ed´erationWallonie- Bruxelles for financing my research visit at the London Business School. My gratitude goes to Sandrine Delhaye, who made the process around the doctoral program so smooth. I was proud to work within the LIDAM institute and the CORE research center. Thank you for hosting me and giving me a nice office. Working in such an active environment was exciting. I particularly thank Catherine Germain for her help in all the administrative processes. Finally, it was great being a fellow of the Louvain Finance (LFIN) research center. Thank you to all those involved in organizing high-quality seminars, conferences, doctoral days and courses, among other activities.

Life wouldn’t be as fun without friends around. Les deux sacr´es,Matthieu and Schouls. Thanks for giving me the motivation to run a marathon, and never failing to make me laugh. Le Vince, I look forward to your own PhD defense! Maxime and Adeline, thanks for keeping me company and cooking so many delicious meals when I was living in London. Greg, for our traditional Mexicain-Guinness-MarioKart night out, and for being my fervent partner in supporting Les Diables Rouges. Chlo´e,Greg, Mathou, Meg and Paul for our many board game nights. Le groupe des vrais amis, my oldest group of friends, for visiting me in London.

My family constitutes the most important people in my life. Aubry, you often tell me that I am your source of inspiration. Believe me, I feel just the same. I am so impressed by Acknowledgments xi your capacity of following your passions, and for finishing your first book just a few weeks ago. Somehow, we both found our own way to get published :) I am proud to be your brother. Maman, I am afraid that listing all the ways in which you have supported me up to this day would make these acknowledgments longer than the thesis itself... Thank you for your unconditional love, trust and support. This thesis wouldn’t exist without you. Zorro, for your daily dose of cuteness and craziness, and for holding me company when I was working from home. Finally, my girlfriend’s family: Eric, Yvelise and Gauthier. Thank you for always making me feel at home in Mussy.

Chlo´e,you know better than anyone else how much this PhD mattered to me, and I cannot thank you enough for your support, even when I was living abroad. Your smile, your love and your courage keep uplifting me every day. I feel so lucky to have you by my side, these last seven years have been wonderful. I am also indebted to you for offering me the notebooks in which I developed most of the mathematical results in this thesis. I dedicate this work to you. Research accomplishments

International journal papers

Lassance, N., & Vrins, F. (2018). A comparison of pricing and hedging performances of equity derivatives models. Applied Economics, 50(10):1122–1137. Lassance, N., & Vrins, F. (2019). Minimum R´enyi entropy portfolios. Annals of Operations Research, https://doi.org/10.1007/s10479-019-03364-2.

Papers in progress (submitted work)

Lassance, N., DeMiguel, V., & Vrins, F. (2019). Optimal portfolio diversification via independent component analysis. London Business School working paper. Lassance, N., & Vrins, F. (2019). Robust portfolio selection using sparse estimation of comoment tensors. Louvain Finance working paper. Lassance, N., & Vrins, F. (2019). Portfolio selection: A target-distribution approach.

International conference and seminar presentations

“A comparison of pricing and hedging performances of equity derivatives models”: • Internal seminar for the quant team of ING (Brussels, BE, 2019)

“Minimum R´enyi entropy portfolios”: • PhD day in math finance and financial engineering (Louvain-la-Neuve, BE, 2017) • Actuarial and Financial Mathematics Conference (Brussels, BE, 2018) • 35th Annual Conference of the French Finance Association (Paris, FR, 2018) • Belgian Financial Research Forum (Brussels, BE, 2018)

“Optimal portfolio diversification via independent component analysis”: • INFORMS Annual Meeting (Phoenix, USA, 2018) • Louvain Finance internal seminar (Mons, BE, 2019) • Actuarial and Financial Mathematics Conference (Brussels, BE, 2019) • Financial Econometrics Conference (Toulouse, FR, 2019, presented by V. DeMiguel)

“Portfolio selection: A target-distribution approach”: • 9th General AMaMeF Conference (Paris, FR, 2019) • Louvain Finance internal seminar (Mons, BE, 2019) • Paris Financial Management Conference (Paris, FR, 2019) • Actuarial and Financial Mathematics Conference (Brussels, BE, 2020, accepted) List of Figures

I.1 Minimum-variance portfolio on S&P500 and U.S. 10y Treasury Note...3

1.1 Sensitivity of modified VaR to volatility, skewness and kurtosis...... 24

2.1 The exponential Shannon entropy violates subadditivity for two i.i.d. Student t random variables with ν < 1...... 41 2.2 Decreasing α increases the sensitivity to tail uncertainty of the exponential R´enyi entropy...... 44 2.3 Coefficients of the second-order Gram-Charlier expansion of R´enyi entropy 46 2.4 The minimum R´enyi entropy portfolio balances variance and kurtosis minimization...... 48 2.5 Impact of kurtosis on the exponential R´enyi entropy...... 49 2.6 Impact of the parameter m on the robustness to outliers of the m-spacings estimator of the exponential R´enyi entropy...... 53 2.7 Impact of the parameter m on the stability of the minimum R´enyi entropy portfolio...... 54

3.1 Mutual information versus correlation...... 80 3.2 Mutual information of principal versus independent components..... 81 3.3 Arbitrariness of the factor-variance-parity portfolio...... 85 3.4 Excess kurtosis of PC- and IC-variance-parity portfolios...... 88

4.1 Gain of parameters for the estimation of higher moments via independent versus principal components...... 125 4.2 Time evolution of number of factors K ...... 128 4.3 Non-linear correlation of principal versus independent components.... 131 4.4 Boxplots of the MMVaR, MMVaRPC and MMVaRIC portfolios for the 17Ind dataset...... 132

5.1 Schematic representation of the minimum-divergence portfolio...... 141 5.2 Generalized-normal target-return density for different values of γ ..... 146 5.3 Function C(γ) as a function of γ ...... 147 5.4 Minimum-divergence portfolio with Gaussian asset returns...... 148 5.5 Decomposition of the KL divergence for the Gaussian target return... 151 5.6 Illustration of the reference-portfolio approach...... 155 5.7 Boxplots of portfolio weights for the MV and G2MV portfolios...... 183

xiii

List of Tables

2.1 List of datasets considered in the empirical study of Chapter2...... 53 2.2 Out-of-sample performance of minimum-variance and minimum R´enyi entropy portfolios...... 57 2.3 Mean, volatility and break-even transaction cost for the minimum-variance and minimum R´enyi entropy portfolios...... 69

3.1 Out-of-sample performance of factor-risk-parity portfolios (1/2)..... 97 3.1 Out-of-sample performance of factor-risk-parity portfolios (2/2)..... 98 3.2 Out-of-sample performance of long-only factor-risk-parity portfolios... 111

4.1 Comparison of cardinality of comoment tensors...... 126 4.2 Out-of-sample performance of estimation strategies...... 130 4.3 Robustness test: Out-of-sample performance for daily returns...... 136

5.1 Minimum-divergence portfolio for the Gaussian target return...... 151 5.2 Dirac-delta minimum-divergence portfolio versus mean-variance portfolio 153 5.3 List of datasets considered in the empirical study of Chapter5...... 159 5.4 List of portfolio strategies considered in the empirical study of Chapter5 160 5.5 Out-of-sample performance during the full sample...... 164 5.6 Out-of-sample performance during the financial crisis...... 167 5.7 Out-of-sample performance of all considered portfolios (1/3)...... 175 5.7 Out-of-sample performance of all considered portfolios (2/3)...... 176 5.7 Out-of-sample performance of all considered portfolios (3/3)...... 177 5.8 Portfolio-weight concentration of minimum-divergence versus reference portfolios...... 183

xv

List of Notation

Typesetting convention a, b, c, d, k Real constants α, β, γ Real constants i, j, k Vector or matrix indices K,N Dimension t Time index x Column vector

xi ith entry of vector x A Matrix

Ai ith row of matrix A

Aij (i, j) entry of matrix A X Univariate random variable X(i:T ) ith order statistic of X from sample of size T X Multivariate random variable

Xi ith component of X

Vector and matrix operations x0 Transpose of vector x ||x|| Euclidean norm of vector x diag(x) Diagonal matrix made of entries of x sign(x) Vector of signs of entries of x L(wˆ, w) Expected quadratic-utility loss from approximating w by wˆ H[p] Inverse Herfindahl index of vector p A0 Transpose of matrix A A−1 Inverse of matrix A ||A|| Frobenius norm of matrix A det(A) Determinant of matrix A trace(A) Trace of matrix A ](A) Cardinality of matrix A A ⊗ B Kronecker (tensor) product of matrices (tensors) A and B

Statistical operators  B θˆ Bias of estimator θˆ xvii List of Notation xviii

µX , E[X] Mean of X

µˆX Sample mean of X

µX Mean vector of X

µˆX Sample mean vector of X 2 σX , σX (volatility) and variance of X 2 σˆX , σˆX Sample standard deviation (volatility) and variance of X

RX Correlation matrix of X

ΣX Covariance matrix of X

Σb X Sample covariance matrix of X

mi,X ith of X

ζX Skewness of X

κX Excess kurtosis of X

Mk(X) Comoment tensor/matrix of X of order k X? Standardized (zero-mean, unit-variance) copy of X

Information-theoretic operators

H[pX ] Shannon entropy of discrete distribution pX

H(X),H1(X) Shannon entropy of X

H[fX ],H1[fX ] Shannon entropy of density fX exp exp H1 (X),H1 [fX ] Exponential Shannon entropy of X with density fX

Hα(X),Hα[fX ] R´enyi entropy of X with density fX exp exp Hα (X),Hα [fX ] Exponential R´enyi entropy of X with density fX

Hbα(m, T ) m-spacings estimator of Hα(X) from sample of size T Hb(m, T ) m-spacings estimator of H(X) from sample of size T exp exp Hbα (m, T ) m-spacings estimator of Hα (X) from sample of size T GC Hα (X) Second-order Gram-Charlier expansion of Hα(X)

hfX |fY i Kullback-Leibler divergence between X and Y

hfX |fY iα R´enyi divergence between X and Y I(X) Mutual information of X

Particular functions

fX (·) Probability density function (pdf) of X

FX (·) Cumulative distribution function (cdf) of X −1 FX (·),QX (·) Quantile function (inverse cdf) of X

VaRX (·) Value-at-Risk of X

MVaRX (·) Modified Value-at-Risk of X List of Notation xix

CVaRX (·) Conditional Value-at-Risk of X φ(·) Standard Gaussian density Φ(·) Standard Gaussian cdf

zε Standard Gaussian quantile at confidence level ε P(·) Probability measure L(·) Lebesgue measure U(·) Utility function

σEF (·) Efficient-frontier volatility function G(·) FastICA entropy-estimation function ˆ EIFθˆ(·) Empirical influence function of estimator θ Γ(·) Gamma function ψ(·) Digamma function B(·, ·) Beta function

1F1(·) Confluent hypergeometric function

Particular random variables, vectors, matrices and sets X N asset returns Xˆ Projection of asset returns on first K principal components w N portfolio weights P,P (w) Portfolio return Pˆ Projection of portfolio return on first K principal components

wMV Minimum-variance portfolio

wˆMV,K Estimator of wMV based on first K principal components w Mean-variance portfolio with mean µ µ0 0

wλ Mean-variance-efficient portfolio with risk-aversion coefficient λ

wSR Maximum-Sharpe-ratio (tangent) portfolio

wMRE Minimum R´enyi entropy portfolio

wIC Minimum-variance IC-variance-parity portfolio T Target return or sample size w Minimum-divergence portfolio for a target-return density f fT T

wα,β,γ Minimum-divergence portfolio for a generalized-normal target

wα,β Minimum-divergence portfolio for a Gaussian target

wα Minimum-divergence portfolio for a Dirac-delta target

w0 Reference portfolio

wψ Portfolio minimizing a function ψ of moments

wˆψ,K Estimator of wψ based on K principal components List of Notation xx

† wˆψ,K Estimator of wψ based on K independent components µ Asset mean-return vector Σ Asset-return covariance matrix

Σb δ Shrinkage estimator of Σ with shrinkage intensity δ Σe Estimator of Σ based on first K principal components

VN Matrix of N eigenvectors of Σ V Matrix of K ≤ N eigenvectors of Σ

ΛN Diagonal matrix of N eigenvalues of Σ Λ Diagonal matrix of K ≤ N eigenvalues of Σ R K × K rotation matrix R(θ) 2 × 2 rotation matrix S K independent factors Y ? First K principal components Y Arbitrary rotation R of principal components R† Rotation matrix minimizing the mutual information of Y Y † K independent components ˆ † † † Mk(Y ) Approximation of tensor Mk(Y ) assuming Y are independent ˆ ˆ ˆ † Mk(X) Approximation of tensor Mk(X) assuming Y are independent w˜(R) Exposures on factors Y = RY ?

w˜, w˜(IK ) Exposures on principal components w˜(R†) Exposures on independent components 1 N-dimensional vector of ones 1± N-dimensional vector with entries in {−1, 1}

1K K ≤ N-dimensional vector of ones ± 1K K ≤ N-dimensional vector with entries in {−1, 1}

IN N × N identity matrix

ΩX Support set of random variable X W Set of allowed portfolios w Wf Set of allowed exposures on principal components Wf† Set of allowed exposures on independent components Wf(R) Set of allowed exposures on factors Y = RY ? K−1 ? WFVP (R) Set of 2 factor-variance-parity portfolios for factors Y = RY SO(K) Set of K × K rotation matrices

Other symbols σ Volatility contribution of asset X to the portfolio return P Xi|P i List of Notation xxi

† MVaR † ˆ(ε) Modified VaR contribution of independent component Yi Yi |P to the reduced portfolio return Pˆ ˆGME fP Gaussian-mixture estimator of the density fP ˆKDE fP Kernel-density estimator of the density fP N (µ, σ) Gaussian law N (µ, Σ) Multivariate Gaussian law GN (α, β, γ) Generalized-normal law T (ν) Student t law with ν degrees of freedom

δα Dirac-delta density centered in α C(w; f ) Objective function for the the minimum-divergence portfolio w T fT

C(w; α, β, γ) Objective function for the minimum-divergence portfolio wα,β,γ given a generalized-normal target return T ∼ GN (α, β, γ)

C(w; α, β) Objective function for the minimum-divergence portfolio wα,β given a Gaussian target return T ∼ N (α, β)

C(w; α) Objective function for the minimum-divergence portfolio wα

given a Dirac-delta target-return density fT = δα ˆ C(w; γ) Estimate of C(w; α, β, γ) given a reference portfolio w0 ˆ C(w) Estimate of C(w; α, β) given a reference portfolio w0 δ Shrinkage intensity R Number of rolling windows O(N k) Polynomial in N of order k

κmin Minimum portfolio-return excess kurtosis

κmax Maximum portfolio-return excess kurtosis

κIC Excess kurtosis of an IC-variance-parity portfolio

κPC Excess kurtosis of a PC-variance-parity portfolio

ρG(X,Y ) Correlation of G-transform of X and Y

Introduction

How should I optimally allocate my wealth?, that is the investor’s question. The portfolio- selection problem is too multifaceted in nature to hope for the existence of one ideal solution, but providing a suitable framework to think about and solve this question remains of utter importance. Indeed, portfolio selection underpins a substantial portion of the activities conducted by financial institutions, private investors, as well as individuals for whom financial markets are made increasingly accessible at a low cost via recent developments such as exchange-traded funds and digital investment platforms. In turn, the portfolio-selection rules followed by these agents have important consequences on the financial and economic system as a whole.

The 1950s marked a turning point with the 1952 seminal paper of Harry Markowitz— “Portfolio selection”—for which he was awarded the Nobel Prize in 1990 (Markowitz 1952). Markowitz was the first to provide a rigorous mathematical framework to study the portfolio-selection problem, departing from heuristic and less systematic strategies used by investors until then. He introduced in particular the concept of mean-variance efficient frontier: a portfolio of assets is mean-variance efficient if and only if no other portfolio with the same mean return can achieve a lower variance. He also recognized the central notion of diversification, realizing that one can reduce variance by combining imperfectly correlated assets. The mean-variance theory of Markowitz was the start of the so-called modern portfolio theory, which deeply influenced financial markets and led down the road to a plethora of academic papers devoted to this theory and its extensions.

Markowitz’s theory exclusively relies on the mean and variance of portfolio returns to find the optimal portfolio. This makes sense if one of two sufficient conditions is met: investors’ preferences depend only on mean and variance, or the multivariate distribution of asset returns is Gaussian (more precisely, elliptical). Actually, Markowitz (2014) and others also showed that these sufficient conditions might not be necessary, because the optimal portfolio of the investor can often be well approximated by some portfolio on the mean-variance efficient frontier. Leaving that aside, the issue with Markowitz’s theory is that the two sufficient conditions above are often violated in practice. Many evidences contradict the first assumption; see the paper of Scott and Horvath (1980). It is well known, for example, that investors prefer distributions that are skewed to the right rather than skewed to the left. The second assumption is not realistic either; empirical asset-return

1 Introduction 2 distributions are not Gaussian and, in particular, display probabilities for extreme returns that are much higher than predicted by the Gaussian distribution; see the early work of the Belgian mathematician Mandelbrot (1963). Let us take an example to illustrate this non-Gaussian behavior. Consider a U.S. investor who wants to invest in the bond and equity markets. For this purpose, he considers an investment set made of the CBOE 10 year Treasury Note Yield index (TNX) and the S&P500 index (SPX). He collects daily prices from January 1962 to August 2019; see Figure I.1 for the historical returns of the two indices. He decides to follow the conservative minimum-variance portfolio strategy, and computes given the above data that the minimum-variance portfolio invests 36.93% in TNX and 63.07% in SPX. The corresponding portfolio-return annualized mean and volatility are 5.09% and 13.24%, respectively. The investor now wants to assess the historical tail risk associated with this portfolio. To do so, he computes the annualized 1% Value-at-Risk (VaR); that is, the minimum annual loss one can expect to face with a 1% probability. If the investor assumes that the asset returns are Gaussian, he can easily compute that the 1% VaR is given by 25.71%. However, is the Gaussian assumption reasonable? Figure I.1 shows that it is not: the minimum-variance portfolio-return density has much fatter tails than the Gaussian. In particular, the annualized 1% VaR implied by the actual historical density is 33.96%, which is 8.25 percentage points more than the Gaussian VaR. Thus, the Gaussian assumption underpinning the mean-variance portfolio-selection paradigm is dangerous. It underestimates the probability of extreme events and leads to suboptimal portfolios with respect to the higher moments of portfolio returns for investors who truly care about them.

For this reason, extensions of Markowitz’ mean-variance portfolio theory that account for the non-Gaussianity of asset returns have been put forward. However, such extensions are challenging, for two main reasons.

1. Higher moments are difficult to estimate accurately because they are very sensitive to outliers. For example, a sample of size 100 from a unit-variance random variable containing a value of 5 will have an excess kurtosis of at least 54/100 − 3 = 3.25.

2. Whereas mean-variance portfolio selection essentially boils down to the maximization of a quadratic utility function, the literature review of Chapter1 shows that going beyond the first two moments can be accomplished in many different ways. In particular, no consensus emerges from the literature about how investors should allocate their wealth when higher moments cannot be ignored. Introduction 3

Figure I.1: Minimum-variance portfolio on S&P500 and U.S. 10y Treasury Note (a) Additive returns 5 TNX SPX 4

3

2

1 Additive returns

0

-1 1962 1970 1980 1990 2000 2010 2019

(b) Density of minimum-variance portfolio 80 Density of min-variance portfolio 70 Corresponding Gaussian density 60

50

40

Density 30

20

10

0 -5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% Daily return

Notes. The top figure depicts the additive returns of the CBOE 10 year Treasury Note Yield index (TNX) and the S&P500 index (SPX), from January 1962 to August 2019. The bottom figure depicts the return density of the minimum-variance portfolio, obtained via kernel density estimation with Gaussian kernels and the bandwidth of Silverman (1986). The Gaussian density with the same mean and variance is depicted on the bottom figure too.

Given these considerations, the purpose of this thesis is to tackle the higher-moment portfolio-selection problem. We accomplish this from a novel angle based on information theory. This branch of mathematics was born with the 1948 seminal paper of Claude Shannon: “A mathematical theory of communication” (Shannon 1948). Information theory was originally used in signal processing to answer different questions related to the information contained in data signals. In particular, Shannon (1948) showed that the limit to which a signal can be compressed without losing information is given by a quantity known as entropy. Intuitively, the higher the entropy of a signal, the more random it is, and the more difficult it is to compress. In physics, the second law of thermodynamics, according to which the entropy of a closed system always increases, is often rephrased as “disorder always increases.” In probability theory, the entropy of a random variable measures its randomness, or uncertainty. What is particularly interesting Introduction 4 in our portfolio-selection context is that entropy measures the uncertainty of the whole distribution of the random variable, going beyond the variance. Thus, because investors aim to minimize the risk of their portfolio returns, several researchers have recently proposed using entropy as an investment criterion; see Section 1.4 for a detailed review of the literature. However, its application to portfolio selection is merely nascent and much remains to explore about the potential benefits of entropy in this context.

In Chapter1, we begin by reviewing the relevant literature and mathematical con- cepts related to portfolio selection, namely: mean-variance approaches, higher-moment approaches, risk-parity approaches and information-theoretic approaches.

In Chapter2, we study a natural alternative to the popular minimum-variance portfolio strategy that consists in minimizing the R´enyientropy (R´enyi 1961) of portfolio returns. The R´enyi entropy recovers Shannon entropy as a special case, and has more flexibility as it features a parameter that one can play with to control the relative contributions of the central and tail parts of the distribution in the final entropy value. The proposed portfolio strategy coincides with the minimum-variance portfolio for Gaussian asset returns but, otherwise, also accounts for higher moments. The exponential of R´enyi entropy is shown to provide a more natural risk measure than the plain R´enyi entropy: it is closely connected to the class of deviation risk measures. However, it violates subadditivity in some extreme scenarios. Although R´enyi entropy has some desirable features, we show via a Gram-Charlier expansion and an empirical study that minimizing R´enyi entropy can yield portfolios with worse higher moments than minimizing the variance. Thus, whereas minimizing entropy may seem like a natural higher-moment alternative to minimizing the variance, it can actually be undesirable with respect to higher moments.

For this reason, we turn in chapters3 to5 to different ways of applying entropy in portfolio selection that, as we show, are theoretically appealing with respect to higher moments. In particular, we consider two popular frameworks. First, risk parity, which is popular among practitioners (Chapter3). Second, expected utility, which is praised by academics (Chapter4 and Chapter5). Our aim is to make them suitable in the presence of higher moments.

In Chapter3, we study a popular heuristic approach used by practitioners, called the factor-risk-parity portfolio, introduced by Meucci (2009). Its aim is to diversify the portfolio-return risk across uncorrelated factors underlying the asset returns. A natural choice of factors proposed in the literature are the principal components. However, as we show, the factor-risk-parity problem is actually ill-posed: there exist an infinite number Introduction 5 of uncorrelated factors, and they are all associated with a different factor-risk-parity portfolio. This makes the principal-components choice arbitrary. We propose to solve this arbitrariness issue by relying on factors that are maximally independent, rather than merely uncorrelated. These so-called independent components are obtained via independent component analysis (ICA), and correspond to the rotation of the principal components with minimum Shannon entropy. In addition, we show theoretically that relying on the independent components provides a natural way of reducing the portfolio- return kurtosis, which is desirable in terms of tail risk. This is confirmed on empirical data. Thus, we provide theoretical foundations for the use of factor-risk parity as higher-moment portfolio strategy, provided one relies on the independent components as factors.

Whereas risk parity is popular among practitioners, most academics dismiss it as a heuristic approach without clear theoretical foundations. Indeed, risk parity is normatively debatable as “diversification” is not, as such, a financial objective. Instead, academics favor the expected-utility approach, such as the mean-variance portfolio, which maximizes the expected portfolio-return utility. We turn to this framework in the next two chapters.

In Chapter4, we show that ICA provides a simple and elegant framework to obtain robust estimates of moment-based portfolios, such as those based on expected utility via a Taylor series expansion. These portfolios suffer from the curse of dimensionality because the number of comoments to estimate quickly explodes with the number of assets and the moment order considered. This severely impacts the stability and out-of- sample performance of the investment strategy. We tackle the curse of dimensionality by enhancing robustness via a sparse representation of the comoment tensors of asset returns. We achieve this by projecting the asset returns onto a small set of maximally independent factors, found via ICA. We show that this solves the curse of dimensionality: by neglecting the remaining dependence of the independent components, we drastically reduce the number of free parameters in the comoment tensors. This leads to well-performing, low-turnover investment strategies that are computationally efficient.

In Chapter5, we revisit the expected-utility framework. A difficulty in the application of expected utility is that, as we explain in Chapter1, specifying the utility function is arduous when higher moments matter. In particular, commonly used utility functions such as the constant-relative-risk-aversion one are locally quadratic, and lead to portfolios close to the mean-variance optimal one. Therefore, in Chapter5, we introduce an alternative way of capturing all investors’ preferences via a single function, which is more natural than the utility function. Specifically, we recognize that investors ultimately care for the Introduction 6 distribution of their portfolio returns, and we propose a framework whereby investors’ preferences are captured via a target-return distribution. The corresponding strategy is the minimum-divergence portfolio, whose return distribution is as close as possible to the target-return one, as measured by the Kullback-Leibler divergence (relative entropy). Relying on a generalized-normal target return, we show that we recover Markowitz’s efficient frontier when asset returns are Gaussian and when targeting a Dirac-delta distribution. For the Gaussian target return, we show that the objective function admits a natural decomposition in three terms that respectively measure the fit to the target- return mean, variance and higher moments. The latter are naturally accounted for via the Shannon entropy of standardized portfolio returns, which drives higher moments toward those of the Gaussian. Empirical results confirm that our strategy helps to improve higher moments compared to mean-variance portfolios.

The thesis contributions raise open questions to be addressed in future research, some of which are discussed in Chapter6. All original results are stated in propositions, and the most important ones in theorems. Appendices at the end of each chapter contain proofs for all results, as well as supplementary materials.

I wish you a pleasant reading of this thesis ! Chapter 1

Research background

In this first chapter, we provide a review of the relevant literature and the mathematical concepts related to portfolio selection that are needed to contextualize, understand and appreciate the contributions of this thesis, covered in chapters2 to5. We divide this chapter in five sections:

1. Mean-variance approaches (Section 1.1)

2. Higher-moment approaches (Section 1.2)

3. Risk-parity approaches (Section 1.3)

4. Information-theoretic approaches (Section 1.4)

5. Thesis contributions (Section 1.5)

Concerning the notation, we adopt the following convention:

• Deterministic scalars: lower case and plain text (a)

• Deterministic vectors: lower case and bold text (a) in column format

• Matrices: upper case and bold text (A)

• Univariate random variables: upper case and plain text (X)

• Multivariate random variables: upper case and bold text (X) in column format

1.1 Mean-variance approaches

Mean-variance portfolio theory marks the start of modern portfolio theory. It remains the baseline approach employed by both academics and practitioners. We review the related definitions in Section 1.1.1, the issue of estimation risk in Section 1.1.2, and robust mean-variance portfolios in Section 1.1.3.

7 Chapter 1. Research background 8

1.1.1 Definitions

The so-called modern portfolio theory was born in a series of papers by Markowitz (1952, 1959), Roy (1952), Sharpe (1962) and Merton (1971, 1972) among the most important ones.1 They together introduced a quantitative framework to formally describe and solve the investment problem of an investor who wants to optimize the mean-variance trade-off of his portfolio returns. The solution to this problem is summarized by the celebrated mean-variance efficient frontier.

Let us formalize the mean-variance portfolio-selection problem. We take as given an investment set of N assets among which the investor can allocate his wealth. At some time t, the ith asset is represented by its price Si,t. Given a time frequency τ (e.g. daily, monthly), the arithmetic return of the ith asset from t to t + τ is then

Si,t+τ − Si,t Xi,t,τ := . (1.1) Si,t

In this thesis, we focus on the single-period investment problem; we do not model the dynamics of asset returns. For this reason, we assume that the asset returns Xi,t,τ are i.i.d. over time and we drop the index t. The time frequency τ will also be clear from the context, and we adopt the simpler notation

Xi,t,τ ← Xi.

0 The N asset returns together form the multivariate random variable X = (X1,...,XN ) . In this thesis, we consider that there is no risk-free asset: the volatility σ > 0 for all i. Xi

Define the asset mean-return vector µ and covariance matrix Σ having entries µi := E[Xi] and Σij := E[(Xi − µi)(Xj − µj)], respectively. We always assume that Σ is of full rank and thus invertible. The investor’s portfolio problem is how to split his wealth among the 0 N assets. The portfolio is represented by a vector of weights w = (w1, . . . , wN ) . Unless otherwise stated, w belongs to the set

N 0 W := {w ∈ R | 1 w = 1}, (1.2)

1It is little known that Roy (1952) independently discovered the basic tenants of mean-variance portfolio theory. Markowitz (1999, p.5) will claim that “On the basis of Markowitz (1952), I am often called the father of modern portfolio theory (MPT), but Roy (1952) can claim an equal share of this honor.” Chapter 1. Research background 9 where 1 is a N-dimensional vector of ones. That is, the portfolio weights sum to one. The resulting return of this portfolio is

P = P (w) := w0X (1.3) with mean and variance 2 0 0 (µP , σP ) = (w µ, w Σw). (1.4)

Having introduced the necessary notation, we report the main properties of mean- variance-efficient portfolios. We use the concise notation of Bodnar et al. (2013).

Property 1.1 (Mean-variance-efficient portfolios). Let the matrix Q be defined as

Σ−1110Σ−1 Q := Σ−1 − . (1.5) 10Σ−11

We have the following properties:

(i) The minimum-variance portfolio is given by

−1 2 Σ 1 wMV := argmin σP =⇒ wMV = 0 −1 , (1.6) w∈W 1 Σ 1

with return mean and variance

10Σ−1µ 1  (µ , σ2 ) := , . (1.7) MV MV 10Σ−11 10Σ−11

(ii) Given a mean return µ0, the mean-variance portfolio is given by

µ − µ w := argmin σ2 subject to µ = µ =⇒ w = w + 0 MV Qµ. (1.8) µ0 P P 0 µ0 MV 0 w∈W µ Qµ

(iii) w is a mean-variance-efficient portfolio if µ ≥ µ . µ0 0 MV (iv) The mean-variance-efficient portfolio can also be defined via the quadratic utility function: λ 2 1 wλ := argmax µP − σP =⇒ wλ = wMV + Qµ, (1.9) w∈W 2 λ

µ0Qµ with the equivalence wλ = wµ if λ = for µ0 ≥ µMV . 0 µ0−µMV Chapter 1. Research background 10

(v) The tangent portfolio maximizing the Sharpe ratio (Sharpe 1994),

µP SRP := , (1.10) σP

is given by Σ−1µ wSR := argmax SRP =⇒ wSR = 0 −1 . (1.11) w∈W 1 Σ µ

(vi) The mean-variance efficient frontier is given by the function

s (µ − µ )2 σ (µ ) := σ2 + 0 MV (1.12) EF 0 MV µ0Qµ

for all µ0 ≥ µMV .

1.1.2 Estimation risk

Suppose that the investor wants to determine his portfolio at a given point in time to optimize his expected quadratic utility. Clearly, the investor does not know the true mean and covariance matrix of asset returns, and thus, will need to estimate his mean-variance- efficient portfolio via historical asset-return data. The standard estimation procedure to perform this is called the sample plug-in estimate, which consists of three steps.

1. Given a fixed time frequency τ (say, monthly), collect a sample of T historical return observations for each asset; for example, ten years of past monthly returns 0 (T = 120). We denote Xt = (X1,t,...,XN,t) the asset-return vector at each time t = 1,...,T .

2. Compute the sample (maximum-likelihood) estimators of µ and Σ:

T 1 X µˆ := X , (1.13) T t t=1 T 1 X 0 Σb := (Xt − µˆ)(Xt − µˆ) . (1.14) T t=1

3. Compute the sample mean-variance-efficient portfolio by plugging µˆ and Σb in place of the true µ and Σ:

−1 −1 Σb 1 1 ˆ Σb µˆ wˆMV = , wˆλ = wˆMV + Qµˆ, wˆSR = . (1.15) 10Σb −11 λ 10Σb −1µˆ Chapter 1. Research background 11

The sample plug-in estimate of the mean-variance portfolio is very sensitive to estimation risk because treating µˆ and Σb as if they were equal to their population counterparts ignores that sample estimators carry potentially substantial estimation errors with them. They can be very different from the true mean and covariance matrix of asset returns. For this reason, Michaud (1989, p.33–34) has coined sample mean-variance portfolios estimation-error maximizers, in the sense that “mean-variance optimization significantly overweights (underweights) those securities that have large (small) estimated returns, negative (positive) correlations and small (large) variances. These securities are, of course, the ones most likely to have large estimation errors.” This severely affects the out-of- sample performance of mean-variance portfolios. Out of sample means here that the future realized performance depends on portfolio weights estimated via past return data.

For mean-variance portfolios, it is well documented that estimation risk arises mainly from the mean-return vector µ rather than the covariance matrix Σ. This can be explained by a combination of two phenomena.

1. The variance of sample means tends to be much larger than the variance of sample variances and covariances; see Merton (1980). As a result, “even if the expected return on the market were known to be a constant for all time, it would take a very long history of returns to obtain an accurate estimate” (Merton 1980, p.326). To get an idea of what “a very long history of returns” might represent, DeMiguel et al. (2009b) show that, assuming Gaussian asset returns, the estimation window required for the sample tangent portfolio to outperform the equally weighted portfolio,

wEW := 1/N, (1.16)

is, for example, about 600 months (50 years) when N = 50 and that the true Sharpe ratio of the tangent and equally weighted portfolio are 0.40 and 0.20, respectively.

2. The mean-variance portfolio is much more sensitive to changes in µ than in Σ; see Best and Grauer (1991) and Chopra and Ziemba (1993). In particular, for a moderate value of the risk-aversion coefficient λ, the latter find that errors in means are about ten times more important than errors in variances, and again twice as much than errors in covariances.

As a result, the minimum-variance portfolio wMV , which is the only portfolio on the efficient frontier that does not require an estimate of asset mean returns, has been shown to outperform any other mean-variance portfolio in terms of out-of-sample Sharpe ratio Chapter 1. Research background 12

and turnover (the magnitude of changes in portfolio weights over time); see, among others, Jorion (1986), Chan et al. (1999), Jagannathan and Ma (2003), DeMiguel et al. (2009a), DeMiguel et al. (2009b) and Kourtis et al. (2012). In particular, Jagannathan and Ma (2003, p.1652) conclude that “the estimation error in the sample mean is so large nothing much is lost in ignoring the mean altogether”.

In order to formalize the superiority of the sample minimum-variance portfolio over the sample mean-variance portfolio, we follow Kan and Zhou (2007) and Kourtis et al. (2012). They consider the case where the investment set is made of N risky assets and one risk-free asset. In that case, the mean-variance-efficient portfolio is given by the risky assets weights 1 w = Σ−1µ, (1.17) λ λ 0 and a weight of 1 − 1 wλ on the risk-free asset. We wish to compare this portfolio with the minimum-variance portfolio that invests in the risky assets only (otherwise, it would

trivially invest 100% in the risk-free asset), given by wMV in (1.6). To that end, we denote the quadratic utility associated to a portfolio w as

λ U(w) := µ − σ2 , (1.18) P (w) 2 P (w)

and we define the expected quadratic-utility loss resulting from approximating the true portfolio w via a sample estimator wˆ as

L(wˆ, w) := U(w) − E(U(wˆ)), (1.19)

where the expectation is taken with respect to the true distribution of asset returns X. In this setup, assuming Gaussian asset returns, Kan and Zhou (2007) show that2

1 L(wˆ , w ) = (1 − c )SR2 + c , (1.20) λ λ 2λ 1 2

2 0 −1 where SR = µ Σ µ is the squared Sharpe ratio of the mean-variance portfolio wλ and

T  T (T − 2)  c = 2 − , (1.21) 1 T − N − 2 (T − N − 1)(T − N − 4) NT (T − 2) c = . (1.22) 2 (T − N − 1)(T − N − 2)(T − N − 4)

2 Note that because wλ maximizes U(w), L(wˆ, wλ) in (1.19) is always positive. Chapter 1. Research background 13

Moreover, Kourtis et al. (2012) show that

SR2 λ(T − 2) L(wˆ , w ) = − µ + σ2 . (1.23) MV λ 2λ MV 2(T − N − 1) MV

To assess which portfolio among the sample mean-variance portfolio wˆλ and sample minimum-variance portfolio wˆMV yields the lowest quadratic-utility loss, Kourtis et al.

(2012) compute the number of observations T required to have L(wˆλ, wλ) < L(wˆMV , wλ). They find that T is often unrealistically large. To give one example, assume monthly observations and a constant annual mean return of 10% for each asset, and fix N = 25, SR = 0.2 and λ = 1. Then, Kourtis et al. (2012) find that the number of observations required is T = 828; that is, 69 years of monthly observations !

1.1.3 Robust mean-variance portfolios

As we have seen in Section 1.1.2, the sample plug-in estimation is largely suboptimal for portfolio-selection purposes. In particular, a wide body of research shows that the sample minimum-variance portfolio largely outperforms the sample mean-variance portfolios, both theoretically and empirically. Still, the sample minimum-variance portfolio is itself subject to estimation risk because

(i) it is well known that the sample covariance matrix carries substantial estimation errors in high-dimensional settings as it is made of O(N 2) parameters;

(ii) the sample covariance matrix is the most efficient estimator assuming Gaussian asset returns, but its efficiency is very sensitive even to slight deviations from Gaussianity; see Huber (2004) and DeMiguel and Nogales (2009).

The second issue is particularly worrying as asset returns are well known to significantly deviate from Gaussianity; see Section 1.2.

As a result, many researchers have put forward robust minimum/mean-variance portfolios that outperform their sample counterparts out of sample. Fabozzi et al. (2010) and Section 3 of Kolm et al. (2014) provide a detailed review of such portfolios, including for downside risk measures. In this section, we review five well-known approaches that will be used later in our empirical analyses. The first three approaches are directed at the minimum-variance portfolio, and the later two at the mean-variance portfolio:

1. Shrinkage estimation of covariance matrix

2. Robust M-estimation of portfolio weights Chapter 1. Research background 14

3. Constraints on portfolio weights

4. Shrinkage estimation of portfolio weights

5. Bayesian estimation of mean-return vector

Note that, in this thesis, we take a quite general meaning of robustness. A robust portfolio is a portfolio that is stable and well-performing out of sample and, in the same vein, a robust estimator of some statistics is an estimator that yields a robust portfolio. Depending on the context, robustness can then be achieved in different ways, such as avoiding inputs that are difficult to estimate (e.g., the equally weighted or the minimum-variance portfolio), improving the robustness to outliers (e.g., the m-spacings estimator of entropy used in Chapter2) or reducing the number of parameters to estimate (e.g., the sparse estimation approach proposed in Chapter4).

1. Shrinkage estimation of covariance matrix One of the most commonly used approaches to reduce the amount of estimation error in the sample covariance matrix Σb is to rely instead on a shrinkage estimator of the form

Σb δ := (1 − δ)Σb + δFb, δ ∈ [0, 1], (1.24)

where δ is the shrinkage intensity and Fb is the target covariance matrix, taken as a sparse estimator of the true covariance matrix Σ. The rationale is to combine the unbiased but inefficient sample covariance matrix Σb with the biased but efficient target covariance matrix Fb. Shrinkage is an old technique in statistics (James and Stein 1961), but has been applied in portfolio selection for covariance-matrix estimation only starting with a series of papers published by Ledoit and Wolf (2003, 2004a, 2004b). They consider as shrinkage target, respectively, the single-factor model of Sharpe (1963), a scalar multiple of the identity matrix and a constant-correlation model. They calibrate the shrinkage intensity via an estimator δˆ? of the shrinkage intensity δ? minimizing the Frobenius norm:

? δ := argmin Σb δ − Σ , (1.25) δ∈[0,1]

where the Frobenius norm of a matrix A is defined as ||A|| := ptrace(A0A). In particular, δˆ? can be computed in closed form. Other noteworthy approaches used in portfolio selection to shrink the covariance matrix is the shrinkage of the inverse covariance matrix of Kourtis et al. (2012) and the non-linear shrinkage approach of Ledoit and Wolf (2017). An overview Chapter 1. Research background 15 of recent approaches in shrinkage of high-dimensional covariance matrices is provided in Fan et al. (2016).

2. Robust M-estimation of portfolio weights Another well-known approach, made popular by DeMiguel and Nogales (2009), is the M-estimation of portfolio weights. The idea is to replace the variance, which is based on the squared loss function, by a measure of risk based on a robust loss function. That is, a loss function that grows less rapidly than the squared loss. The M-estimator of portfolio-return risk is defined as

T 1 X s (w, m) := ρ(w0X − m), (1.26) ρ T t t=1 where ρ is a convex symmetric loss function with unique minimum at zero, and m is the M-estimator of portfolio mean return:

m := argmin sρ(w, x). (1.27) x∈R

This definition is a generalization of the sample mean and variance, which are obtained for the squared loss function ρ(x) = x2. DeMiguel and Nogales (2009) propose to minimize sρ(w, x) with respect to x and w at the same time to find the M-portfolio:

wρ := argmin sρ(w, x). (1.28) w∈W,x∈R

The use of a more robust (slowly growing) loss function makes the M-portfolio more robust to deviations from normality (such as jumps and fat tails) than the sample minimum-variance portfolio. DeMiguel and Nogales (2009) use Huber’s loss function defined as ( x2/2 if |x| c ρ(x) := 6 (1.29) c(|x| − c/2) if |x| > c for some return threshold c. In particular, DeMiguel and Nogales (2009) show analytically that the M-portfolio has a bounded influence function. The innovative approach of DeMiguel and Nogales (2009), compared to previous studies on M-estimation of portfolio weights such as Vaz-de Melo and Camara (2003), is that robust estimation and portfolio optimization are performed in one step, as seen in (1.28). This avoids the issues related to plug-in approaches. Chapter 1. Research background 16

3. Constraints on portfolio weights Another natural approach to improve the robustness of estimated portfolios is to add constraints on portfolio weights. By limiting the space of allowed portfolios, we expect the estimated portfolios to be less sensitive to changes in the return data and, concurrently, to require less rebalancing over time. The central reference in this area is DeMiguel et al. (2009a) who study the minimum-variance portfolio under a constraint on the norm of portfolio weights. They show that their framework nests several well-known approaches such as the equally weighted portfolio of DeMiguel et al. (2009b), the no-short-selling constraint of Jagannathan and Ma (2003) and the minimum-variance portfolio under shrinkage estimators of the covariance matrix above. In particular, they consider portfolio-

weight constraints of LASSO and Ridge type; that is, constraints on the `1 and `2 norm, respectively. Specifically, they propose to minimize the portfolio-return variance w0Σw 0 under the constraints 1 w = 1 and ||w||1 ≤ δ or ||w||2 ≤ δ. The `1-norm constraint limits

the amount of short-selling, while the `2-norm constraint shrinks the minimum-variance portfolio toward the equally weighted portfolio. Moreover, DeMiguel et al. (2009a) show that norm-constrained minimum-variance portfolios can be interpreted as resulting from Bayesian estimation: they are the portfolios corresponding to the mode of the posterior distribution of minimum-variance portfolio weights given a prior distribution on portfolio

weights that is either a Laplace distribution (`1 norm) or a Gaussian distribution (`2 norm). This Bayesian interpretation shows that norm-constrained minimum-variance portfolios

account for estimation risk. One drawback of the `1 and `2-norm constraints however is that they do not account for the fact that some assets may feature higher estimation risk than others, and thus, should see their weights be relatively more constrained. Realizing this, Levy and Levy (2014) introduce the global variance-based constraint of the form

N  2 σˆ X 1 Xi wi − ≤ δ. (1.30) N 1 PN σˆ i=1 N j=1 Xj

The rationale is “to impose more stringent constraints on stocks with relatively high standard deviations, as the estimation errors for these stocks’ parameters, and hence the potential economic loss, are larger than for stocks with relatively low standard deviations” (p.375). Finally, many other types of portfolio-weight constraints have been considered. For example, cardinality constraints (Bertsimas and Shioda 2009) that control the number of assets one can invest in, and portfolio-turnover constraints (Schreiner 1980, Olivares- Nadal and DeMiguel 2018) that limit the amount of rebalancing needed from one period to the next. Chapter 1. Research background 17

4. Shrinkage estimation of portfolio weights Another popular approach in reducing estimation risk in optimal portfolio weights is to combine two portfolio strategies together. If the estimation errors of the two portfolios are not perfectly correlated with one another, this can yield a portfolio with less estimation risk than either of the two strategies on their own. For example, Kan and Zhou (2007) shrink the sample mean-variance portfolio toward the sample minimum-variance portfolio, DeMiguel et al. (2009b) shrink the sample minimum-variance portfolio toward the equally weighted portfolio, and Tu and Zhou (2011) shrink the sample mean-variance portfolio (among other strategies) toward the equally weighted portfolio. Let us describe Kan and Zhou (2007)’s approach in more details. They consider portfolio weights of the form

1  −1 −1  wKZ := cΣb µˆ + dΣb 1 , (1.31) λ

0 with 1 − 1 wKZ invested in the risk-free asset. Then, they derive the coefficients c and d that minimize the expected quadratic-utility loss in (1.19) assuming asset returns are Gaussian. In particular, as they show, it is always optimal to invest in the sample minimum-variance portfolio; that is, d 6= 0 (unless 10Σb −1µˆ = 0). In the case of this thesis where the risk-free asset is not part of the investment universe, one can follow DeMiguel et al. (2009b) and Frahm and Memmel (2010) and rely on the normalized 0 weights wKZ /1 wKZ . Finally, one may want to consider other criteria than the expected quadratic-utility loss to find the optimal combination of portfolios. To that aim, DeMiguel et al. (2013) consider the shrinkage portfolios studied by Kan and Zhou (2007), DeMiguel et al. (2009b) and Tu and Zhou (2011), and derive analytical expressions for the shrinkage coefficients considering several calibration criteria such as minimum variance or maximum Sharpe ratio. They also consider non-parametric calibration via smoothed bootstrap.

5. Bayesian estimation of mean-return vector A final approach that has been extensively studied to improve the estimation of mean- variance portfolios consists in improving the estimation of the asset-return mean vector µ. Green et al. (2013) list more than 300 papers dealing with this problem. This is a challenging problem because

(i) mean-variance portfolios are highly sensitive to µ;

(ii) the variance of the sample mean can quickly explode even for moderate deviations of the data from Gaussianity (Huber 1964); Chapter 1. Research background 18

(iii) asset returns are not stationary over time and, in particular, display very low autocorrelation (Campbell et al. 1997).

Let us describe here the Bayesian approach to this problem, reviewed by Avramov and Zhou (2010) and Vanderveken (2019). We focus more particularly on the influential work of Jorion (1986) who derives a James-stein estimator of µ via a Bayesian procedure. Stein (1955) and James and Stein (1961) consider the risk function

Z Risk(µ) := Q(µ, µˆ(x))fX|µ(x)dx, (1.32) ΩX

where Q(µ, µˆ(x)) := (µ − µˆ(x))0Σ−1(µ − µˆ(x)) is the quadratic loss function, and show that the sample mean µˆ is inadmissible in the sense that there exists another estimator of µ that achieves a lower Risk(µ) for all values of µ. This estimator is a shrinkage estimator of the form

µˆJS := (1 − δ)µˆ + δµ01 (1.33)

for a given target mean return µ0 and a shrinkage intensity δJS that has an easy analytical expression. Berger (1978) then showed that considering the square of the quadratic loss function leads to an estimator that is very robust to the exact functional form of the loss. This estimator is given by (1.33) with a shrinkage intensity of the form

b δ = 0 −1 (1.34) d + T (µˆ − µ01) Σ (µˆ − µ01)

with b ∈ [0, 2(N − 2)] and weak conditions on d. Jorion (1986)’s insight was then to show that considering a conjugate prior on µ leads to an estimator of the type (1.33)–

(1.34) with b = d = N + 2 and µ0 = µMV . As he showed, this leads to mean-variance portfolios with larger quadratic utility than the sample mean-variance portfolio (δ = 0) and the sample minimum-variance portfolio (δ = 1). Whereas Jorion (1986) focused on an application of the Bayesian framework to mean-variance portfolios, we refer to Bodnar et al. (2017) for an application to the minimum-variance portfolio.

The above approaches provide robust minimum and mean-variance portfolios. Com- pared to their sample non-robust counterparts, they perform well in terms of out-of-sample mean-variance trade-off and remain more stable over time. However, these estimation strategies remain in the mean-variance framework, which is quite restrictive when asset returns are not Gaussian and that investors care about higher moments. Going beyond the first two moments is the subject of the next section. Chapter 1. Research background 19

1.2 Higher-moment approaches

To proceed, let us introduce some additional notations and definitions related to higher k moments. Let mk,P := E[(P − E[P ]) ] be the portfolio–return kth central moment, and

M3(X) and M4(X) be the asset-return and cokurtosis matrices defined as

 0 0 M3(X) := E (X − µ)(X − µ) ⊗ (X − µ) , (1.35)  0 0 0 M4(X) := E (X − µ)(X − µ) ⊗ (X − µ) ⊗ (X − µ) , (1.36) where the Kronecker product ⊗ between two matrices A and B of size m × n and p × q is the mp × nq matrix   A11B ...A1nB  . . .  A ⊗ B :=  . .. . . (1.37)   Am1B ...AmnB Then, we have that the third and fourth portfolio-return central moments can be written concisely as (Briec et al. 2007, Boudt et al. 2008)

0 m3,P = w M3(X)(w ⊗ w), (1.38) 0 m4,P = w M4(X)(w ⊗ w ⊗ w). (1.39)

Further, we define the portfolio-return skewness and excess kurtosis as

  m3,P m4,P (ζP , κP ) := 3 , 4 − 3 . (1.40) σP σP

The mean-variance-efficient portfolio in (1.9) is optimal assuming that asset returns are Gaussian or that investors’ preferences can be fully described by a quadratic utility function irrespective of the distribution of asset returns. Neither of these two assumptions is reasonable in practice.

1. Asset returns are not Gaussian, and thus, the choice of portfolio weights do not only impact the portfolio-return mean and variance, but also its higher moments. Evidence for the non-Gaussian behavior of asset returns are numerous; see, for example, Fama (1963), Mandelbrot (1963), Simkowitz and Beedle (1980), Das and Uppal (2004) and Massacci (2017). Gormsen and Jensen (2019) recently provide an in-depth study on the properties of higher-moment risk in stock returns. Asset returns, and particularly equity returns, feature so-called stylized facts (Cont 2001) Chapter 1. Research background 20

such as negative skewness, positive excess kurtosis and jumps that are not depicted by the Gaussian distribution.

2. Investors do not make decisions according to a quadratic utility function. This was realized early on for example by Jean (1971), Ingersoll (1975) and Scott and Horvath (1980). In particular, the latter show that risk-averse rational investors display positive preferences for odd moments (such as mean and skewness) and negative preferences for even moments (such as variance and kurtosis). Thus, the assumption that asset returns are Gaussian or that utility functions are quadratic, while admittedly helpful for the sake of mathematical tractability, is not realistic in practice. As Hanoch and Levy (1970, p.181) put it: “In the real world, investors’ utility functions and investment probability distributions of returns may assume highly complex or irregular forms. However, most theoretical discussions of choice under risk have dealt with relatively simple forms, for example, quadratic utility functions and normal probability distributions, in order to make more manageable the description and testing of investment decision rules.” This makes that the mean-variance-efficient portfolio advocated by Markowitz (1952) and many others can be largely suboptimal, particularly in times of crises where non-Gaussianity is very pronounced (Massacci 2017). Harvey and Siddique (2000) and Ang et al. (2006) for example show that investors are ready to sacrifice some amount of mean return or volatility in exchange of a larger skewness or lower kurtosis leading to less downside risk.

For this reason, researchers have put forward portfolio strategies that account for the higher moments of portfolio returns, instead of merely the mean and variance; see Briec et al. (2013) and Section 3.3.5 of Kolm et al. (2014) for reviews. While the theoretical precept for mean-variance strategies is clear—choose a portfolio on the mean-variance efficient frontier by maximizing your expected quadratic utility—there is no such consensus in the literature when it comes to higher moments. Higher-moment approaches can be classified in four different classes:

1. Approaches that aim to find a portfolio on the higher-moment efficient surface via an expected-utility approach.

2. Approaches that aim to find a portfolio on the higher-moment efficient surface without specifying a utility function.

3. Approaches that optimize higher moments via alternative performance criteria that may result in portfolios outside of the efficient surface. Chapter 1. Research background 21

These first three classes are direct in the sense that they explicitly optimize higher moments and thus need to estimate them.

4. Indirect approaches that improve higher moments without the need to estimate them.

Let us now describe these four classes in turn.

1.2.1 Efficient portfolios

The first two classes concern portfolios located on the higher-moment efficient surface. That is, for a selected number of moments, no other portfolio can perform better on all the selected moments.

The first class considers extensions of the quadratic utility function that include the effect of higher moments. Typically, this is achieved by relying on cubic or quartic utility functions; see Levy (1969), Hanoch and Levy (1970) and, more recently, Jondeau and Rockinger (2006), Guidolin and Timmermann (2008), Harvey et al. (2010) and Martellini and Ziemann (2010). This approach is justified because, for an infinitely differentiable utility function U, a Taylor series expansion around the mean µP gives the expected portfolio-return utility ∞ X U (k)(µ ) [U(P )] = P m . (1.41) E k! k,P k=0 Hence, one obtains the cubic and quartic utility functions by discarding the terms of order k > 3 and k > 4, respectively. For example, Martellini and Ziemann (2010) assume that the investor has a constant relative risk aversion (CRRA) utility function, also known as power utility function, given by

x1−λ − 1 U(x) = . (1.42) 1 − λ

In this case, the four-moment utility function is given by

λ λ(λ + 1) λ(λ + 1)(λ + 2) [U(P )] = U(µ ) − m + m − m . (1.43) E P 2 2,P 6 3,P 24 4,P

The portfolio maximizing (1.43) is located on the mean-variance-skewness-kurtosis efficient surface in the sense that no other portfolio can dominate it on all four moments. However, several studies show that this Taylor-series-expansion approach is not adequate because, for commonly used utility functions such as CRRA and CARA, it results in portfolios that Chapter 1. Research background 22 are barely impacted by the third and fourth moments; see Levy and Markowitz (1979), Cremers et al. (2005) and Markowitz (2014). The latter concludes that “a careful choice from a mean–variance efficient frontier will approximately maximize expected utility for a wide variety of concave (risk-averse) utility functions.” Thus, commonly used utility functions are said to be locally quadratic. Other utility functions that arguably better account for higher-moment preferences have been proposed, such as disappointment- aversion utility (Ang et al. 2005, Dahlquist et al. 2017) or S-shaped utility (Cremers et al. 2005). Still, it seems that the utility function may not be the simplest way for investors to design optimal portfolios in the presence of higher moments.

Given the mentioned difficulties associated with expected utility, the second class still aims to find a portfolio on the higher-moment efficient surface, but without the need to specify a utility function. Let us mention three examples. Lai et al. (1991) solve the mean-variance-skewness portfolio via a polynomial-goal-programming problem, whose merit is that it “requires only the investor’s preferences for mean and skewness of portfolio return, whereas the latter [the utility approach] has to specify an investor’s exact utility function, which is generally unknown or too complicated to be used reliably” (p.297). Athayde and Flores (2004) minimize the portfolio-return variance for a given mean and skewness. As they put it, “the utility function approach may not seem reasonable for fund managers, especially those who need to report to their clients the criteria used in selecting the portfolios” (p.1336). Briec et al. (2007) propose a primal approach whereby one searches for the largest improvements in mean, variance and skewness relative to a benchmark portfolio. This approach is guaranteed to give a global solution and does not require a utility function either. Specifically, given a benchmark portfolio wb, the mean-variance-skewness portfolio is computed by solving the problem

max δ δ∈[0,1],w∈W subject to w0µ ≥ w0 µ, b (1.44) 0 0 w Σw ≤ wbΣwb(1 − δ), 0 0 w M3(X)(w ⊗ w) ≥ wbM3(X)(wb ⊗ wb)(1 + sbδ),

0  where sb = sign wbM3(X)(wb ⊗ wb) . Chapter 1. Research background 23

1.2.2 Downside-risk criteria

The third class departs from portfolios on the higher-moment surface and, instead, relies on more intuitive performance criteria; see Chen et al. (2011) for a review. These alternative performance criteria often aim to minimize downside risk. Contrary to the two previous approaches, portfolios that minimize downside risk are in general no longer located on the higher-moment efficient surface. Downside risk is typically measured via the Value-at-Risk (VaR) and conditional VaR (CVaR). The VaR at confidence level ε is defined as

−1 VaRP (ε) := −FP (ε), (1.45)

−1 where FP is the inverse cdf (quantile function) of P . The VaR has appealing theoretical properties and is easily interpretable for investors. However, it has been criticized because it is not always subadditive (Artzner 1999). That is, there exist random variables X and Y for which

VaRX (ε) + VaRY (ε) < VaRX+Y (ε), (1.46) which means that the VaR criterion can favor concentration. An alternative downside-risk criterion to the VaR is the CVaR (a.k.a. expected shortfall, tail VaR), defined as

CVaRP (ε) := −E[P | P ≤ −VaRP (ε)]. (1.47)

CVaR measures the expected loss in the left tail. As shown by Acerbi and Tasche (2003), the CVaR is subadditive. However, Lim et al. (2011) show that it is very fragile when used in portfolio optimization due to estimation errors. Portfolio strategies that rely on VaR and CVaR as risk criteria, such as minimum and mean-VaR/CVaR portfolios, have been studied by several authors; see Basak and Shapiro (2001), Campbell et al. (2001), Favre and Galeano (2002), Alexander and Baptista (2002, 2004) and El Ghaoui et al. (2003). See also Lwin et al. (2017) for a review of the use of VaR in portfolio optimization. A popular approach is to rely on second-order Cornish-Fisher expansions of the VaR and CVaR, known as modified VaR and CVaR, which have the advantage of being functions of the first four portfolio-return moments only; see Boudt et al. (2008). For example, the modified VaR (MVaR) is defined as

 1 1 1  MVaR (ε) := −µ − σ z + (z2 − 1)ζ + (z3 − 3z )κ − (2z3 − 5z )ζ2 , P P P ε 6 ε P 24 ε ε P 36 ε ε P (1.48) Chapter 1. Research background 24

Figure 1.1: Sensitivity of modified VaR to volatility, skewness and kurtosis 0.1

Volatility σP 0.09 Skewness ζP Excess kurtosis κP 0.08

0.07 ed VaR fi 0.06

Modi 0.05

0.04

0.03 0.1 0.3 σP , , ζP [−0.5, 0.5], κP [5, 20] ∈ 5√252 √2526 ∈ ∈

Notes. The figure depicts the sensitivity of the modified VaR in (1.48) to volatility σP (green), skewness ζ (blue) and excess kurtosis κ (red). We fix ε = 1% and µ = 0, and vary the moments in the intervals P h i P P σ ∈ √0.1 , √0.3 , ζ ∈ [−0.5, 0.5] and κ ∈ [5, 20], which are realistic values for daily portfolio returns. P 252 252 P P We very each moment individually, keeping the other moments constant and equal to the middle of the interval.

−1 where zε := Φ (ε) is the standard Gaussian quantile. Gregoriou and Gueyie (2003)

propose the modified Sharpe ratio, defined as the ratio µP /MVaRP (ε), as an extension of the Sharpe ratio. As the MVaR will often be used as higher-moment criterion in this thesis, it is interesting to get some insights into its connection with the moments. For

this purpose, we depict in Figure 1.1, for ε = 1% and µP = 0, how the MVaR changes h i by varying the moments σ , ζ , κ in the intervals σ ∈ √0.1 , √0.3 , ζ ∈ [−0.5, 0.5] P P P P 252 252 P

and κP ∈ [5, 20]. These are realistic values for daily portfolio returns. We very each moment individually, keeping the other moments constant and equal to the middle of the interval. The figure shows that the impact of the moments is the desired one (an increase in volatility and kurtosis increases MVaR, an increase in skewness decreases MVaR). Moreover, whereas MVaR is very sensitive to volatility and kurtosis, the sensitivity to skewness is minor.

The first two classes in Section 1.2.1, as well as the modified VaR and CVaR here, require an estim ate of the asset-return coskewness and cokurtosis matrices in (1.35)– (1.36). Thus, they are highly sensitive to estimation risk. Indeed, the number of distinct N+2 3 N+3 4 parameters in M3(X) and M4(X) are 3 = O(N ) and 4 = O(N ) respectively (Boudt et al. 2015). This is the reason why Brandt et al. (2009, p.3418) state that “[...] extending the traditional approach beyond first and second moments, when the investor’s utility function is not quadratic, is practically impossible because it requires modeling not only the conditional skewness and kurtosis of each stock but also the numerous Chapter 1. Research background 25 high-order cross-moments.” However, Martellini and Ziemann (2010) and Boudt et al. (2018) introduce shrinkage estimators of coskewness and cokurtosis matrices, of the same form than the shrinkage estimators of the covariance matrix in (1.24), and show that they improve the out-of-sample performance compared to the sample estimators.

1.2.3 Indirect approaches

To bypass the estimation-risk issue, a fourth class consists in carefully crafting strategies that do not require estimating higher moments (or other non-mean-variance criteria such as downside risk), but still indirectly improve them. This is clearly a challenging task and, to our knowledge, the only example in this class is Kim et al. (2014) who “develop an approach that can control higher moments of portfolio returns within the mean–variance framework without directly imposing third and fourth moment terms in the formulation” (p.155). They achieve this via a robust-optimization formulation, where they assume an n o uncertainty set made of I sample estimators of µ and Σ: U := (µˆ1, Σb 1),..., (µˆI , Σb I ) . They compute the portfolio maximizing the expected quadratic utility in the worst case in the uncertainty set. That is,

λ max min w0µˆ − w0Σbw. (1.49) w∈W (µˆ,Σb )∈U 2

They show theoretically that, as the number of estimators I and sample size T tend to infinity, this formulation favors solutions with more skewness and less kurtosis than the true mean-variance-efficient portfolio. The underlying intuition is the following. Let 0 λ 0 λ 2 2 fi(w) := w µˆi − 2 w Σb iw = µˆP,i − 2 σˆP,i. The covariance Cov(µˆP,i, σˆP,i) is equal to the 2 skewness m3,P /T . Thus, the more skewness increases, the more µˆP,i and −σˆP,i move in opposite directions, which creates a diversification effect that makes fi(w) less dispersed. This ultimately improves the worst-case solution in (1.49). Similarly, the worst-case solution gets better when the variance of fi(w) decreases, and we have that the variance 2 4 ofσ ˆP,i is m4,P − σP . Thus, problem (1.49) favors portfolios with low kurtosis as well.

1.3 Risk-parity approaches

In sections 1.1 and 1.2, we discussed the mean-variance approach to portfolio selection, and how it has been extended to account for higher-moment preferences of investors. We also discussed how estimation risk associated to asset mean returns severely impacts the out- of-sample performance of such portfolios. The disappointing out-of-sample performance Chapter 1. Research background 26

of mean-variance portfolios led researchers and practitioners to focus on so-called risk- based portfolios that do not require an estimate of asset mean returns and aim to be better diversified. The most prominent risk-based portfolios are the minimum-variance portfolio (Section 1.1.2), maximum-diversification portfolios (Choueifaty and Coignard 2008), asset-risk-parity portfolios (Maillard et al. 2010), and factor-risk-parity portfolios (Meucci 2009). Reviews of risk-based portfolios can be found in Clarke et al. (2013) and Koumou and Dionne (2019).

It must be noted however that although these risk-based portfolios ignore mean returns in their computation, they are still associated with an implied mean return. In particular, as explained by Ardia and Boudt (2015a), this implied mean return is often better in terms of stability and forecast precision, which can explain why those strategies outperform mean-variance portfolios out of sample.

In this section, we focus on risk-parity portfolios; see the textbook of Roncalli (2013) for a detailed treatment. Section 1.3.1 deals with asset-risk parity, and Section 1.3.2 with factor-risk parity. Although the risk-parity framework is heuristic and does not have clear theoretical foundations (see Section 1.3.3), it is very popular among practitioners and is used by some of the largest hedge funds in the world, such as AQR and BridgeWater.

1.3.1 Asset-risk parity

The original approach to risk parity introduced by Qian (2005, 2006) is to find the portfolio for which each asset contributes equally to the portfolio-return volatility, and thus, is most diversified. Embracing this definition of a well-diversified portfolio, mean- variance portfolios are often not well diversified. For example, given two asset returns with volatilities (σ , σ ) = (0.15, 0.30) and correlation ρ = 0.20, the minimum-variance X1 X2 portfolio is such that the first asset contributes to 86% of the portfolio-return volatility.

The first rigorous academic study on the asset-variance-parity portfolio is the paper of Maillard et al. (2010). Because volatility is a positive-homogeneous risk measure,

that is σaP = |a|σP , one can compute the marginal contribution of each asset to the portfolio-return volatility as

∂σ w (Σw) σ := w P = i i . (1.50) Xi|P i ∂wi σP Chapter 1. Research background 27

We have from Euler’s homogeneous-function theorem that

N X σ = σ . (1.51) Xi|P P i=1

To compute the portfolio for which all marginal contributions are equal, Maillard et al. (2010) propose to solve

N N X X 2 min (wi(Σw)i − wj(Σw)j) subject to wi ≥ 0 ∀i. (1.52) w∈W i=1 j=1

As they show, the asset-variance-parity portfolio is unique under the long-only constraint, and its volatility is located between the one of the minimum-variance and equally weighted portfolios. Using a single-factor model, Clarke et al. (2013) show that the asset-variance- parity portfolio invests in all assets, in contrast to the minimum-variance portfolio that is very concentrated in the low-risk assets, and thus, is associated to a large proportion of zero weights. Many empirical tests have been carried out, typically across several asset classes (such as equities and bonds), and show that this strategy tends to perform quite well out of sample in terms of Sharpe ratio and turnover; see, for example, Anderson et al. (2012) and Asness et al. (2012).

When short-selling is allowed, Bai et al. (2016) show that the asset-variance-parity portfolio is no longer unique and that there exist 2N−1 different solutions, which are obtained by solving the optimization problem

N 2 X min σP − ln βiwi, subject to βiwi > 0 ∀i (1.53) w∈ N R i=1 for all sign vectors β = (±1,..., ±1) ∈ RN , and then normalizing the solution (noting that the sign vectors β and −β lead to the same solution).

Finally, several authors have studied asset-risk-parity portfolios using higher-moment risk measures such as the VaR or CVaR; see Boudt et al. (2013), Baitinger et al. (2017) and Haugh et al. (2017).

One important drawback of the asset-risk-parity framework is however that the asset returns are correlated with one another, and thus, the portfolio whose risk is equally spread among the assets may not be well diversified after all. For example, consider an economy driven by two uncorrelated risk factors and an investment set of five assets, of Chapter 1. Research background 28

which four are mostly exposed to the first risk factor while the fifth asset is mostly exposed to the second risk factor. Then, the asset-risk-parity portfolio is very concentrated on the first risk factor, and thus, is poorly diversified. As Boudt and Peeters (2013, p.60) put it, “when those assets are very dependent on underlying common risk factors, the portfolio risk may effectively be very concentrated in a few underlying risk factors.” This calls for a refinement of the risk-parity approach, where the portfolio-return risk is equally spread among uncorrelated factors rather than correlated assets.

1.3.2 Factor-risk parity

The factor-risk-parity portfolio is the portfolio for which the uncorrelated factors underly- ing the asset returns contribute equally to the portfolio-return risk. This can be achieved either via observable economic factors (Boudt and Peeters 2013, Roncalli and Weisang 2015) or non-observable statistical factors. We focus on the latter approach, introduced by Meucci (2009).

Meucci (2009) considers the variance as risk measure and takes as uncorrelated factors the principal components (PCs) of assets returns, defined as

? −1/2 0 Y := ΛN VN X, (1.54)

where VN is the matrix whose columns are the N eigenvectors of Σ and ΛN is the diagonal matrix made of the corresponding (strictly positive) eigenvalues. One can observe that the portfolio return P = w0X can be rewritten as a linear combination of the PCs as follows: 0 ? 1/2 0 P = w˜ Y and w˜ := ΛN VN w, (1.55)

where w˜ are the PC exposures. Because the PCs form an orthonormal basis (ΣY ? = IN ), the portfolio-return variance is given by

N 2 X 2 σP = w˜i . (1.56) i=1

2 2 This, the PC-variance-parity portfolio is defined as the portfolio for which w˜i = w˜j for all ± i, j = 1,...,N. In other words, it is obtained for the PC exposures w˜ = c1 with c ∈ R0 and 1± a N-dimensional vector of ±1’s. As shown by Deguest et al. (2013) this means Chapter 1. Research background 29 from (1.55) that there are 2N−1 PC-variance-parity portfolios given by the set

( −1/2 ± ) VN ΛN 1 WPC := w ∈ W w = . (1.57) 0 −1/2 ± 1 VN ΛN 1

Several studies extended Meucci (2009)’s seminal work. Deguest et al. (2013) study the theoretical properties of the PC-variance-parity portfolios, showing in particular that ± −1/2 0 the minimum-variance PC-variance-parity portfolio is obtained for 1 = sign(ΛN VN 1). Meucci et al. (2015) propose an alternative to the PCs based on the uncorrelated factors that track another correlated set of factors, such as the original asset returns, as closely as possible. Concerning empirical performance, Lohre et al. (2012, 2014) document a satisfying out-of-sample performance in terms of Sharpe ratio and diversification on the S&P500 index and global asset class indices, respectively. Poddig and Unger (2012) find on simulated and empirical data that this portfolio is very unstable over time: “As the PCA portfolio is based on all principal components, but the components are decreasingly responsible for the randomness in the market, change in the input parameters induces a drastic change in the last factor and thus in the asset weights” (p.384).

1.3.3 Criticisms

The PC-variance-parity approach, and the risk-parity framework in general, is largely heuristic. Not much is known about the theoretical properties of such portfolios, and in which dimensions they can be useful with regards to portfolio performance. It is fair to say that these portfolios have been introduced as an alternative to classical mean- variance-efficient portfolios in the hope that they can outperform them out of sample. On risk-based approaches, Lee (2011) puts it as follows: “While proponents of these portfolios may not construct them, nor explicitly claim them to be mean-variance efficient, there is no doubt that their performance metrics reveal their preference for more portfolio efficiency than less. At the very least, one may interpret that these risk-based approaches were built to approximate a mean-variance efficient world, for various reasons such as to mitigate the well-known problem of error maximization or other problems yet to be clarified” (p.13). This explains why these alternative portfolio strategies, although potentially appealing in practice, remain dubious for many academics: “Since performance goals, such as maximizing the Sharpe ratio, are not incorporated into their objective function, some of these risk-based portfolios [...] remain heuristic in nature. Without a clearly defined objective, we struggle to understand what problems these portfolios are Chapter 1. Research background 30 built to solve” (Lee 2011 p.13). Indeed, investors ultimately care about the distribution of their portfolio returns, and these preferences can be captured via a utility function for example. In contrast, diversification is not a goal in itself.

1.4 Information-theoretic approaches

In this thesis, we often rely on concepts coming from information theory. Born with the seminal work of Shannon (1948), information theory provides a probabilistic framework to quantify the amount of information embedded in a random variable. It provides a rich set of tools to be used in problems dealing with uncertainty and, as such, has found many applications in finance. We will detail the relevant concepts when needed in the next chapters. Let us just introduce here the Shannon entropy (Shannon 1948), or simply entropy, defined as N X H(X) = H[pX ] := − pi ln pi (1.58) i=1 if X has a discrete distribution pX = (p1, . . . , pN ), and Z H(X) = H[fX ] := − fX (x) ln fX (x)dx (1.59) ΩX if X has a continuous density fX , where ΩX := {x | fX (x) > 0} is the support set of X. Entropy measures the amount of information conveyed on average by the observation of a random variable. In turn, it also measures the uncertainty of X. Indeed, the higher the entropy, the less information we have about the random variable prior to observing it, and thus, the more uncertain it is. Mathematically, this is captured by the fact that H[pX ] is maximized when all pi’s are uniform and equal 1/N, and that H[fX ] is maximized when fX is a uniform density in case X has bounded support. Also, when X is Gaussian, 1 2 H(X) = H[N (µX , σX )] = 2 ln 2πeσX has a one-to-one mapping with the variance.

Information-theoretic measures have been used in many financial applications; see Golan and Maasoumi (2008) and Zhou et al. (2013) for reviews. Let us mention three different examples to get a flavour of how diverse the applications can be. First, Brody et al. (2007) introduce an option-pricing model in which the risk-neutral density of the underlying asset at maturity is obtained as a maximum-R´enyi-entropy distribution under moment constraints. R´enyi entropy is an extension of Shannon entropy; see Chapter 2. This extends the classical maximum-entropy principle of Jaynes (1957) based on Shannon entropy. Using R´enyi entropy allows them to obtain a power-law distribution that is more in line with observed distributions than the exponential distribution found Chapter 1. Research background 31 by maximizing Shannon entropy. Second, Jiang et al. (2018) introduce a measure of stock-return asymmetry based on an entropy-type distance measure between the left and right tails of the distribution. They find that it captures asymmetry in a more powerful way than the skewness. Third, Ormos and Zibriczky (2014) introduce an analog of the CAPM asset-pricing model using Shannon and Quadratic entropies, and find that entropy provides a higher explanatory power than the beta.

In portfolio selection, the literature is mainly limited to employing entropy as a penalty term besides a more standard objective function. One considers the portfolio weights as discrete probabilities, and uses their Shannon entropy as a penalty term to shrink them toward the equally weighted portfolio; see Bouchaud et al. (1997), Bera and Park (2008), Usta and Kantar (2011), Huang (2012), Yu et al. (2014) and Pola (2016). For example, Bera and Park (2008) propose to solve the maximum-entropy optimization problem

max H[w] subject to µP ≥ µ0, σP ≤ σ0, wi ≥ 0 ∀i. (1.60) w∈W

Carmichael et al. (2015) propose a generalization of this approach in which one maximizes 0 the portfolio Rao’s quadratic entropy, HD[w] := w Dw, where D is a dissimilarity matrix. As they show, this general framework encompasses many existing diversification approaches.

There are surprisingly few proposals in the literature on portfolio selection to rely on the entropy of the portfolio returns, rather than portfolio weights. Yet, entropy provides a natural way of measuring the uncertainty, and thus in some sense the risk, associated to the portfolio returns. Philippatos and Wilson (1972) were the first to propose using Shannon entropy instead of the variance and construct mean-entropy portfolios. However, their only finding is that, empirically, the mean-variance and mean-entropy efficient frontiers are similar. More recently, Dionisio et al. (2006) and Vermorken et al. (2012) recognize that Shannon entropy provides an appealing higher-moment measure that captures uncertainty and diversification, but they do not propose to use it as an objective function. To the best of our knowledge, the only authors who put forward Shannon entropy as an objective function for portfolio selection are Kirchner and Zunckel (2011) and Flores et al. (2017). Kirchner and Zunckel (2011) only propose it implicitly by noticing that the Shannon entropy of a portfolio of two Gaussian asset returns with equal variances is minimized for equal weights. Flores et al. (2017) do not propose entropy directly as an objective function but, instead, propose a maximum-diversification portfolio as in Choueifaty and Coignard (2008) but using exponential Shannon entropy instead of Chapter 1. Research background 32 volatility as risk measure. Their diversification measure, noted DD?, is of the form

PN   PN  exp i=1 wiH(Xi) − exp H i=1 wiXi DD? := , (1.61) PN  exp i=1 wiH(Xi)

They show the maximum-DD? portfolio favors skewness and disfavors kurtosis.

Overall, the portfolio-return entropy H(P ) has been proposed as a risk or uncertainty criterion based on intuitively appealing grounds, such as its information-theoretic roots and the fact that it captures higher moments. However, not much is known about its properties in a risk-measurement and portfolio-selection context.

1.5 Thesis contributions

Having reviewed the relevant theory and literature on portfolio selection, we provide in this section a brief description of the contributions of this thesis, covered in chapters2 to5. In all four chapters, we introduce portfolio strategies that account for the higher moments of portfolio returns. As discussed in Section 1.2, this is an important challenge in portfolio selection. Each strategy accounts for higher moments in a different way, but they have in common that they rely on information theory to accomplish this.

In Chapter2 (Minimum R´enyi entropy portfolios), we follow the recent literature on the use of entropy in portfolio selection (Section 1.4) and introduce portfolios that minimize the portfolio-return exponential R´enyi entropy. The latter is an extension of Shannon entropy featuring a free parameter α ∈ [0, ∞]. Our contribution is divided in two main parts. First, we study the properties of exponential R´enyi entropy as risk measure. We show that it is closely related to the so-called class of deviation risk measures, but that, in general, it is not subadditive. We also provide arguments for why it is better to restrict the R´enyi parameter α ∈ [0, 1]. Second, we study the minimum-R´enyi-entropy portfolio. We make explicit the relationship between R´enyi entropy and higher moments via a second-order Gram-Charlier expansion, and show with this expansion that minimizing the exponential R´enyi entropy can actually favor portfolios with worse higher moments than minimizing the variance. Thus, although entropy is an intuitively appealing measure of risk, it can be undesirable with respect to higher moments. Finally, we derive a robust estimator of R´enyi entropy to deal with estimation risk and show empirically that, compared to the minimum-variance portfolio, the minimum R´enyi entropy portfolio improves the Sharpe ratio but deteriorates the higher moments and turnover. Chapter 1. Research background 33

In Chapter3 (Optimal portfolio diversification via independent component analysis), we study the factor-risk-parity approach of Section 1.3.2 and legitimize it as a well-founded higher-moment strategy. Our contribution is divided in three main parts. First, we show that factor-risk parity based on principal components is problematic because they are merely uncorrelated, rather than independent, and because they are an arbitrary choice as any rotation of the PCs remains uncorrelated and thus valid for factor-risk parity. Actually, as we show, any portfolio is a factor-variance-parity portfolio for some rotation of the PCs. Second, we propose to resolve these issues by relying on the independent components, which are the rotation of PCs that are maximally independent, and found by minimizing the sum of the Shannon entropies of the factors. This solves the arbitrariness of the decorrelation criterion and, as we show, provides a natural indirect approach to reduce the portfolio-return kurtosis. Third, we study the empirical performance of portfolios that shrink minimum-risk portfolios toward factor-risk-parity portfolios, using variance and modified VaR as risk measures. We find that portfolios based on independent components largely outperform those based on principal components, and perform well in terms of Sharpe ratio, higher moments and turnover.

In Chapter4 (Robust portfolio selection using sparse estimation of comoment tensors), we exploit the natural sparsity of comoment tensors associated with independent factors to obtain robust estimates of moment-based portfolios. Our contribution is divided in three main parts. First, we show that projecting the N assets returns onto the basis formed by the first K principal components reduces dimension: the number of parameters in the kth comoment tensors of asset returns goes from the order N k to Kk. However, the growth remains exponential. Second, to really address the curse of dimensionality, we project the asset returns onto the independent components. By neglecting their remaining higher- moment dependence, the number of parameters becomes linear in K. Third, we study the out-of-sample performance of minimum-variance and minimum-MVaR portfolios, and find that our sparse estimates drastically improve performance and reduce turnover compared to sample and PCA-based estimates. Moreover, our sparse estimate of the minimum-MVaR portfolio succeeds in dominating the minimum-variance strategy, in contrast with non-sparse estimates.

In Chapter5 (Portfolio selection: A target-distribution approach), we propose a framework in which investors’ preferences are captured via a target-return distribution, which provides a simple and intuitive way of specifying higher-moment preferences. Our contribution is divided in three main parts. First, we introduce the minimum-divergence portfolio, which is the portfolio whose return distribution is as close as possible to the Chapter 1. Research background 34 target-return distribution, as measured by the Kullback-Leibler divergence (relative entropy). Second, we study the particular case of the generalized-normal target-return distribution. We propose to match the first two target-return moments to those of a reference portfolio, chosen on the mean-variance efficient frontier. In this setup, we show that we recover Markowitz’s efficient frontier when asset returns are Gaussian, and when targeting a Dirac-delta distribution. We also show that the objective function admits a natural decomposition in the case of a Gaussian target return, in which the higher-moment preferences are captured via the Shannon entropy of the standardized portfolio return. The latter shrinks the reference portfolio toward a portfolio with higher moments close to the Gaussian. Third, we provide a closed-form estimator of the objective function and show empirically that, compared to popular robust mean-variance reference portfolios, our minimum-divergence portfolios exhibit comparable mean-variance trade-offs but feature substantially less tail risk. Chapter 2

Minimum R´enyi entropy portfolios

Abstract Accounting for the non-normality of asset returns remains one of the main challenges in portfolio optimization. In this chapter, we tackle this problem by assessing the risk of the portfolio through the “amount of randomness” conveyed by its returns. We achieve this using an objective function that relies on the exponential of R´enyientropy, an information-theoretic criterion that precisely quantifies the uncertainty embedded in a distribution, accounting for higher moments. Compared to Shannon entropy, R´enyi entropy features a parameter that can be tuned to play around the notion of uncertainty. A Gram-Charlier expansion shows that it controls the relative contributions of the central (variance) and tail (kurtosis) parts of the distribution in the measure. We further rely on a non-parametric estimator of the exponential R´enyi entropy that extends a robust sample-spacings estimator initially designed for Shannon entropy. A portfolio-selection application shows that one must be careful in using R´enyi entropy as objective function because its minimization can yield portfolios with less skewness and more kurtosis than the minimum-variance portfolio.

Reference This chapter is partially based on: Lassance, N., & Vrins, F. (2019). Minimum R´enyi entropy portfolios. Annals of Operations Research, https://doi.org/10.1007/s10479-019-03364-2.

2.1 Introduction

As explained in Section 1.1.2, the minimum-variance portfolio performs very well out of sample in terms of Sharpe ratio because it does not require estimates of asset mean returns. At the same time, however, investors display non-negligible higher-moment preferences, which calls for the design of portfolio strategies that account for higher moments, and not only for the variance. As seen in Section 1.4, recent studies in portfolio selection hint that entropy might provide an appealing way of measuring the uncertainty embedded in

35 Chapter 2. Minimum R´enyientropy portfolios 36 the whole portfolio-return distribution, thereby accounting for all moments. Still, no one so far has proposed using entropy as an objective function for portfolio selection, and its properties as a risk measure have not been studied either.

In this chapter, we propose minimizing the portfolio-return uncertainty measured via the exponential of R´enyi entropy, estimated within a robust m-spacings framework. The optimal portfolio so obtained is called minimum R´enyi entropy portfolio, and provides the minimum-risk portfolio in the sense of information theory. It coincides with the minimum-variance portfolio when asset returns are Gaussian but, otherwise, also accounts for higher moments. R´enyi entropy is an extension of Shannon entropy that features a parameter α ∈ [0, ∞] that, as we show, allows one to trade off the contributions of the central and tail parts of the distribution in the measure. We argue in particular in favor of setting α ∈ [0, 1] as, then, R´enyi entropy has natural connections with measures of spread (as shown by the extended Chebyshev’s inequality of Campbell 1966) and with a minimum-variance-kurtosis objective (as shown by a first-order Gram-Charlier expansion of the measure). The empirical results support that choice as well.

The chapter is organized as follows. Section 2.2 introduces the notions of Shannon and R´enyi entropy. Section 2.3 explores the theoretical properties of the exponential R´enyi entropy and makes the link with the notion of risk. Section 2.4 follows with the minimum R´enyi entropy portfolio and its connections with higher moments. Section 2.5 derives a robust m-spacings estimator of the measure and studies its consistency and robustness. We design and perform an empirical out-of-sample performance study of the proposed method in Section 2.6. Minimum R´enyi entropy portfolios are shown to outperform standard minimum-variance portfolios in terms of Sharpe ratio, but are associated with more tail risk as measured by skewness and kurtosis, as well as more turnover. Section 2.7 concludes. The Appendix at the end of the chapter contains proofs for all results, as well as supplementary materials.

2.2 The notion of entropy

Entropy, a central notion in information theory, encompasses many different definitions; see Amigo et al. (2018) for a review. Here, we review the ones used in this chapter and the whole thesis: Shannon entropy and R´enyi entropy. Chapter 2. Minimum R´enyientropy portfolios 37

2.2.1 Shannon entropy

Information theory started with Shannon (1948) who proposed a mathematical formula to quantify the amount of information conveyed by a (discrete) distribution, now called Shannon entropy or simply entropy. Shannon (1948) gave an axiomatic derivation of entropy, showing that it is the only measure fulfilling three desirable properties of an information measure. Here, we prefer to rely on a more intuitive derivation, which goes as follows. Given a random variable X with discrete probability distribution pX =

(p1, . . . , pN ), the amount of information conveyed by the occurrence of an event with probability pi can be defined as I(pi) := − ln pi. We have that I(1) = 0 and I(0) = +∞, which corresponds to the intuitive idea that observing a very unlikely event brings more information than a very likely one. Then, the Shannon entropy of X is defined as the average amount of information conveyed by the distribution pX ; that is,

N X H(X) = H[pX ] := − pi ln pi. (2.1) i=1

In the case where X is a continuous random variable with density fX and support set

ΩX , the Shannon entropy is similarly defined as Z H(X) = H[fX ] := − fX (x) ln fX (x)dx, (2.2) ΩX provided the integral exists.

As explained in Section 1.4, entropy is closely connected with uncertainty, which makes it an intuitively appealing alternative to the variance as portfolio-selection criterion. For more details on Shannon entropy, we refer the reader to the textbook of Cover and Thomas (2006).

2.2.2 R´enyi entropy

R´enyi (1961) proposed to relax one of the axioms considered by Shannon (1948) to derive his notion of entropy. This results in a generalized measure of entropy that recovers Shannon entropy as a special case, known as R´enyientropy.

To derive this measure, we follow the algebraic approach of Erdogmus and Xu (2010) rather than the more abstract axiomatic approach of R´enyi (1961). The idea is that, in

(2.1), one does not have to use a linear operator to average the I(pi). Instead, the average Chapter 2. Minimum R´enyientropy portfolios 38 amount of information conveyed by the discrete random variable X can be defined more generally via the so-called Kolmogorov–Nagumo average (Kolmogorov 1930) defined as

N ! −1 X g pig(I(pi)) , i=1 where g is any continuous, strictly monotone function. Then, when applying the desired property that entropy must be additive for independent events, the class of possible functions g is restricted to two cases: g(x) = cx, leading to Shannon entropy, and g(x) = ce(1−α)x, leading to R´enyi entropy:

n ! 1 X H (X) = H [p ] := ln pα , (2.3) α α X 1 − α i i=1 where α ∈ [0, ∞], α =6 1. In case X is a continuous random variable, R´enyi entropy is similarly defined as

1 Z H (X) = H [f ] := ln f α (x)dx, (2.4) α α X 1 − α X ΩX and Shannon entropy is recovered for α → 1:

H1(X) := lim Hα(X) = H(X). (2.5) α→1

Focusing on the continuous definition of R´enyi entropy in (2.4) that is of main interest in this thesis, Hα(X) satisfies the following important properties (Koski and Persson 1992, Johnson and Vignat 2007, Pham et al. 2008):

(i) For a fixed variance, it is maximized for the Student t distribution when 1/3 < α < 1, the Gaussian distribution when α = 1 and a truncated Student t distribution when α > 1;

(ii) It is translation invariant:

Hα(X + c) = Hα(X); (2.6)

(iii) It has the scaling property

Hα(cX) = Hα(X) + ln |c|; (2.7) Chapter 2. Minimum R´enyientropy portfolios 39

(iv) It is non-increasing and continuous in α;

(v) It has finite lower bound: Hα(X) > −∞.

2.3 R´enyi entropy and risk measurement

In this section, we study the properties of R´enyi entropy as risk measure. We show that the exponential R´enyi entropy is closely connected to the class of deviation risk measures. We take a deep look at the subadditivity property and show that, although it can violate subadditivity in some extreme scenarios, the exponential R´enyi entropy can be considered subadditive in a portfolio-selection context. A discussion of the impact of R´enyi’s α parameter closes the section, arguing in particular that setting α ∈ [0, 1] is more natural.

2.3.1 Connection with deviation risk measures

The exponential of R´enyi entropy satisfies more natural risk properties than its non- exponential counterpart.

Definition 2.1 (Exponential R´enyi entropy). The exponential R´enyientropy is defined as 1 Z  1−α exp α Hα (X) := exp(Hα(X)) = fX (x)dx , (2.8) ΩX with the exponential Shannon entropy as special case:

 Z  exp exp H1 (X) := lim Hα (X) = exp − fX (x) ln fX (x)dx . (2.9) α→1 ΩX

This quantity was first introduced by Campbell (1966) who studied its relevance as a exp measure of spread of a distribution for α ∈ [0, 1]. We come back to the link between Hα and measures of spread in Section 2.3.4. In this chapter, we apply this measure to the construction of minimum-risk portfolios; see Section 2.4.

As it turns out, the exponential R´enyi entropy has properties similar to risk measures that belong to the class of deviation risk measures, introduced by Rockafellar et al. (2006).

Definition 2.2 (Deviation risk measure). A deviation risk measure is any operator D satisfying:

(i) Positivity: D(X) > 0 for all non-constant X, and D(X) = 0 for any constant X;

(ii) Positive-homogeneity: D(cX) = cD(X) ∀c > 0; Chapter 2. Minimum R´enyientropy portfolios 40

(iii) Translation-invariance: D(X + c) = D(X) ∀c ∈ R;

(iv) Subadditivity: D(X + Y ) 6 D(X) + D(Y ).

exp As we show in the next proposition, Hα fulfills properties (i), (ii) and (iii). The subadditivity property (iv) is dealt with in the next section.

exp Proposition 2.1. The exponential R´enyi entropy Hα fulfills the positivity, positive- homogeneity and translation-invariance properties of deviation risk measures.

2.3.2 The subadditivity property

In their recent paper on diversification based on Shannon entropy (see Section 1.4), Flores et al. (2017) claim that the exponential Shannon entropy is subadditive; that is, for any random variables X and Y , it would hold that

exp exp exp H1 (X + Y ) ≤ H1 (X) + H1 (Y ).

In the next theorem, we show via three counterexamples that this claim is not justified in general.1

exp Theorem 2.1. The exponential R´enyientropy Hα is not subadditive.

Proof. In Appendix 2.8.1.2, we give three analytical counterexamples to subadditivity for α = 1 using pairs of random variables (X,Y ): i) a pair of independent one-sided (L´evy),ii) a pair of independent two-sided bimodal (Gaussian mixtures) and iii) a pair of comonotonic random variables.

In addition to these three analytical counterexamples, we provide below a numerical counterexample for a pair of independent Student t random variables.

exp Example 2.1. As we now show numerically, the exponential Shannon entropy H1 is not subadditive for a pair (X,Y ) of i.i.d. Student t random variables with ν < 1 degrees of freedom. From Zografos and Nadarajah (2003), we have that

√ ν 1 ν + 1 ν + 1 ν  Hexp(X) = Hexp(Y ) = νB , exp ψ − ψ , (2.10) 1 1 2 2 2 2 2

d Γ(x)Γ(y) where ψ(x) := dx ln Γ(x) is the digamma function and B(x, y) := Γ(x+y) is the Beta

function. The density fX+Y is not known in closed form but can be evaluated numerically 1The authors are grateful to Yuri Salazar Flores for discussions on the provided counterexamples. Chapter 2. Minimum R´enyientropy portfolios 41

Figure 2.1: The exponential Shannon entropy violates subadditivity for two i.i.d. Student t random variables with ν < 1

)) 30 Y ( exp 1 25 H 20 ) + X ( 15 exp 1 H ( 10 − )

Y 5 +

X 0 ( exp 1

˜ -5 H 0.5 0.75 1 1.25 1.5 1.75 2 ν ˜ exp exp exp Notes. The figure depicts, from Example 2.1, the quantity H1 (X + Y ) − (H1 (X) + H1 (Y )) for a ˜ exp pair (X,Y ) of i.i.d. Student t random variables with ν degrees of freedom. H1 (X + Y ) is the estimate of exp H1 (X + Y ) obtained via convolution and Matlab integrate function. The figure shows that subadditivity is violated whenever ν < 1.

exp ˜ exp by convolution, which allows us to evaluate H1 (X + Y ) numerically. Denoting H1 (X + Y ) the numerical approximation using the MATLAB built-in function integrate, Figure 2.1 shows how ˜ exp exp exp H1 (X + Y ) − (H1 (X) + H1 (Y )) evolves as a function of ν ∈ [0.5, 2]. The figure shows that subadditivity is violated exp exp exp whenever ν < 1. It is in fact easy to show that H1 (X + Y ) = H1 (X) + H1 (Y ) when ν = 1. Indeed, in that case, X and Y are two i.i.d. Cauchy(0, 1) random variables, which, exp being a stable law, means that X +Y ∼ Cauchy(0, 2). The additivity of the H1 operator exp then follows because, for Z ∼ Cauchy(µ, σ), H1 (Z) = 4πσ.

Theorem 2.1 and Example 2.1 show that the exponential R´enyi entropy is not subadditive. However, it is worth stressing that the provided counterexamples are atypical in portfolio-selection applications. The L´evydistribution and Student t distribution with ν < 1 depict extreme fat tails: none of the moments are defined. In practice, asset returns exhibit much thinner tails (Cont 2001). Similarly, multimodal distributions and comonotonicity are behaviors that rarely arise in portfolio selection.

In fact, just like the Value-at-Risk is subadditive in most practical situations (Daniels- son et al. 2013), the exponential R´enyi entropy can be reasonably assumed subadditive in the specific context of portfolio selection. For instance, the exponential R´enyi entropy Chapter 2. Minimum R´enyientropy portfolios 42

is subadditive for Gaussian distributions. Indeed, we have that (Koski and Persson 1992)

√ exp  1/(α−1) Hα N (µ, σ) = σ 2πα , (2.11)

and the standard deviation is subadditive (Artzner et al. 1999).

The Gaussian case is however quite restrictive because asset returns are typically not well described by the Gaussian distribution; see Section 1.2. A more appealing candidate to model the fat tails observed in asset returns is the general class of elliptical distributions, which has many applications in finance and portfolio selection; see, for instance, Chen et al. (2011) and Kan and Zhou (2017). The Gaussian distribution considered above is a special instance of the class of elliptical distributions, which also comprises more realistic distributions to model asset returns such as the Student t and Laplace distributions. As the next proposition shows, the exponential R´enyi entropy is subadditive when the pair (X,Y ) is jointly distributed according to an elliptical distribution. This provides a broader and more realistic sufficient condition for subadditivity than the Gaussian setting.

Proposition 2.2. Let (X,Y ) ∼ El(µ, Σ, g2) be elliptically distributed; that is,

−1/2 0 −1  fX,Y (x) = | det(Σ)| g2 (x − µ) Σ (x − µ) , (2.12)

where g2 is a non-negative density generator function and Σ is the scaling matrix of exp (X,Y ). Then, Hα is subadditive for the pair (X,Y ).

Remark 2.1. Proposition 2.2 can be extended to any dimension. That is, if exp PN  PN exp X ∼ El(µ, Σ, gN ), then Hα i=1 Xi ≤ i=1 Hα (Xi). Combined with the positive- exp homogeneity property, this means that Hα is subadditive at the portfolio level; that is, exp PN  PN exp for positive portfolio weights w, we have Hα i=1 wiXi 6 i=1 wiHα (Xi).

2.3.3 Exponential R´enyi entropy as a flexible risk measure

This section explains how the parameter α allows to tune the relative contributions of the central and tail parts of the distribution, leading to different definitions of risk. This makes R´enyi entropy more flexible than the standard Shannon entropy.

To show this, consider the two extreme cases α = 0 and α = ∞. From Koski and exp Persson (1992), we have that Hα (X) measures the spread of X when α = 0, while exp Hα (X) is given by the inverse of the supremum of fX when α = ∞. Specifically, we Chapter 2. Minimum R´enyientropy portfolios 43

exp exp exp exp have that H0 (X) := lim Hα (X) and H∞ (X) := lim Hα (X) are given by α↓0 α→∞

exp H0 (X) = L(ΩX ), (2.13) exp H∞ (X) = 1/ sup fX , (2.14)

where L(ΩX ) is the Lebesgue measure of the support set of X.

These two extreme cases show that varying α modifies the way we measure entropy, that is uncertainty, and so the risk. Taking α = 0 amounts to measure risk by the support of the distribution, while taking α = ∞ amounts to measure risk by the maximal probability. By minimizing the portfolio-return entropy, as we propose in the next section, one can therefore minimize the density range on the x-axis with α = 0 or maximize exp the density range on the y-axis with α = ∞. H0 focuses only on extreme values (low exp entropy = low distance between extreme values). H∞ focuses only on the most likely outcomes (low entropy = high maximal probability), and so, in the symmetric unimodal case, on the center of the distribution.

exp It results that taking α too large may not be desirable because Hα will barely be affected by tail events, which is the criticism that is made about the variance. Conversely, by decreasing α, we assign more similar “weight” to all events, thus increasing the relative importance of tail events compared to events around the mode.

Example 2.2. To illustrate the effect of α, Figure 2.2 depicts how the exponential R´enyi entropy of a Student t random variable X ∼ T (ν) evolves with ν for different values of α. From (Zografos and Nadarajah 2005), it is given by

α  α(ν+1)  ν+1 ! Γ − 1 exp 1−α Γ 2 2 2 Hα (X) = (πν) 2 , (2.15) Γ ν   α(ν+1)  2 Γ 2

The figure shows that, when going from α = 2 to α = 0.4, the sensitivity to the increase of tail uncertainty (to the decrease of ν) is indeed increasingly visible. 2

Note however at this stage that even though decreasing α increases the sensitivity to tail events, nothing guarantees that the exponential R´enyi entropy with a low α has the right preferences with respect to higher moments.

2The fact that decreasing α increases the sensitivity to changes in the distribution uncertainty can exp ∂Hα [N (µ,σ)] easily be checked in the Gaussian case; from (2.11), we have that ∂α∂σ < 0. Chapter 2. Minimum R´enyientropy portfolios 44

Figure 2.2: Decreasing α increases the sensitivity to tail uncertainty of the exponential R´enyi entropy 16 α = 0.4 14 α = 0.5 α = 0.7 12 α = 1 α = 2 ) X

( 10 exp α

H 8

6

4

3 4 6 8 10 12 14 16 18 20 ν Notes. The figure depicts, from Example 2.2, the exponential R´enyi entropy of the Student t distribution in (2.15) as a function of the number of degrees of freedom ν for different values of α.

2.3.4 Appeal of the case α ∈ [0, 1]

The previous section showed how decreasing the value of α makes the exponential R´enyi entropy increasingly more affected by tail events. In this section, we show more specifically that setting α ∈ [0, 1] is more natural. We list three arguments.

exp 1. For α ∈ [0, 1], Hα has close connections with measures of spread. By minimiz- ing the variance, investors ensure that most of the portfolio-return distribution is concentrated in some small interval around the mean. This is established by

Chebyshev’s inequality which, given the set Ak := {x ∈ ΩX | |x − µX | ≥ k}, says that σ2 (X ∈ A ) ≤ X . P k k2 exp Similarly, for α ∈ [0, 1], a small value of Hα (X) entails that most of the probability distribution of X is concentrated on a set of small Lebesgue measure. This is determined by Campbell (1966)’s extended Chebyshev’s inequality: given α ∈ [0, 1] 0 and Ak = {x ∈ ΩX | fX (x) ≤ k}, the following inequality holds:

0 exp 1−α P(X ∈ Ak) ≤ kHα (X) . (2.16)

This inequality is more general than Chebyshev’s inequality because it does not only deal with the absolute deviation around the mean. Instead, it measures the spread in terms of the size of the set on which most of the probability density is Chapter 2. Minimum R´enyientropy portfolios 45

exp situated. For a unimodal X, (2.16) tells us that if Hα (X) is small, the probability that X is concentrated on a small interval around its mode is close to one.

2. A second argument in favor of setting α ∈ [0, 1] is related to the Gram-Charlier expansion of R´enyi entropy derived in Section 2.4.2. The first-order expansion will show that, when α ∈ [0, 1], the coefficient in front of the kurtosis of X is positive (and instead negative for α > 1) and so that, to a first-order approximation, a decrease in kurtosis decreases the R´enyi entropy.

3. The empirical results of Section 2.6.2 show that the minimum R´enyi entropy portfolio performs largely better in terms of Sharpe ratio and turnover when α ∈ [0, 1].

2.4 R´enyi entropy and portfolio selection

exp Given the theoretical properties of Hα discussed in Section 2.3, we propose to use this measure as an objective function to design investment strategies. In particular, given the difficulties associated with estimating asset mean returns (see Section 1.1.2), we propose to construct a minimum-risk portfolio, called the minimum R´enyientropy (MRE) portfolio, which minimizes the portfolio-return exponential R´enyi entropy.

2.4.1 Definition

Definition 2.3 (MRE portfolio). The MRE portfolio for a given α is defined as

exp wMRE := argmin Hα (P ). (2.17) w∈W

Note that, being affected by higher moments that are not convex in the portfolio weights (Jurczenko and Maillet 2006), problem (2.17) may not necessarily be convex.

2.4.2 Entropy and higher moments: A Gram-Charlier expansion

Traditional minimum-risk portfolios are built from specific moments of the portfolio return, typically the variance, leading to the minimum-variance portfolio, and possibly higher moments; see Section 1.2. In the baseline case of Gaussian asset returns, the MRE portfolio coincides with the minimum-variance portfolio because, from (2.11), there is a exp 2 one-to-one mapping between Hα (X) and the variance σX for X being Gaussian. In a Chapter 2. Minimum R´enyientropy portfolios 46

Figure 2.3: Coefficients of the second-order Gram-Charlier expansion of R´enyi entropy 0.5

0 cients ffi -0.5

-1

k1(α) -1.5 k2(α) Higher-moment coe k3(α) -2 0.25 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 α

Notes. The figure depicts the coefficients k1(α), k2(α) and k3(α) in the second-order Gram-Charlier expansion of R´enyi entropy as a function of α. The expression for the expansion and the coefficients are displayed in Equations (2.18)–(2.19). more general setting however, the MRE portfolio does not coincide with the minimum- variance portfolio because it accounts for the uncertainty coming from higher moments. To see this, we derive a second-order Gram-Charlier expansion of R´enyi entropy in the next theorem.

Theorem 2.2. The second-order Gram-Charlier expansion of Hα(X) is given by

GC   2 2 Hα (X) := Hα N (0, σX ) + k1(α)κX + k2(α)ζX + k3(α)κX , (2.18) where 1 − α k (α) = , 1 8α 3α2 − 6α + 5 k (α) = − , (2.19) 2 24α3/2 3α4 − 12α3 + 42α2 − 60α + 35 k (α) = − . 3 384α5/2 In particular, the expansion for Shannon entropy is

1 1 HGC (X) = H N 0, σ  − ζ2 − κ2 . (2.20) 1 1 X 12 X 48 X

The three coefficients k1(α), k2(α) and k3(α) are depicted in Figure 2.3. Note that the expansion for Shannon entropy in (2.20) recovers a well-known result from information theory; see Hyv¨arinenet al. (2001, p.115) for example. Chapter 2. Minimum R´enyientropy portfolios 47

Let us first focus on the first-order Gram-Charlier expansion of R´enyi entropy. In that     case, because Hα N 0, σP = Hα N (0, 1) + ln σP , we have from (2.18)–(2.19) that the criterion to minimize is approximately

1 − α  Hexp(P ) ≈ σ exp(k (α)κ ) = σ exp κ . α P 1 P P 8α P

We make three interesting observations. First, minimizing Shannon entropy is, to a

first-order approximation, equivalent to minimizing variance; k1(α = 1) = 0. Thus, the ability to control for kurtosis is a notable advantage of R´enyi entropy over Shannon entropy. Second, skewness does not appear in the first-order expansion because R´enyi entropy is a symmetric measure of uncertainty; it does not distinguish between the left and right tails. Only the skewness squared appears in (2.18). Third, setting α ≤ 1 is more natural with respect to investors’ preferences for kurtosis, as noted in Section 2.3.4.

When α < 1, k1(α) > 0 and the MRE portfolio is, to a first-order approximation, similar to a minimum-variance-kurtosis portfolio. As noted by Martellini and Ziemann (2010), such a strategy is well performing because estimators for even moments are less noisy than estimators for odd moments. Thus, in line with Section 2.3.3, by playing with α one trades off the minimization of the central (variance) and tail (kurtosis) uncertainty; that is, of the first two even moments. This is illustrated in the next example.

Example 2.3. Consider N = 2 independent asset returns X1 ⊥ X2 that follow zero- mean non-standardized Student t distributions with parameters (σ , ν ) = (0.3, 10) X1 X1 and (σ , ν ) = (0.2, 6). On Figure 2.4, we depict how Hexp(P ), σ and κ evolve with X2 X2 α P P respect to the portfolio weight on the first asset, w1. The figure shows that when α is high, the MRE portfolio is close to the minimum-variance portfolio w1 = 0.35 because σ > σ and that mostly central events matter when α is high. However, the more α X1 X2 decreases, the more important is the impact of the fatter tails of Y (ν < ν ) and so X2 X1 the more the MRE portfolio approaches the minimum-kurtosis portfolio w1 = 0.57.

Let us now look at the second-order terms in the Gram-Charlier expansion. The coefficients k2(α) and k3(α) are negative for all α, and thus, the effect of two additional 2 2 terms k2(α)ζP and k3(α)κP is to drive the portfolio-return skewness and kurtosis away from the Gaussian ones. This is because, for a fixed mean and variance, the Shannon entropy is maximized by the Gaussian distribution; see Section 2.2.2. Thus, while several authors put forward entropy as an intuitively appealing measure of risk for financial and portfolio-selection applications (see Section 1.4), our Gram-Charlier expansion shows that it can yield portfolios with worse higher moments than those of the minimum-variance Chapter 2. Minimum R´enyientropy portfolios 48

Figure 2.4: The minimum R´enyi entropy portfolio balances variance and kurtosis minimization 5

σP κP 4 α = 0.2 α = 0.25 α = 0.3 3 α = 0.4 α = 0.5 α = 0.6 2 α = 0.8 α = 1.5

1

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 w1

Notes. The figure depicts, from Example 2.3, the portfolio-return standard deviation σP , excess kurtosis exp κP and exponential R´enyi entropy Hα (P ) as a function of the portfolio weight on the first asset, w1. The investment set is composed of two independent asset returns X1 ⊥ X2 that follow a zero-mean non-standardized Student t distribution with parameters (σ , ν ) = (0.3, 10) and (σ , ν ) = (0.2, 6). X1 X1 X2 X2 portfolio. In particular, an increase in kurtosis decreases the second-order Gram-Charlier expansion of the portfolio-return R´enyi entropy under the following condition:

GC 3/2 ∂Hα (P ) 1 k1(α) 24(1 − α)α ≤ 0 if κP ≥ − = 4 3 2 . (2.21) ∂κP 2 k3(α) 3α − 12α + 42α − 60α + 35

For α ∈ [0, 1], the lower bound in (2.21) takes values between zero (for α = 0 and α = 1) and 0.415 (for α = 0.714). Thus, the lower bound is easily crossed. This issue is similar to Cavenaile and Lejeune (2012) who find that the modified VaR in (1.48), defined as the second-order Gram-Charlier expansion of the VaR, is positively affected by an increase in kurtosis when the confidence level ε > 4.16%.

Nonetheless, the Gram-Charlier expansion is a crude estimator of the true R´enyi entropy, especially when α is low. Thus, for the true R´enyi entropy, the effect of kurtosis may be different. To assess this, we replicate Figure 2.2 but scaling the random variable X so that it has a unit variance for each value of ν. Figure 2.5 shows that when α is large, exp a decrease of ν, and thus an increase in kurtosis, decreases Hα (X), as predicted by the Gram-Charlier expansion. However, when α is low enough, a decrease of ν systematically exp increases Hα (X), in contrast to what the Gram-Charlier expansion tells us.

Intuitively, for α ∈ [0, 1], the exponential R´enyi entropy is a measure of spread of the distribution as seen in Section 2.3.4. Everything else being equal, an increase in kurtosis Chapter 2. Minimum R´enyientropy portfolios 49

Figure 2.5: Impact of kurtosis on the exponential R´enyi entropy 2 α = 0.4 α = 0.5 α = 0.7 α = 1 1.5 α = 2 ) X ( exp α H 1

0.5 3 4 6 8 10 12 14 16 18 20 ν Notes. The figure depicts the exponential R´enyi entropy of the Student t distribution in (2.15) as a function of the number of degrees of freedom ν for different values of α. Compared to Figure 2.2, the random variable X is scaled so as to keep a unit variance for each value of ν. increases the spread in the tails, but also makes the distribution more peaked in the center, which reduces the spread around the mode. If α is not low enough, the lower spread around the mode compensates for the higher spread in the tails, and an increase in kurtosis decreases the R´enyi entropy.

When finding the MRE portfolio however, a change in w impacts all moments at the same time, and it is not clear whether the solution will yield favorable higher moments. The empirical study in Section 2.6 suggests that it does not. Indeed, we observe that the minimum-variance portfolio has a systematically higher skewness and lower kurtosis than the MRE portfolio for α ∈ [0, 1]. This confirms that entropy can be an undesirable risk measure with respect to higher moments, challenging the existing literature.

2.5 Robust m-spacings estimation

In this section, we explain how, given a finite sample of size T , one can estimate the exp exponential R´enyi entropy Hα (X) in a robust way. In particular, we propose to use an estimator based on m-spacings in Section 2.5.1, and we discuss its properties in terms of consistency and robustness in Section 2.5.2.

2.5.1 Motivation and expression for the m-spacings estimator

To avoid making assumptions on the portfolio-return distribution, we look for a non- parametric estimator of entropy. There exists substantial research on non-parametric Chapter 2. Minimum R´enyientropy portfolios 50

estimation of Shannon entropy; see Beirlant et al. (1997) and Hlavackova-Schindler et al. (2007) for reviews.

A natural way of estimating entropy is the plug-in estimate where a density estimator is plugged into the integral defining the entropy. One could for example choose the well-known kernel density estimator (Parzen 1962). However, this estimator is known to be very sensitive to the bandwidth parameter, which can yield issues of stability for our portfolio-selection context.3 Instead, m-spacings estimation of entropy is more reliable: Wachowiak et al. (2005) show that such estimators “are robust and accurate, compare favorably to the popular Parzen window method for estimating entropies, and, in many cases, require fewer computations than Parzen methods” (p.95). Therefore, we rely on a robust m-spacings estimator of R´enyi entropy that extends the m-spacings estimator of Shannon entropy studied by van Es (1992) and Learned-Miller and Fisher (2003). The latter describe it as a “consistent, rapidly converging and computationally efficient estimator of entropy which is robust to outliers” (p.1271).

Appendix 2.8.1.5 gives a detailed derivation of the estimator, of which we only report the final expression in the next proposition for conciseness.

exp Proposition 2.3. Given T i.i.d observations of X, the m-spacings estimator of Hα (X) is given by

1 T −m  1−α! 1−α exp 1 X T + 1 (i+m:T ) (i:T ) Hbα (m, T ) := X − X , (2.22) T − m m i=1

(1:T ) (2:T ) (T :T ) where X 6 X 6 ··· 6 X are the order statistics (the observations sorted in increasing order) and m ∈ [1,T − 1] is an integer.

The parameter m is a free parameter of great importance: increasing its value reduces the variance of the estimator by grouping more order statistics in each spacing X(i+m:T ) − X(i:T ). Consequently, it determines the robustness of the estimator, and in turn of the MRE portfolio.

Let us look at the two special cases α = 0 and α = 1. Taking the limit where α → 1, we recover the exponential of the estimator of van Es (1992) and Learned-Miller and

3When applied to our empirical data in Section 2.6, the kernel density estimator of R´enyi entropy of Erdogmus and Xu (2010) with Gaussian kernel achieves a worse risk-adjusted performance than the m-spacings estimator considered here for a wide range of values of the bandwidth parameter. Chapter 2. Minimum R´enyientropy portfolios 51

Fisher (2003):

T −m  ! exp 1 X T + 1 (i+m:T ) (i:T ) Hb (m, T ) := exp ln X − X . (2.23) 1 T − m m i=1

When α = 0, the exponential R´enyi entropy of X is the Lebesgue measure of the support set ΩX ; see Equation (2.13). Plugging α = 0 in (2.22), the sum telescopes and the m-spacings estimate collapses to the sum of the m largest values minus the sum of the m lowest values (up to a multiplicative constant):

m exp T + 1 X (T −m+i:T ) (i:T ) Hb (m, T ) = X − X . (2.24) 0 m(T − m) i=1

2.5.2 Properties of the m-spacings estimator

The m-spacings estimation of entropy has attracted much research (Beirlant et al. 1997) and dates a while back; see, for instance, Vasicek (1976). However, it has been considered mainly for Shannon entropy and, even for this specific case, only asymptotic behavior has been studied. In the general case α 6= 1, consistency has not been established. In this section, we first discuss the properties of our estimator in terms of asymptotic bias. Second, we assess the impact of the parameter m on the robustness of the estimator.

2.5.2.1 Asymptotic bias

exp Let us first consider the case α = 1 and denote Hb1(m, T ) := ln Hb1 (m, T ). van Es

(1992) proved that Hb1(m, T ) is asymptotically biased but, interestingly, that the bias only depends on the fixed value of m and not on the density fX :

Hb1(m, T ) − H1(X) → ψ(m) − ln m a.s. (2.25)

This bias is always negative, is an increasing function of m and vanishes when m → ∞. Equation (2.25) means that we can simply subtract the bias to get a consistent estimator. Getting back to the exponential case, this means that

−ψ(m) exp exp me Hb1 (m, T ) → H1 (X) a.s. (2.26)

Interestingly, because the asymptotic bias depends only on m when α = 1, using the bias-corrected estimator in (2.26) or the biased estimator in (2.23) is equivalent Chapter 2. Minimum R´enyientropy portfolios 52 when searching the portfolio weights that minimize the portfolio-return entropy in (2.17). Ideally, we would want the same result to hold for all α; that is, the asymptotic bias of

Hbα(m, T ) to depend only on α and m. However, our first attempts to such a proof suggest that this is a difficult problem. A partial result is provided in the next proposition, which shows that the bias of the estimator is the same for X and X? := (X − a)/b; that is, the bias is invariant to shifting and scaling X.

? exp Proposition 2.4. Let X = (X − a)/b, Hbα(X; m, T ) := ln Hbα (X; m, T ) be the m-   spacings estimator of the R´enyientropy of X, and B Hbα(X; m, T ) := E Hbα(X; m, T ) −

Hα(X) be the bias of this estimator. Then, it holds that

 ?  B Hbα(X; m, T ) = B Hbα(X ; m, T ) . (2.27)

2.5.2.2 Robustness to outliers

The parameter m acts as a smoothing parameter that controls the variance of the entropy estimator. In this section, we show that increasing m makes the m-spacings estimator more robust, in the sense that it is less sensitive to outliers, which is crucial with regards to the out-of-sample performance of the MRE portfolio. Robustness to outliers means that a small perturbation from the true return distribution yields only a small change in the value of the estimate.

In assessing the robustness of an estimator, the empirical influence function represents ˆ a useful tool; see Hampel et al. (1986). Given an estimator θ(X1,...,XT ) of a quantity θ ˆ based on a sample from X of size T , EIFθˆ(rˆ) measures the sensitivity of the estimator θ to the addition of a supplementary observationr ˆ in the sample:

ˆ ˆ  EIFθˆ(ˆr) := (T + 1) θ(X1,...,XT , rˆ) − θ(X1,...,XT ) . (2.28)

ˆ Intuitively, the lower is EIFθˆ(rˆ), the more robust is the estimator θ. In Figure 2.6, we depict the empirical influence function of the m-spacings estimator, EIF exp (rˆ), for T = 250 Hbα observations drawn from X ∼ N (µX = 0, σX = 0.2). Following Hampel et al. (1986), we −1 i  do not draw the observations randomly but instead select them as Xi = µX +σX Φ T +1 , i = 1,...,T . This eliminates the random-sample variability. We consider rˆ ∈ [−5σX , 5σX ] and report the results for α = 0.5; other values of α give similar insights. One can observe as desired that the empirical influence function decreases with m for large enough values ofr ˆ; that is, for outliers. Chapter 2. Minimum R´enyientropy portfolios 53

Figure 2.6: Impact of the parameter m on the robustness to outliers of the m-spacings estimator of the exponential R´enyi entropy 10 m = 5 m 8 = 10 m = 25 m = 50

) 6 r (ˆ exp α ˆ H 4 EIF 2

0

−5σX −4σX −3σX −2σX −σX 0 σX 2σX 3σX 4σX 5σX rˆ Notes. The figure depicts the empirical influence function (defined in (2.28)) of the m-spacings estimator

of the exponential R´enyi entropy—EIF exp (rˆ)—for α = 0.5 and different values of m as a function of Hbα the supplementary observation rˆ. We draw T = 250 observations from a Gaussian random variable −1 i  X ∼ N (µX = 0, σX = 0.2) by setting Xi = µX + σX Φ T +1 and we considerr ˆ ∈ [−5σX , 5σX ].

Table 2.1: List of datasets considered in the empirical study of Chapter2

Datasets Abb. N Time period 6 Fama-French portfolios of firms sorted by size and book-to-market 6BTM 6 1963 - 2016 25 Fama-French portfolios of firms sorted by size and book-to-market 25BTM 25 1963 - 2016 6 Fama-French portfolios of firms sorted by size and momentum 6Mom 6 1963 - 2016 25 Fama-French portfolios of firms sorted by size and momentum 25Mom 25 1963 - 2016 10 industry portfolios representing the US stock market 10Ind 10 1963 - 2016 17 industry portfolios representing the US stock market 17Ind 17 1963 - 2016 Notes. The table reports the six datasets that we use in the empirical study of Section 2.6. All datasets are made of monthly value-weighted arithmetic returns. Source: Kenneth French library. The time period goes precisely from 07/1963 to 06/2016.

2.6 Out-of-sample performance

We now study the out-of-sample performance of the MRE portfolio compared to the minimum-variance portfolio strategy. We present the methodology in Section 2.6.1 and discuss the results in Section 2.6.2.

2.6.1 Data and methodology

We rely on six monthly return equity datasets from the Kenneth French library that are listed in Table 2.1. These are extensively used as benchmarks in the literature to compare portfolio strategies; see, for instance, DeMiguel et al. (2009a,b), Behr et al. (2013) and Ardia et al. (2017). Chapter 2. Minimum R´enyientropy portfolios 54

Figure 2.7: Impact of the parameter m on the stability of the minimum R´enyi entropy portfolio

M-portfolio MRE portfolio (m = 2) 1 1

X1 X2 0.5 X3 0.5 X4 X5 X6 0 0 X7 X8 X9 X

Portfolio weights 10 -0.5 Portfolio weights -0.5

-1 -1 1 5 10 15 20 25 30 35 40 43 1 5 10 15 20 25 30 35 40 43 Rolling window Rolling window MRE portfolio (m = 8) MRE portfolio (m = 24) 1 1

0.5 0.5

0 0 Portfolio weights -0.5 Portfolio weights -0.5

-1 -1 1 5 10 15 20 25 30 35 40 43 1 5 10 15 20 25 30 35 40 43 Rolling window Rolling window Notes. The figure depicts, for the 10Ind dataset, the time evolution of the portfolio weights for the M-portfolio of DeMiguel and Nogales (2009) and the MRE portfolios with α = 0.5 and m ∈ {2, 8, 24}. The weights are unconstrained; that is, w is restricted to W in (1.2). The rolling windows on the y-axis are one-year windows ranging from mid 1973 to mid 2016.

We compare the MRE portfolio with six values of α ∈ {0.3, 0.5, 0.7, 1, 1.5, 2} to five different minimum-variance (MV) portfolios: the sample MV portfolio, the MV portfolio based on the shrinkage estimators of the covariance matrix of Ledoit and Wolf (2003, 2004a, 2004b), and the M-portfolio (MP) of DeMiguel and Nogales (2009). See Section 1.1.3 for details. As noted in Section 2.4.1, the MRE portfolio problem is in general not convex. Thus, we rely on Matlab GlobalSearch optimizer devised by Ugray et al. (2007). We observe on the different datasets that recompiling the optimization with different starting points yields essentially indistinguishable solutions, outlining that non-convexity is not a major issue.

Regarding the selection of the parameter m for the m-spacings estimator, we noted in Section 2.5.2.2 that is of great importance as it determines the robustness of the MRE portfolio. To illustrate this, Figure 2.7 reports, for the 10Ind dataset, the time evolution of the MRE portfolio weights for α = 0.5 and m ∈ {2, 8, 24}. The portfolio weights of the M-portfolio are also reported for comparison. The figure shows that increasing m improves the stability of the portfolio weights. We select m via the simple rule-of-thumb m = T 2/3, which works well on the considered datasets. As explained below, we use an estimation window of ten years, and thus, m = 1202/3 = 24. We actually observe that once m is high enough, the results display only a very minor sensitivity to the specific Chapter 2. Minimum R´enyientropy portfolios 55 value of m that is chosen. Thus, one can select m = 24 without fearing that another different but close value would yield dissimilar results.4

Regarding portfolio-weight constraints, it is useful to restrict the solution space so as to improve stability and out-of-sample performance, particularly when using low-frequency data such as monthly returns here (see Section 1.1.3). Therefore, we compute the different portfolios subject to the global variance-based constraint devised by Levy and Levy (2014) in (1.30). Using the 48 U.S. industry portfolios dataset, they observe improved out-of-sample Sharpe ratios compared to several robust portfolios for a wide range of threshold δ. In particular, their results are stable for δ between 10% and 25%. We select δ at the higher hand, δ = 25%, because too low values make it difficult to distinguish between the different portfolios as they are too close to the equally weighted portfolio, obtained for δ = 0.

To evaluate the out-of-sample performance of the different portfolios, we employ a rolling-horizon methodology. At a given time t, we estimate the different portfolio policies using the past ten years of monthly returns, a sample size of T = 120. We keep these portfolio weights constant for the next year and compute the corresponding out-of-sample monthly portfolio returns. After one year, we roll forward the estimation window by one year and repeat this procedure. A one-year rolling window, as in Behr et al. (2013) for example, is not too frequent, which is important to limit turnover (Carroll et al. 2017). To account for monthly asset-return changes, the portfolio weights are rebalanced every month to keep the relative allocation to each asset at the desired value. The data collected cover the period 07/1963 to 06/2016, and thus, we obtain 43 years of out-of-sample monthly portfolio returns.

Finally, we consider the following performance criteria. In terms of mean-variance √ trade-off, we report the annualized Sharpe ratio 12SRP . In terms of higher moments, we report the monthly skewness and excess kurtosis, ζP and κP , respectively. In terms of portfolio-weight stability and transaction costs, we report the monthly portfolio turnover of DeMiguel et al. (2009b), defined as

12R−1 N 1 X X Turnover := |w − w + |, (2.29) 12R − 1 i,t+1 i,t t=1 i=1

4Specifically, the results in Table 2.2 for the case m = 18 and m = 35 are very similar to the results we obtain with m = 24. The only non-negligible changes are in terms of turnover, which is higher for m = 18 and slightly lower for m = 35. Chapter 2. Minimum R´enyientropy portfolios 56

where R = 43 is the number of rolling windows, 12R is the total number of months, wi,t+1 is the desired weight of asset i at time t + 1, and wi,t+ is its weight before rebalancing at t + 1.

2.6.2 Results

We now discuss the out-of-sample results obtained with the methodology in Section 2.6.1. The results are reported in Table 2.2.

First, we compare the performance of the six MRE portfolios. In terms of Sharpe ratio, the best performances are achieved for α ∈ [0, 1]. Moreover, selecting a low value of α allows to substantially reduce the portfolio turnover. On average across the six datasets, α = 2 yields a monthly turnover of 86.7%, versus 32.7% for α = 0.3. Thus, the Sharpe ratio and turnover statistics show that it is largely preferable to select α ∈ [0, 1]. Further, the Sharpe ratio remains very stable with respect to α ∈ [0, 1] whereas the turnover continues to decrease by lowering α, and thus, one is better off selecting α close to zero. In terms of skewness and excess kurtosis however, the table shows that decreasing α decreases skewness and increases excess kurtosis. Thus, by selecting a value of α close to zero as we advice above, the portfolio-return tail risk deteriorates.5 This contradicts the conclusions one could draw from the crude first-order Gram-Charlier expansion of R´enyi entropy derived in Section 2.4.2.

Second, we compare the performance of the MRE and MV portfolios. In terms of Sharpe ratio, the MRE portfolios outperform the MV portfolios for α ∈ [0, 1]. For example, when α = 0.3, the MRE portfolio outperforms all five MV portfolios for all datasets. Still, note that the improvement remains modest: on average across the six datasets, the annualized Sharpe ratio goes from around 0.890 for the MV portfolios to 0.911 for the MRE portfolio with α = 0.3. Moreover, the MRE portfolio requires more turnover than the MV portfolios: 32.7% for α = 0.3 compared to approximately 28% for the MV portfolios, on average. This is because the MRE portfolio is sensitive to higher moments, and thus, is more sensitive to estimation risk than the MV portfolio. In Appendix 2.8.2, we show that the improvement in Sharpe ratio comes mostly from a better mean return, and that it is robust to the inclusion of transaction costs. Finally, whereas the original hope in minimizing R´enyi entropy rather than variance was to improve higher moments, the table shows that the opposite happens. On average over the six datasets, the MRE

5For completeness, we have checked the results for α ∈ {0.05, 0.1, 0.2} as well. Compared to α = 0.3, we observe that the turnover slightly decreases, the Sharpe ratio remains nearly identical and the skewness and excess kurtosis become slightly worse. Chapter 2. Minimum R´enyientropy portfolios 57

Table 2.2: Out-of-sample performance of minimum-variance and minimum R´enyi entropy portfolios

Sharpe ratio MRE portfolios MV portfolios

α = 0.3 α = 0.5 α = 0.7 α = 1 α = 1.5 α = 2 Σb Σb CC Σb SF Σb I MP 6BTM 0.844 0.845 0.846 0.841 0.832 0.815 0.835 0.821 0.833 0.836 0.841 25BTM 0.985 0.980 0.973 0.961 0.937 0.906 0.953 0.913 0.951 0.957 0.956 6Mom 0.768 0.763 0.753 0.750 0.716 0.700 0.738 0.741 0.738 0.737 0.731 25Mom 0.936 0.948 0.957 0.960 0.954 0.928 0.902 0.920 0.904 0.903 0.916 10Ind 0.995 1.001 1.014 1.013 0.971 0.936 0.977 0.970 0.983 0.973 0.976 17Ind 0.938 0.947 0.936 0.965 0.905 0.891 0.936 0.924 0.935 0.935 0.918 Average 0.911 0.914 0.913 0.915 0.886 0.863 0.890 0.882 0.891 0.890 0.890 Skewness MRE portfolios MV portfolios

α = 0.3 α = 0.5 α = 0.7 α = 1 α = 1.5 α = 2 Σb Σb CC Σb SF Σb I MP 6BTM -0.591 -0.601 -0.612 -0.622 -0.643 -0.621 -0.520 -0.497 -0.520 -0.523 -0.590 25BTM -0.533 -0.538 -0.568 -0.566 -0.609 -0.631 -0.447 -0.308 -0.430 -0.447 -0.532 6Mom -0.451 -0.448 -0.439 -0.425 -0.397 -0.365 -0.418 -0.415 -0.416 -0.417 -0.403 25Mom -0.536 -0.521 -0.505 -0.491 -0.436 -0.372 -0.491 -0.452 -0.478 -0.494 -0.513 10Ind -0.208 -0.226 -0.243 -0.219 -0.212 -0.187 -0.192 -0.173 -0.182 -0.195 -0.176 17Ind 0.037 0.049 0.056 0.040 0.080 0.072 0.118 0.167 0.145 0.117 0.078 Average -0.380 -0.381 -0.385 -0.381 -0.369 -0.351 -0.325 -0.280 -0.314 -0.327 -0.356 Excess kurtosis MRE portfolios MV portfolios

α = 0.3 α = 0.5 α = 0.7 α = 1 α = 1.5 α = 2 Σb Σb CC Σb SF Σb I MP 6BTM 2.561 2.586 2.612 2.645 2.723 2.623 2.375 2.292 2.378 2.391 2.540 25BTM 2.557 2.617 2.701 2.537 2.445 2.641 2.243 1.907 2.182 2.253 2.571 6Mom 2.120 2.092 2.026 1.986 1.824 1.717 2.048 2.004 2.043 2.050 2.010 25Mom 2.780 2.783 2.763 2.718 2.507 2.313 2.524 2.472 2.507 2.532 2.678 10Ind 1.355 1.292 1.280 1.207 1.101 1.216 1.298 1.251 1.277 1.300 1.232 17Ind 2.614 2.452 2.423 2.107 2.669 2.411 3.102 3.337 3.224 3.083 2.701 Average 2.331 2.304 2.301 2.200 2.212 2.153 2.265 2.210 2.269 2.268 2.289 Turnover MRE portfolios MV portfolios

α = 0.3 α = 0.5 α = 0.7 α = 1 α = 1.5 α = 2 Σb Σb CC Σb SF Σb I MP 6BTM 0.166 0.164 0.164 0.172 0.219 0.250 0.149 0.148 0.148 0.144 0.151 25BTM 0.454 0.497 0.588 0.945 1.324 1.514 0.394 0.389 0.387 0.389 0.433 6Mom 0.148 0.152 0.165 0.178 0.233 0.268 0.152 0.157 0.151 0.151 0.162 25Mom 0.383 0.391 0.484 0.599 0.845 1.165 0.350 0.348 0.342 0.348 0.363 10Ind 0.328 0.324 0.353 0.416 0.567 0.744 0.252 0.246 0.244 0.253 0.295 17Ind 0.481 0.530 0.648 0.790 0.994 1.260 0.381 0.357 0.362 0.381 0.417 Average 0.327 0.343 0.400 0.517 0.697 0.867 0.280 0.274 0.272 0.277 0.303 Notes. The portfolios are computed with the methodology presented in Section 2.6.1. The minimum- variance (MV) portfolios are: i) the sample MV portfolio (Σb ), ii) the three MV portfolios based on the shrinkage estimators of the covariance matrix of Ledoit and Wolf (2003, 2004a, 2004b) based on a constant-correlation target (Σb CC ), single-factor target (Σb SF ) and multiple-of-identity-matrix target (Σb I ), iii) the M-portfolio (MP) of DeMiguel and Nogales (2009). Chapter 2. Minimum R´enyientropy portfolios 58

portfolios have less skewness than the MV portfolios for all values of α and more kurtosis for all α < 1. The kurtosis of the MRE portfolio can be lower than that of the MV portfolios by choosing α ≥ 1. However, as discussed above, high values of α yield an unacceptable amount of turnover and, for α = 1.5 and 2, a lower Sharpe ratio that the MV portfolio.

Putting everything together, we conclude that investing in the minimum R´enyi entropy portfolio instead of the minimum-variance portfolio can help improve the Sharpe ratio but at the cost of a deterioration in higher moments and turnover.

2.7 Conclusion

Building upon the literature, reviewed in Section 1.4, that applies entropy and information theory in finance and portfolio selection, the goal of this chapter was to study the exponential R´enyi entropy for the purpose of risk measurement and higher-moment portfolio selection.

Regarding risk measurement, we have shown that the exponential R´enyi entropy is closely connected with the class of deviation risk measures but that, in contrast with the recent claim of Flores et al. (2017), it is not subadditive in general (Theorem 2.1). However, we have demonstrated that it is subadditive if the asset returns are jointly elliptical.

Regarding higher-moment portfolio selection, we have derived a second-order Gram- Charlier expansion of R´enyi entropy (Theorem 2.2) and this approximation gave first evidence that the minimum R´enyi entropy portfolio can be undesirable in terms of skewness and kurtosis. This has manifested in real data as well: the empirical study has shown that the minimum-variance portfolio has more skewness and less kurtosis than the minimum R´enyi entropy portfolio.

Therefore, while entropy can sound like an appealing criterion to deal with higher moments, this chapter indicates that bluntly minimizing the portfolio-return entropy is not the best way to proceed. The fact that minimizing entropy can actually worsen higher moments reveals a potentially more appealing way of using entropy: what if we aim to minimize the portfolio-return variance but maximize the entropy of the standardized (unit-variance) portfolio return? Interestingly, a similar strategy appears in Chapter5 when we consider the portfolio whose return distribution is as close as possible—in terms of Kullback-Leibler divergence—to a Gaussian distribution. In that case, we show that the Chapter 2. Minimum R´enyientropy portfolios 59 objective function aims to fit the mean and variance of the Gaussian distribution and to maximize the Shannon entropy of the standardized portfolio return. Because the Gaussian distribution is the most-entropic distribution for a fixed variance, this amounts to shrink the higher moments toward those of the Gaussian. That is, to shrink the skewness and excess kurtosis toward zero. Such a strategy will be shown to perform very well in terms of tail risk and outperform the minimum-variance portfolio.

Before this, we turn to another strategy in Chapter3, which consists in diversifying the portfolio-return risk across maximally independent factors, found as the uncorrelated factors with minimum Shannon entropy. As we demonstrate, this diversification strategy helps to naturally reduce the portfolio-return kurtosis.

2.8 Appendix

2.8.1 Proofs of results

2.8.1.1 Proposition 2.1

Proof. From the properties of R´enyi entropy listed in Section 2.2.2, we directly have that exp Hα fulfills the positive-homogeneity and translation-invariance properties. Regarding exp positivity, Hα (X) is strictly positive if X is non-constant from the positivity of the exp density fX . To see that it is null if X is constant, let us compute Hα (k) where k is a exp constant by computing the limit of Hα (k + cX) as c tends to zero for a given random variable X of finite entropy:

exp exp exp Hα (k) = lim Hα (k + cX) = lim |c|Hα (X) = 0. c→0 c→0

2.8.1.2 Theorem 2.1

In this section, we report the three counterexamples to subadditivity in Theorem 2.1. The counterexamples are provided for the special case α = 1.

exp Counterexample 2.1. H1 is not subadditive for a pair (X,Y ) of independent L´evy- distributed random variables.

Proof. The density of Z ∼ L´evy(µ, σ) is given by

− σ r σ e 2(x−µ) f (x) = , (2.30) Z 2π (x − µ)3/2 Chapter 2. Minimum R´enyientropy portfolios 60

and is strictly positive for x > µ. Its exponential Shannon entropy is known in closed form (Zografos and Nadarajah 2003):

exp √ 1+2γ 2 H1 (Z) = 4σ πe , (2.31)

where γ ≈ 0.577 is the Euler-Mascheroni constant. We note the parameters of X,Y as

(µX , σX ) and (µY , σY ), with σX , σY > 0. As L´evyis a stable law, the sum X + Y is again √ L´evy-distributed,with parameters µX+Y = µX + µY and σX+Y = σX + σY + 2 σX σY . √ √ Subadditivity is thus equivalent to σX+Y 6 σX + σY ⇔ 2 σX σY 6 0, which never

holds when σX , σY > 0.

Counterexample 2.2. Consider (ZX ,ZY ) a pair of independent standard Gaussian

variables and (UX ,UY ) a pair of independent Bernoulli variables of parameter 1/2, inde-

pendent from both ZX ,ZY . Define X := (2µX UX −µX )+σZX ,Y := (2µY UY −µY )+σZY exp with constants µX , µY and σ > 0. Then, H1 is not subadditive for the pair (X,Y )

whenever, for example, (µX , µY ) = (1, 2) and σ < 0.3918.

Proof. Noting φ(x) the standard Gaussian density, the marginal densities of X,Y are given by the Gaussian mixtures

1  x + µ  x − µ  f (x) = φ X + φ X , X 2σ σ σ (2.32) 1  x + µ  x − µ  f (x) = φ Y + φ Y . Y 2σ σ σ

One can show (Pham and Vrins 2005) that the density of X + Y is

        φ x+µX√+µY + φ x+µX√−µY + φ x−µX√+µY + φ x−µX√−µY σ 2 σ 2 σ 2 σ 2 f (x) = √ . (2.33) X+Y 4σ 2

Vrins et al. (2007) show that for a random variable Z with Gaussian mixture density

K X π x − µ  f (x) = k φ k , (2.34) Z σ σ k=1 k k

where the πk’s are positive weights summing to 1, H1(Z) can be bounded below and above. More explicitly,

H1(Z) ≤ H1(Z) ≤ H1(Z), (2.35) Chapter 2. Minimum R´enyientropy portfolios 61 with K X π H (Z) := k ln2πeσ2 + h[π], 1 2 k k=1 (2.36) K K X X   s   H (Z) := H (Z) − π 0 − π ln + 1  , 1 1 k k k π s k k=1 k=1 k k √ 1 x−µk  −1 where sk := maxx φ = 2πσk and s := maxk sk. Rearranging the µk by σk σk increasing order and defining dk := min(µk −µk−1, µk+1 −µk) with µ0 := −∞, µK+1 := ∞ by convention, we have

  d 2 d 1 d − √ k k 0 k 2 2σ k := 2Φ − and k := k + √ e k . (2.37) 2σk 2 2 2πσk

exp Using these lower and upper bounds, the exponential Shannon entropy H1 fails to be subadditive for the pair (X,Y ) if

exp(H1(X + Y )) > exp(H1(X)) + exp(H1(Y )). (2.38)

exp Indeed, the left hand side is a lower bound to H1 (X + Y ) while the right hand side is exp exp an upper bound to H1 (X) + H1 (Y ). Setting (µX , µY ) = (µ, 2µ), the bounds in (2.36) applied to the densities in (2.32)-(2.33) become

√  H1(X) = H1(Y ) = ln 2σ 2πe , √ √ µ 2 (2.39) H (X + Y ) = ln 8σ πe − 2Φ(µ/ 2σ)(3/2 + ln(4)) − √ e−(µ/2σ) , 1 2σ π from which we find for example that, setting µ = 1, (2.38) holds if 0 < σ < 0.3918.

exp Counterexample 2.3. H1 is not subadditive for a comonotonic pair (X,Y ), with Y = F (X), F 0(X) ∼ Exp(1).

Proof. Two random variables X,Y are comonotonic when Y can be written as F (X) with F a strictly increasing continuous function. Denote G(x) := x + F (x), which is also strictly increasing and so invertible, and denote H(x) its inverse. Then, the cumulative distribution function of X + Y = G(X) is given by

FX+Y (x) = P(G(X) 6 x) = P(X 6 H(x)) = FX (H(x)), and its density reads f (H(x)) f (x) = X . X+Y G0(H(x)) Chapter 2. Minimum R´enyientropy portfolios 62

exp As a result, H1 (X + Y ) becomes ! Z f (H(x)) f (H(x)) Hexp(X + Y ) = exp − X ln X dx . 1 G0(H(x)) G0(H(x))

A change of variable z = H(x) and algebraic manipulations lead to

exp exp 0  H1 (X + Y ) = H1 (X) exp E ln(1 + F (X)) .

Based on a similar reasoning, we can show that

exp exp 0  H1 (Y ) = H1 (X) exp E ln F (X) .

Eventually, this means that subadditivity is violated if

0  0  exp E ln(1 + F (X)) > 1 + exp E ln F (X) . (2.40)

Let us now show an example satisfying (2.40). Because F 0(X) must be a positive random variable (F is strictly increasing), we consider F 0(X) ∼ Exp(1). Define W := ln(1 + F 0(X)) and Z := ln F 0(X). Inequality (2.40) reduces to

eE(W ) > 1 + eE(Z). (2.41)

One can find that the densities of W and Z are given by

1+x−ex fW (x) = e (x ≥ 0) (2.42) x−ex fZ (x) = e (x ∈ R) (2.43)

We can now compute the expectations. From (2.42), E(W ) is given by the following integral: Z ∞ x−ex E(W ) = e xe dx. 0 A change of variable z = ex and integration by parts yields

E(W ) = eΓ(0, 1) = 0.596,

R ∞ s−1 −t where Γ(s, x) := x t e dt is the upper incomplete Gamma function. A similar derivation yields E(Z) = −γ = −0.577; that is, minus the Euler-Mascheroni constant. Chapter 2. Minimum R´enyientropy portfolios 63

Finally, we have in agreement with (2.41) that

eE(W ) = ee Γ(0,1) = 1.815 > 1.561 = 1 + e−γ = 1 + eE(Z), hence providing a counterexample to subadditivity.

2.8.1.3 Proposition 2.2

exp Proof. We set µ = 0 without loss of generality as Hα is translation invariant. The proof relies on the convolution properties of elliptical distributions; see Fang and Zhang (1990). In particular, any linear combination of an elliptical random vector remains elliptical, meaning that we can write

2 2 0 X ∼ El(σX , g1),Y ∼ El(σY , g1),X + Y ∼ El(1 Σ1, g1),

0 2 2 where 1 Σ1 = σX + σY + 2ρσX σY . Moreover, any elliptical random vector X can be written as X = µ + AY , where AA0 = Σ and Y has a spherical distribution; that is, an elliptical distribution with Σ = I , the identity matrix. Applied to our case, this N √ 0 means that we can write X = σX Z, Y = σY Z and X + Y = 1 Σ1Z for Z ∼ El(0, 1, g1). exp Finally, the subadditivity condition for Hα reduces to

exp exp exp Hα (X + Y ) 6 Hα (X) + Hα (Y ) √ 0 exp exp exp ⇔ 1 Σ1Hα (Z) 6 σX Hα (Z) + σY Hα (Z) q 2 2 ⇔ σX + σY + 2ρσX σY 6 σX + σY , which is satisfied for any ρ ∈ [−1, 1].

2.8.1.4 Theorem 2.2

? Proof. Let X := (X − µX )/σX be the standardized copy of X. The second-order Gram- Charlier expansion of the density of X? is defined as

  GC H3(x) H4(x) f ? (x) := φ(x) 1 + ζ + κ . (2.44) X 3! X 4! X

The proof relies on special properties of the Hermite polynomials Hi’s. Those are defined in relation with the derivatives of the standard Gaussian pdf φ:

∂iφ(x) = (−1)iH (x)φ(x). ∂xi i Chapter 2. Minimum R´enyientropy portfolios 64

2 3 The first four polynomials are given by H1(x) = x, H2(x) = x − 1, H3(x) = x − 3x and 4 2 H4(x) = x − 6x + 3. They form an orthonormal system in the sense that

( Z i! if i = j Hi(x)Hj(x)φ(x)dx = . 0 if i 6= j

To find the Gram-Charlier expansion of Hα(X), we want to have a similar system for the α √ 1−α αth power of φ. One can check that φ (x) = kφ( αx) with k = (2π) 2 and that √ ∂iφ( αx) √ = (−1)iH˜ (x)φ( αx), ∂xi i

˜ ˜ 2 2 ˜ 3 3 2 ˜ 4 4 3 2 2 with H1(x) = αx, H2(x) = α x −α, H3(x) = α x −3α x and H4(x) = α x −6α x +3α . √ Hence, in relation with the original polynomials Hi’s, φ( αx) forms the system

Z ( √ Ci(α) if i = j Hi(x)Hj(x)φ( αx)dx = . 0 if i 6= j

Following algebraic manipulations, the first four coefficients Ci(α)’s express as

−3/2 C1(α) = α , (α − 2)α + 3 C (α) = , 2 α5/2 3(3(α − 2)α + 5) C (α) = , 3 α7/2 3(3α4 − 12α3 + 42α2 − 60α + 35) C (α) = . 4 α9/2

Let us now first derive the Gram-Charlier expansion of

Z +∞ ? GC α Iα(X ) := fX? (x) dx. (2.45) −∞

Using the results above and the second-order Taylor series expansion (1 + ε)α ≈ 1 + αε + α(α−1) 2 ? 2 ε , Iα(X ) approximates as

3kα(α − 1)2 kα(α − 1)C (α) kα(α − 1)C (α) I (X?) ≈ IN (0, 1) + κ + 3 ζ2 + 4 κ2 , α 4!α5/2 X 2 × 3!2 X 2 × 4!2 X

  p 1−α where I N (0, 1) = (2π) /α. Note that there is no ζX term left because

Z √ φ( αx)H3(x)dx = 0. Chapter 2. Minimum R´enyientropy portfolios 65

1 Now, to finish, we need to get back to Hα(X) = 1−α ln Iα(X). We apply the Taylor series expansion ln IN (0, 1) + ε ≈ ln IN (0, 1) + IN (0, 1)−1ε,

? which finally gives the result in (2.18)–(2.19), noting that H(X) = H(X ) + ln σX .

2.8.1.5 Proposition 2.3

1 2 T (1:T ) Consider i.i.d. copies X ,X ,...,X of the random variable X. We denote X 6 (2:T ) (T :T ) X 6 ··· < X the corresponding order statistics and define the associated m- (i+m:T ) (i:T ) spacings (1 6 m < T ) as the sequence of non-negative differences X − X , 1 6 i 6 T − m.

exp In a first step, we build a 1-spacing estimator of Hα (X) because the case m = 1 has a natural relation to a sample-spacings estimator of the density fX .

Recall that the order statistics Y (1:T ),...,Y (T :T ) of a uniform U(0, 1) random variable Y follow a Beta distribution (Arnold et al. 1992). In particular,

i Y (i:T ) = . E T + 1

1 T Let us now map X ,...,X through FX to obtain T U(0, 1) i.i.d. random variables i i (1:T ) (T :T ) Y := FX (X ). Obviously, the sequence FX (X ),...,FX (X ) agrees with the order statistics Y (1:T ),...,Y (T :T ), leading to:

i Y (i:T ) = F X(i:T ) = X X(i:T ) = . E E X P 6 T + 1

(i:T ) (i+1:T ) Hence, the expected probability mass between two order statistics X 6 X is

(i+1:T ) ! Z X 1 E fX (x)dx = . (2.46) X(i:T ) T + 1

ˆ One can use the key observation in (2.46) to obtain an estimator fX of fX being told

T order statistics. Indeed, one can thus approximate fX between two successive order (i:T ) (i+1:T ) statistics X ,X by a constant ki such that the corresponding probability mass

Z X(i+1:T ) Z X(i+1:T ) ˆ (i+1:T ) (i:T ) fX (x)dx = kidx = ki X − X X(i:T ) X(i:T ) Chapter 2. Minimum R´enyientropy portfolios 66

agrees with the expected probability mass in (2.46). Denoting X(0:T ) := inf X and X(T +1:T ) := sup X, this yields

1 k = i (T + 1)(X(i+1:T ) − X(i:T ))

(i:T ) (i+1:T )  (0:T ) (T +1:T ) for X < x 6 X . As the T + 1 spacings form a partition of X ,X ,

one can approximate the density fX by

T X fˆ (x) = 1 (i:T ) (i+1:T ) k . (2.47) X {X

This estimator corresponds to the histogram composed of T + 1 bins with bounds (i:T ) (i+1:T ) [X ,X ], 0 6 i 6 T , and with height such that the area of each bin is equal to 1/(T + 1).

exp From this density estimator, one can derive a 1-spacing plug-in estimator of Hα as follows.

Lemma 2.1. Let the density fX be approximated by (2.47). Then, the 1-spacing plug-in exp estimator of Hα (X) is given by

1 T ! 1−α 1 X  1−α (T + 1)X(i+1:T ) − X(i:T ) . (2.48) T + 1 i=0

Proof. The 1-spacing plug-in estimator of R f α (x)dx becomes ΩX X

Z T Z X(i+1:T ) ˆα X ˆα fX (x)dx = fX (x)dx (i:T ) ΩX i=0 X T (i+1:T ) X 1 Z X = α dx (i+1) (i)  (i:T ) i=0 (T + 1)(X − X ) X T 1 X  1−α = (T + 1)X(i+1:T ) − X(i:T ) , T + 1 i=0

resulting in (2.48).

The estimator in (2.48) cannot be used as such because, in general, we do not know X(0:T ) and X(T +1:T ); that is, the support set of X. Following Learned-Miller and Fisher (2003), we therefore disregard the values below X(1:T ) and above X(T :T ), and compensate Chapter 2. Minimum R´enyientropy portfolios 67 this by a factor (T + 1)/(T − 1). This compensation factor comes from the fact that the expectation of the cdf differences

 (T :T ) (1:T )  E FX (X ) − FX (X ) = (T − 1)/(T + 1)

for all cdf FX ; see Vrins (2007, p.251). This yields the final 1-spacing approximation

1 T −1 ! 1−α 1−α exp 1 X  (i+1:T ) (i:T ) Hbα (1,T ) := (T + 1) X − X . T − 1 i=1

As detailed by Learned-Miller and Fisher (2003) in the specific case of Shannon entropy, the 1-spacing estimator suffers from a high variance. To reduce the asymptotic variance, one can consider a m-spacings estimator where the m-spacings overlap. The counterpart of ki in that case becomes

m k (m) := i (T + 1)(X(i+m:T ) − X(i:T ))

(i:T ) (i+m:T ) for X < x 6 X . However, because the m-spacings overlap, they do not form a partition of X(0:T ),X(T +1:T ) anymore (the same x can fall in more than one m-spacing), and thus, we lose the correspondence with the density estimator as a weighted sum of indicators in (2.47). Still, from the definition of ki(m), we can consider this extension of exp Hbα (1,T ):

1 T −m  1−α! 1−α exp 1 X T + 1 (i+m:T ) (i:T ) Hbα (m, T ) := X − X , T − m m i=1 which corresponds to (2.22). Taking the limit α → 1, we recover the original estimator of van Es (1992) and Learned-Miller and Fisher (2003) in (2.23).

2.8.1.6 Proposition 2.4

Proof. From Royston (1982), the density of X(r:T ), the rth order statistic of X, is given by r−1 T −r fX(r:T ) (x) = (1 − FX (x)) (FX (x)) fX (x) . (2.49) Chapter 2. Minimum R´enyientropy portfolios 68

x−a  x−a  As we have FX (x) = FX? b and fX (x) = fX? b /|b|, we find from (2.49) that the density of X(r:T ) and X?(r:T ) are related to one another according to

1 x − a f (r:T ) (x) = f ?(r:T ) , X |b| X b meaning that X(r:T ) = a + bX?(r:T ). (2.50)

Replacing (2.50) in the expression of Hbα(X; m, T ), we can write

? Hbα(X; m, T ) = ln |b| + Hbα(X ; m, T ). (2.51)

exp exp ? ? Moreover, as Hα (X) = |b|Hα (X ), we have ln |b| = Hα(X) − Hα(X ). Replacing this in (2.51) yields

? ? Hbα(X; m, T ) = Hα(X) − Hα(X ) + Hbα(X ; m, T ) ? ? ⇔ Hbα(X; m, T ) − Hα(X) = Hbα(X ; m, T ) − Hα(X )  ?  ⇔ B Hbα(X; m, T ) = B Hbα(X ; m, T ) , which completes the proof.

2.8.2 Additional empirical results

In Table 2.3, we report two additional empirical results to complete the discussion related to Table 2.2. We focus on the MRE portfolio with α = 0.3, which achieves the lowest turnover among all considered MRE portfolios, and the MV portfolio with Σb SF as covariance-matrix estimate, which achieves the largest Sharpe ratio and lowest turnover among all considered MV portfolios. First, we report the mean and volatility separately. As the table shows, the slight improvement in Sharpe ratio allowed by the MRE strategy comes from a slightly larger mean return, combined with a similar volatility. Second, given that the MRE portfolio depicts more turnover than the MV portfolio, we evaluate the level of transaction costs for which the improvement in Sharpe ratio disappears. Specifically, following DeMiguel et al. (2009a, p.1930), the portfolio return at time t + 1,

Pt+1, is adjusted as follows:

N ! X Pt+1 ← (1 + Pt+1) 1 − c × |wi,t+1 − wi,t+ | − 1. (2.52) i=1 Chapter 2. Minimum R´enyientropy portfolios 69

Table 2.3: Mean, volatility and break-even transaction cost for the minimum- variance and minimum R´enyi entropy portfolios

Mean return Volatility Sharpe ratio Break-even MRE MV MRE MV MRE MV cost (in bps) 6BTM 0.126 0.124 0.149 0.149 0.843 0.833 990 25BTM 0.136 0.132 0.138 0.139 0.985 0.951 642 6Mom 0.117 0.112 0.152 0.152 0.768 0.738 / 25Mom 0.127 0.124 0.136 0.137 0.936 0.904 931 10Ind 0.120 0.119 0.120 0.121 0.995 0.983 165 17Ind 0.120 0.120 0.128 0.128 0.938 0.935 34 Notes. The portfolios are computed with the methodology presented in Section 2.6.1. The mean return, volatility and Sharpe ratio are annualized. For the minimum R´enyi entropy (MRE) portfolio, we fix α = 0.3. For the minimum-variance (MV) portfolio, we take the shrinkage estimator of the covariance matrix of Ledoit and Wolf (2003) based on a single-factor target. The break-even cost is the value of the proportional transaction cost c in (2.52) for which the MRE and MV portfolios have the same transaction-cost-adjusted Sharpe ratios.

We then compute the value of the proportional transaction cost c (in basis points) for which the transaction-cost-adjusted Sharpe ratios of the MRE and MV portfolios are equal. As Table 2.3 shows, the value of c is very large, expect for the 17Ind dataset for which c = 34 bps. Thus, the Sharpe-ratio improvement is robust to the inclusion of transaction costs. The reason is that, as depicted in Table 2.2, the MRE portfolio with α = 0.3 does not have a much larger turnover than the MV portfolio and, moreover, we recompute the portfolio strategies only once per year. Note that, for the 6Mom dataset, there is no break-even solution for c because, as Table 2.2 shows, the MRE portfolio has a larger Sharpe ratio and lower turnover than the MV portfolio.

Chapter 3

Optimal portfolio diversification via independent component analysis

Abstract A popular approach to enhance portfolio diversification is to use the factor-risk-parity portfolio, which is usually defined as the portfolio whose return variance is spread equally across the principal components (PCs) of asset returns. Although PCs are unique and useful for dimension reduction, they are an arbitrary choice because any rotation of the PCs remains uncorrelated. This is problematic because we demonstrate that any portfolio is the factor-variance-parity portfolio corresponding to a specific rotation of the PCs. To overcome this problem, we rely on the factor-risk-parity portfolio based on the independent components (ICs), which are the rotation of the PCs that are maximally independent. We demonstrate that using the IC-variance-parity portfolio helps to reduce the return kurtosis. We also show how to exploit the near independence of the ICs to parsimoniously estimate the factor-risk-parity portfolio based on modified Value-at-Risk. Finally, we empirically demonstrate that portfolios based on ICs outperform those based on PCs, and the minimum-variance portfolio.

Reference This chapter is partially based on: Lassance, N., DeMiguel, V., & Vrins, F. (2019). Optimal portfolio diversification via independent component analysis. London Business School working paper. Available on SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3285156.

3.1 Introduction

As discussed in sections 1.1 and 1.3, the large estimation risk associated with traditional mean-variance portfolios has led researchers to consider alternative portfolio strategies that often aim to optimize portfolio diversification. In particular, mean-variance portfolios estimated from historical return data are usually not well diversified in terms of portfolio

71 Chapter 3. Optimal portfolio diversification via independent component analysis 72

weights (Green and Hollifield 1992), and thus, researchers have proposed incorporating portfolio-weight diversification as an explicit objective. For instance, one can shrink the mean-variance portfolio toward the equally weighted portfolio, which is the most diversified portfolio in terms of weights (DeMiguel et al. 2009b, Tu and Zhou 2011). However, the equally weighted portfolio is not optimally diversified in terms of risk when some of the assets are riskier than others.

For this reason, investment managers often prefer to diversify their portfolios in terms of asset risk contributions rather than in terms of asset weights. This leads to the so-called asset-risk-parity portfolio, which is the portfolio whose risk is spread equally across the different assets; see Section 1.3.1. However, the asset-risk-parity portfolio is not optimally diversified in terms of risk unless the asset returns are uncorrelated.

To address this weakness of the asset-risk-parity portfolio, researchers have turned to the factor-risk-parity portfolio, which is the portfolio whose risk is equally spread among a set of uncorrelated factors; see Section 1.3.2. To implement this portfolio, one needs to select a specific set of uncorrelated factors. The standard choice introduced by Meucci (2009) is to rely on the principal components (PCs) of asset returns, provided by principal component analysis (PCA).

Although the PCs are useful for dimension reduction (Campbell et al. 1997, Xu et al. 2016), they are problematic for factor-risk parity for two main reasons.

1. Relying on the PCs is arbitrary because once dimension has been reduced, any further rotation of the PCs remains uncorrelated, and thus, is equally valid for factor- risk-parity purposes. This is worrying because as we demonstrate, any portfolio is a factor-variance-parity portfolio for factors corresponding to a specific rotation of the PCs.

2. The PCs are not optimal because they are merely uncorrelated, rather than in- dependent. Correlation is insufficient, in particular, when asset returns are not Gaussian. For instance, Poon et al. (2004, p.583) argue that “[correlation] assumes a linear relationship and a multivariate Gaussian distribution, which might lead to a significant underestimation of the risk from joint extreme events.” Thus, just as relying on assets is suboptimal for risk parity because they are correlated, relying on PCs is suboptimal because they may feature higher-moment dependencies.

To overcome these two challenges, we propose using the factor-risk-parity portfolio based on maximally independent factors instead of merely uncorrelated ones. This is Chapter 3. Optimal portfolio diversification via independent component analysis 73 achieved via independent component analysis (ICA), an extension of PCA that identifies the rotation of the PCs that results in factors as independent as possible (Hyv¨arinen et al. 2001). Independence is particularly relevant in portfolio selection because asset-return distributions often display fat tails and non-linear dependencies; see Section 1.2. The ICs overcome the two aforementioned challenges related to PCA: they discriminate among all the existing uncorrelated factors and they enhance diversification by accounting for higher- moment dependencies. In addition, we show that the ICs allow one to parsimoniously deal with higher-moment risk measures.

Our contribution is fourfold. First, we demonstrate that the arbitrariness of the PCs is worrying in the context of factor-risk parity. To do this, we first provide a closed-form expression for the factor-variance-parity portfolio corresponding to a generic rotation of the PCs. We then use this closed-form expression to show that any portfolio is a factor-variance-parity portfolio corresponding to a specific rotation of the PCs. Thus, the decorrelation criterion is not sufficient to discriminate among all the possible portfolios and, although they are unique, the PCs are an arbitrary choice because any rotation of them remains uncorrelated.

Second, to overcome the difficulties associated with the PC-risk-parity portfolio, we propose relying instead on the independent components. We theoretically demonstrate that, assuming the asset returns are given by a linear combination of K independent factors, using the IC-variance-parity portfolio helps to reduce the return kurtosis. To do this, we first characterize the minimum and maximum excess kurtosis achievable by any portfolio. Then, we show that the excess kurtosis of the IC-variance-parity portfolio return is 1/K times the arithmetic mean of the excess kurtosis of the ICs, and that it is often close to the minimum achievable excess kurtosis. Thus, diversifying the portfolio-return variance across ICs provides a natural way of reducing the portfolio-return tail risk.

Third, we show that the ICs provide a parsimonious way to estimate higher moments of portfolio returns. Indeed, although decorrelation suffices to decompose the portfolio- return variance as a sum of factor variance exposures, independence is required to obtain a similar decomposition for higher moments. For the case where the ICs are sufficiently independent, the higher comoments of ICs are negligible, and this greatly reduces the number of parameters to be estimated. We exploit this fact to show how to parsimoniously estimate an IC-risk-parity portfolio based on the modified VaR.

Fourth, we evaluate the out-of-sample performance of the shrinkage portfolios obtained by combining minimum-risk portfolios and factor-risk-parity portfolios. We use the linear Chapter 3. Optimal portfolio diversification via independent component analysis 74

shrinkage of Kan and Zhou (2007) and Tu and Zhou (2011) with the shrinkage intensity estimated via cross-validation (Hastie et al. 2001). We find that the portfolios based on ICs substantially outperform those based on PCs in terms of Sharpe ratio and tail risk. In addition, IC-based portfolios are more stable than PC-based ones as demonstrated by their lower turnover. The IC-based shrinkage portfolios also outperform the traditional minimum-risk portfolios.

Our work is connected to four streams of literature. First, the literature on risk-parity portfolios, and particularly on factor-risk-parity portfolios; see Section 1.3. We extend the previous studies using PCs for factor-risk parity (Meucci 2009, Deguest et al. 2013, Lohre et al. 2012, 2014) by allowing for dimension reduction, emphasizing the importance of the selection of factors, proposing an alternative to the PCs based on ICA and considering higher-moment risk measures. Most papers on factor-risk parity are published in practitioner journals, which is a reflection of the lack of clear theoretical foundations for this framework; see Section 1.3.3. We stress this even further by showing that relying on the PCs as factors is arbitrary in the sense that any rotation of the PCs remains uncorrelated and yields a different factor-risk-parity portfolio. In contrast, we show that the IC-variance-parity portfolio provides a natural way of reducing the portfolio-return kurtosis. Thus, the factor-risk-parity portfolio based on ICs is a theoretically well-founded higher-moment portfolio strategy.

Second, this chapter contributes to the literature on higher-moment portfolio selection by showing theoretically and empirically that IC-risk-parity portfolios help improve tail risk. Many existing higher-moment portfolios are obtained by a direct approach that finds a portfolio on the higher-moment efficient surface, and thus, requires estimating the high- dimensional higher-comoment matrices; see Section 1.2.1. In contrast, the IC-risk-parity portfolio is an indirect approach because it does not optimize the portfolio-return higher moments but still helps to improve them. It belongs to the fourth higher-moment approach in Section 1.2.3. Consequently, it is much more parsimonious because, in addition to the covariance matrix, one just needs to estimate the rotation matrix determining the ICs (made of K(K − 1)/2 distinct entries only) and, for the modified VaR risk measure, the marginal skewness and kurtosis of the ICs.

Third, this chapter is related to the literature on portfolio selection under parameter uncertainty (see Section 1.1.3) and, in particular, to approaches that use shrinkage estimators of portfolio weights. We introduce shrinkage portfolios that shrink the sample minimum-risk portfolio toward the IC-risk-parity portfolio, using variance and modified Chapter 3. Optimal portfolio diversification via independent component analysis 75

VaR as risk measures. In contrast to DeMiguel et al. (2009b) and Tu and Zhou (2011) who shrink the sample mean-variance portfolio toward the naively diversified equally weighted portfolio, we shrink the sample minimum-variance portfolio toward a portfolio with equally weighted exposures on the ICs.

Fourth, this chapter is related to the recent literature that applies independent component analysis to factor analysis of financial time series; see, for instance, Chen et al. (2010), Garc´ıa-Ferrer et al. (2012) and Fabozzi et al. (2016). To the best of our knowledge, the only other papers in which ICA is applied in a portfolio-selection context are Chen et al. (2007), who rely on ICA for the computation of the Value-at-Risk of portfolio returns, and Hitaj et al. (2015), who propose maximizing the expected CARA utility of an investor, estimated parsimoniously by assuming the ICs are independent. Our work is fundamentally different because we focus on the use of ICs to define factor-risk-parity portfolios that are optimally diversified.

The chapter is organized as follows. Section 3.2 reviews PCA and ICA. Section 3.3 shows that factor decorrelation is not a discriminant criterion for factor-variance parity. Section 3.4 shows that factor-variance parity based on ICs helps reduce the portfolio- return kurtosis. Section 3.5 considers the IC-risk-parity approach with modified VaR as risk measure. Section 3.6 evaluates the out-of-sample empirical performance of the proposed shrinkage portfolios. Section 3.7 concludes. The Appendix at the end of the chapter contains proofs for all results, as well as supplementary materials.

3.2 PCA versus ICA: from decorrelation to independence

In this section, we first review principal component analysis (PCA), a standard dimension- reduction technique that generates a specific basis of uncorrelated factors. We argue that, although the principal components (PCs) are unique, any rotation of them remains uncorrelated, and thus, the PCs are an arbitrary choice for factor-risk parity. We then introduce independent component analysis (ICA), which extends PCA by rotating the PCs to make them as independent as possible. This allows one to discriminate among all bases of uncorrelated factors. Chapter 3. Optimal portfolio diversification via independent component analysis 76

3.2.1 Principal component analysis

0 The covariance matrix of asset returns, Σ, can be diagonalized as Σ = VN ΛN VN ,

where the N columns of VN are the eigenvectors, and ΛN := diag(λ1, . . . , λN ) is the diagonal matrix made of the (strictly positive) eigenvalues sorted in decreasing order. This factorization is at the heart of PCA; see, for instance, Section 6.4 of Campbell et al. (1997). To simplify the notation, we denote Λ the K × K diagonal matrix containing the K largest eigenvalues, and V the N ×K matrix whose columns contain the corresponding eigenvectors.

Definition 3.1 (PCA). The PC basis is the orthonormal basis formed by the columns of V Λ−1/2; that is, the first K standardized eigenvectors of Σ. The principal components (PCs) are the projection of the asset returns X onto the PC basis:

Y ? := Λ−1/2V 0X. (3.1)

PCA is useful for dimension reduction because the first few PCs explain most of the variance in asset returns. For example, Kozak et al. (2019) show that a few principal components are sufficient to explain the cross section of stock returns. However, once dimension has been reduced, it is the fact that PCs are uncorrelated that makes them interesting for factor-risk parity, which spreads the portfolio-return risk across a set of uncorrelated factors. However, there is no sound motivation behind this particular choice because any rotation of the PCs remains uncorrelated, and therefore is as suitable for factor-risk parity as the PCs. To see this, let SO(K) be the special orthonormal group 0 of order K; that is, the set of K × K rotation matrices R such that RR = IK and det(R) = 1. Then, for any R ∈ SO(K), the factors Y = RY ? remain uncorrelated:

0 0 0 ΣY = RΣY ? R = RIK R = RR = IK .

Hence, as far as factor-risk parity is concerned, there is no reason to believe a priori that the K PCs would be superior to any of their rotations. Using the PCs amounts to

making the particular choice of uncorrelated factors associated with R = IK , a decision that looks arbitrary.

Therefore, there is an indeterminacy in the standard formulation of factor-risk parity because it does not motivate which orthonormal basis one must consider when performing risk parity. This would not be a problem if performing risk parity over any orthonormal Chapter 3. Optimal portfolio diversification via independent component analysis 77 basis led to the same portfolio. However, as we show in Section 3.3.2, any portfolio is the factor-variance-parity portfolio associated with a particular rotation of the PCs. This suggests that one should think carefully about which set of uncorrelated factors, among the infinitely many rotations of the PCs, one should consider for factor-risk parity.

3.2.2 Independent component analysis

To resolve the arbitrariness inherent in the decorrelation criterion and discriminate among all the orthonormal bases, one can look for the orthonormal basis that gives factors as independent as possible. Contrary to the PC basis, this maximally independent basis is not arbitrary and, as we show in Section 3.4.2, it enhances higher-moment diversification when used for factor-risk parity.

Independence is a much stronger requirement than decorrelation; it requires decorrela- tion for every (non-linear) transformation. In particular, independence not only requires the covariance matrix to be diagonal, but all the higher comoments must vanish as well. A standard criterion to measure the “distance to independence” of a random vector is mutual information.

Definition 3.2 (Mutual information). The mutual information of Y is the Kullback- Leibler divergence between the joint density and the product density of Y :

* K + Y Z f (y) I(Y ) := f , f = f (y) ln Y dy. (3.2) Y Yi Y QK y f (y ) i=1 i=1 Yi i

Mutual information is always non-negative, and it is zero if and only if the components of Y are mutually independent; see Cover and Thomas (2006, p.253). Thus, one can extend the notion of orthonormal basis to that of IC basis by using mutual information instead of linear correlation as dependence measure.

Definition 3.3 (ICA). The independent components (ICs) are the rotation of the PCs that minimize the mutual information:

† † ? † Y := R Y , R ∈ RI , (3.3) where R is the set I   ? RI := argmin I(RY ) . (3.4) R∈SO(K)

The corresponding basis, formed by the columns of V Λ−1/2R†0, is the IC basis. Chapter 3. Optimal portfolio diversification via independent component analysis 78

Note that mutual information is invariant with respect to a change of sign and permutation of the factors. Therefore, just like the set of PCs, the set of ICs are determined only up to a change of sign and permutation. As we show in Section 3.3 however, this indeterminacy is irrelevant for factor-risk-parity purposes. Moreover, in general, the ICs are not perfectly independent; they are the maximally independent factors found by a linear transformation of the asset returns.

In what follows, we assume that at most one of the K principal components is Gaussian. This is a mild assumption because it holds in the data. In particular, we find that for the equity-return data used in Section 3.6, the PCs have significantly positive excess kurtosis, and thus, are far from being Gaussian. The reason we require this assumption is that, if more than one PC is Gaussian, any rotation of the Gaussian PCs remains jointly Gaussian and uncorrelated—hence, mutually independent—and is thus an IC basis. Therefore, in this case, the IC basis is not unique anymore, even up to a change of sign and permutation. This is because higher-moment information is needed to discriminate the maximally independent factors, and thus ICA is not useful in the purely Gaussian case. If the non-Gaussianity assumption is violated, the ICs are arbitrary like PCs, but only in the subset of purely Gaussian PCs.

The IC basis is computed using independent component analysis (ICA), a well-known technique in machine learning; see Hyv¨arinenet al. (2001) and Vrins (2007). In general, the rotation matrix R† associated with the ICs is found via adaptive techniques such as gradient descent. To that aim, we rely on the FastICA algorithm of Hyv¨arinen(1999), a robust algorithm whose optimization criterion is derived from the mutual information and that provides R† in a fraction of a second. Details about this algorithm are provided in Appendix 3.8.2 at the end of the chapter.1 It is interesting to note that the mutual- information criterion in (3.2) is closely connected with Shannon entropy. Indeed, it is easy to show that (Cover and Thomas 2006, p.251)

K X I(Y ) = H(Yi) − H(Y ) i=1 where H(X) is the Shannon entropy of X in (2.2). Because H(AX) = H(X)+ln | det(A)| for any square matrix A, we have that the mutual information to be minimized in (3.4) ? PK ? ? becomes I(RY ) = i=1 H(RiY ) − H(Y ), where Ri is the ith row of R. The second

1Specific ICA algorithms have been designed to efficiently tackle high-dimensional problems. This is for instance the case of the recent PICARD algorithm of Ablin et al. (2018), which is a quasi-Newton algorithm that learns the curvature from past iterations. Chapter 3. Optimal portfolio diversification via independent component analysis 79 term does not depend on R, and thus, finding R ∈ SO(K) minimizing I(RY ?) is equivalent to minimizing the sum of the Shannon entropies of factors:

  ( K ) ? X ? RI = argmin I(RY ) = argmin H(RiY ) . (3.5) R∈SO(K) R∈SO(K) i=1

Because the Gaussian distribution is the most-entropic distribution for a fixed vari- ance, this means that the ICs are the least-Gaussian factors. This is quite intuitive: the central limit theorem roughly tells us that “mixing Gaussianizes”, and thus, the independent factors can be demixed from the asset returns by maximizing the departure from Gaussianity.

3.2.3 Numerical illustration

In the examples 3.1 and 3.2 below, we illustrate the arbitrariness of the decorrelation criterion to select a basis of factors, contrary to the independence (mutual information) criterion. We focus on the case K = N = 2 for graphical illustration purposes. We parametrize the rotation matrix R ∈ SO(2) as a function of the rotation angle θ:

! cos θ − sin θ R = R(θ) := ∈ SO(2). (3.6) sin θ cos θ

The examples are illustrated with polar plots, in which the standard cartesian coordinates (x, y) are replaced by polar coordinates (θ, r), where θ is the angle in (3.6) and r is the value taken by the variable of interest, measured by the radial distance from the origin (θ, 0).

Example 3.1 (Independent non-Gaussian asset returns). Consider two assets with returns independently distributed as Student t distributions: X1 ∼ T (ν1 = 3) ⊥ X2 ∼

T (ν2 = 20). Because the two asset returns are uncorrelated and X1 has a higher variance ? −1/2 than X2, the PCs coincide with the standardized asset returns; that is, Y = Λ X. The left plot of Figure 3.1 depicts how the relative mutual information2 and correlation of the factors Y = R(θ)Y ?, obtained by rotating the PCs, depend on the rotation angle θ. Observe that the correlation is zero for all θ, but that the relative mutual information is zero only when θ = kπ/2, k ∈ Z; that is, the ICs Y † coincide with the (standardized) asset returns up to sign and permutation. Thus, while the decorrelation criterion is not discriminating, the independence criterion results in a unique factor basis. 2The relative mutual information is a standardized version of the mutual information taking values ¯ PK in [0, 1]. It is defined as I(Y ) := KI(Y )/ i=1 H(Yi), where H(Yi) is the Shannon entropy of Yi. Chapter 3. Optimal portfolio diversification via independent component analysis 80

Figure 3.1: Mutual information versus correlation (a) Independent non-Gaussian asset returns (b) Dependent non-Gaussian asset returns

3 = :=2 3 = :=2

3 = 2:=3 r = 0:1 3 = :=3 3 = 2:=3 r = 0:1 3 = :=3

r = 0:05 r = 0:05 3 = 5:=6 3 = :=6 3 = 5:=6 3 = :=6

r = 0 r = 0

3 = : 3 = 0 3 = : 3 = 0

3 = 7:=6 3 = 11:=6 3 = 7:=6 3 = 11:=6

3 = 4:=3 3 = 5:=3 3 = 4:=3 3 = 5:=3 3 = 3:=2 3 = 3:=2

Notes. The figure depicts the relative mutual information (in solid blue) and correlation (in dashed red) of the K = N = 2 factors Y obtained as a rotation of the principal components, Y = R(θ)Y ?. The left figure considers the case with independent non-Gaussian asset returns and the right figure considers the case with dependent non-Gaussian asset returns, corresponding to examples 3.1 and 3.2, respectively. The plots are in polar coordinates (θ, r), where θ is the angle determining the rotation matrix R(θ) defined in (3.6) and r is the radial distance from the origin (θ, 0). The principal components correspond to θ = 0.

Example 3.2 (Dependent non-Gaussian asset returns). Consider now a joint distribution of two asset returns obtained by mixing, via a Gaussian copula of correlation ρ = 0.2, two independent Student t random variables with degrees of freedom (ν1, ν2) = (3, 20). The PCs Y ? are obtained by rotating the two asset returns by an angle of −0.17 radians; that is, V = R(−0.17). The right plot of Figure 3.1 depicts how the relative mutual information and correlation of the factors Y = R(θ)Y ?, obtained by rotating the PCs, depend on the rotation angle θ. Independence is never achieved because the asset returns are nonlinear functions of random variables with non-Gaussian distributions. Nevertheless, the mutual information varies with θ and this allows one to discriminate between the continuum of uncorrelated factors. In particular, we find that the ICs are obtained for θ† = 0.35 + kπ/2 6= 0, k ∈ Z, and thus, the ICs Y † = R(θ†)Y ? differ from the PCs Y ?.

Finally, to gauge the practical relevance of the ICs, we use three empirical datasets to evaluate the gain in independence that can be achieved by moving from the PCs to the ICs. We consider the three equity-return datasets from Kenneth French library that we use in Section 3.6: six size and book-to-market portfolios (6BTM ), six size and operating- profitability portfolios (6Prof ) and 30 U.S. industry portfolios (30Ind). Figure 3.2 shows that the gain in independence, measured by I(Y †) − I(Y ?), is particularly substantial Chapter 3. Optimal portfolio diversification via independent component analysis 81

Figure 3.2: Mutual information of principal versus independent components

0

-0.05 )

⋆ -0.1 Y ( I −

) -0.15 † Y ( I -0.2

-0.25

-0.3 06/1980 1985 1990 1995 2000 2005 2010 2015 Notes. The figure depicts the time evolution of the difference in mutual information between the K = 2 independent and principal components, I(Y †) − I(Y ?), for three datasets: 6BTM in solid blue, 6Prof in dashed red and 30Ind in dotted green. The mutual-information difference is computed every six months using an estimation window containing the previous five years of daily asset returns. The x-axis labels indicate the window midyear. around the 2007-2008 financial crisis, a period characterized by pronounced tail risk and extreme-value dependence of asset returns; that is, by pronounced non-Gaussianity.

3.3 Factor-variance parity via uncorrelated factors

The factor-variance-parity (FVP) portfolio is the portfolio for which each uncorrelated factor contributes equally to its return variance. In Section 3.5, we tackle the case of higher-moment risk measures. The original proposal of Meucci (2009) is to diversify the portfolio-return variance on the PC basis. As discussed in Section 3.2.1, this choice appears arbitrary because the only motivation for using the PCs is that they are uncorrelated, but any rotation of the PCs remains uncorrelated, and is thus equally valid for FVP purposes. We divide this section in two parts. First, we derive the FVP portfolio for an arbitrary rotation of the PCs. Second, we formally establish the arbitrariness of the decorrelation criterion by showing that any portfolio is a FVP portfolio on a set of factors obtained by rotating the PCs.

3.3.1 Derivation of the factor-variance-parity portfolios

We consider below factors given by an arbitrary rotation of the K principal components, Y = RY ?. Contrary to Meucci (2009) and Deguest et al. (2013), we do not restrict to Chapter 3. Optimal portfolio diversification via independent component analysis 82 the full-dimensional case K = N. Allowing for K ≤ N reduces dimension and alleviates the impact of estimation error (Hastie et al. 2001).

To proceed, we rewrite the portfolio return as a linear combination of the fac- tors Y . Defining Pˆ := w˜(R)0Y with w˜(R) the factor exposures, we have that Pˆ = w˜(R)0RΛ−1/2V 0X, and thus, w is retrieved from the factor exposures w˜(R) as

w = V Λ−1/2R0w˜(R). (3.7)

Taking the pseudoinverse leads to the following definition of w˜(R):

w˜(R) := RΛ1/2V 0w. (3.8)

Consistent with the requirement w ∈ W in (1.2), the factor exposures w˜(R) belong to

 K −1/2 0 Wf(R) := w˜(R) ∈ R | V Λ R w˜(R) ∈ W . (3.9)

We have that Pˆ = w0VV 0X. Thus, Pˆ =6 P , except when no dimension reduction is performed (K = N). We call Pˆ the reduced portfolio return.

Observe now that because ΣY = IK , the reduced-portfolio-return variance becomes

K 2 X 2 2 σPˆ = w˜i(R) = kw˜(R)k , (3.10) i=1 which allows us to define the FVP portfolio as follows.

Definition 3.4 (Factor-variance-parity portfolio). A factor-variance-parity (FVP) port- ? 2 2 folio associated with the factors Y = RY is a portfolio for which w˜i(R) = w˜j(R) for all i, j = 1,...,K.

Note that variance parity is achieved by allocating equal squares exposures because the factors are standardized to have unit variance. In the next proposition, we give an analytical expression for the FVP portfolios.

Proposition 3.1. Given a rotation matrix R ∈ SO(K), the following holds: Chapter 3. Optimal portfolio diversification via independent component analysis 83

(i) There are 2K−1 factor-variance-parity portfolios with respect to the factors Y = RY ?, given by the set

( ) V Λ−1/2R01± W (R) := w ∈ W w = K , (3.11) FVP 0 −1/2 0 ± 1 V Λ R 1K

± where 1K is any K-dimensional vector of ±1’s.

(ii) The set of portfolios WFVP (R) is invariant to a change of sign and permutation

of rows of R: WFVP (R) = WFVP (BR) for all rotation matrices B of the form ± ± B = IK Π with IK a diagonal matrix of ±1’s and Π a permutation matrix;

2 (iii) The factor-variance-parity portfolio with minimum return variance σP is unique ± MV and obtained by considering the particular case 1K ← 1K in (3.11), where

MV −1/2 0 1K := sign(RΛ V 1). (3.12)

Point (i) shows that there are 2K−1 FVP portfolios. This is a natural consequence of the fact that a portfolio is a FVP portfolio if all the squared factor exposures are equal; their signs are irrelevant. This holds for any choice of factor basis, including the PC and IC bases. Point (ii) shows that the sign and order of the uncorrelated factors considered is irrelevant for factor-variance parity. Point (iii) shows that there is a single FVP portfolio minimizing the portfolio-return variance, which allows us to choose one of the 2K−1 FVP portfolios.

3.3.2 Arbitrariness of the decorrelation criterion

As we now show, the decorrelation criterion is arbitrary for FVP purposes because any portfolio is a FVP portfolio for factors corresponding to some rotation of the PCs.

Theorem 3.1. When K = N, given any portfolio w ∈ W there exists a rotation matrix

R ∈ SO(K) for which w is a FVP portfolio; that is, for which w ∈ WFVP (R).

Note that this theorem does not depend on the multiplicity of solutions in (3.11); it ± holds for any choice of vector 1K . Thus, although decorrelation is sufficient to decompose the portfolio-return variance as a sum of individual factor contributions in (3.10), decor- relation alone is not sufficient to discriminate among all the possible portfolios. When K < N, it still holds that the FVP portfolio depends on the rotation R chosen, and thus, that decorrelation is not a discriminant criterion. Chapter 3. Optimal portfolio diversification via independent component analysis 84

Choosing R = IK leads to a unique set of PC-variance-parity (PCVP) portfolios, which consists of taking the PCs as factors, but this constitutes an arbitrary choice. As explained above, the only motivation to use PCs for factor-risk parity is that they are uncorrelated, but there are infinitely many rotations of the PCs sharing this feature, which, as shown in Theorem 3.1, lead to different FVP portfolios. It is therefore natural to ask ourselves whether the PCs provide the best factors for FVP purposes. We answer this question in Section 3.4.2 in the context of higher moments. Specifically, we show that diversifying the portfolio-return variance across ICs helps to reduce kurtosis, but that diversifying it across merely uncorrelated factors, such as the PCs, does not.

The example below illustrates the result in Theorem 3.1 for the case K = N = 2.

Example 3.3 (Minimum-variance portfolio and FVP). Let us write the minimum- ± variance portfolio as a FVP portfolio for K = N = 2 (with 12 = 1). We need to look −1/2 0 −1 −1 0 for R = R(θ) such that V Λ R(θ) 1 = cΣ 1 = cV Λ V 1 for some c ∈ R0. This √ amounts to show that R(θ)01 = cΛ−1/2V 01, which, for c = 2/||Λ−1/2V 01||, becomes √ 2 R(θ)01 = y, y := Λ−1/2V 01. ||y||

√y1+y2 √y1−y2 The angle θ must be such that cos θ = 2 2 and sin θ = 2 2 , and always exists 2(y1 +y2 ) 2(y1 +y2 )  2  2 √y1+y2 √y1−y2 √y1+y2 because 2 2 + 2 2 = 1 and 2 2 ∈ [−1, 1]. For example, in the 2(y1 +y2 ) 2(y1 +y2 ) 2(y1 +y2 ) special case where X and X are uncorrelated and σ > σ , we have that Σ = Λ, and 1 2 X1 X2 θ is given by   σ + σ X1 X2 θ = arccosq . 2σ2 + σ2  X1 X2 Example 3.4 (Arbitrariness of decorrelation for FVP). We generate T = 1000 observa- tions from the asset-return distribution considered in Example 3.2. Proposition 3.1 shows that there are two different FVP portfolios for each rotation of the PCs Y = R(θ)Y ? when K = 2. The left plot of Figure 3.3 depicts the weight on the first asset, w1, of the two FVP portfolios as a function of the rotation angle θ. In line with Theorem 3.1, 0 w1 can take any real value. In other words, any portfolio w = (w1, 1 − w1) is a FVP portfolio for some θ. However, if for each θ we choose among the two FVP portfolios the one with the lowest portfolio-return variance, then the optimal weight on the first asset is no longer unbounded as a function of θ. This is illustrated in the right plot of Figure 3.3. Nonetheless, the solution remains arbitrary with respect to θ: any portfolio allocating a Chapter 3. Optimal portfolio diversification via independent component analysis 85

Figure 3.3: Arbitrariness of the factor-variance-parity portfolio (a) Weight on first asset (b) Weight on first asset of FVP portfolio of the two FVP portfolios with minimum return variance

3 = :=2 3 = :=2 w = 1 3 = 2:=3 3 = :=3 3 = 2:=3 1 3 = :=3

w1 = 0:5 w1 = 10 3 = 5:=6 3 = :=6 3 = 5:=6 3 = :=6 w1 = 0 w1 = 0

w1 = !10 w1 = !0:5

3 = : 3 = 0 3 = : 3 = 0

3 = 7:=6 3 = 11:=6 3 = 7:=6 3 = 11:=6

3 = 4:=3 3 = 5:=3 3 = 4:=3 3 = 5:=3

3 = 3:=2 3 = 3:=2

Notes. The figures are generated for the N = 2 asset returns in Example 3.4. We consider K = 2 factors Y given by some rotation of the principal components (PCs), Y = R(θ)Y ?. From Proposition 3.1, there exist two factor-variance-parity (FVP) portfolios for any rotation of the PCs. The left figure depicts the weight on the first asset return, w1, of the two FVP portfolios (in solid blue and dashed red, respectively) as a function of the rotation angle θ. The right figure depicts the weight on the first asset return, w1, of the FVP portfolio with minimum return variance as a function of the rotation angle θ. The plots are in polar coordinates (θ, r), where θ is the angle determining the rotation matrix R(θ) defined in (3.6) and r = w1 is the radial distance from the origin (θ, 0). The PCs correspond to θ = 0. weight to the first asset between −0.25 and 0.7 is a FVP portfolio with minimum return variance.

3.4 Higher-moment diversification via ICA

In Section 3.3, we showed that decorrelation is not a discriminant criterion for factor- variance parity; any portfolio is a FVP portfolio for some rotation of the PCs. In contrast, relying on the ICs resolves this indeterminacy. Moreover, as we show in this section, performing FVP with respect to the ICs provides a natural way of reducing the kurtosis of portfolio returns. In contrast, the PCs are exclusively based on the asset-return covariance matrix, and thus, completely ignore higher-moment dependence in asset returns.

3.4.1 IC-variance-parity portfolios

The IC basis is obtained by finding the rotation matrix R† in (3.3) such that the factors Y † = R†Y ? are as independent as possible, according to the mutual-information criterion. Thus, the IC-variance-parity portfolios are particular FVP portfolios. Chapter 3. Optimal portfolio diversification via independent component analysis 86

Definition 3.5 (IC-variance-parity portfolios). The set of 2K−1 IC-variance-parity † (ICVP) portfolios is given by WFVP (R ) in (3.11).

From Proposition 3.1, the set of ICVP portfolios is unaffected by the sign-and- permutation indeterminacy of the ICs. Thus, under the identifiability assumption that at most one of the PCs is Gaussian (see Section 3.2.2), relying on the ICVP portfolios addresses the arbitrariness inherent in the decorrelation criterion.

3.4.2 Kurtosis diversification

We now show that using ICs for factor-variance parity is better than using merely uncorrelated factors such as the PCs because, as the following theorem shows, it helps to reduce the kurtosis of portfolio returns.

Theorem 3.2. Let the N asset returns be given by a linear mixture of K indepen- dent factors with positive excess kurtosis. Let the arithmetic and harmonic means be 1 PK PK −1 A(x1, . . . , xK ) := K i=1 xi and H(x1, . . . , xK ) := K i=1 1/xi , respectively. Then, the following holds:

(i) The minimum and maximum return excess kurtosis achievable by any portfolio are

1  κmin = H κY † , . . . , κY † , (3.13) K 1 K  κmax = max κ † , . . . , κ † , (3.14) Y1 YK

where κ † is the excess kurtosis of the ith independent component. Yi (ii) The return excess kurtosis of the 2K−1 IC-variance-parity portfolios are identical and equal to the arithmetic mean of the excess kurtosis of the ICs divided by K,

1  κIC = A κY † , . . . , κY † . (3.15) K 1 K

(iii) The excess kurtosis of any FVP portfolio return, other than an ICVP one, can take † any possible value in [κmin, κmax] as a function of R .

The intuition behind Theorem 3.2 comes from the central limit theorem, which tells us that the asymptotic distribution of a sum of standardized independent random variables is Gaussian, and therefore has zero excess kurtosis. Because the ICVP portfolio return is an equally weighted sum (up to a change of sign) of maximally independent factors, Chapter 3. Optimal portfolio diversification via independent component analysis 87 it is expected to have an excess kurtosis close to zero. In contrast, the PCs are not independent, and thus, for the PC-variance-parity portfolios the above Gaussianization effect does not apply.

There are two important implications from Theorem 3.2.

1. Concomitantly with diversifying the portfolio-return variance over independent factors, one obtains a kurtosis that is close to the minimum one. In particular, the excess kurtosis of the IC-variance-parity portfolios is equal to the minimum one,

κIC = κmin, if the excess kurtosis of all ICs are equal. Otherwise, it remains close to the minimum excess kurtosis as long as the excess kurtosis of the ICs are not very different from one another.3 Note that the ICVP portfolios share the same kurtosis because the kurtosis of a weighted sum of independent factors only depends on the squared weights. This is no longer true for non-independent factors.

2. By diversifying the portfolio-return variance over non-independent factors, we are not guaranteed to obtain a low kurtosis in general. Specifically, depending on the rotation matrix R† over which we have no control, the FVP portfolio return can

have any kurtosis in the range [κmin, κmax]. This is true in particular for the PCVP portfolios and is a fundamental difference with the kurtosis of the ICVP portfolios that does not depend on R†.

Thus, the ICVP portfolio strategy offers a natural way of reducing the kurtosis of portfolio returns, which is important for investors concerned about tail risk. Moreover, the ICVP portfolio accounts for kurtosis in a parsimonious way because, in addition to the asset-return covariance matrix, one does not need to estimate the high-dimensional cokurtosis matrix, but merely the ICA rotation matrix R† made of K(K − 1)/2 distinct parameters only.

This reduction in portfolio-return risk is meaningful for investors given the large risk premium reflected by the cross-section of stock returns for being exposed to kurtosis and downside risk; see Dittmar (2002) and Ang et al. (2006). In particular, although it is true that kurtosis is a symmetric risk measure and thus also penalizes upside gains, contrary to skewness for example, minimizing kurtosis is still relevant because Ang et al. (2006) find that the downside-risk premium is much more significant than the upside-risk premium.

3  For example, taking K = 3 and κ † , κ † , κ † = (3, 5, 7), the minimum portfolio-return kurtosis Y1 Y2 Y3 is given by κmin = 105/71 = 1.48, versus κIC = 5/3 = 1.67 for the ICVP portfolios. Chapter 3. Optimal portfolio diversification via independent component analysis 88

Figure 3.4: Excess kurtosis of PC- and IC-variance-parity portfolios

3y = :=2 r = 6 3y = 2:=3 3y = :=3

r = 5

r = 4 3y = 5:=6 3y = :=6

r = 3

r = 2

3y = : 3y = 0

3y = 7:=6 3y = 11:=6

3y = 4:=3 3y = 5:=3

3y = 3:=2

 Notes. The figure is an illustration of Proposition 3.2 with κ † , κ † = (3, 6). It depicts, as a function Y1 Y2 of the rotation angle θ† determining the ICA rotation matrix R† = R(θ†), the excess kurtosis of the minimum-kurtosis portfolio (κmin = 2 in solid green), of the two ICVP portfolios (κIC = 9/4 in solid blue) and of one of the two PCVP portfolios (κPC with the plus sign in (3.16) in dashed red). The plot is in polar coordinates (θ†, r), where r is the radial distance from the origin (θ†, 0).

The following example illustrates Theorem 3.2 for the case K = 2.

Example 3.5 (Kurtosis of ICVP versus PCVP portfolios). Let K = 2 and R† = R(θ†). Then, the excess kurtosis of the two PCVP portfolios is given by (see Appendix 3.8.1.4 for a proof)

†  †  ±4 sin 2θ κY † − κY † + (3 − cos 4θ ) κY † + κY † κ ∈ 1 2 1 2 , (3.16) PC 8

In line with point (iii) of Theorem 3.2, it is easy to check that κPC spans the range † [κmin, κmax] as a function of θ . This is shown in Figure 3.4, which depicts κmin, κIC and †  κPC (with the plus sign in (3.16)) as a function of θ for κ † , κ † = (3, 6). By chance, Y1 Y2 the kurtosis of the PCVP portfolio could be lower than the kurtosis of the ICVP portfolio, but then only slightly because κIC = 9/4 is very close to κmin = 2. In contrast, it can also be much higher because κPC takes values in the range [κmin, κmax] = [2, 6]. Chapter 3. Optimal portfolio diversification via independent component analysis 89

3.4.3 Data-driven shrinkage portfolio

In this section, we introduce the portfolio-selection strategy that we propose to implement, which is a shrinkage of the minimum-variance portfolio toward the minimum-variance ICVP portfolio.

Theorem 3.2 supports the use of the minimum-variance ICVP (MV-ICVP) portfolio,

V Λ−1/2R†01MV w := K with 1MV := sign(R†Λ−1/2V 01) (3.17) IC 0 −1/2 †0 MV K 1 V Λ R 1K out of the 2K−1 ICVP portfolios because, under the assumptions of Theorem 3.2, all the ICVP portfolios have the same return kurtosis but not the same return variance. Thus, the MV-ICVP portfolio helps reduce variance without increasing kurtosis.

Remark 3.1. Under the assumptions of Theorem 3.2, all the 2K−1 ICVP portfolios have the same return kurtosis, but not the same return skewness. Indeed, the return skewness ± of the ICVP portfolio corresponding to the sign vector 1K is given by

PK ±  i=1 ζY † 1K ζ = i i , (3.18) P K3/2

±  ± where 1K i is ith entry of 1K . Thus, the skewness varies with the sign of the IC exposures ±   and the maximum-skewness ICVP portfolio is obtained for 1 = sign ζ † . However, K i Yi we have implemented the maximum-skewness ICVP portfolio on the datasets considered in Section 3.6 and observe that it often takes large positive and negative weights, resulting in a high skewness but an otherwise poor out-of-sample performance on the remaining criteria. For this reason, we focus on the minimum-variance ICVP portfolio.

The minimum-variance (MV) portfolio in (1.6), wMV , has been widely studied among researchers and shown to exhibit favorable out-of-sample performance in terms of Sharpe ratio because it is the only portfolio on the efficient frontier that does not require any estimate of asset mean returns; see Section 1.1.2.

Next, we introduce a shrinkage portfolio that combines the MV portfolio wMV and the MV-ICVP portfolio wIC .

Definition 3.6 (ICMV portfolio). The ICMV portfolio is defined as the shrinkage portfolio

wICMV := (1 − δ)wMV + δwIC , (3.19) Chapter 3. Optimal portfolio diversification via independent component analysis 90 where δ ∈ [0, 1] is a given shrinkage intensity.

We calibrate the shrinkage intensity δ in a data-driven way using the bootstrap methodology of 10-fold cross-validation that avoids assumptions on the asset-return distribution; see Section 3.6.1.

The empirical results show that the ICMV portfolio achieves an appealing trade-off between the high out-of-sample Sharpe ratio of the MV portfolio and the low kurtosis of the MV-ICVP portfolio. In contrast, shrinking the MV portfolio toward the minimum- variance PCVP portfolio, which relies on the principal components, does not reduce the return kurtosis as much.

Shrinkage portfolios as in (3.19) have been considered by several authors in the literature. For instance, Kan and Zhou (2007) shrink the mean-variance portfolio toward the MV portfolio. Perhaps more closely related to our work are DeMiguel et al. (2009b) and Tu and Zhou (2011), who shrink the MV and mean-variance portfolios, respectively, toward the equally weighted portfolio. A distinguishing feature of our approach is that we shrink toward a portfolio for which the IC exposures are equally weighted.

3.5 Factor-risk parity with higher-moment risk measures

In sections 3.3 and 3.4, we showed how to apply ICs to the traditional factor-risk-parity approach based on variance. In this section, we show how to apply ICs to diversify the modified VaR of portfolio returns in (1.48), which accounts for higher moments. In Appendix 3.8.3, we theoretically discuss the case where we diversify the kurtosis of portfolio returns across ICs.

Haugh et al. (2017) consider an asset-risk-parity portfolio under the VaR risk measure but, for tractability, they rely on the Gaussian VaR approximation. In contrast, the modified VaR accounts for skewness and kurtosis, which we parsimoniously incorporate by exploiting the maximally independent property of the ICs.

3.5.1 Parsimonious estimation of higher moments with ICs

We now show that relying on the ICs provides a natural way to parsimoniously estimate the portfolio-return higher moments. Consider a risk measure that depends on the portfolio- return third and fourth moments. Then, diversifying the portfolio-return risk with respect Chapter 3. Optimal portfolio diversification via independent component analysis 91 to K uncorrelated factors Y given by some rotation of the PCs, Y = RY ?, requires estimating the third and fourth moments of the reduced portfolio return Pˆ = w˜(R)0Y . From (1.38)–(1.39), this requires estimating the coskewness and cokurtosis matrices of

Y , M3(Y ) and M4(Y ), which are together made of large number of parameters (Boudt et al. 2015). Namely,

K + 2 K + 3 K(K + 1)(K + 2)(K + 7) ](M (Y )) + ](M (Y )) = + = = O(K4). 3 4 3 4 24

To develop a more parsimonious estimation approach, we rely on the independent compo- nents Y † = R†Y ? and assume that they are independent. In this case, estimating the † † matrices M3(Y ) and M4(Y ) requires estimating only 2K = O(K) parameters, which correspond to the third and fourth central moments of the K independent components.

Proposition 3.2. Let the K ICs be independent and have finite fourth central moments. Then, the third and fourth central moments of the reduced portfolio return are given by

K X † 3 m ˆ = w˜i(R ) m † , (3.20) 3,P 3,Yi i=1 K K K X † 4 X X † † 2 m ˆ = w˜i(R ) m † + 3 (w ˜i(R )w ˜j(R )) , (3.21) 4,P 4,Yi i=1 i=1 j6=i

where m † and m † are the third and fourth central moments of the ith IC. 3,Yi 4,Yi

In case the ICs are not truly independent, the expressions in (3.20) and (3.21) are biased approximations, but this bias is expected to be largely offset by the benefits in terms of estimation risk.4 In Chapter4, we will use this parsimonious estimation procedure to obtain robust estimates of moment-based portfolios.

3.5.2 IC-risk parity with modified Value-at-Risk

Proposition 3.2 allows one to develop parsimonious estimators of factor-risk-parity portfo- lios with higher-moment risk measures. We show how to do this for an approximation of the popular VaR risk measure. The difficulty in using the standard VaR is that contrary

4For example, Martellini and Ziemann (2010) introduce estimators of the asset-return coskewness and cokurtosis matrices that shrink the sample estimator toward a target estimator obtained under the assumption of a constant correlation or a single-factor model. Empirically, they find that the shrinkage intensity minimizing the mean-squared estimation error is around 80-90% on average for the cokurtosis matrix, and nearly 100% for the coskewness matrix. This shows that higher-comoment matrices are highly tainted by estimation error, supporting our parsimonious estimation approach. Chapter 3. Optimal portfolio diversification via independent component analysis 92 to the central moments in (3.20) and (3.21), the quantile of a linear combination of independent random variables is not an affine function of the individual quantiles.

To circumvent this difficulty, it is useful to employ the Cornish-Fisher expansion of the reduced-portfolio-return quantile at confidence level ε (Cornish and Fisher 1938),

r ! X QPˆ(ε, r) := µPˆ + σPˆ zε + pi(zε) , (3.22) i=1 where zε is the standard Gaussian quantile and the pi’s are polynomials whose coefficients depend on the standardized moments of Pˆ. The exact value of the quantile is given by

QPˆ(ε, ∞), which requires estimating all the reduced-portfolio-return moments. To avoid doing this, it is common to truncate the infinite sum and choose r < ∞. Choosing r = 0 corresponds to the Gaussian VaR, studied by Alexander and Baptista (2002) and Haugh et al. (2017). We choose r = 2, which allows us to capture the skewness and kurtosis.5

This corresponds to the modified VaR of Favre and Galeano (2002), MVaRPˆ(ε) in (1.48):

MVaRPˆ(ε) := −QPˆ(ε, 2) = −µPˆ − σPˆ(zε + p1(zε) + p2(zε)), (3.23) where p (x) = 1 (x2 − 1)ζ and p (x) = 1 (x3 − 3x)κ − 1 (2x3 − 5x)ζ2 . 1 6 Pˆ 2 24 Pˆ 36 Pˆ

Even though the MVaR is only an approximation of the VaR, it is a sensible higher- moment criterion becayse, for ε = 1% used in the empirical analysis, we have that

2 ∂MVaRPˆ(ε) 3(zε − 1) ≤ 0 if ζPˆ ≥ 3 = −0.9769, ∂ζPˆ 2zε − 5zε a mild lower bound, and

∂MVaRPˆ(ε) 1 3 = (3zε − zε )σPˆ ≥ 0. ∂κPˆ 24

The Cornish-Fisher approximation is useful in our context because it relates the reduced- portfolio-return quantile to its moments, which can easily be differentiated with respect to the IC exposures w˜(R). This allows an easy computation of the MVaR contribution of each IC. Specifically, because MVaR is positive homogeneous, it can be decomposed as

K K X X † ∂MVaRPˆ(ε) MVaRPˆ(ε) = MVaRY †|Pˆ(ε) := w˜i(R ) . (3.24) i ∂w˜ (R†) i=1 i=1 i

5Baillie and Bollerslev (1992) provide evidence that incorporating additional terms barely improves the approximation, while it increases the number of moments to estimate. Chapter 3. Optimal portfolio diversification via independent component analysis 93

Boudt et al. (2008) provide closed-form expressions for the MVaR contributions in the case of asset returns. We extend their results to the case of MVaR contributions of factors and in particular, of ICs. As shown in Proposition 3.2, we can do this parsimoniously by assuming the ICs are independent.

Proposition 3.3. Let the K ICs be independent and have finite fourth central moments. Then, the MVaR contribution of each IC in (3.24) is given by

 † ∂iσPˆ MVaRY †|Pˆ(ε) =w ˜i(R ) × − µY † − (zε + p1(zε) + p2(zε))+ i i 2σ Pˆ (3.25)  1 1  σ − p (z )∂ ζ − (z3 − 3z )∂ κ + (2z3 − 5z )ζ ∂ ζ , Pˆ 1 ε i Pˆ 24 ε ε i Pˆ 18 ε ε Pˆ i Pˆ where

3 2 2σ ∂im ˆ − 3m ˆσ ˆ∂iσ ∂ ζ = Pˆ 3,P 3,P P Pˆ , (3.26) i Pˆ 2σ6 Pˆ 2 2 σ ∂im ˆ − 2m ˆ∂iσ ∂ κ = Pˆ 4,P 4,P Pˆ , (3.27) i Pˆ σ6 Pˆ with σ2 , m and m given by (3.10), (3.20) and (3.21), respectively.6 Pˆ 3,Pˆ 4,Pˆ

Equipped with Proposition 3.3, we define the IC-MVaR-parity portfolios.

Definition 3.7 (IC-MVaR-parity portfolios). An IC-MVaR-parity portfolio is a portfolio for which MVaR † ˆ(ε) = MVaR † ˆ(ε) for all i, j = 1,...,K. Yi |P Yj |P

Because there may be multiple IC-MVaR-parity portfolios, similar to the variance case in Proposition 3.1, we consider the IC-MVaR-parity portfolio that minimizes MVaR, which we estimate numerically by solving the following optimization problem:

K !−1 X 2 wˆIC (ε) := argmin MVaRP (ε) subject to %MVaRY †|Pˆ(ε) ≥ K − , (3.28) w∈W i i=1 where  is a tolerance parameter set to  = 5 × 10−4 and

MVaR † ˆ(ε) Yi |P %MVaR † ˆ(ε) := (3.29) Yi |P PK MVaR † ˆ(ε) j=1 Yj |P

6We assume that the confidence level ε is low enough for the MVaR contributions to be all positive. This assumption typically holds for ε = 5% or below. Chapter 3. Optimal portfolio diversification via independent component analysis 94 is the relative (percentage) MVaR contribution of each IC. The constraint in (3.28) is used because, given a discrete probability distribution p, the inverse Herfindahl index −1 PK 2 i=1 pi achieves a maximum value of K when pi = 1/K for all i. Finally, we propose to implement the following shrinkage portfolio as counterpart of the ICMV portfolio in Definition 3.6.

Definition 3.8 (ICMVaR portfolio). The ICMVaR portfolio is the shrinkage portfolio

wICMV aR(ε) := (1 − δ)wMMV aR(ε) + δwˆIC (ε), (3.30)

where δ ∈ [0, 1] is a given shrinkage intensity, wMMV aR(ε) is the minimum-MVaR portfolio and wˆIC (ε) is the IC-MVaR-parity portfolio that minimizes MVaR, computed in (3.28).

3.6 Out-of-sample performance

We now evaluate the out-of-sample performance of the proposed shrinkage portfolios. In Section 3.6.1 we discuss the empirical data and methodology employed, in Section 3.6.2 we discuss the calibration of parameters, and in Section 3.6.3 we discuss the results.

3.6.1 Data and methodology

We consider three equity-return datasets from Kenneth French library: six size and book-to-market portfolios (6BTM ), six size and operating-profitability portfolios (6Prof ) and 30 U.S. industry portfolios (30Ind). We download value-weighted daily portfolio returns from 1978 to 2017. We use daily return data because larger sample sizes produce better estimates.

We consider six portfolio policies.7 First, the minimum-variance asset-variance-parity portfolio (AVP).8 Second, the minimum-variance portfolio (MV). Third and fourth, the shrinkage portfolio obtained by combining the MV portfolio and the minimum-variance FVP portfolio based on the ICs (ICMV) and PCs (PCMV). Fifth, the minimum-MVaR

7In Appendix 3.8.4 at the end of the chapter, we also compute the MV, ICMV and PCMV portfolios under the no-short-selling constraint. We find that the conclusions are qualitatively similar, although the performance is worse than that of the unconstrained portfolios because, with daily data, preventing short-selling hurts the out-of-sample performance (Jagannathan and Ma 2003). 8As explained in Section 1.3.1, Bai et al. (2016) show that there are 2N−1 asset-variance-parity portfolios and provide an optimization program in (1.53) to compute them. For 6BTM and 6Prof, we compute all 25 solutions and choose the one with minimum return variance. For 30Ind, it is unfeasible to compute all 229 solutions, and thus, we rely instead on the heuristic algorithm of Bai et al. (2016, p.7) that provides a solution close to the minimum-variance asset-variance-parity portfolio with much less computational work. Chapter 3. Optimal portfolio diversification via independent component analysis 95 portfolio (MMVaR). Sixth, the shrinkage portfolio obtained by combining the MMVaR portfolio and the minimum-MVaR IC-MVaR-parity portfolio (ICMVaR).9 We fix the MVaR confidence level at ε = 1%.10 All portfolios are computed using sample estimators.

To alleviate the impact of estimation error, we reduce problem dimension and keep only the first K ≤ N principal components. We select K using the minimum-average- 0 partial-correlation method of Velicer (1976). Let R(K) := RX − VV be the correlation matrix of X after the effect of the first K principal components has been partialed out. PN PN 2 Then, we select the K ∈ [2,N − 1] that minimizes i=1 j6=i Rij(K) .

To evaluate the out-of-sample performance of the different portfolios, we employ a rolling-horizon methodology as for the minimum R´enyi entropy portfolio in Section 2.6.1. We use a rolling window of six months and an estimation window of five years (sample size T = 1260). Thus, we obtain 35 years of out-of-sample daily portfolio returns.

We compare the out-of-sample performance of the different portfolios in terms of mean-variance trade-off, tail risk and portfolio stability. For mean-variance trade-off, we report the annualized sample mean, volatility and Sharpe ratio. For tail risk, we report the daily sample skewness, excess kurtosis, 1% MVaR and 1% modified Sharpe ratio, which is the ratio between the mean and the MVaR of the portfolio return as in Gregoriou and Gueyie (2003). For portfolio stability, we report the daily portfolio turnover in (2.29) (with 252 observations per year).

To test the statistical significance of the difference between the Sharpe ratio or the modified Sharpe ratio of two given portfolios, we use circular-bootstrapping methods that are well-suited for returns that have fat tails and non-stationarity. We compute two-sided p-values with the method proposed by Ledoit and Wolf (2008) for the Sharpe ratio and Ardia and Boudt (2015b) for the modified Sharpe ratio. As in DeMiguel et al. (2009a), we use B = 1000 bootstrap resamples and block size b = 5.11 The stars in Table 3.1, ?, ?? and ???, mean that the two-sided p-value is less than 20%, 10% and 5%. We compare the following pairs of portfolios: PCMV vs MV (stars in superscript next to PCMV quantity),

9We use the Matlab function fmincon with the GlobalSearch option to compute the MMVaR and ICMVaR portfolios, and we find the ICs with the FastICA Matlab package available on Aapo Hyv¨arinen’s website: www.cs.helsinki.fi/u/ahyvarin/papers/fastica.shtml. 10Two common choices in the literature are ε = 1% and 5%. We choose ε = 1% because Cavenaile and Lejeune (2012) show that setting ε > 4.16% is inconsistent with investors’ negative preferences for kurtosis. 11The idea behind the circular bootstrap is to wrap time-series data in a circle; that is, to assume XT +1 = X1, XT +2 = X2 and so on. Then, one randomly (uniformly) chooses a sequence of b consecutive observations on this circle, B amounts of times. We employ the R package PeerPerformance to compute the bootstrapped p-values. Chapter 3. Optimal portfolio diversification via independent component analysis 96

ICMV vs MV (stars in superscript next to ICMV quantity), ICMV vs PCMV (stars in subscript next to ICMV quantity) and ICMVaR vs MMVaR (stars in superscript next to ICMVaR quantity).

We calibrate the shrinkage intensity δ via 10-fold cross-validation; see Chapter 7 in Hastie et al. (2001). As discussed in Section 3.4.2, the FVP portfolio based on the ICs is expected to improve upon the MV portfolio in terms of kurtosis. Thus, we use as calibration criterion the modified Sharpe ratio, which accounts for higher moments.12

3.6.2 Calibration of K and δ

Let us discuss the results from the calibration of the parameters K and δ. As one can observe in Table 3.1, the calibration method of Velicer (1976) systematically chooses K = 2 for the 6BTM and 6Prof datasets. This is sensible because they are driven by two dimensions: size and book-to-market or size and operating profitability. For the 30Ind dataset, which is a higher-dimensional dataset, the average K over the 70 rolling windows is 2.73, with a maximum of K = 5 in three windows. Note that these PCs are far from being Gaussian, supporting the identifiability assumption in Section 3.2.2 that at most one of the PCs is Gaussian. Specifically, the Gaussianity test of Jarque and Bera (1980) systematically rejects the Gaussianity of all K principal components for all estimation windows and datasets, with very low p-values.

Regarding the shrinkage intensity δ, calibrated via 10-fold cross-validated modified Sharpe ratio, we observe in Table 3.1 that it is higher on average for the ICMV portfolio than the PCMV portfolio. This indicates that the ICs provide a superior FVP target portfolio than the PCs with regards to higher moments. For the ICMVaR portfolio, the optimal δ is quite low because the MMVaR portfolio already minimizes the denominator of the modified Sharpe ratio.

3.6.3 Results

Table 3.1 reports the out-of-sample performance of the six portfolio policies for the three datasets. First, we observe that the AVP portfolio does not compare favorably to the MV

12The 10-fold cross-validation method works as follows. First, set δ = 0. Then, divide the estimation window in 10 distinct intervals of equal length. Select one interval, remove it from the estimation window, and use the remaining data to compute the portfolio policy. Use these portfolio weights to compute out-of-sample returns on the interval that was removed. After repeating this procedure for each of the 10 intervals, compute the modified Sharpe ratio on all the out-of-sample returns. Increase δ by 0.01 and repeat the process above. Finally, when δ = 1 is reached, select the value of δ associated with the maximum modified Sharpe ratio. Chapter 3. Optimal portfolio diversification via independent component analysis 97

Table 3.1: Out-of-sample performance of factor-risk-parity portfolios (1/2)

6BTM dataset (average K = 2) AVP MV ICMV PCMV MMVaR ICMVaR Mean 19.92% 19.75% 21.39% 20.84% 23.50% 23.93% (2.31%) (2.19%) (2.28%) (2.19%) (2.56%) (2.52%) Volatility 13.81% 13.36% 13.41% 13.60% 15.26% 15.14% (0.35%) (0.34%) (0.27%) (0.32%) (0.26%) (0.25%) Sharpe ratio 1.44 1.48 1.59??? 1.53 1.54 1.58? (0.18) (0.18) (0.18) (0.17) (0.17) (0.18) Skewness -0.42 -0.32 -0.33 -0.23 -0.04 -0.16 (0.51) (0.53) (0.31) (0.50) (0.27) (0.20) Excess kurtosis 21.39 20.51 11.53 18.53 8.72 7.22 (4.65) (4.95) (2.40) (4.60) (2.36) (1.39) 1% modified Value-at-Risk 6.51% 6.08% 4.33% 5.75% 4.13% 3.84% (1.05%) (0.99%) (0.50%) (0.95%) (0.48%) (0.33%) 2 ?? ?? ? Modified Sharpe ratio (×10 ) 1.21 1.29 1.96?? 1.44 2.26 2.47 (0.29) (0.29) (0.34) (0.33) (0.39) (0.38) Portfolio turnover 3.05% 2.47% 2.79% 3.29% 5.75% 5.75% Average shrinkage intensity δ \ 0.00 0.63 0.59 0.00 0.24

6Prof dataset (average K = 2) AVP MV ICMV PCMV MMVaR ICMVaR Mean 17.78% 17.53% 20.41% 17.83% 22.38% 22.55% (2.41%) (2.42%) (2.57%) (2.46%) (2.69%) (2.54%) Volatility 14.13% 13.98% 14.83% 14.41% 16.14% 15.60% (0.37%) (0.36%) (0.32%) (0.36%) (0.34%) (0.31%) ??? ? Sharpe ratio 1.26 1.25 1.38??? 1.24 1.39 1.45 (0.17) (0.17) (0.17) (0.17) (0.17) (0.18) Skewness -0.04 -0.03 -0.35 -0.04 -0.03 -0.25 (0.64) (0.62) (0.42) (0.59) (0.46) (0.34) Excess kurtosis 22.09 22.55 14.86 20.06 14.66 10.99 (7.52) (7.98) (4.11) (6.90) (5.02) (3.12) 1% modified Value-at-Risk 6.63% 6.64% 5.53% 6.32% 5.78% 4.88% (1.31%) (1.37%) (0.97%) (1.25%) (1.04%) (0.74%) 2 ? ? Modified Sharpe ratio (×10 ) 1.06 1.05 1.46? 1.12 1.54 1.83 (0.31) (0.30) (0.37) (0.28) (0.37) (0.38) Portfolio turnover 3.23% 2.44% 2.93% 3.09% 5.16% 5.00% Average shrinkage intensity δ \ 0.00 0.71 0.39 0.00 0.23 portfolio. For 6BTM it is outperformed in terms of Sharpe ratio and modified Sharpe ratio, for 6Prof it performs very similarly, and for 30Ind it improves tail risk but has a much worse Sharpe ratio and turnover. Thus, diversifying the portfolio-return variance across assets is not helpful.

We then compare the performance of the ICMV and PCMV portfolios to test whether the shrinkage portfolios based on ICs outperform those based on PCs. The results confirm that portfolios based on the ICs have superior out-of-sample performance. In terms of Sharpe ratio, the ICMV portfolio systematically outperforms the PCMV portfolio. The gains come mostly from a higher portfolio mean return. Regarding tail risk, the ICMV Chapter 3. Optimal portfolio diversification via independent component analysis 98

Table 3.1: Out-of-sample performance of factor-risk-parity portfolios (2/2) 30Ind dataset (average K = 2.73) AVP MV ICMV PCMV MMVaR ICMVaR Mean 9.66% 11.18% 15.14% 10.58% 10.22% 11.70% (2.31%) (1.95%) (2.49%) (2.19%) (2.27%) (2.28%) Volatility 13.34% 11.71% 14.60% 12.92% 13.29% 14.14% (0.27%) (0.31%) (0.29%) (0.30%) (0.34%) (0.33%) ? Sharpe ratio 0.72 0.95 1.04?? 0.82 0.77 0.83 (0.17) (0.17) (0.17) (0.16) (0.17) (0.17) Skewness -0.47 -0.74 -0.49 -0.56 -0.97 -0.78 (0.45) (0.76) (0.42) (0.58) (0.75) (0.67) Excess kurtosis 13.45 23.99 12.95 16.96 21.59 17.46 (5.67) (12.56) (5.75) (8.76) (14.13) (11.20) 1% modified Value-at-Risk 4.78% 6.06% 5.11% 5.32% 6.43% 5.97% (1.27%) (2.30%) (1.36%) (1.85%) (2.80%) (2.54%) Modified Sharpe ratio (×102) 0.80 0.73 1.18 0.79 0.63 0.78 (0.36) (0.46) (0.43) (0.43) (0.55) (0.56) Turnover 6.75% 2.45% 2.52% 2.53% 4.35% 3.99% Average shrinkage intensity δ \ 0.00 0.42 0.27 0.00 0.22 Notes. This table reports the out-of-sample performance of the six portfolio policies on the three datasets following the methodology in Section 3.6.1. The portfolio mean return, volatility and Sharpe ratio are annualized, while all other performance criteria are reported in daily terms. The shrinkage intensity δ is computed via 10-fold cross-validation, using the modified Sharpe ratio as the calibration criterion. The number of PCs, K, is selected via the minimum-average-partial-correlation method of Velicer (1976). Bold figures indicate the best strategy for each criterion. Figures in parentheses indicate the standard errors, computed via non-parametric bootstrap with Matlab bootstat function and B = 1000 bootstrap resamples. We evaluate the statistical significance of the difference in Sharpe ratio and the modified Sharpe ratio by computing the two-sided p-values via the circular bootstrapping methodology of Ledoit and Wolf (2008) for the Sharpe ratio and Ardia and Boudt (2015b) for the modified Sharpe ratio. The stars ?, ?? and ??? mean that the p-value is less than 20%, 10% and 5%. We compare the following pairs of portfolios: PCMV vs MV (stars in superscript next to PCMV quantity), ICMV vs MV (stars in superscript next to ICMV quantity), ICMV vs PCMV (stars in subscript next to ICMV quantity) and ICMVaR vs MMVaR (stars in superscript next to ICMVaR quantity). portfolio also outperforms the PCMV portfolio in terms of kurtosis, modified VaR and modified Sharpe ratio. This result highlights that diversifying variance with respect to ICs helps to reduce tail risk, in line with the predictions of Theorem 3.2. The improvement in Sharpe ratio is statistically significant for 6Prof and 30Ind, and the improvement in modified Sharpe ratio for 6BTM and 6Prof. Moreover, the ICMV portfolio has a lower turnover, which makes its implementation less costly.

Comparing the ICMV and MV portfolios, we observe that there are clear benefits from shrinking the MV portfolio toward the minimum-variance IC-variance-parity portfolio. Naturally, the MV portfolio achieves a lower volatility, but the ICMV portfolio has a substantially higher mean return because the modified-Sharpe-ratio calibration criterion favors a higher mean return in the numerator. This results in a Sharpe ratio that is systematically larger for the ICMV portfolio. In terms of tail risk, the ICMV portfolio also outperforms the MV portfolio in terms of kurtosis, modified VaR and modified Sharpe Chapter 3. Optimal portfolio diversification via independent component analysis 99 ratio, as expected. The improvement in the Sharpe ratio and the modified Sharpe ratio is statistically significant for 6BTM and 6Prof, but not for 30Ind. This is to be expected because the optimal shrinkage intensity δ is, on average, lower for 30Ind. Lastly, the superior performance of the ICMV portfolio is achieved with a turnover that is only slightly higher than that of the MV portfolio, and thus, remains appealing in practice.

We then turn to the two portfolios based on the modified VaR risk measure. Just like for the variance risk measure, the results show that there are substantial benefits in shrinking the minimum-modified-VaR portfolio (MMVaR) toward a portfolio that is diversified with respect to ICs. Indeed, the resulting shrinkage portfolio (ICMVaR) has a superior Sharpe ratio and modified Sharpe ratio on all datasets. Notably, even though the MMVaR portfolio aims to minimize the modified VaR, the ICMVaR portfolio achieves a lower MVaR than the MMVaR portfolio out of sample. This indicates that it is less sensitive to estimation risk and that the parsimonious estimation procedure introduced in Section 3.5.1 works well. This is confirmed by the lower turnover of the ICMVaR portfolio compared to the MMVaR portfolio.

3.7 Conclusion

Factor-risk parity is an effective approach to portfolio diversification. Compared to asset- risk parity, it has the advantage of diversifying risk across uncorrelated factors, which improves diversification. However, we show that for any portfolio there is a basis of uncorrelated factors for which the portfolio is a factor-variance-parity portfolio (Theorem 3.1). Thus, this framework is arbitrary when decorrelation is the only criterion governing the choice of factor basis. The basis formed by the first principal components of asset returns is not more meaningful because, once dimension has been reduced, any further rotation of the principal components yields an equally uncorrelated basis. To circumvent these concerns, we propose using the independent components, which are the factors with least dependence. They offer three benefits: they discriminate among all the factor-risk- parity portfolios, they help to reduce the portfolio-return kurtosis (Theorem 3.2), and they provide a parsimonious way to deal with higher-moment risk measures. Empirically, we find that shrinking the minimum-risk portfolio toward the IC-risk-parity portfolio helps to improve the out-of-sample performance in terms of mean-variance and tail risk.

Although the introduced IC-risk-parity approach works well empirically and has been shown theoretically to be a natural way of reducing the portfolio-return kurtosis, it does not explicitly depend on the investor’s preferences. This contrasts with the standard Chapter 3. Optimal portfolio diversification via independent component analysis 100 expected-utility approach favored by academics in which the investor’s preferences are captured via a utility function.

In Chapter4, we consider more standard moment-based portfolios, such as those based on expected utility via a Taylor series expansion, and exploit the parsimonious estimation procedure of Section 3.5.1 to estimate them in a robust way. Then, in Chapter5, we consider an approach that is similar in spirit to expected utility—that is, capturing all investors’ preferences via a single function—but instead of a utility function we specify those preferences via a target-return distribution, which is a more intuitive specification when higher moments matter.

3.8 Appendix

This appendix contains four sections. Section 3.8.1 gives proofs of all results in the chapter. Section 3.8.2 describes the FastICA algorithm. Section 3.8.3 studies the theoretical properties of IC-kurtosis-parity portfolios. Section 3.8.4 discusses the performance of the long-only factor-variance-parity portfolios.

3.8.1 Proofs of results

3.8.1.1 Proposition 3.1

Proof. Let us prove the results one by one:

2 (i) From Definition 3.4, the FVP portfolios are obtained for w˜i(R) = c for all i and

some constant c ∈ R0 that ensures that w˜(R) ∈ Wf(R). This can be rewritten ± as w˜(R) = c1K , which, from (3.7), translates in portfolio weights given by w = −1/2 0 ± cV Λ R 1K . After normalization, c vanishes and the final expression for the FVP K ± portfolios in (3.11) is obtained. There are 2 different vectors 1K , which, combined with the normalization constraint, means that there are 2K−1 FVP portfolios.

(ii) The set of FVP portfolios WFVP (R) is invariant to a change of sign and permutation because, from (3.8), changing the sign or permuting the factors Y changes the factor exposures w˜(R) in the same way, so that the reduced portfolio return remains identical.

± (iii) Given some vector 1K , the variance of the FVP portfolio return is given by

(1± )0RΛ−1/2V 0V Λ V 0 V Λ−1/2R01± σ2 = K N N N K , (3.31) P 0 −1/2 0 ± 2 (1 V Λ R 1K ) Chapter 3. Optimal portfolio diversification via independent component analysis 101

0 where VN and ΛN are the full N × N eigenvectors and eigenvalues matrices. V VN 0 is given by the first K rows of IN , and thus V VN ΛN is made of the first K rows 0 0 of ΛN . Multiplying V VN ΛN by VN V , the first K columns of IN , thus gives

0 0 V VN ΛN VN V = Λ.

As a result, (3.31) simplifies to

(1± )01± K σ2 = K K = . (3.32) P 0 −1/2 0 ± 2 0 −1/2 0 ± 2 (1 V Λ R 1K ) (1 V Λ R 1K )

± Clearly, the vector 1K that minimizes (3.32) is unique and corresponds to

± 0 −1/2 0 0 −1/2 0 MV 1K = sign((1 V Λ R ) ) = sign(RΛ V 1) =: 1K .

3.8.1.2 Theorem 3.1

Proof. To prove the theorem, we need to show that given any portfolio w ∈ W, there exists R ∈ SO(N) for which w can be written in the form

V Λ−1/2R01± w = . (3.33) 10V Λ−1/2R01±

This amount to show that given any non-zero vector v ∈ RN , there exists R ∈ SO(N) for which RΛ1/2V 0v = c1± (3.34) holds for some c ∈ R0. The corresponding portfolio w is then obtained by normalizing v. √ We have that ||1±|| = N and the scaling coefficient c given by

||Λ1/2V 0v|| c = √ N means that (3.34) can be written as

Rx = y, (3.35)

√ Λ1/2V 0v ± where x := ||Λ1/2V 0v|| and y := 1 / N are unit-norm vectors. The claim then comes from the fact that the space of unit-norm N-dimensional vectors is spanned by SO(N). More explicitly, given two arbitrary unit-norm vectors x, y ∈ RN , there exists a rotation matrix R ∈ SO(N) such that Rx = y. To see this, notice that a rotation around the Chapter 3. Optimal portfolio diversification via independent component analysis 102 origin is merely a specific direct isometry. But every isometry in RN can be written as a composite of at most N + 1 reflections; see Theorem A.1.5 of Pressley (2010, p.387). Actually, R can be built as a sequence of two reflections: the first reflection S is done with respect to the hyperplane orthogonal to x + y, and maps x to −y. The second reflection Q maps −y to y. Eventually, because a reflection is an indirect isometry, these two reflections form a direct isometry, and R = QS is the rotation matrix mapping x to y = Rx.

3.8.1.3 Theorem 3.2

Proof. This theorem assumes that the asset returns are given by the linear-mixture model

X = AS, (3.36) where A is a full-rank N × K matrix, I(S) = 0 and κ > 0 for all i. In this setting, Si we have from Comon (1994) that the ICs Y † are independent and correspond to S, ignoring the irrelevant sign-and-permutation indeterminacy. As a result, we have that A = V Λ1/2R†0 and the IC exposures are w˜(R†) = A0w. Thus, the reduced portfolio return Pˆ corresponds to the portfolio return P in this setup:

Pˆ = w˜(R†)0Y † = w0AY † = w0AS = w0X = P. (3.37)

This result is important for the proofs below. For ease of exposition, we denote κi := κ † Yi the excess kurtosis of the ith IC. Let us prove the results one by one:

(i) The excess kurtosis of the portfolio return P = Pˆ = w˜(R†)0Y † is found from the excess kurtosis of a weighted sum of independent random variables:

PK κ w˜ (R†)4 PK κ z2 κ = i=1 i i = i=1 i i , (3.38) P  2  2 PK † 2 PK i=1 w˜i(R ) i=1 zi

† 2 PK where zi := w˜i(R ) ≥ 0. We begin by noting two things. First, the sum i=1 zi 6= 0, 2 otherwise the portfolio-return variance σP vanishes. Second, the vector of zi’s can only be identified up to a multiplicative constant as, from (3.38), it results in the

same excess kurtosis κP . Therefore, without loss of generality, we normalize the zi’s Chapter 3. Optimal portfolio diversification via independent component analysis 103

PK so that i=1 zi = 1. In turn, the portfolio-return excess kurtosis simplifies to

K X 2 0 κP = κizi = z Kz, (3.39) i=1

0 where z = (z1, . . . , zK ) and K := diag(κ1, . . . , κK ). Let us now find the global

minimum and maximum of the excess kurtosis κP .

• To find the global minimum of the excess kurtosis κP , we need to solve the following quadratic optimization program:

0 0 min z Kz subject to 1K z = 1, z ≥ 0. (3.40) z

This is a convex problem with unique solution similar to that of the minimum- −1 0 −1 variance portfolio: z = K 1K /1K K 1K . As desired, z is positive as we assume that the ICs have strictly positive excess kurtosis. The obtained

minimum excess kurtosis κP corresponds indeed to κmin in (3.13):

1 1 1 min κ = = = H(κ , . . . , κ ) =: κ . P 10 K−11 PK K 1 K min K K i=1 1/κi

• To find the maxima of the excess kurtosis κP , we need to solve the following quadratic optimization program:

0 0 max z Kz subject to 1K z = 1, z ≥ 0. (3.41) z

Given that z0Kz is a convex function in z, all maxima correspond to corner

solutions of the type zi = 1 and zj = 0 for all j =6 i, with excess kurtosis

κP = κi. Therefore, the obtained global maximum corresponds indeed to κmax in (3.14):

max κP = max(κ1, . . . , κK ) =: κmax.

ˆ ±0 † (ii) The ICVP portfolio return is given by P = P = c1K Y and the resulting excess ± kurtosis in (3.38) is independent of c and the particular vector 1K chosen, and is given by PK κ 1 κ = i=1 i = A(κ , . . . , κ ) =: κ . P K2 K 1 K IC

(iii) Consider the factors Y given by an arbitrary rotation R of the PCs, Y = RY ?. Our purpose is to show that the FVP portfolio associated with Y depends on the mixing matrix A associated with the linear-mixture model X = AS in (3.36) Chapter 3. Optimal portfolio diversification via independent component analysis 104

except in the case corresponding to the rotation associated with the IC basis; that is,

except when R ∈ RI in (3.4). To show this, we first need to show that any portfolio return P is a FVP portfolio return for some rotation of the ICs. That is, that there exists R˜ ∈ SO(K) such that ˆ † 0 † ±0 ˜ † P = P = w˜(R ) Y = c1K RY (3.42)

˜ 0 ± holds for some c ∈ R0. This is equivalent to showing that RA w = c1K , which √ holds by the same argument than in Theorem 3.1 by setting c = ||RA˜ 0w||/ K.

Suppose now that the mixture model in the problem at hand is of the form 0 ˜ A = R1DR R where R1 and D are arbitrary rotation and full-rank diagonal matrices, respectively. That is, A is such that R† = R˜0R. Then, it is easy to see that the factors Y correspond to Y = RY˜ †. Therefore, in this specific case, one obtains from (3.42):

±0 ˜ † ±0 ˜ ˜0 ±0 P = c1K RY = c1K RR Y = c1K Y ;

In other words, P is a FVP portfolio return associated with the factors Y . Therefore, depending on the ICA rotation matrix R† hidden behind the asset-return generating process A, the FVP portfolio return associated to the factors Y could be any portfolio return and, consequently, could have any attainable kurtosis value. This is a fundamental difference with the ICVP portfolio returns that are independent ˜ from A. Indeed, only the P corresponding to the special case R = IK (up to a change of sign and permutation of the rows) are ICVP portfolio returns, and all of those 2K−1 portfolio returns have the same excess kurtosis in (3.15).

3.8.1.4 Excess kurtosis of PCVP portfolios in Example 3.5

± Proof. We give the proof for the signs of PC exposures equal to 12 = 12. The proof is ± 0 similar for the other choice of signs, 12 = (1, −1) . Under the assumptions of Theorem 3.2, the PCVP portfolio return is given by

0 † 0 † † † † † † † P = c12R(θ ) Y = c cos θ − sin θ Y1 + c cos θ + sin θ Y2 , Chapter 3. Optimal portfolio diversification via independent component analysis 105

where c ∈ R0 is an irrelevant constant. Then, from (3.38), the return excess kurtosis of the PCVP portfolio is given by

† †4 † †4 cos θ − sin θ κY † + cos θ + sin θ κY † κ = 1 2 , P  2 (cos θ† − sin θ†)2 + (cos θ† + sin θ†)2 which after algebraic manipulations yields the final expression with minus sign in (3.16).

3.8.1.5 Proposition 3.2

Proof. To prove the proposition, we use the well-known fact that cumulants are additive for independent random variables. The third and fourth cumulants of the reduced portfolio return Pˆ = w˜(R†)0Y † correspond to m and m − 3σ4 , respectively. Therefore, under 3,Pˆ 4,Pˆ Pˆ the independence of the ICs, we directly find the third central moment of Pˆ:

K K X X † 3 m ˆ = m † † = w˜i(R ) m † . 3,P 3,w˜i(R )Yi 3,Yi i=1 i=1

Similarly, the fourth central moment of Pˆ is found as

K 4 X 4  m4,Pˆ − 3σ ˆ = m4,w˜ (R†)Y † − 3σ † † P i i w˜i(R )Yi i=1 K K K !2 X † 4 X † 4 X † 2 ⇔ m ˆ = w˜i(R ) m † − 3 w˜i(R ) + 3 w˜i(R ) 4,P 4,Yi i=1 i=1 i=1 K K K K X † 4 X † 4 X X † † 2 ⇔ m ˆ = w˜i(R ) m † − 3 w˜i(R ) + 3 (w ˜i(R )w ˜j(R )) 4,P 4,Yi i=1 i=1 i=1 j=1 K K K X † 4 X X † † 2 ⇔ m ˆ = w˜i(R ) m † + 3 (w ˜i(R )w ˜j(R )) . 4,P 4,Yi i=1 i=1 j6=i

3.8.1.6 Proposition 3.3

Proof. Boudt et al. (2008) show that the MVaR contribution of the ith asset Xi is given by  ∂ σ2 MVaR (ε) := w × − µ − i P (z + p (z ) + p (z ))+ Xi|P i Xi 2σ ε 1 ε 2 ε P (3.43)  1 1  σ − p (z )∂ ζ − (z3 − 3z )∂ κ + (2z3 − 5z )ζ ∂ ζ . P 1 ε i P 24 ε ε i P 18 ε ε P i P Chapter 3. Optimal portfolio diversification via independent component analysis 106

† The MVaR contribution of the ith IC Yi is equivalent to (3.43) but with respect to the reduced portfolio return Pˆ = w˜(R†)0Y † and under the assumption that the ICs are independent.

3.8.2 FastICA algorithm

The theory supporting the foundations of FastICA is built upon the independence assumption; that is, the ICs are truly independent (I(Y †) = 0). Thus, we present the FastICA algorithm under this assumption, keeping in mind that its practical application does not require this assumption to hold. In the non-independent case, the algorithm still delivers a special basis, associated with the linear combination of asset returns that minimizes mutual information. Most results below can be consulted in more details in the textbook of Hyv¨arinenet al. (2001).

Computing the ICs requires to find the rotation matrix such that the output factors have zero mutual information. In contrast with PCA, ICA needs to be achieved in an adaptive way, using gradient-descent techniques for instance. This triggers computational issues since mutual information in (3.2) requires the joint distribution of the factors Y , which would need to be estimated at every step of the algorithm. Fortunately, as explained in Section 3.2.2, minimizing the mutual information is equivalent to minimizing the sum PK ? of the Shannon entropies of factors, i=1 H(RiY ), a criterion that only features the marginal densities of Y1,...,YK .

Another standard formulation of mutual information consists of using negentropy, defined as J(X) := H[N (0, 1)] − H(X). (3.44)

Because this criterion is positive and vanishes if and only if X ∼ N (0, 1) whenever X is a standard random variable with support on the real line, it is trivial to see that PK ? PK ? maximizing i=1 J(RiY ) is equivalent to minimizing i=1 H(RiY ) over R ∈ SO(K). This provides another intuitive interpretation of ICA in terms of non-Gaussianity. From the central limit theorem, it is known that the asymptotic distribution of a mixture of independent random variables is Gaussian. The negentropy formulation suggests that the independent (assumed non-Gaussian) factors can be recovered by maximizing a ? non-Gaussianity measure—here, negentropy—applied to every output Yi = RiY .

Interestingly, it has been proven by relying on a truncated development of the density around the standard Gaussian density using a set of orthogonal basis functions (a Chapter 3. Optimal portfolio diversification via independent component analysis 107 generalization of Gram-Charlier expansion, but using more general functions than simple monomials) that many “non-Gaussianity measures” can serve as a criterion to perform ICA. The fact that they correspond to a crude negentropy estimator is not an issue since it is possible to show that replacing J(X) by

m ˆ X 2 2  Jm(X) := ci E [Gi(N (0, 1))] − E [Gi(X)] , i=1 for a set of functions Gi’s satisfying some technical conditions, leads to a criterion PK ˆ ? ? i=1 J(RiY ) with stationary points on the set {R ∈ SO(K) | I(RY ) = 0}. In fact, it can even be shown that the restriction to m = 1 (and c1 = 1), leading to

ˆ ˆ 2 2 J(X) := J1(X) = E [G(N (0, 1))] − E [G(X)], (3.45) is enough to perform ICA if all ICs have a positive excess kurtosis, in the sense that the stationary points of the above criterion include the ICs provided that G satisfies some conditions. For instance, taking G(x) = x4 corresponds to using the kurtosis as non-Gaussianity measure. Even though it is arguably a crude estimator of negentropy, it is theoretically proven that maximizing the sum of the square of the output kurtoses allows to recover the ICs; see Delfosse and Loubaton (1995). In practice however, neither the criterion corresponding to the actual negentropy in (3.44) nor that associated with the kurtosis is an appealing choice. Indeed, the first one requires the estimation of the pdf of each of the factor Yi at every iteration, and the second suffers from obvious robustness issues when estimated from samples. When all ICs have a positive excess kurtosis, it has been suggested to use Jˆ in (3.45) with G(x) := log cosh x. This function PK ˆ ? G satisfies the technical conditions ensuring that the gradient of i=1 J(RiY ) vanishes when RY ? = Y † (up to a change of sign and permutation, as usual), and leads to a very robust algorithm. This choice corresponds to the default setup of the FastICA algorithm developed by Hyv¨arinen(1999), the most popular algorithm to perform ICA. This approach has been successfully used in many applications, from the analysis of hyperspectral data to biomedical engineering, and thus, is used in the empirical part of this chapter.

Finally, one may wonder whether the choice of the criterion used to perform ICA affects the existence of potential spurious local optima. That is, local optima that do not correspond to the ICs. The choice of the criterion does indeed matter. For instance, it has been shown that the actual negentropy criterion exhibits spurious minima when the Chapter 3. Optimal portfolio diversification via independent component analysis 108 independent factors are strongly multimodal; see Vrins and Verleysen (2005). In contrast, the kurtosis criterion is, in theory, free from spurious solutions: all stationary points of the criterion agree with the recovery of the ICs (Delfosse and Loubaton 1995). This result is however purely theoretical, and is of little practical interest given the large estimation errors resulting from the fourth power entering the kurtosis definition. The criterion optimized in FastICA is very appealing in this respect: choosing a smooth non-quadratic function G that does not grow too fast, like G(x) = ln cosh x, gives a robust algorithm, a feature that has been extensively studied with the help of simulated data. For further details about spurious solutions of ICA algorithms, we refer to Vrins (2007).

3.8.3 Theoretical properties of IC-kurtosis-parity portfolios

In this section, we study the theoretical properties of the IC-kurtosis-parity (ICKP) portfolios, which are the portfolios for which each of the K independent components contributes equally to the excess kurtosis of the portfolio returns. As for the modified VaR studied in Section 3.5, we make the assumptions that the ICs are independent. We also assume that all ICs have a positive excess kurtosis, κ † > 0 for all i. Yi

Because the ICs are assumed independent, we have from (3.38) that the excess kurtosis of the reduced portfolio return Pˆ = w˜(R†)0R† is given by

PK † 4 i=1 κY † w˜i(R ) κ = i , (3.46) Pˆ  2 PK † 2 i=1 w˜i(R )

† 4 and thus, the ICKP portfolios are obtained for κ † w˜i(R ) = c for all i and some constant Yi c ∈ R0. Because the excess kurtosis of ICs are assumed positive, this gives the IC exposures

† −1/4 ±  w˜(R ) = cK 1K with K := diag κ † , . . . , κ † . (3.47) Y1 YK

In turn, just like for the variance risk measure, there are 2K−1 ICKP portfolios, given by the set  V Λ−1/2R†0K−1/41±  W := w ∈ W w = K . (3.48) ICKP 0 −1/2 †0 −1/4 ± 1 V Λ R K 1K Moreover, the ICKP portfolio with minimum variance corresponds to the following choice ± −1/4 † −1/2 0 of signs: 1K = sign(K R Λ V 1).

Finally, under the assumptions of Theorem 3.2, we find by plugging the condition † 4 K−1 κ † w˜i(R ) = c for all i in (3.46) that the excess kurtosis of each of the 2 ICKP Yi Chapter 3. Optimal portfolio diversification via independent component analysis 109 portfolios is the same and is given by

K !−2 X −1/2 κICK = K κ † . Yi i=1

Moreover, the excess kurtosis of the ICKP portfolios, κICK , is always lower than that of PK 2  the ICVP portfolios in (3.15), κIC = κ † /K . To show this, let κ := κ † , . . . , κ † . i=1 Yi Y1 YK 1/p 1 1  1 PK p We have κIC = M1(κ) and κICK = M 1 (κ), where Mp(x) := x . The K K − 2 K i=1 i statement κICK ≤ κIC then follows from the generalized-mean inequality, which states that Mp(x) increases with p for positive xi’s. For example, for K = 3 and κ = (3.5, 7), we have κmin = 1.48, κICK = 1.53 and κIC = 1.67.

3.8.4 Long-only factor-variance-parity portfolios

We now report the out-of-sample performance of factor-variance-parity (FVP) portfolios, based on PCA and ICA, under the no-short-selling constraint. Preventing short-selling is an important consideration in practice because many investors are not allowed to short assets and that it accounts for estimation risk (Jagannathan and Ma 2003).

Let us explain how we compute the long-only FVP portfolios. As shown by Roncalli and Weisang (2015), when short-selling is not allowed there may not exist a portfolio for which factor-variance parity is perfectly achieved. Therefore, we compute the long-only FVP portfolio in two steps. We consider here factors Y given by some rotation of the PCs, Y = RY ?.

1. We compute the maximum factor-variance diversification that can be achieved by computing the maximum inverse Herfindahl index under no short-selling:

max H[p(R)] subject to wi ≥ 0 ∀i, w∈W

2 2 where pi(R) := w˜i(R) /||w˜(R)|| is the relative variance contribution of the ith −2 factor Yi and H[p] := ||p|| ∈ [1,K] is the inverse Herfindahl index, which equals

K when pi(R) = 1/K for all i. This gives the maximum inverse Herfindahl index

Hmax, which can be lower than K. 2. To be consistent with our proposal of relying on the FVP portfolio with minimum variance, we compute the long-only FVP portfolio as follows:

0 min w Σw subject to H[p(R)] ≥ Hmax − , wi ≥ 0 ∀i, w∈W Chapter 3. Optimal portfolio diversification via independent component analysis 110

where  is a tolerance parameter set to  = 5 × 10−4.

In Table 3.2, we report the out-of-sample performance of three portfolio strategies, using the same methodology than for the unconstrained portfolios in Section 3.6.1:

(i) MV: the long-only minimum-variance (MV); portfolio

(ii) ICMV: the portfolio that is a shrinkage between the long-only MV portfolio and the long-only minimum-variance IC-variance-parity portfolio;

(iii) PCMV: the portfolio that is a shrinkage between the long-only MV portfolio and the long-only minimum-variance PC-variance-parity portfolio.

The shrinkage intensity is calibrated as for the unconstrained portfolios; that is, via 10-fold cross-validation using modified Sharpe ratio as calibration criterion.

The table shows that the results obtained for the long-only portfolios remain qualita- tively similar to the results for the unconstrained portfolios in Table 3.1. Specifically, the ICMV portfolio outperforms the PCMV portfolio in terms of Sharpe ratio and modified Sharpe ratio for all three datasets and, compared to the MV portfolio, performs slightly worse in terms of Sharpe ratio but substantially better in terms of tail risk. Finally, the performance of long-only portfolios is systematically worse than that of unconstrained portfolios. The reason may be that we use daily data, and Jagannathan and Ma (2003) have shown that for daily data the no-short-selling constraint is likely to hurt performance. One benefit of long-only portfolios however is that they require less turnover. Chapter 3. Optimal portfolio diversification via independent component analysis 111

Table 3.2: Out-of-sample performance of long-only factor-risk-parity portfolios

6BTM dataset 6Prof dataset 30Ind dataset MV ICMV PCMV MV ICMV PCMV MV ICMV PCMV Mean 14.54% 14.99%14.78% 13.82% 14.49%13.90% 12.36% 15.18%10.77% (2.57%) (2.66%) (2.80%) (2.57%) (2.67%) (2.56%) (2.22%) (2.87%) (2.32%) Volatility 14.82% 15.62% 16.07% 15.18% 16.91% 15.24% 13.12% 16.91% 13.66% (2.57%) (2.66%) (2.80%) (2.57%) (2.67%) (2.56%) (2.22%) (2.87%) (2.32%) Sharpe ratio 0.98 0.96 0.92 0.91 0.86 0.91 0.94 0.90 0.79 (0.17) (0.18) (0.18) (0.17) (0.17) (0.17) (0.17) (0.17) (0.17) Skewness -0.55 -0.72 -0.42 -0.46 -0.54 -0.45 -0.78 -0.20 -0.61 (0.39) (0.29) (0.32) (0.42) (0.35) (0.39) (0.71) (0.45) (0.61) Excess kurtosis 15.51 13.00 13.66 15.32 10.41 15.06 24.89 16.75 18.81 (3.25) (2.36) (2.37) (3.71) (2.42) (3.62) (10.81) (4.50) (8.62) 1% modified Value-at-Risk 5.77% 5.55% 5.77% 5.84% 5.32% 5.80% 6.97% 6.73% 6.01% (0.81%) (0.67%) (0.64%) (0.88%) (0.65%) (0.90%) (2.25%) (1.27%) (1.94%) Modified Sharpe ratio (×102) 1.00 1.07 1.02 0.94 1.08 0.95 0.70 0.90 0.71 (0.24) (0.26) (0.24) (0.25) (0.24) (0.24) (0.35) (0.27) (0.35) Portfolio turnover 0.24% 0.44% 0.44% 0.19% 0.50% 0.22% 0.61% 0.95% 0.82% Average shrinkage intensity δ / 0.36 0.14 / 0.52 0.10 / 0.37 0.40 Notes. This table reports the out-of-sample performance of factor-variance-parity portfolios under the no-short-selling constraint, following the methodology in Section 3.6.1 and Appendix 3.8.4. The portfolio mean return, volatility and Sharpe ratio are annualized, while all other performance criteria are reported in daily terms. The shrinkage intensity δ is computed via 10-fold cross-validation, using the modified Sharpe ratio as the calibration criterion. The number of PCs kept, K, is fixed via the minimum-average- partial-correlation method of Velicer (1976). Bold figures indicate the best strategy for each criterion and dataset. Figures in parentheses indicate the standard errors, computed via non-parametric bootstrap with Matlab bootstat function and B = 1000 bootstrap resamples.

Chapter 4

Robust portfolio selection using sparse estimation of comoment tensors

Abstract It is well known that estimation issues severely impact the performances of investment strategies. This becomes even more problematic when accounting for higher moments as the number of parameters to be estimated quickly explodes with the number of assets. In this paper, we address this issue by relying on specific factor models. Although useful to reduce the dimension of the problem, principal component analysis (PCA) is only a partial solution. In particular, it does not break the exponential law that links the number of parameters to the moment order. This issue is tackled by using a new robust portfolio-selection technique that relies on independent component analysis (ICA). By linearly projecting the asset returns onto a small set of maximally independent factors, we obtain a sparse approximation of the comoment tensors of asset returns. This drastically decreases the dimensionality of the problem and, as expected, leads to well-performing, low-turnover investment strategies that are computationally efficient.

Reference This chapter is partially based on: Lassance, N., & Vrins, F. (2019). Robust portfolio selection using sparse estimation of comoment tensors. Louvain Finance working paper. Available on SSRN: https://pape rs.ssrn.com/sol3/papers.cfm?abstract_id=3455400.

4.1 Introduction

As discussed in Section 1.1.2, robustness is one of the main concerns when it comes to the realized performance of optimal portfolio strategies. Portfolios that are too sensitive to the provided asset-return inputs are de facto disappointing in terms of out-of-sample 113 Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 114

performance, and require intensive rebalancing transactions. Robustness is particularly needed in high dimensions, because the number of (co)moments to estimate is considerable compared to the sample sizes generally available.

Various techniques have been introduced to improve the robustness of mean-variance portfolios; see Section 1.1.3 for a review. Beyond mean and variance, investors also display preferences for higher moments. As a result, higher-moment portfolio strategies receive increasing attention in the literature; see Section 1.2. However, very few studies are devoted to the analysis of robust higher-moment portfolios. One possible explanation for this can be found in Brandt et al. (2009, p.3418): “[...] extending the traditional approach beyond first and second moments, when the investor’s utility function is not quadratic, is practically impossible because it requires modeling not only the conditional skewness and kurtosis of each stock but also the numerous high-order cross-moments.” Nonetheless, some authors have tackled this problem via Bayesian estimation (Harvey et al. 2010), shrinkage estimation (Martellini and Ziemann 2010, Boudt et al. 2018) and factor models (Boudt et al. 2015, 2019).

In this paper, we introduce a robust portfolio-selection strategy that combines dimen- sion reduction with a specific factor model. The estimation method put forward in this chapter is intuitive, computationally efficient and suited for any portfolio strategy based on moments of any order. We proceed in two steps.

1. In a first step, we show that reducing dimension by projecting the N asset returns onto the basis formed by the first K principal components (PCs) helps to reduce the number of parameters in comoment tensors. However, the PCs are simply uncorrelated; there is no reason why they would be independent, too. Therefore, many higher comoments remain to estimate: the number of parameters in the kth higher-comoment tensor of PCs is of the order Kk.

2. In a second step, to address this problem, we assume a factor model in which the asset returns are a linear mixture of ICs and, by neglecting their remaining higher- moment dependence, we obtain sparse approximations of the comoment tensors of asset returns. In particular, the number of parameters in the kth higher-comoment tensor of ICs becomes linear in K as only the marginal moments matter.

An empirical analysis on Fama-French industry datasets shows the usefulness of our ap- proach in terms of out-of-sample performance and turnover, especially in high dimensions. In particular, our results put some of the conclusions of Martellini and Ziemann (2010) Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 115 into perspective. Using sample, sparse and shrinkage comoment-tensor estimates, they conclude that, for monthly returns, there are no benefits in choosing a higher-moment portfolio strategy over the minimum-variance portfolio. In contrast, using monthly returns as well, we observe that our sparse low-dimensional estimation of tensors based on ICA succeeds in dominating the minimum-variance portfolio.

The proposed portfolio-selection strategy exploits the idea introduced in Chapter3. In that chapter, we use ICA as a way to resolve the complete indeterminacy that affects the factor-risk-parity approach when relying solely on a decorrelation criterion for the choice of factors. In the special case where the risk measure used for factor-risk parity is impacted by higher moments, as studied in Section 3.5.1, we notice that working with the ICs as factors triggers an additional computational advantage. Indeed, the latter being maximally independent, one can make the economy of their comoments estimation. In this chapter, we use the same trick, but in a different context, and for different purposes. Instead of considering risk-parity strategies, we deal with traditional minimum-risk portfolios. Our goal is not to diversify a given risk measure across ICs but is to enhance the robustness of minimum-risk portfolios, in the same vein as shrinkage techniques. To do so, we first decompose the portfolio return as a weighted sum of ICs, neglect the comoments between the latter, and then solve the optimization problem. Compared to the traditional approach, the resulting objective function features a much smaller number of parameters, especially when working in high-dimensional spaces. In the context of this chapter, working with the ICs is only relevant when dealing with higher-moment risk measures; for the variance, working with the ICs or the PCs is equivalent. This stresses the difference with Chapter3 in which we find that working with the ICs in factor-risk-parity strategies remains beneficial in terms of portfolio-return tail risk even when using variance as risk measure.

Our work is related to other approaches in the literature dealing with robust estimation of the higher comoments of portfolio returns. The most popular framework is the use of shrinkage estimators of comoment tensors, which are a combination between the sample estimate and a sparse target estimate. It has mainly been used for the covariance matrix; see Section 1.1.3. Analogs for coskewness and cokurtosis tensors have been developed by Martellini and Ziemann (2010) and Boudt et al. (2018). The goal of shrinkage estimators is similar to that of the sparse estimators introduced in this paper. However, shrinkage still requires the sample estimator as input, which is made of many parameters, and crucially depends on the non-trivial choice of target estimator. In contrast, our sparse tensor estimates only require to determine the number K of retained PCs. Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 116

More closely related to our work, factor models are also used to estimate comoment tensors in portfolio optimization. The vast majority of those studies concentrates on the estimation of the covariance matrix; see De Nard et al. (2019) for a recent review. Concerning higher-comoment tensors, Boudt et al. (2015) show that projecting the asset returns on K factors helps to reduce the number of parameters. However, just like for the PCs above, the growth in the number of parameters remains of order Kk. Jondeau et al. (2018) introduce moment component analysis, an extension of PCA to higher moments. The moment components are the directions that account for most of the variations in selected comoment tensors, and can be used to obtain improved tensor estimates by ignoring the non-relevant directions. The factors considered in the above papers do not exhibit any special dependence structure; they are not uncorrelated or independent, in general. As a result, many higher comoments remain to estimate. Instead, our approach precisely aims at working in a specific system of coordinates, in order to fully exploit the natural sparsity of the comoment tensors associated with independent factors. By doing so, we make the economy of the estimation of many comoments.

Other authors rely, just like us, on independent factor models in portfolio optimization. Chen et al. (2007) rely on ICA for forecasting portfolio Value-at-Risk (VaR). Ghalanos et al. (2015) introduce a dynamic autoregressive independent factor model for the same purpose. Both papers focus on the parametric density estimation of the independent factors via normal inverse Gaussian and generalized hyperbolic distributions, respectively. They both obtain more accurate VaR forecasts than via classical methods such as GARCH. Our focus in this chapter is not on VaR forecasting and density estimation, but on robust moment-based portfolios via sparse comoment tensors obtained with ICA. Moreover, both papers do not apply dimension reduction (K = N). Hitaj et al. (2015) rely on ICA to estimate the maximum-CRRA-utility portfolio. Again, their focus is on the density estimation of the ICs via several parametric distributions.

The closest paper to ours is the recent work of Boudt et al. (2019). They consider a latent independent factor model and, in doing so, obtain improved estimates of higher- comoment tensors called nearest comoment estimates. They apply their estimates to different portfolio strategies—minimum VaR, maximum CRRA expected utility and the shortage-function approach of Briec et al. (2007)—and find improvements compared to sample estimators. The main benefit of our approach compared to theirs is that ours is better suited to high-dimensional settings, where robustness is particularly important. Indeed, the estimation methodology in Boudt et al. (2019) is computationally intensive due to the use of a large-dimensional weight matrix. As they explain (p being the number Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 117 of assets): “A computational disadvantage of the proposed estimator is the use of a weight matrix of dimension equal to the number of sample moments, thus growing as O(p4). In the simulations and empirical applications, we therefore consider only settings of up to p = 10, which is still highly relevant to asset allocation [...].” In contrast, our approach based on ICA is computationally efficient even in large dimensions, for two reasons. First, all inputs needed to obtain the tensor estimates are easy to compute even for large N: we only need the eigendecomposition of the covariance matrix, the rotation matrix that deliver the ICs (very efficient numerical procedures are available to this end such as FastICA) and, finally, the marginal moments of the ICs. Second, as we show, dimension reduction also helps to reduce the dimension of the optimization problem: in our framework, the problem of finding the N optimal portfolio weights simplifies to that of finding the K optimal factor exposures.

The chapter is organized as follows. In Section 4.2, we show how dimension reduction via PCs helps to reduce the dimensionality of comoment tensors. In Section 4.3, we introduce our sparse estimation based on ICA. In Section 4.4, we evaluate the performance of the proposed portfolio strategies. Section 4.5 concludes. The Appendix at the end of the chapter contains proofs for all results, as well as supplementary materials.

4.2 Dimension reduction

In this chapter, we rely on centered asset returns because modeling the asset returns X as a linear combination of K factors Y as µ + AY does not help to improve the estimates of asset mean returns. Moreover, as explained in Section 1.1.2, including µ in the objective function often deteriorates the out-of-sample portfolio performance.

4.2.1 Curse of dimensionality

We consider any portfolio strategy that minimizes a function of (central) moments:

 wψ := argmin ψ mq,P (w), mr,P (w), . . . , ms,P (w) , (4.1) w∈W where 2 ≤ q ≤ r ≤ s are integers and s is the maximum moment order considered. Notice that this framework is very general. For instance, it encompasses maximum-expected- utility portfolios as a special case, via a truncated Taylor series expansion as in Section 1.2 Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 118

for example. The kth moment mk,P can be decomposed as

k−1 0 O mk,P = w Mk(X) w, (4.2) i=1

Nk where i=1 w := w ⊗ · · · ⊗ w with k times w, and

" k−1 # O 0 Mk(X) := E X X . (4.3) i=1

k Mk(X) can be represented equivalently as an N -dimensional tensor, in which case ⊗ is the tensor product, or as a N × N k−1 matrix, in which case ⊗ is the Kronecker product. We follow the tensor representation in this chapter. Without specific assumptions on X, 1 the cardinality (the number distinct parameters to estimate) of Mk(X) is

N + k − 1 ](M (X)) = = ON k. (4.4) k k

Thus, for portfolio strategies based on a large number of assets N and/or on higher moments k > 2, the number of parameters impacting the kth moment of P (w) quickly be- comes unmanageable. This severely affects the robustness and out-of-sample performance

of the portfolio wψ. In the sequel, we refer to this problem as the curse of dimensionality.

4.2.2 Reducing dimensionality via principal components

A first common approach to address the curse of dimensionality is to reduce dimension (Xu et al. 2016). More precisely, we approximate the asset returns X by Xˆ , a well- chosen linear mixture of K ≤ N standardized factors. Mathematically, we seek for an approximation Xˆ of the form Xˆ = AY , (4.5)

where A is the N × K loading matrix and the K factors Y are uncorrelated and have ˆ unit variance, M2(Y ) = IK . It is well known that the best approximation X of X in the least-squares sense is given by principal component analysis (PCA). Adopting the same notation than in Chapter3, V ∈ RN×K and Λ ∈ RK×K are the eigenvectors and eigenvalues matrices associated with the K largest eigenvalues of Σ, and the optimal

1 N+1 (N+1)! For instance, the cardinality of an N × N covariance matrix is 2 = (N+1−2)!2! = N(N + 1)/2. Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 119 factors and loading matrix are given by

Y ? := Λ−1/2V 0X and A := V Λ1/2. (4.6)

Our lower-dimensional representation of the asset returns simply becomes Xˆ = VV 0X. The approximation Pˆ of the portfolio return P resulting from ignoring the last N − K PCs is then obtained as

Pˆ := w0Xˆ = w0V Λ1/2Y ? = w˜0Y ?, (4.7) where the PC exposures w˜ and portfolio weights w are related according to (3.7):

w = V Λ−1/2w˜. (4.8)

Since Σ is assumed to be of full rank, there is a loss of information resulting from the projection step. Thus, Pˆ = w0Xˆ is different from P = w0X unless K = N. We can now ˆ estimate the portfolio wψ by finding the PC exposures w˜ optimizing the moments of P , the best approximation of P in the least-squares sense.

Definition 4.1 (PC-estimate). The PC-estimate of the portfolio wψ in (4.1) is

−1/2 wˆψ,K := V Λ w˜ψ,K , (4.9) where  w˜ψ,K := argmin ψ mq,Pˆ, mr,Pˆ, . . . , ms,Pˆ (4.10) w˜∈Wf with Pˆ = w˜0Y ? and  K −1/2 Wf := v ∈ R | V Λ v ∈ W . (4.11)

0 Remark 4.1. When K = 1, the PC-estimate reduces to wˆψ,1 = V /1 V and is indepen- dent of the function ψ. Thus, we restrict to K ≥ 2 in the sequel.

The above problem can also be formulated as an optimization with respect to w ∈ W directly as we also have Pˆ = w0Xˆ . However, problem (4.10) is computationally much more efficient as only K variables are optimized instead of N. Thus, in contrast with classical methods used to obtain improved comoment-tensor estimates (see Section 4.1), we do not need to plug high-dimensional asset-return comoment tensors to compute the optimal portfolio. Instead, only the comoment tensors of the K PCs Y ? are needed in (4.10) to compute the moments of Pˆ, which substantially speeds up the optimization. Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 120

It is worth noting that the PC-estimate wˆψ,K focuses on the K directions with the largest eigenvalues. These are the most relevant and, moreover, are most precisely estimated, which is expected to improve the robustness of the portfolio. This will be shown empirically in Section 4.4: the portfolio with dimension reduction wˆψ,K performs better than wψ, and requires less turnover.

4.2.3 Approximation of comoment tensors

By disregarding the contribution of the N − K least relevant dimensions, the projection step allows us to reduce the dimensionality of the problem. In the next proposition, we formalize this result by showing how the comoment tensors of asset returns simplify when assuming the dimension-reduction factor model based on PCs in (4.5).

Proposition 4.1. The kth comoment tensor of the projected asset returns Xˆ = AY ? is given by k−1 ˆ ? O 0 Mk(X) = AMk(Y ) A , (4.12) i=1 ˆ ? and has the cardinality ](Mk(X)) = ](A) + ](Mk(Y )), where

K(K − 1) ](A) = KN − , (4.13) 2

? ](M2(Y )) = 0 and, for k ≥ 3,

K + k − 1 ](M (Y ?)) = = O(Kk). (4.14) k k

In the following corollary, we consider the minimum-variance portfolio wMV in (1.6) that is known to work very well in practice, precisely due to its remarkable robustness; 0 see Section 1.1.2. It corresponds to solving (4.1) with ψ(m2,P ) = m2,P = w Σw.

Corollary 4.1. The PC-estimate of the minimum-variance portfolio wMV is given by

Σe −11 wˆMV,K := , (4.15) 10Σe −11

0  where Σe := V ΛV , and ] Σe = ](A) in (4.13).

The minimum-variance portfolio wMV corresponds to K = N, that is, amounts 0 to replace Σe by Σ = VN ΛN VN in (4.15). The above corollary shows that even when Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 121 focusing on covariance-based strategies, dimension reduction helps to reduce the number  of parameters needed to estimate Σ as ] Σe is much lower than ](Σ) = N(N + 1)/2 for K small relative to N.

Estimates of the covariance matrix via PCs of the form Σe are quite standard in factor modeling (De Nard et al. 2019), although a notable difference is that, in our approach, we do not try to estimate the covariance matrix of the idiosyncratic errors ε = X − Xˆ . However, in contrast with most of the literature that concentrates on the covariance matrix, Proposition 4.1 is much more general: it explains how linear dimension reduction helps with respect to higher-comoment tensors, similarly to Boudt et al. (2015). Unfortunately, reducing the number of factors from N to K does not resolve the curse of dimensionality. Indeed, the kth moment of Pˆ is

k−1 0 ? O mk,Pˆ = w˜ Mk(Y ) w˜, (4.16) i=1

? and, for k ≥ 3, ](Mk(Y )) is similar to ](Mk(X)) in (4.4) with N ← K. In particular, it still grows exponentially with k, and thus can be large even for small K.

Example 4.1. Let N = 30 and K = 5. The PC-estimate of the minimum-variance  portfolio wˆMV,K relies on the estimation of ] Σe = 140 parameters, which is much smaller than the 465 associated with wMV . Yet, a strategy that would further feature the third ? ? and fourth moments requires the estimation of an extra ](M3(Y )) + ](M4(Y )) = 35 + 70 = 105 parameters that, moreover, are notoriously difficult to estimate (Martellini and Ziemann 2010).

4.3 Sparse higher-comoment tensors

As explained above, dimension reduction helps to decrease the number of parameters to be estimated in a given portfolio strategy but, in general, does not solve the curse of dimensionality. In particular, the number of parameters still grows with the number K of factors to the power s, the highest moment order involved in the portfolio strategy. Interestingly however, exploiting a specific dependence structure among factors can help sparsifying the higher-comoment tensors. In this section, we show that working with the specific linear transform of the PCs featuring the lowest dependence—the indepen- dent components—allows one to further lower in a substantial way the dimensionality underlying the portfolio-optimization problem, avoiding exponential growth. Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 122

4.3.1 Independent factor model

Let us assume that the centered asset returns can be explained by the following linear factor model: X = A†S, (4.17)

where the loading matrix A† is of size N × K and the factors S are standard and

independent; that is, they have zero mean, covariance matrix M2(S) = IK and satisfy

Si ⊥ Sj for all i 6= j. Notice that this is an independent factor model: we do not merely require S to be uncorrelated, but to be independent. This requirement implies that only

marginal moments must be estimated in Mk(S), resulting in much sparser estimates of the comoment tensors of asset returns than with the PCs, which are merely uncorrelated. This is captured in the next theorem.

Theorem 4.1. Under the independent factor model in (4.17), the kth comoment tensor of X is k−1 † O †0 Mk(X) = A Mk(S) A , (4.18) i=1

† and, for k ≥ 3, ](Mk(X)) = ](A ) + ](Mk(S)) with

](Mk(S)) = K(k − 3 + 1k=3). (4.19)

This theorem has a fundamental consequence: conditional upon A†, the cardinality

of the tensor Mk(X) grows only linearly with K. This is intuitive since under our independence factor assumption, the kth comoment tensor of X solely depends on the marginal moments of the K factors S up to order k.

Remark 4.2. Equation (4.19) results from the fact that, because S is mutually inde-

pendent, the tensor Mk(S) for k ≥ 3 requires all marginal moments of order 3 to k (those of order 2 are all equal to one), except those of order k − 1 as they appear in the terms [S S ...S S ] = m [S ] = 0. In practical portfolio settings, the tensors are E i i i j k−1,Si E j

generally used in successive order, and estimating all tensors Mk(X) from order 2 to † s entails ](A ) plus ](M3(S),..., Ms(S)) = K(s − 2) parameters as the latter merely requires all marginal moments m , . . . , m , i = 1,...,K. 3,Si s,Si Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 123

4.3.2 Approximation via independent component analysis

ˆ The dimension-reduction step introduced in Section 4.2 yields Mk(X) ≈ Mk(X) where Xˆ = AY ? is an approximation of X based on only K ≤ N factors. Of course, the PCs Y ? form just one set of uncorrelated factors, out of infinitely many others. Indeed, as explained in Section 3.2, any rotation R of the latter yields a new set of factors, RY ?, with identity covariance matrix and spanning the same subspace. Combining those factors according to the loading matrix AR0 yields Xˆ :

(AR0)(RY ?) = AY ? = Xˆ .

An interesting choice is to consider the specific rotation of the PCs corresponding to the independent components (ICs). Remind from Section 3.2.2 that the ICs are given by choosing the rotation R minimizing mutual information:

( ) Y † := R†Y ? with R† ∈ argmin I(RY ?) . (4.20) R∈SO(K)

Note that the ICs only lead to the minimum mutual information; there is no guarantee that I(Y †) = 0. However, a well-known identification theorem from signal processing

(Comon 1994) states that if the factor model in (4.17) holds with at most one Si being Gaussian, we have that Y † found using (4.20) agrees with S, up to an irrelevant change of sign and permutation. In other words, if X can be written as a linear invertible transform of truly independent factors of which at most one is Gaussian, then the ICs will agree with those and therefore, will be independent as well. Otherwise, the ICs correspond to the maximally independent linear mixture of X. The particular representation

Xˆ = A†Y † with A† = A(R†)0 (4.21) is useful because, compared to the PCs Y ?, the ICs Y † have a special feature: they correspond to the least dependent uncorrelated factors. Therefore, we can assume that † ˆ † ˆ † Mk(Y ) ≈ Mk(Y ) where Mk(Y ) is defined as the tensor whose diagonal elements are † given by those of Mk(Y ) but the off-diagonal elements are replaced by their independent 2 † † counterparts. This is not unrealistic as M2(Y ) is already diagonal and the ICs Y are

2 ˆ † † For k = 3, this corresponds to setting the (i, i, i) entries of M3(Y ) to the ones in M3(Y ), and all ˆ † other entries to 0. For k = 4, this corresponds to setting the (i, i, i, i) entries of M4(Y ) to the ones in † M4(Y ), the (i, i, j, j) entries to 1, and all other entries to 0. Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 124

close to mutual independence. Therefore, we obtain the approximation

k−1 ˆ ˆ † ˆ † O † 0 Mk(X) := A Mk(Y ) (A ) , (4.22) i=1

† ˆ †  which enjoys the sparse cardinality of Theorem 4.1 with ] A = KN and ] Mk(Y ) in (4.19).

ˆ ˆ ˆ † Remark 4.3. Obviously, Mk(X) = Mk(X) whenever I(Y ) = 0: when the indepen- ˆ dence assumption perfectly holds, the tensors Mk(X) are naturally sparse and agree ˆ ˆ with Mk(X). In practice, the ICs may not be perfectly independent, which leads to ˆ ˆ ˆ the approximation Mk(X) ≈ Mk(X). However, we do not assume that independence holds for an arbitrary set of uncorrelated factors such as the PCs; this would be arguably dubious. Indeed, the PCs are given by eigendecomposition of the covariance matrix of the asset returns, so that really nothing can be said a priori regarding their higher-moment dependence. Instead, we make the independence assumption when working with the very specific set of factors that best matches this hypothesis. This trick is pretty cheap (the only additional cost is the estimation of the rotation matrix R† that features only K(K − 1)/2 free parameters), and substantially reduces the number of free parameters.

4.3.3 Sparse estimate of optimal portfolio

Our portfolio strategy is thus made of four steps. First, the PCA-based projection yields the factors Y ?. Second, we compute the matrix R† via the FastICA algorithm of Hyv¨arinen(1999); see Appendix 3.8.2 for details. Third, we approximate the ICs as being † ˆ † actually independent: Mk(Y ) ≈ Mk(Y ). Fourth, we compute the IC-estimate of the

portfolio wψ according to the following definition.

Definition 4.2 (IC-estimate). The IC-estimate of the portfolio wψ in (4.1) is

† −1/2 † 0 † wˆψ,K := V Λ (R ) w˜ψ,K , (4.23)

where †  w˜ψ,K := argmin ψ mˆ q,Pˆ, mˆ r,Pˆ,..., mˆ s,Pˆ (4.24) w˜∈Wf† with †  K −1/2 † 0 Wf := v ∈ R | V Λ (R ) v ∈ W (4.25)

† ˆ † and mˆ k,Pˆ the approximation of mk,Pˆ obtained by replacing Mk(Y ) by Mk(Y ) in (4.16). Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 125

Figure 4.1: Gain of parameters for the estimation of higher moments via independent versus principal components 3000

m3(Pˆ) vsm ˆ 3(Pˆ) 2500 m4(Pˆ) vsm ˆ 4(Pˆ)

2000

1500

1000

500

0 1 5K 10 15 Notes. The figure depicts the difference between the number of parameters needed to estimate the third ˆ ˆ ˆ moment (blue) and fourth moment (red) of P via principal components (m3(P ), m4(P )) versus independent ˆ ˆ components (m ˆ 3(P ), mˆ 4(P )). The expression for the difference is given in Proposition 4.2.

ˆ ˆ ˆ 0 Because M2(X) = M2(X) = Σe = AA , the IC-estimate of the minimum-variance portfolio coincides with the PC-estimate in Corollary 4.1. This is not surprising since the rotation transform does not impact the covariance matrix of the reconstructed asset returns. If higher moments are considered (s > 2), then the PC and IC-estimates wˆψ,K † † and wˆψ,K differ as soon as I(Y ) > 0. Specifically, we have the following proposition concerning the gain in parameters.

Proposition 4.2. Let k ≥ 3. Then, the difference in the number of parameters needed to estimate mk,Pˆ and mˆ k,Pˆ is

K + k − 1 K − 1  − K + k − 3 + 1 . (4.26) k 2 k=3

Example 4.2. In Figure 4.1, we depict the difference in (4.26) for the third and fourth moments (k ∈ {3, 4}) as a function of K. It shows that the gain in parameters allowed by our sparse approximation is substantial and raises quickly with K. For example, the differences for the coskewness and cokurtosis tensors are respectively 20 and 55 for K = 5, versus 165 and 660 for K = 10.

In Table 4.1, we report a summarized comparison of the cardinality of the comoment ˆ tensors via i) full estimation Mk(X), ii) PC estimation Mk(X), and iii) IC estimation ˆ ˆ Mk(X). The values of N and K reflect those used empirically in Section 4.4. Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 126

Table 4.1: Comparison of cardinality of comoment tensors (a) General expressions  ](M2) ] Mk≥3

N−k+1 Full estimation N(N + 1)/2 k PC estimation NK−K(K−1)/2 NK−K(K−1)/2 K−k+1 + k IC estimation NK−K(K−1)/2 K(N + k − 3 + 1k=3) (b) Particular cases

](M2) ](M3) ](M4) ](M2, M3, M4) (N,K) = (5, 3) Full estimation 15 35 70 120 PC estimation 12 22 27 37 IC estimation 12 18 18 21 (N,K) = (17, 6) Full estimation 153 969 4845 5967 PC estimation 87 143 213 269 IC estimation 87 108 108 114 (N,K) = (30, 10) Full estimation 465 4960 40920 46345 PC estimation 255 475 970 1190 IC estimation 255 310 310 320 Notes. The table reports the cardinality of the comoment tensors of asset returns in three separate cases. ˆ ˆ ˆ First, the full estimation Mk(X). Second, the PC estimation Mk(X). Third, the IC estimation Mk(X). Table (a) reports the general expressions, and Table (b) reports particular cases of N of K. The values of N correspond to the three datasets used in the empirical analysis. The values of K correspond to the rounded average K used over time for the three datasets; see Section 4.4.1 for details on the choice of K. The column ](M2, M3, M4) in Table (b) is the cardinality implied by the estimation of the three tensors together, and is equal to or lower than the sum of the cardinality of each separate tensor because of redundancies.

4.4 Empirical analysis

We now assess the out-of-sample performance of our PC- and IC-estimates of portfolio strategies. Section 4.4.1 details the methodology, Section 4.4.2 compares the independence of PCs versus ICs, and Section 4.4.3 discusses the results.

4.4.1 Data and methodology

In order to assess the effect of the dimensionality on portfolio performances, we use value-weighted monthly arithmetic returns from July 1979 to June 2019 for the 5, 17 and 30 Fama-French industry portfolios (5Ind, 17Ind, 30Ind). We rely on monthly returns Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 127 to place ourselves in a situation where sample sizes are low, and thus, estimation risk is prominent. In Appendix 4.6.2, we evaluate the performance using daily returns.

We consider two portfolio strategies wψ, corresponding to two risk measures ψ. First, let us take the variance, associated with the objective function ψ(m2,P ) = m2,P . Indeed, recall from Corollary 4.1 that dimension reduction is helpful even in the minimum-variance setup. Second, to evaluate the performances of strategies featuring higher moments, we consider the Value-at-Risk (VaR) which is probably the most popular risk measure besides the variance. To express VaR in terms of moments, we rely on the modified VaR (MVaR) in (1.48). As in Section 3.6, we fix ε = 1%. To implement the minimum-variance and minimum-MVaR portfolios, we consider three estimation strategies, all computed using sample moment averages.

i) MV and MMVaR: estimation using full comoment tensors of asset returns.

ii) MVPC and MMVaRPC: PC-estimates wˆψ,K introduced in Definition 4.1. MVPC is known in closed form from Corollary 4.1.

† iii) MMVaRIC: IC-estimate wˆψ,K introduced in Definition 4.2. We only consider MVaR for the IC-estimate because, as noted in Section 4.3.3, the PC and IC-estimates of the minimum-variance portfolio coincide.

We select K ∈ {2,...,N} as the lowest value of K for which the first K PCs retain 90% of the variance; that is, for which

PK λ i=1 i > 90%, (4.27) PN j=1 λj where λi is the ith eigenvalue. This approach, which consists of selecting K via a threshold on the percentage of total variance explained by the selected factors, is pretty standard in many applications of data processing (Everitt and Dunn 2010). Figure 4.2 depicts the time evolution of K for the three datasets. It shows that the datasets are well diversified as the number of factors needed to recover 90% of the variance is quite large. Following Proposition 4.2, this is precisely the kind of situations where our sparse ICA-based tensor approximations can bring a substantial robustness gain compared to the simple PCA-based reduction of dimension.

To evaluate the out-of-sample performance of the portfolios, we employ a rolling- horizon methodology as in the previous chapters, using an estimation window of ten years (a sample size of T = 120 months) and a rolling window of six months. Thus, the 30-years Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 128

Figure 4.2: Time evolution of number of factors K

5Ind 17Ind 30Ind 14 14 14 12 12 12 10 10 10 8 8 8 K K K 6 6 6 4 4 4 2 2 2 1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60 Rolling window Rolling window Rolling window Notes. The figure depicts the number of principal components kept, K, over time for the three datasets. It is selected so that 90% of the variance is retained as in Equation (4.27). We use 60 rolling windows of ten years of monthly returns, rolled over by six months over time, covering the period July 1979 to June 2019.

out-of-sample period goes from July 1989 to June 2019. The returns are computed net of transaction costs as in DeMiguel et al. (2009a, p.1930) with a proportional transaction

cost of c = 50 basis points. That is, the portfolio return at time t + 1, Pt+1, is adjusted as follows: N ! X Pt+1 ← (1 + Pt+1) 1 − c × |wi,t+1 − wi,t+ | − 1, (4.28) i=1

where wi,t+1 is the portfolio weight at time t + 1, and wi,t+ is the portfolio weight before rebalancing at t + 1 as in (2.29).

Finally, we compare the out-of-sample performance of the different portfolios in terms of mean-variance trade-off, tail risk and portfolio-weight stability using the same criteria than in Section 3.6.1. We compute p-values via the methodologies of Ledoit and Wolf (2008) for the Sharpe ratio and Ardia and Boudt (2015b) for the modified Sharpe ratio, as in Section 3.6.1. We compare the following pairs of portfolios: MVPC versus MV and MMVaRIC versus MMVaR. The stars ?, ?? and ???, mean that the two-sided p-value is less than 20%, 10% and 5%.

4.4.2 Independence of PCs versus ICs

The sparse representation of the asset-return comoment tensors via ICA is based on the assumption that the ICs are independent. For the coskewness tensor, this amounts to disregard the off-diagonal elements. Intuitively, the error resulting from this approximation is expected to be much smaller when considering the ICs as factors instead of the PCs. ? To illustrate this, let us compare the coskewness tensors of the K = 2 PCs (M3(Y )) to Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 129

† 3 that of the K = 2 ICs (M3(Y )) on the full out-of-sample period for the 30Ind dataset:

! ! ? -3.03 −1.15 −1.15 −0.15 † -3.61 0.11 0.11 0.10 M3(Y ) = , M3(Y ) = . −1.15 −0.15 −0.15 -0.40 0.11 0.10 0.10 -0.46

† As expected, the off-diagonal elements of M3(Y ) are much closer to zero than those ? of M3(Y ); it is therefore more natural to disregard the former than the latter. This is precisely what we do by choosing to rely on the ICs. Similar observations are made for the two other datasets and the cokurtosis tensors. Second, for the three datasets, we report in Figure 4.3 the time evolution of the non-linear correlation between the K = 2 PCs and ICs, Cov(G(Y ),G(Y )) ρ (Y ,Y ) := 1 2 , (4.29) G 1 2 σ σ G(Y1) G(Y2) for G(x) = ln cosh x. This choice is motivated by the fact that independence between two variables (X,Y ) amounts to decorrelation between any non-linear transforms of them, i.e. Cov(f(X), g(Y )) = 0 for all functions f, g. Obviously, linear functions only capture linear dependence. Here, we measure the residual dependence among the PCs and ICs by computing the correlation of their G-transform, because G is the non-linear mapping used in FastICA; see Appendix 3.8.2. We also depict the 95% confidence intervals estimated via bootstrap. For the PCs, ρG is nearly always positive, and ρG = 0 is quite a few times outside the confidence interval. In contrast, for the ICs, ρG remains close to zero without up or down trends, and ρG = 0 is nearly always in the confidence interval.

4.4.3 Results

The out-of-sample results obtained with the methodology of Section 4.4.1 are reported in Table 4.2. We make the following observations.

First, dimension reduction is not useful in terms of performance for small size datasets (5Ind), but is very useful for moderate and large size datasets (17Ind and 30Ind). Indeed, in terms of Sharpe ratio and modified Sharpe ratio, MVPC and MMVaRPC are outperformed by MV and MMVaR for 5Ind, and the contrary for 17Ind and 30Ind. This is not surprising as, for N = 5, there are not so many parameters to estimate, and thus dimension reduction brings no further benefit. For N = 17 and 30 however, the benefits are very substantial (see Table 4.1). Also, as far as the objective functions

3For K = 2, the third comoment (coskewness) tensor is of dimension 2 × 2 × 2. For visualization purposes, we represent it in matrix form by stacking the two 2 × 2 subcomoment matrices columnwise. The two diagonal elements (1, 1, 1) and (2, 2, 2) are now the elements (1, 1) and (2, 4) in bold. Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 130

Table 4.2: Out-of-sample performance of estimation strategies

5Ind dataset MV MVPC MMVaR MMVaRPC MMVaRIC Mean 11.10% 11.02% 11.12% 11.57% 10.89% Volatility 12.26% 13.53% 13.90% 14.72% 14.53% Sharpe ratio 0.91 0.81 0.80 0.79 0.75 Skewness -0.44 -0.64 -0.22 -0.34 -0.34 Excess kurtosis 0.43 1.39 0.06 0.28 0.78 Modified VaR 8.55% 10.67% 9.03% 10.08% 10.48% Modified SR (×10) 1.08 0.86? 1.03 0.96 0.87 Turnover 7.84% 5.27% 18.80% 16.08% 8.99%

17Ind dataset MV MVPC MMVaR MMVaRPC MMVaRIC Mean 7.47% 10.14% 8.65% 11.07% 10.98% Volatility 11.66% 11.76% 17.18% 14.92% 12.31% Sharpe ratio 0.64 0.86?? 0.50 0.74 0.89??? Skewness -0.46 -0.42 0.16 0.19 -0.29 Excess kurtosis 0.48 0.66 3.09 1.52 0.52 Modified VaR 8.46% 8.40% 13.79% 9.97% 8.44% Modified SR (×10) 0.74 1.01?? 0.52 0.93 1.09?? Turnover 17.59% 10.51% 48.42% 30.81% 16.81%

30Ind dataset MV MVPC MMVaR MMVaRPC MMVaRIC Mean 6.46% 8.95% -2.03% 7.62% 9.87% Volatility 12.66% 11.71% 28.29% 14.36% 12.00% Sharpe ratio 0.51 0.76 -0.07 0.53 0.82??? Skewness -0.12 -0.41 0.15 -0.35 -0.28 Excess kurtosis 1.14 1.36 6.72 1.39 0.64 Modified VaR 9.24% 9.00% 31.06% 11.23% 8.35% Modified SR (×10) 0.58 0.83 -0.05 0.57 0.98??? Turnover 30.63% 14.99% 193.21% 35.23% 22.00% Notes. The table reports the out-of-sample performance of the five considered portfolio policies on the three datasets following the methodology in Section 5.7.1. The portfolio mean return, volatility and Sharpe ratio are annualized, while all other performance criteria are reported in monthly terms. All performance criteria are reported net of transaction costs, assuming a proportional transaction cost of c = 50 basis points in (4.28). The number of principal components, K, is selected so that 90% of the variance is retained. The out-of-sample period ranges from July 1989 to June 2019, and we use an estimation window of ten years and a rolling window of six months. We evaluate the statistical significance of the difference in Sharpe ratio and the modified Sharpe ratio by computing the two-sided p-values via the circular bootstrapping methodology of Ledoit and Wolf (2008) for the Sharpe ratio and Ardia and Boudt (2015b) for the modified Sharpe ratio. The stars ?, ?? and ??? mean that the p-value is less than 20%, 10% and 5%. We compare the following pairs of portfolios: MVPC versus MV and MMVaRIC versus MMVaR. Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 131

Figure 4.3: Non-linear correlation of principal versus independent components

5 17 30 1 Ind, PCs 1 Ind, PCs 1 Ind, PCs

0.5 0.5 0.5

0 0 0

-0.5 -0.5 -0.5

-1 -1 -1 80-90 85-95 90-00 95-05 00-10 05-15 80-90 85-95 90-00 95-05 00-10 05-15 80-90 85-95 90-00 95-05 00-10 05-15 5 17 30 1 Ind, ICs 1 Ind, ICs 1 Ind, ICs

0.5 0.5 0.5

0 0 0

-0.5 -0.5 -0.5

-1 -1 -1 80-90 85-95 90-00 95-05 00-10 05-15 80-90 85-95 90-00 95-05 00-10 05-15 80-90 85-95 90-00 95-05 00-10 05-15 Notes. The figure depicts the non-linear correlation of the K = 2 principal and independent components over time, for the 5Ind, 17Ind and 30Ind datasets. The non-linear correlation is computed by transforming the factors with the function G(x) = ln cosh x, and computing Pearson’s correlation on the transformed factors; see Equation (4.29). The solid blue lines depict the point estimates, and the dotted blue lines the 95% confidence intervals computed via bootstrap using Matlab bootci function with 1000 bootstrap resamples. We use 60 rolling windows of ten years of monthly returns, rolled over by six months over time, covering the period July 1979 to June 2019. These windows are depicted on the x-axis. are concerned, MMVaRPC achieves a lower modified VaR than MMVaR for 17Ind and 30Ind, and MVPC achieves a similar volatility than MV for 17Ind, and a lower one for 30Ind. Finally, it is particularly striking to observe that the out-of-sample performance of MMVaR completely collapses for 30Ind: although it aims to minimize MVaR, its out-of-sample MVaR is 31.06%, compared with 11.23% for MMVaRPC. This shows that the curse of dimensionality is prominent for strategies accounting for higher moments.

Second, dimension reduction is always very useful in terms of turnover. For example, for the 30Ind dataset, the monthly turnover of MV is 30.63% versus 14.99% for MVPC, and is 193.21% for MMVaR versus 35.23% for MMVaRPC. This does not come as a surprise since a higher sparsity is expected to be beneficial in terms of stability.

Third, for MVaR as higher-moment risk measure, using the IC-estimate MMVaRIC rather than the PC-estimate MMVaRPC is not useful in terms of performance for small size datasets (5Ind), but is very useful for moderate and large size datasets (17Ind and 30Ind). Indeed, in terms of Sharpe ratio and modified Sharpe ratio, MMVaRIC is outperformed by MMVaRPC for 5Ind, and the contrary for 17Ind and 30Ind. As far as the goal of minimizing MVaR is concerned, MMVaRIC outperforms MMVaRPC: the MVaR is 8.44% versus 9.97% for 17Ind, and 8.35% versus 11.23% for 30Ind. This shows that considering (nearly) independent rather than merely uncorrelated factors enhances robustness and performance. Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 132

Figure 4.4: Boxplots of the MMVaR, MMVaRPC and MMVaRIC portfolios for the 17Ind dataset 1.5

1

0.5

0

-0.5

-1

-1.5 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16 w17 Notes. The figure depicts boxplots of portfolio weights, for the 17Ind dataset, of the MMVaR (left boxplots), MMVaRPC (middle boxplots) and MMVaRIC (right boxplots) portfolio strategies. The figure is constructed using Matlab boxplot function. The central line on each boxplot indicates the median, the box indicates the 25% and 75% quantiles, the whiskers extend to the most extreme data points that are not considered as outliers, and the outliers are represented with the + symbol.

Fourth, in addition to improving portfolio performance, our sparse IC estimates of comoment tensors also significantly reduce turnover; for all three datasets, MMVaRIC has a lower turnover than MMVaRPC. In Figure 4.4, we depict boxplots of portfolio weights of MMVaR, MMVaRPC and MMVaRIC for the 17Ind dataset. The figure shows that MMVaRIC is by far the most stable portfolio. Moreover, we have also observed that it entails less short positions over time.

Finally, our sparse tensor estimates based on ICA yield higher-moment portfolio strategies that outperform the minimum-variance strategy (except, again, for 5Ind). For 17Ind, MMVaRIC has a similar skewness, kurtosis and MVaR than MV and MVPC, but a much higher modified Sharpe ratio. For 30Ind, MMVaRIC is the strategy with the lowest kurtosis, lowest MVaR and highest modified Sharpe ratio. In particular, its MVaR is 8.35%, versus 9.24% for MV and 9.00% for MVPC. In contrast, the classical estimates (MMVaR) or PC estimates (MMVaRPC) yield portfolios that do not outperform the minimum-variance strategy: MMVaR and MMVaRPC have a higher MVaR than MV for all datasets. Thus, robust estimation procedures are needed to reap higher-moment benefits in portfolio optimization. This result is to be put in contrast with Martellini and Ziemann (2010) who find that, for monthly returns, there is no benefit in using a four-moment portfolio strategy over the minimum-variance portfolio, no matter whether one relies on sample, sparse or shrinkage estimators of comoment tensors. Instead, even Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 133 though we use monthly returns, we find that higher-moment strategies can outperform the minimum-variance portfolio when using our sparse low-dimensional estimation of the comoment tensors based on ICA.

4.5 Conclusion

Although it is clear that investors care about higher-moment features such as asymmetry and fat tails, these are often dismissed in portfolio strategies. The reason has to be found in the dimensionality of the problem: in practice, the numbers of parameters to be estimated quickly explodes with the dimension of the investment universe and the order of the moments. This, in turn, deeply impacts the robustness of the corresponding strategies, and leads to disappointing out-of-sample performances. Several techniques have been proposed in the literature to deal with this problem, for example by using shrinkage or imposing a factor structure. Although successfully applied earlier to derive robust estimators of the covariance matrix, their extension to higher-comoment tensors is complex and can struggle to outperform simpler strategies ignoring higher moments for moderate sample sizes. By contrast, we derive a simple and elegant framework whose purpose is precisely to tackle the curse of dimensionality: we seek to enhance robustness by proposing a much sparser representation of the comoment tensors of asset returns. To achieve this goal, the asset returns are projected in a linear way onto a small set of maximally independent factors, found via independent component analysis. This procedure is proven to solve the curse of dimensionality impacting the classical estimates: by exploiting the near-dependence structure of these specific factors, we drastically reduce the number of free parameters in the comoment tensors (Theorem 4.1). In particular, the latter is no longer given by a power law, but evolves linearly with the higher moments involved in the objective function. This makes our strategy very scalable and efficient, even when applied in high-dimensional investment universes. Using modified Value-at-Risk as higher-moment risk measure, we find empirically that the resulting portfolio enjoys remarkable out-of-sample performance and turnover. In particular, it outperforms the classical non-sparse portfolio estimates when the number of assets is moderate or large. Moreover, it allows for improved higher moments compared to the minimum-variance portfolio, contrary to non-sparse estimates.

In this chapter, we have focused on moment-based portfolios, which are often used in the expected-utility framework via Taylor series expansion of the utility function; see Section 1.2. In Chapter5, we revisit the expected-utility framework and propose an Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 134 alternative specification of investors’ preferences that is more natural in non-Gaussian settings. Namely, we capture investors’ preferences via a distribution of returns, which is ultimately what investors care about.

4.6 Appendix

4.6.1 Proofs of results

A useful property for the proofs below is that, given a random vector Z, the kth comoment tensor of AZ is k−1 O 0 Mk(AZ) = AMk(Z) A . (4.30) i=1 This is straightforward to derive from the definition of comoment tensor in (4.3) and the fact that the the tensor product ⊗ is associative:

" k−1 # " k−1 k−1 # " k−1 # k−1 O 0 0 O 0 O 0 O 0 O 0 Mk(AZ) = E AZ (Z A ) = AE Z Z A = AE Z Z A , i=1 i=1 i=1 i=1 i=1 which corresponds to (4.30).

4.6.1.1 Proposition 4.1

Proof. Equation (4.12) is a direct application of (4.30). The cardinality of A = V Λ1/2 is ](A) = ](V ) + ](Λ). Clearly, ](Λ) = K. The matrix V is made of NK entries, of which K we must withdraw K normalization constraints on the columns and 2 = K(K − 1)/2 orthogonalization constraints (one for each pair of columns). This results in ](A) given ? by (4.13). The cardinality of Mk(Y ) is obtained by plugging N ← K in (4.4).

4.6.1.2 Corollary 4.1

ˆ 0 ˆ Proof. Given that P = w X, the PC-estimate of the portfolio wψ in Definition 4.1 is equivalently given by

  wˆψ,K = argmin ψ mq,Pˆ, mr,Pˆ, . . . , ms,Pˆ , (4.31) w∈W

0 ˆ which, for the special case ψ(m2,Pˆ) = m2,Pˆ = w M2(X)w has the well-known solution

ˆ −1 M2(X) 1 wˆψ,K = . (4.32) 0 ˆ −1 1 M2(X) 1 Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 135

ˆ 0 Finally, we have from (4.12) that M2(X) = AA = Σe.

4.6.1.3 Theorem 4.1

Proof. Equation (4.18) is an immediate consequence of (4.30). To count the number of distinct parameters in each higher-comoment tensor Mk(S) for k ≥ 3 knowing that the factors S are mutually independent, let us start with some specific values of K. Note that, because of independence, the expectation of a product of several Si’s simplifies to the product of the corresponding moments. For K = 3, the entries (i, j, k) and (i, i, j) are all zero because the mean of each factor is zero. Thus, only the K entries (i, i, i), equal to m , must be counted. For K = 4, the entries (i, j, k, l) and (i, i, i, j) are zero and the 3,Si entries (i, i, j, j) equal m m = 1. Thus, only the K moments m must be counted. 2,Si 2,Sj 4,Si For K = 5 however, one must count not only the K moments m , but also the entries 5,Si (i, i, i, j, j) that equal m m = m . Thus, one must add the K moments m . 3,Si 2,Si 3,Si 3,Si Note that the fourth moments m are not counted because they appear in the terms 4,Si

(i, i, i, i, j) that equal zero. By generalizing this, we therefore have that ](M3(S)) = K and, for k ≥ 4, Mk(S) requires all marginal moments of order 3 to k, except those of order k − 1. Thus, ](Mk(S)) = K(k − 3) for k ≥ 4, proving Equation (4.19).

4.6.1.4 Proposition 4.2

0 ˆ Nk−1 0 ˆ ˆ Nk−1 Proof. We have that mk,Pˆ = w Mk(X) i=1 w and mˆ k,Pˆ = w Mk(X) i=1 w. Thus, ˆ  ˆ ˆ  the difference in the number of parameters is ] Mk(X) − ] Mk(X) . From Proposition ˆ  ? ? 4.1, the cardinality ] Mk(X) = ](A)+](Mk(Y )) = ](V )+](Λ)+](Mk(Y )). Similarly, ˆ ˆ  † ˆ †  from Theorem 4.1, the cardinality ] Mk(X) = ](A ) + ] Mk(Y ) = ](V ) + ](Λ) + † ˆ †  ? † ˆ †  ](R ) + ] Mk(Y ) . Thus, the difference becomes ](Mk(Y )) − ] R + ] Mk(Y ) , ?  ˆ †  † and simplifies to (4.26) from ] Mk(Y ) in (4.14), ] Mk(Y ) in (4.19), and ](R ) = K(K − 1)/2 for any K × K rotation matrix.

4.6.2 Robustness test: daily returns

In Table 4.2, we report the out-of-sample performance using monthly returns, a setting in which estimation risk is prominent. In this appendix, we evaluate how the results change when using daily returns. Specifically, in Table 4.3, we repeat the empirical analysis of Section 4.4 for daily returns of the 30Ind dataset. We observe that using the 90%-retained-variance rule to select the number of factors, K, does not work well: it leads to portfolios with worse performance and similar turnover than the classical estimation Chapter 4. Robust portfolio selection using sparse estimation of comoment tensors 136

Table 4.3: Robustness test: Out-of-sample performance for daily returns

30Ind dataset MV MVPC MMVaR MMVaRPC MMVaRIC Mean 7.14% 10.85% 6.01% 13.59% 15.14% Volatility 12.36% 15.15% 14.24% 19.19% 19.93% Sharpe ratio 0.58 0.72 0.42 0.71 0.76?? Skewness 0.03 -0.25 -0.07 -0.04 0.08 Excess kurtosis 9.58 10.70 4.90 5.57 8.31 Modified VaR 3.51% 4.72% 3.13% 4.37% 5.23% Modified SR (×100) 0.81 0.91 0.76 1.23 1.15 Turnover 2.07% 1.21% 3.89% 2.45% 2.89% Notes. We repeat the empirical analysis of Table 4.2 using daily returns of the 30Ind dataset. Contrary to Table 4.2, the number of principal components, K, is selected via the methodology of Velicer (1976); see Section 3.6.1.

without dimension reduction. Thus, instead, we select K via the methodology of Velicer (1976) as used in Section 3.6.1.

The table shows that our methodology continues to work well even for daily returns. MVPC has a larger Sharpe ratio and lower turnover than MV. Compared to MMVaR, MMVaRPC has a larger Sharpe ratio and modified Sharpe ratio, and a lower turnover. However, it has a larger tail risk than MMVaR with more kurtosis and modified VaR. Finally, there are no benefits in using the MMVaRIC strategy rather than MMVaRPC. Thus, when the sample size is large, the extra gain in the number of parameters associated to using the ICs does not compensate for the extra bias it introduces. Overall, in fact, the classical estimates MV and MMVaR best achieve the desired objective function: MV has the lowest volatility and MMVaR has the lowest modified VaR. This suggests that, to improve the out-of-sample performance in cases where estimation risk is less prominent (large sample size in Table 4.3 or low number of assets for the 5Ind dataset in Table 4.2), it may be worth combining our sparse tensor estimates with the classical sample estimates, in the spirit of shrinkage estimators (see Section 1.1.3). This is further suggested in Section 6.2.1 on open questions for future research. Chapter 5

Portfolio selection: A target-distribution approach

Abstract We introduce the minimum-divergence portfolio, a strategy that accounts for higher moments in a simple and intuitive way. It is the portfolio whose return density is as close as possible to a target-return density that captures the investor’s preferences. We study target densities among the generalized-normal family, for which the objective function admits a natural decomposition. We propose to match the first two target-return moments to those of a reference portfolio, chosen on the mean-variance efficient frontier. The minimum-divergence portfolio is therefore expected to be close to the reference portfolio, but with higher return moments implicitly shrunk toward those of the target distribution. We demonstrate that we recover Markowitz’s efficient frontier when asset returns are Gaussian, and when a Dirac-delta target is taken. An extensive empirical study reveals that minimum-divergence portfolios exhibit comparable mean-variance trade-offs as the reference portfolios, but have substantially less tail risk, including in crisis periods. Moreover, our strategy is shown to outperform common higher-moment portfolios put forward in the literature.

Reference This chapter is partially based on: Lassance, N., & Vrins, F. (2019). Portfolio selection: A target-distribution approach.

5.1 Introduction

According to the celebrated expected-utility framework, agents make decisions in an uncertain environment by maximizing the expectation of a utility function. Following the utility theorem of von Neumann and Morgenstern (1953), this framework provides a valid representation for agents with rational preferences. Embracing the expected-utility framework amounts to model agents’ preferences via a single function: the utility function.

137 Chapter 5. Portfolio selection: A target-distribution approach 138

The seminal mean-variance portfolio theory, reviewed in Section 1.1.1, is largely built upon the expected-utility framework. Investors who only care about the mean and variance of their portfolio returns should select their optimal portfolio on the mean-variance efficient frontier, and the latter is the result of an expected-quadratic-utility maximization problem. In this case, the quadratic utility embeds all the investor’s preferences, in terms of mean return and variance aversion. The mean-variance-efficient portfolio remains the baseline approach to portfolio selection. However, two main concerns have made researchers and practitioners alike less keen to rely on the utility paradigm.

The first concern comes from the large estimation risk associated with the mean- variance-efficient portfolio, which comes mainly from the well-known hurdle to estimate the mean return. Thus, although theoretically optimal in terms of quadratic utility, this strategy can be very disappointing in terms of out-of-sample performance; see Section 1.1.2. This explains why researchers and practitioners have turned to alternative strategies, such as the equally weighted portfolio, the minimum-variance portfolio, Bayesian portfolios and shrinkage portfolios to cite a few; see Section 1.1.3. Those approaches no longer maximize the expected quadratic utility in sample, but are appealing due to their better out-of-sample performance. It is worth stressing that this departure from the expected- utility framework is by no means a rejection of the theory; it is a consequence arising from estimation considerations. The investor interested in the first two return moments only who has perfect knowledge of the future mean and covariance matrix of asset returns should still go for the mean-variance-efficient portfolio.

The second concern is more problematic and is related to the fact that asset returns are not Gaussian. In this case, the quadratic utility is not appropriate anymore because investors pay attention to the whole shape of the portfolio-return distribution, which is no longer fully described by the first two moments; see Section 1.2. However, whereas the quadratic utility function is easy to specify, designing a utility function that adequately captures higher-moment preferences is arduous. In particular, as explained in Section 1.2, the standard approach based on cubic or quartic utility functions is not adequate as it yields portfolios very close to mean-variance ones. Thus, in the presence of higher moments, the utility function may not be the best way for investors to design optimal portfolios. As a result, many researchers going beyond the mean-variance-efficient portfolio depart, to some extent, from the expected-utility framework; see the second, third and fourth higher-moment approaches in Section 1.2. Overall, no consensus emerges from the literature on how to best manage the higher moments of portfolio returns, which is symptomatic of the difficulty to specify the utility function adequately. Chapter 5. Portfolio selection: A target-distribution approach 139

In this chapter, we propose a framework that reinstates the fundamental idea under- lying the expected-utility framework, which is to capture all the investor’s preferences using a single, appropriate function of the portfolio returns. However, we consider instead a function that is expected to be more intuitive to investors than utility functions, more directly related to the backbone of investment strategies. Namely, the distribution of the portfolio returns itself. Specifically, the proposed approach aims to capture all the investor’s preferences in terms of a single function: the target-return density. If this density is attainable given the investment set, then the investor selects the corresponding portfo- lio. Otherwise, the optimal portfolio is the minimum-divergence portfolio, whose return density is as close as possible to the target-return density. This approach emphasizes that what matters for investors is, eventually, the portfolio-return distribution as a whole.

Our setting is related to Bera and Park (2008), who minimize the KL divergence between the portfolio weights and a set of target weights specified by the investor, subject to moment constraints. We extend upon that idea by shrinking the portfolio-return distri- bution (not the portfolio weights) toward a target-return distribution. Chalabi and W¨urtz (2012) study the mathematical aspects of a similar problem via a dual representation of the optimization program. In this paper, we focus primarily on the choice of target- return distribution and the empirical performances of minimum-divergence portfolios. Our proposal is also related in spirit to cost-efficient portfolios in derivatives pricing; see Dybvig (1988) and Bernard et al. (2014) for example. Given a payoff distribution specified by the investor at some time horizon, the cost-efficient portfolio is the lowest-cost dynamic strategy that replicates the given payoff, scenario by scenario. In this chapter, we are interested instead in finding the static strategy that best approaches a given return distribution (the target).

Our contribution is fourfold. The first contribution is methodological, and concerns the target-distribution approach to portfolio selection. This approach is fully general and depends on the target distribution preferred by the investor, just like the expected- utility approach depends on the utility function of the investor. In this paper, we study the portfolio problem of investors who target a generalized-normal return distribution (Nadarajah 2005), a three-parameter symmetric distribution that extends the Gaussian one to allow for non-zero excess kurtosis. Although other choices can be made, this parametric family provides a sparse and sensible choice to depict the preferences of investors aiming to limit the asymmetry and leptokurticity of their portfolio returns. We rely on the Kullback-Leibler (KL) divergence, which is mathematically convenient and Chapter 5. Portfolio selection: A target-distribution approach 140 quantifies the amount of lost information (the increase of entropy) when approximating the portfolio-return density by the target-return density.

The second contribution concerns the theoretical investigation of the minimum- divergence portfolio in the above setting. First, we show that when asset returns are Gaussian and the target-return mean and variance are on the efficient frontier, the minimum-divergence portfolio is always mean-variance efficient. In the non-Gaussian case, the minimum-divergence portfolio is impacted by the portfolio-return higher moments, and thus, is no longer mean-variance efficient. Second, we study the case of a Gaussian target return and show that the objective function admits a natural decomposition into three separate terms. The first two terms measure the fit to the target-return mean and variance. The third term features the Shannon entropy of the standardized portfolio return and naturally shrinks the higher moments of portfolio returns toward those of the Gaussian, which is desirable in terms of downside risk. Third, we show that the mean-variance efficient frontier can be reconstructed via our approach by considering a set of Dirac-delta target-return densities.

The third contribution is to guide the investor regarding the choice of target-return distribution by linking it to the portfolio-return moments. Specifically, we introduce a reference-portfolio approach, whereby the target-return mean and variance match those of a mean-variance-efficient reference portfolio. This approach allows us to build upon the literature on (robust) mean-variance-efficient portfolios (see Section 1.1.3) and shrink them toward a portfolio whose higher moments reflect the target-return density chosen. To guide the reader, Figure 5.1 provides a schematic representation of the minimum-divergence-portfolio methodology.

The fourth contribution concerns the practical implementation of the minimum- divergence portfolio. We provide a closed-form estimate of the objective function based on a Gaussian-mixture estimator of the portfolio-return density and then assess the out-of-sample performance of the minimum-divergence portfolio based on six equity- return benchmark datasets. We rely on five popular minimum-variance and mean-variance reference portfolios (see Section 1.1.3 for details):

(i) the sample minimum-variance portfolio;

(ii) the minimum-variance portfolio using the shrinkage estimator of the covariance matrix of Ledoit and Wolf (2004a);

(iii) the norm-constrained minimum-variance portfolio of DeMiguel et al. (2009a); Chapter 5. Portfolio selection: A target-distribution approach 141

Figure 5.1: Schematic representation of the minimum-divergence portfolio w 0 Target-return-distribution parametric family

Reference 2 (µT , σT ) Higher-moment mean-variance preferences portfolio

fT Asset returns argmin hfP (w)|fT i (X1,...,XN ) w∈W

w fT Notes. The figure depicts a schematic representation of the minimum-divergence portfolio, which is 0 computed in three main steps. First, collect the N asset returns X = (X1,...,XN ) . Second, choose the target-return density fT . This is achieved by relying on a particular parametric family of distributions, such as the generalized-normal one in this chapter, and determining the parameters of the distribution. 2 We advise to match the target-return mean and variance (µT , σT ) to those of a mean-variance-efficient reference portfolio w0. The higher-moment parameters are then determined depending on the investor’s higher-moment preferences, represented for example by the parameter γ controlling kurtosis in the case of the generalized-normal target-return distribution. Third, the minimum-divergence portfolio w is fT computed by minimizing the divergence hfP (w)|fT i between the portfolio-return density fP (w) and the target-return density fT , given the portfolio-weight constraints in W. In this chapter, we rely on the Kullback-Leibler divergence.

(iv) the tangent portfolio based on the Bayes-Stein estimator of Jorion (1986);

(v) the optimal combination of the tangent and minimum-variance portfolios of Kan and Zhou (2007).

We show that our minimum-divergence portfolios achieve an equally good mean-variance trade-off as the reference portfolios but, as expected from the decomposition of the objective function, exhibit a substantial improvement in tail risk as measured by the skewness, kurtosis and modified VaR. In addition, our approach outperforms several aforementioned higher-moment portfolios (see Section 1.2). Compared to the minimum- modified-VaR portfolio of Favre and Galeano (2002), our strategy often generates a lower modified VaR out of sample. Our strategy also outperforms the portfolio of Briec et al. (2007) in terms of kurtosis and modified VaR. Finally, our results confirm that the Taylor series expansion of the CRRA utility function of Martellini and Ziemann (2010) gives portfolios that remain very close to those computed without the inclusion of higher Chapter 5. Portfolio selection: A target-distribution approach 142 moments. This contrasts with our strategy, as the minimum-divergence portfolio is greatly impacted by higher moments.

The chapter is organized as follows. Section 5.2 introduces the general formulation of the minimum-divergence portfolio. Sections 5.3 and 5.4 study the generalized-normal and Gaussian target-return distributions. Section 5.5 presents the reference-portfolio approach to fix the target-return mean and variance. Section 5.6 deals with the estimation of the minimum-divergence portfolio. Section 5.7 evaluates the out-of-sample performance of our portfolios. Section 5.8 concludes. The Appendix at the end of the chapter contains proofs for all results, as well as supplementary materials.

5.2 Minimum-divergence portfolio

In this section, we introduce the minimum-divergence portfolio in its general form, allowing for any choice of target-return distribution. We also discuss the choice of divergence measure.

5.2.1 General formulation

Consider a given target return T , with density fT , specified by the investor according to his preferences. We define the minimum-divergence portfolio as the portfolio whose return density fP minimizes a divergence with respect to the target-return density fT .

Definition 5.1 (Minimum-divergence portfolio). The minimum-divergence portfolio w fT is the portfolio that minimizes the divergence between the portfolio-return density fP and the target-return density fT , which we write as

w := argmax C(w; f ), (5.1) fT T w∈W where

C(w; fT ) := −hfP (w)|fT i, (5.2) with hfP |fT i a divergence measure between fP and fT ; that is, hfP |fT i ≥ 0 for all fP , fT , with equality if and only if fP = fT almost everywhere.

Note that because C(w; fT ) is an objective function, equalities in different specifications of C(w; fT ) will hold up to additive or strictly positive multiplicative constants independent of w. Chapter 5. Portfolio selection: A target-distribution approach 143

In its general form, the minimum-divergence portfolio w is agnostic to the choice fT of target-return density fT and divergence measure hfP |fT i. The choice of divergence measure is covered in Section 5.2.2, and the choice of target-return density in sections 5.3 and 5.4.

5.2.2 The Kullback-Leibler divergence

Definition 5.1 leaves free the choice of the divergence measure. In practice, one can rely on different types of divergence measures proposed in the literature; see Ullah (1996) and Basseville (2013) for extensive reviews. The most common choice that we follow as well in this chapter is to rely on the Kullback-Leibler (KL) divergence, also known as relative entropy (Kullback and Leibler 1951). The KL divergence between fP and fT is derived as follows. In this chapter, P and T always have the real line as support: ΩP = ΩT = R.

Then, we can measure the amount of lost information when approximating fP by fT via the increase of Shannon entropy

Z +∞  Z +∞  − fP (x) ln fT (x)dx − − fP (x) ln fP (x)dx , (5.3) −∞ −∞ where the first term is known as the cross-entropy. The quantity in (5.3) is the KL divergence, which can be rewritten as

Z +∞   fP (x) fP (P ) hfP |fT i := fP (x) ln dx = E ln . (5.4) −∞ fT (x) fT (P )

It is a measure of divergence in the sense that hfP |fT i ≥ 0 for all fP , fT , and hfP |fT i = 0 if and only if fP = fT almost everywhere; see Cover and Thomas (2006, p.252). The KL divergence has been used in many financial applications; see Jiang et al. (2018) and references therein. A detailed analysis of the properties of the KL divergence can be found in van Erven and Harremo¨es(2014).1

We rely on the KL divergence for two main reasons.

1. It has a natural information-theoretic interpretation: it measures the increase of

entropy, and thus the amount of lost information, when approximating fP by fT . In particular, entropy depends on higher moments, and thus, the KL divergence measures the distribution fit in a more general way than just comparing the first two moments of the portfolio and target return P and T .

1The KL divergence is a particular instance in the class of f-divergences introduced by Ali and Silvey (1966) and Csiszar (1967), corresponding to f(x) = x ln x. Chapter 5. Portfolio selection: A target-distribution approach 144

2. Because of the properties of the log function, the KL divergence is mathematically more tractable than other divergence measures, particularly when working with exponential-family densities. For example, as we show in Section 5.4, it naturally decomposes into three separate terms when targeting a Gaussian return density.

One apparent drawback of the KL divergence is its non-symmetry, as hfP |fT i= 6 hfT |fP i in general. The Gaussian target return studied in Section 5.4 shows that the direction hfP |fT i is clearly the most appropriate one: minimizing the converse direction amounts to fit only the first two moments of the target return, while our chosen direction also controls for the portfolio-return higher moments thanks to the additional entropy term.

5.3 Targeting a generalized-normal return distribution

Having introduced the general target-distribution framework in Section 5.2, we now turn to the choice of target-return density fT . In this chapter, we focus on the generalized- normal distribution developed by Nadarajah (2005). This distribution is symmetric and is a particular case of the skewed generalized t distribution, which is considered to provide a good fit to asset returns (Theodossiou 1998). It remains parsimonious as it is defined with three parameters controlling for the location, scale and kurtosis of the target return, respectively. There are several reasons why we decide to rely on this particular distribution:

• Compared to other well-known symmetric extensions of the Gaussian, such as the Student t distribution, the generalized-normal one allows for both positive and negative excess kurtosis, and so is more flexible in that respect.

• It represents a natural target-return candidate for investors who aim to limit the asymmetry and leptokurticity of their portfolio-return distribution. In particular, even though investors would ideally prefer a positive skewness, targeting a zero skewness makes sense because, in real applications, the skewness of portfolio returns is typically negative; see the empirical results of Section 5.7.

• Relying on a symmetric distribution limits the sensitivity of the minimum-divergence portfolio to the estimate of portfolio-return skewness. This is desirable because, as shown by Martellini and Ziemann (2010), skewness is much more difficult to estimate than volatility and kurtosis. In particular, some preliminary results obtained with the skew-normal distribution of Azzalini (1985) show that, out of sample, it leads Chapter 5. Portfolio selection: A target-distribution approach 145

to minimum-divergence portfolios with worse skewness than when relying on the generalized-normal distribution.

• It belongs to the exponential family of densities, which makes it analytically convenient when working with the KL divergence.

In this section, we consider the generalized-normal distribution in its general form. In Section 5.4, we focus more particularly on the Gaussian target-return distribution case.

5.3.1 Minimum-divergence portfolio under a generalized-normal target

Following Nadarajah (2005), the generalized-normal distribution is defined as follows.

Definition 5.2 (Generalized-normal target return). The target return follows a generalized-normal distribution, T ∼ GN (α, β, γ), if its density function is given by2

γ+1 − γ γ 2 γ − 1 |x−α| f (x; α, β, γ) := e 2 ( β ) , (5.5) T βΓ(1/γ) with α ∈ R and β, γ > 0. The corresponding moments of T are given by

Γ(3/γ) Γ(5/γ)Γ(1/γ) µ = α, σ2 = β241/γ , ζ = 0 and κ = − 3. (5.6) T T Γ(1/γ) T T Γ(3/γ)2

Observe in particular that this distribution is symmetric and that its excess kurtosis is negative for γ > 2. The minimum excess kurtosis, obtained at the limit γ → ∞, is −6/5. The generalized-normal distribution retrieves the Laplace distribution for γ = 1, the Gaussian distribution for γ = 2 and the Uniform distribution on the interval [α −β, α +β] for γ → ∞. Figure 5.2 depicts the zero-mean unit-variance generalized-normal density for different γ. Using the KL divergence in (5.4), the objective function associated to the generalized-normal target-return distribution is stated in the next proposition.

Proposition 5.1. Let T ∼ GN (α, β, γ). Then, the minimum-divergence portfolio w = fT wα,β,γ is obtained by maximizing

1 C(w; α, β, γ) = H(P ) − [|P − α|γ], (5.7) 2βγ E where H(P ) is the portfolio-return Shannon entropy in (2.2)

2Note that, in the original parametrization of Nadarajah (2005), there is no 1/2 term in the exponent of the exponential. We parametrize the density in this way to ensure that it corresponds to the Gaussian density for γ = 2. Chapter 5. Portfolio selection: A target-distribution approach 146

Figure 5.2: Generalized-normal target-return density for different values of γ 1.25 γ = 0.7 1 γ = 1 γ = 2

) γ = 4 γ

, 0.75 γ = 10 β , α ; x

( 0.5 T f

0.25

0 -3 -2 -1 0 1 2 3 x Notes. The figure depicts the generalized-normal target-return density defined in (5.5). The location and scale parameters (α, β) are computed such that the target return has zero mean and unit variance; that is, (µT , σT ) = (0, 1). Several values for the kurtosis coefficient γ are considered: γ ∈ {0.7, 1, 2, 4, 10}. The higher the value of γ, the lower the target-return kurtosis.

Without further assumption on the portfolio-return distribution, the expectation E[|P − α|γ] cannot be further simplified. We consider below the special case of Gaussian asset returns.

5.3.2 The case of Gaussian asset returns

In this section, we show that the minimum-divergence portfolio is mean-variance efficient for all γ provided that the asset returns are Gaussian, X ∼ N (µ, Σ), and that the target-return mean and variance are on the mean-variance efficient frontier. To show this result, we first need the following proposition.

Proposition 5.2. Let P ∼ N (µP , σP ) and T ∼ GN (α, β, γ). Then, the objective function in (5.7) is maximized for

(µP , σP ) = (µT ,C(γ)σT ), (5.8)

where 1 s !− γ Γ(1/γ) Γ γ+1 γ C(γ) := 2 . (5.9) 2Γ(3/γ) π1/2

C(γ) ≤ 1, with equality if and only if γ = 2. Moreover, limγ→0 C(γ) = limγ→+∞ C(γ) = 0.

The function C(γ) is depicted in Figure 5.3. We conclude from Proposition 5.2 that, when the portfolio return is Gaussian, its variance minimizing the KL divergence is always lower than the target-return one. Naturally, when targeting a Gaussian target return

(γ = 2), the KL divergence is minimized for (µP , σP ) = (µT , σT ) = (α, β). Chapter 5. Portfolio selection: A target-distribution approach 147

Figure 5.3: Function C(γ) as a function of γ 1

0.8

0.6 ) γ ( C 0.4

0.2

0 0 2 4 6 8 10 12 14 16 18 20 γ Notes. The figure depicts the function C(γ) in (5.9). When the portfolio-return is Gaussian, P ∼ N (µP , σP ), then, as we show in Proposition 5.2, the pair (µP , σP ) minimizing the KL divergence with respect to T ∼ GN (α, β, γ) is given by (µP , σP ) = (µT ,C(γ)σT ). Thus, σP ≤ σT for all γ and σP = σT if and only if γ = 2.

However, there may not be a portfolio w attaining the optimal pair (µP , σP ) in (5.8). In particular, a sensible choice to select target-return distributions that are somehow compatible with the considered investment universe is to select (µT , σT ) on the efficient frontier: µT ≥ µMV and σT = σEF (µT ), where σEF is defined in (1.12). This idea is further developed in Section 5.5. In that case, except if γ = 2, the pair (µT ,C(γ)σT ) is located outside of the efficient frontier and is not achievable. Nonetheless, the portfolio wα,β,γ remains mean-variance efficient for all γ as formally stated in the next theorem.

Theorem 5.1. Let X ∼ N (µ, Σ) and T ∼ GN (α, β, α) with µT ≥ µMV and σT =

σEF (µT ). Then, the minimum-divergence portfolio wα,β,γ is mean-variance efficient for all γ, with µMV ≤ µP ≤ µT and σP = σEF (µP ). Specifically, µP = µT for γ = 2 and, otherwise,

γ−2 γ+1 γ 2 ! 2     2 Γ 2 σP 1 µP − µT γ 1 µP = argmax ln σP − 1/2 1F1 − ; − , , (5.10) µMV ≤µP ≤µT π β 2 σP 2 2

q Γ(1/γ) −1/γ with β = Γ(3/γ) 2 σT and σP = σEF (µP ), where

∞ Γ(c ) X Γ(c + n) xn F (x; c , c ) := 2 1 (5.11) 1 1 1 2 Γ(c ) Γ(c + n) n! 1 n=0 2 is the confluent hypergeometric function (see Slater 1972). Chapter 5. Portfolio selection: A target-distribution approach 148

Figure 5.4: Minimum-divergence portfolio with Gaussian asset returns 0.15

0.1 γ = 4 µT γ = 10 γ = 2 = 20 γ γ = 1.5 γ = 1 0.05 γ = 50 γ = 0.5 γ = 0.25 γ = 0.1 P µ 0

-0.05

-0.1 0 0.05 0.1 σT 0.2 0.25 0.3 σP Notes. The figure depicts the results for the setting in Example 5.1. We consider N = 3 Gaussian asset returns with means (µ , µ , µ ) = (0.02, 0.07, 0.15), volatilities (σ , σ , σ ) = (0.05, 0.15, 0.30), X1 X2 X3 X1 X2 X3 and pairwise correlations (ρ12, ρ13, ρ23) = (0, 0.3, 0.6). We target a generalized-normal distribution T ∼ GN (α, β, γ) with µT = 0.08 and σT = σEF (µT ) = 0.14. The blue curve depicts the mean-volatility hyperbola. The red crosses depict the minimum-divergence portfolios wα,β,γ obtained for γ ranging between 0.1 and 50. In line with Theorem 5.1, wα,β,γ is mean-variance efficient for all γ when asset returns are Gaussian and (µT , σT ) is on the efficient frontier.

Theorem 5.1 shows that the minimum-divergence portfolio is mean-variance efficient when asset returns are Gaussian, as advocated by Markowitz. We give an illustration of Theorem 5.1 in the next example.

Example 5.1. Consider an investment set made of N = 3 Gaussian asset returns with means (µ , µ , µ ) = (0.02, 0.07, 0.15), volatilities (σ , σ , σ ) = (0.05, 0.15, 0.30), X1 X2 X3 X1 X2 X3 and pairwise correlations (ρ12, ρ13, ρ23) = (0, 0.3, 0.6). We target a generalized-normal distribution T ∼ GN (α, β, γ) with µT = 0.08 and σT = σEF (µT ) = 0.14. In Figure 5.4, we depict the return mean and volatility of the minimum-divergence portfolios wα,β,γ obtained for γ ranging between 0.1 and 50. The results are consistent with Proposition

5.2 and Theorem 5.1: wα,β,γ is mean-variance efficient for all γ and, the more γ is distant from 2, the more wα,β,γ approaches the minimum-variance portfolio wMV .

Gaussian asset returns is often not a realistic assumption, but Theorem 5.1 shows that we recover Markowitz’s mean-variance-efficient portfolio by adopting a minimum- divergence-portfolio approach under this assumption. In the case of non-Gaussian asset returns, which is typically the case in real datasets, the minimum-divergence portfolio will no longer be mean-variance efficient if the KL divergence deems that portfolios outside of the efficient frontier have higher moments that compensate for a lesser fit to the target-return mean and volatility (µT , σT ). Chapter 5. Portfolio selection: A target-distribution approach 149

5.4 Targeting a Gaussian return distribution

We now study the particular case of the Gaussian target-return distribution, obtained as a generalized-normal distribution with γ = 2. This case is of particular interest because it provides insights about the minimum-divergence-portfolio framework. Indeed, the KL divergence admits a natural decomposition into three separate terms that measure the fit to the target-return mean, variance and higher moments, respectively. In addition, we investigate the case of a Gaussian target-return distribution with zero variance; that is, a Dirac-delta density. As we show, this allows us to recover the mean-variance efficient frontier via a novel parametrization.

5.4.1 Decomposition of the KL divergence

We consider a Gaussian target-return distribution T ∼ GN (α, β, 2) ∼ N (α, β) with density 1 x − α f (x; α, β) = φ , (5.12) T β β where φ is the standard Gaussian density. This represents an appealing choice because in real applications the higher moments of asset returns are typically less desirable than those of the Gaussian, as indicated by their negative asymmetry and leptokurticity. In the next theorem, we show that the KL divergence decomposes into three separate terms under a Gaussian target.

Theorem 5.2. Let T ∼ N (α, β). Then, the minimum-divergence portfolio w = w fT α,β is obtained by maximizing

1µ − α2 1 σ2  C(w; α, β) = − P + ln σ2 − P + H(P ?), (5.13) 2 β 2 P β2

? where H(P ) = H(P ) − ln σP is the Shannon entropy of the standardized portfolio return.

The first two terms in (5.13) measure the fit to the target-return mean and variance, respectively. The first term is maximized when µP = α, and the second term when σP = β. ? The third term, H(P ), is invariant to both µP and σP . It accounts for the fact that we are not only trying to match the first two target-return moments, but actually the whole target-return distribution (assumed Gaussian in this section), which is a different problem. Because the Gaussian distribution has the largest entropy among all other distributions with the same variance and defined on the same support, H(P ?) is maximized when Chapter 5. Portfolio selection: A target-distribution approach 150

P ? ∼ N (0, 1). It captures in a single term the whole uncertainty related to the higher moments of P . Therefore, the minimum-divergence objective function does not merely try to fit (µP , σP ) to (µT , σT ) but, in addition, to produce portfolio returns with a large standardized entropy. That is, tending to have a “Gaussian shape” or, to put it differently, with higher cumulants close to zero. This can be seen by looking at the second-order Gram-Charlier expansion of Shannon entropy in (2.20):

ζ2 κ2 H(P ?) ≈ HN (0, 1) − P − P . (5.14) 12 48

The right-hand side of (5.14) is maximized when the skewness and excess kurtosis of the portfolio return equal zero. We close this section with an illustration of Theorem 5.2.

Example 5.2. Consider N = 2 independent asset returns following a Student t distribu- tion with parameters (µ1, σ1, ν1) = (0.05, 0.10, 5) and (µ2, σ2, ν2) = 0.15, 0.30, 10). The corresponding asset-return excess kurtosis are κ = 6 and κ = 1, respectively. The X1 X2 investor targets a Gaussian return distribution T ∼ N (α, β) whose mean and variance 0 match the minimum-variance portfolio wMV = (0.87, 0.13) . In Figure 5.5, we depict the three terms in the objective function in Theorem 5.2 as a function of the weight on the first asset, w1. In line with Theorem 5.1, the minimum-divergence portfolio would coincide with the minimum-variance portfolio if we ignored the entropy term. However, the minimum-variance portfolio is mostly concentrated on the first asset, which has the largest return kurtosis, and thus, a low entropy H(P ?). In contrast, the portfolio maximizing the entropy H(P ?), given by w = (0.55, 0.45)0, is less concentrated on the first asset. As a 0 result, the minimum-divergence portfolio, given by wα,β = (0.78, 0.22) , is a shrinkage of the minimum-variance portfolio toward the maximum-entropy portfolio, and thus, has improved tail risk. Its excess kurtosis is 2.66, versus 4.57 for the minimum-variance portfolio. This is summarized in Table 5.1.

5.4.2 The Dirac-delta target-return distribution

The decomposition of the KL divergence in Theorem 5.2 allows us to study the limit case where the investor targets a distribution with zero variance; that is, β = σT → 0. This corresponds to a common definition of the Dirac-delta density, centered in α, denoted by 2 δα. From (5.13), it is useful to observe that multiplying C(w; α, β) by β gives the new objective function

1 C˜(w; α, β) := β2C(w; α, β) = − (µ − α)2 + σ2  + β2H(P ). (5.15) 2 P P Chapter 5. Portfolio selection: A target-distribution approach 151

Table 5.1: Minimum-divergence portfolio for the Gaussian target return ? w1 w2 σP κP H(P )

Minimum-variance portfolio (wMV ) 87.00% 13.00% 12.05% 4.55 1.39 Maximum-entropy portfolio (argmax H(P ?)) 55.00% 45.00% 16.68% 0.87 1.41 Minimum-divergence portfolio (wα,β) 78.00% 22.00% 12.48% 2.66 1.40

Notes. The table reports the portfolio-return volatility (σP ), excess kurtosis (κP ) and standardized entropy (H(P ?) in (5.13)) for three portfolios in the setting of Example 5.2. We consider two independent asset returns following a Student t distribution: X1 ∼ t(µ1 = 0.05, σ1 = 0.10, ν1 = 5) and X2 ∼ t(µ2 = 0.15, σ2 = 0.30, ν2 = 10). The mean and variance of the Gaussian target return T ∼ N (α, β) match those of the minimum-variance portfolio wMV . The minimum-divergence portfolio w a shrinkage of the minimum-variance portfolio w toward the maximum-entropy portfolio, in fT MV line with Theorem 5.2.

Figure 5.5: Decomposition of the KL divergence for the Gaussian target return

0 0

− 2 σ2 1 µP α 1 2 − P − ln σP 2 3 β 4 -5 2 3 β2 4 -0.5

-10

-1 -15

-1.5 -20 -1 -0.5 0 0.5 0.871 1.5 2 -1 -0.5 0 0.5 0.871 1.5 2 w1 w1 1.45 0

H(P ⋆) C(w; α, β, γ) = −hfP |fT i 1.4 -5

1.35 -10

1.3 -15

1.25 -20 -1 -0.5 0 0.55 1 1.5 2 -1 -0.5 0 0.50.781 1.5 2 w1 w1 Notes. The figure depicts the three terms decomposing the (negative) KL divergence in the case of a Gaussian target-return distribution, as derived in Theorem 5.2. The setting is the one of Example 5.2, where we consider a portfolio of two independent asset returns following a Student t distribution with parameters (µ1, σ1, ν1) = (0.05, 0.10, 5) and (µ2, σ2, ν2) = 0.15, 0.30, 10). The mean and variance of the Gaussian target return T ∼ N (α, β) match those of the minimum-variance portfolio given by 0 wMV = (0.87, 0.13) . The curves are drawn as a function of w1, the portfolio weight on the first asset. The grey arrows indicate the location of the maximum for each curve.

Clearly, maximizing C(w; α, β) amounts to maximizing C˜(w; α, β) for every β > 0. However, C(w; α, β) diverges to infinity when β → 0. In order to deal with the Dirac-delta target return, we simply define the objective function

˜ C(w; α) = C(w; δα) := lim C(w; α, β), (5.16) β→0

leading to a finite objective function. This leads to the following optimization program, that now only features one real parameter: the target mean return α. Chapter 5. Portfolio selection: A target-distribution approach 152

Definition 5.3 (Dirac-delta minimum-divergence portfolio). Let fT = δα. Then, the minimum-divergence portfolio w = w is obtained by maximizing δα α

2 2 C(w; α) = −(µP − α) − σP . (5.17)

The objective function C(w; α) corresponds to a mean-square-error criterion—or bias- variance criterion—as it amounts to minimize the expectation E[(P − α)2]. Maximizing the objective function C(w; α) via the Lagrangian method leads to the following analytical

solution for wα.

Proposition 5.3. The minimum-divergence portfolio wα obtained by maximizing C(w; α) is given by

 (α10Aµ − 1)1 w = A αµ − with A = (Σ + µµ0)−1. (5.18) α 10A1

Moreover, as we formalize in the next theorem, any mean-variance portfolio w in µ0

(1.8) can be recovered by considering a specific Dirac target-return density δα. Therefore, the mean-variance efficient frontier can be constructed by considering the minimum-

divergence portfolio with a Dirac-delta target-return density δα, playing with α.

Theorem 5.3. Let g(z; c1, c2) := c1z + c2 and

10A1 10Aµ a := , b := − . (5.19) (µ0Aµ)(10A1) − (10Aµ)2 (µ0Aµ)(10A1) − (10Aµ)2

Then, the Dirac-delta minimum-divergence portfolio wα

(i) matches the minimum-variance portfolio wMV provided that α = g(µMV ; 1, 0), (ii) matches the mean-variance portfolio w provided that α = g(µ ; a, b), µ0 0

(iii) is mean-variance efficient if α ≥ µMV .

Let us illustrate this theorem using a simple example.

Example 5.3. Consider N = 3 asset returns with means, volatilities and pairwise

correlations as in Example 5.1. We fix α = µ0 = 0.10, and we report in Table 5.2 the

minimum-variance portfolio wMV , the Dirac-delta minimum-divergence portfolio wα and the mean-variance portfolio w . As one can observe from (5.17), w achieves a trade-off µ0 α between fitting the target mean return α and minimizing the portfolio-return variance. Chapter 5. Portfolio selection: A target-distribution approach 153

Table 5.2: Dirac-delta minimum-divergence portfolio versus mean-variance portfolio

w1 w2 w3 µP σP

wMV 90.00% 17.69% -7.69% 1.88% 4.36% wα 76.58% 21.94% 1.48% 3.29% 5.34% w 12.49% 42.20% 45.31% 10.00% 18.26% µ0 Notes. The table reports the results related to the setting in Example 5.3. We consider N = 3 asset returns with means (µ , µ , µ ) = (0.02, 0.07, 0.15), volatilities (σ , σ , σ ) = (0.05, 0.15, 0.30), X1 X2 X3 X1 X2 X3 and pairwise correlations (ρ12, ρ13, ρ23) = (0, 0.3, 0.6). We report the weights w and the portfolio-return mean and volatility (µP , σP ) for three different portfolios: the minimum-variance portfolio wMV in (1.6), the mean-variance portfolio w in (1.8) with µ = 0.10, and the Dirac-delta minimum-divergence µ0 0 portfolio wα in (5.18) with α = µ0.

Thus, as the table shows, wα is located on the mean-variance efficient frontier between the minimum-variance and mean-variance portfolios.

5.5 A reference-portfolio approach

In sections 5.3 and 5.4, we studied the minimum-divergence portfolio when choosing as target distribution the generalized-normal or the simple Gaussian. The generalized- normal distribution restricts the universe of distributions that the investor can choose from. However, he still needs to fix the parameters of this target distribution and, in particular, the location and scale parameters (α, β).

To guide this choice, we introduce a reference-portfolio approach: (α, β) are set such that the target-return mean and volatility (µT , σT ) match those of a given reference portfolio, noted w0. This approach ensures that there exists a portfolio w leading to a return P = P (w) with such moments, suggesting that the target-return density is reasonable given the investment set.

In particular, the set of pairs (µT , σT ) defining the mean-variance efficient frontier are worth considering, for two reasons.

1. It ensures that no other target-return distribution can dominate the chosen one in terms of mean and variance.

2. The mean-variance-efficient portfolio is unique, which ensures that there is a unique

portfolio whose mean and volatility are given by (µT , σT ), in contrast with pairs

(µT , σT ) located outside of the efficient frontier. Chapter 5. Portfolio selection: A target-distribution approach 154

In essence, by searching for the portfolio whose return distribution is as close as possible to a generalized-normal distribution of which the mean and variance match those of the reference portfolio, we wish to obtain a minimum-divergence portfolio that inherits the appealing mean-variance trade-off of the reference portfolio, combined with improved higher moments reflecting the target return chosen. Recall that, even when considering a Gaussian target return, there is no reason for the minimum-divergence portfolio to agree with the reference portfolio w0 because of the entropy term in (5.13), unless the asset returns are Gaussian, in which case H(P ?) = H[N (0, 1)] is constant.

Mathematically, we select a reference portfolio w0 and find the parameters (α, β) of 0 0 the target return matching the return mean w0µ and variance w0Σw0 of the reference portfolio; that is,  Γ(1/γ)  α, β2 ← w0 µ, 4−1/γ w0 Σw . (5.20) 0 Γ(3/γ) 0 0 The next example illustrates the proposed reference-portfolio approach.

Example 5.4. Consider N = 3 Student t asset returns with means of (0.02, 0.07, 0.15), volatilities of (0.10, 0.20, 0.30) and excess kurtosis of (8, 4, 0). We consider as reference portfolio w0 the minimum-variance portfolio given by

0 w0 = wMV = (0.74, 0.18, 0.08) .

Its return mean, volatility and excess kurtosis equal 3.85%, 8.68% and 3.05, respectively. We consider a generalized-normal target-return distribution T ∼ GN (α, β, γ) whose first two moments match wMV . We consider several values of γ corresponding to a target- return excess kurtosis κT ∈ {0, −0.2, −0.5, −0.8, −1, −1.1}, which improves upon the excess kurtosis of the minimum-variance portfolio. We observe that the lower is the kurtosis of the target return, the more the minimum-divergence portfolio invests in the third asset, which has the lowest kurtosis: w3 = 0.08 when κT = 0 versus w3 = 0.30 when

κT = −1.1. Figure 5.6 compares the mean-variance and mean-excess-kurtosis trade-offs of the minimum-variance and minimum-divergence portfolios, wMV and wα,β,γ. As one can observe, decreasing κT decreases the kurtosis of the minimum-divergence portfolio as well, at the cost of an increase in variance. Thus, the lower is κT , the more wα,β,γ departs from w0 = wMV . Note finally that because the asset returns are not Gaussian, Theorem

5.1 does not apply and the minimum-divergence portfolios wα,β,γ are not mean-variance efficient, though they remain close to the mean-variance efficient frontier. Chapter 5. Portfolio selection: A target-distribution approach 155

Figure 5.6: Illustration of the reference-portfolio approach (a) Mean-variance trade-off 0.15

wMV κT = 0 κT = −0.2 κT = −0.5 0.1 κT = −0.8 κT = −1 κT = −1.1 P µ 0.05

0 0.1 0.15 0.2 0.25 σP (b) Mean-excess-kurtosis trade-off 0.15

wMV κT = 0 κT = −0.2 = −0.5 0.1 κT κT = −0.8 κT = −1 = −1.1

P κT µ 0.05

0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 κP Notes. The figures depict the results for the illustration of the reference-portfolio approach in Example 5.4. We consider N = 3 Student t asset returns with means of (0.02, 0.07, 0.15), volatilities of (0.10, 0.20, 0.30) and excess kurtosis of (8, 4, 0). We take as reference portfolio the minimum-variance portfolio: w0 = 0 wMV = (0.74, 0.18, 0.08) . We consider a generalized-normal target-return distribution, T ∼ GN (α, β, γ), where (α, β) match the mean and variance of wMV plotted in black circle. We take several values of γ corresponding to a target-return excess kurtosis κT ∈ {0, −0.2, −0.5, −0.8, −1, −1.1}. The figure compares the mean-variance and mean-excess-kurtosis trade-offs of the minimum-divergence and reference portfolios, wα,β,γ and w0, as a function of κT . The blue curves show the trade-off for the mean-variance portfolios w for µ ∈ [0, 0.15]. The lower the κ , the more w departs from w = w . µ0 0 T α,β,γ 0 MV

5.6 Estimation of the minimum-divergence portfolio

In this section, we discuss how one can estimate the minimum-divergence portfolio wα,β,γ

defined in Proposition 5.1 given a sample of historical asset returns Xi,t, with i = 1,...,N and t = 1,...,T . Note that the sample size T must not be confused with the target

return T . We denote Xt the asset-return vector at time t and, for a given portfolio-weight 0 vector w, the portfolio return at time t is denoted by Pt = w Xt. Chapter 5. Portfolio selection: A target-distribution approach 156

5.6.1 Estimation for the generalized-normal target return

When targeting a generalized-normal distribution T ∼ GN (α, β, γ), the objective function to maximize, C(w; α, β, γ), is given by (5.7). Given an asset-return sample Xi,t, one maximizes an estimate of the form

1 γ Cˆ(w; γ) := Hb(P ) − Eˆ[|P − αˆ| ], (5.21) 2βˆγ

ˆ where αˆ and β are estimated from (5.20) via a given reference portfolio w0. The latter must itself be estimated from the asset returns:

0 ˆ2 −1/γ Γ(1/γ) 0 αˆ ← wˆ0µˆ and β ← 4 wˆ0Σbw ˆ0, (5.22) Γ(3/γ) where µˆ and Σb are the sample estimators of the mean and covariance matrix of asset returns in (1.13)–(1.14).

Two elements remain to be estimated in Cˆ(w; γ): the portfolio-return Shannon entropy H(P ) and the expectation E[|P − αˆ|γ]. This is the purpose of the next two sections.

5.6.2 Estimation of the portfolio-return entropy H(P )

As Shannon-entropy estimator, we rely on the m-spacings estimator of van Es (1992) and Learned-Miller and Fisher (2003) discussed in Section 2.5.1. As recalled from (2.23), it is given by T −m   1 X T + 1 (t+m:T ) (t:T ) Hb(m, T ) := ln P − P , (5.23) T − m m t=1 where P (1:T ) ≤ P (2:T ) ≤ · · · ≤ P (T :T ) are the portfolio-return order statistics. As in Section 2.6.1, we select m = T 2/3 to ensure a satisfying robustness of the estimator.

5.6.3 Estimation of the expectation E[|P − αˆ|γ]

γ Computing the expectation E[|P − αˆ| ] requires the knowledge of the density fP , which is unknown and thus must be estimated. To that aim, we rely on a Gaussian-mixture estimator of fP , a common approach that has the advantage of leading to an analytical expression for the estimator Eˆ[|P − αˆ|γ] linked to the confluent hypergeometric function

1F1 in (5.11). Chapter 5. Portfolio selection: A target-distribution approach 157

Finite-mixture distributions are one of the most common approaches in density estimation; see Everitt and Hand (1981). They provide an extension of the popular kernel density estimator (Parzen 1962) that, under mild assumptions, can be shown to converge at a faster rate than any other non-parametric density estimator (Wahba 1975). We rely on Gaussian kernels as mentioned above. This is not a major restriction given that the choice of kernel function is of minor importance with regards to the mean integrated squared error (Wand and Jones 1995).

Definition 5.4 (Gaussian-mixture density estimator). The Gaussian-mixture estimator of the portfolio-return density fP is given by

K X π x − m  fˆGME(x) := k φ k , (5.24) P h h k=1 k k

PK where πk ∈ [0, 1] for all k and k=1 πk = 1. As a special case, the kernel density estimator 3 is obtained for πk = 1/K for all k, K = T , and kernels centered on the observations Pt:

T 1 X 1 x − P  fˆKDE(x) := φ t . (5.25) P T h h t=1 t t

Using the estimator in (5.24), the expectation E[|P − αˆ|γ] is estimated as

Z +∞ ˆ γ ˆGME γ E[|P − αˆ| ] := fP (x)|x − αˆ| dx. (5.26) −∞

As discussed above, this integral can be evaluated analytically.

Proposition 5.4. Eˆ[|P − αˆ|γ] in (5.26) is given by

γ−2 γ+1  K  2 ! 2 2 Γ X 1 m − αˆ γ 1 ˆ[|P − αˆ|γ] = 2 π hγ F − k ; − , . (5.27) E π1/2 k k1 1 2 h 2 2 k=1 k

ˆKDE In the empirical study, we rely on the kernel density estimator fP in (5.25) with a constant bandwidth, ht = h for all t. We compute this bandwidth based on the robust ratio-quantile-ranges approach of Zhang and Wang (2008); see Appendix 5.9.2 for more details. Note that the bandwidth is set based on the equally weighted portfolio

3For kernel density estimation, the bandwidth is commonly taken as constant over the sample points, that is ht = h for all t, but varying bandwidths are also a popular approach introduced by Terrell and Scott (1992). Chapter 5. Portfolio selection: A target-distribution approach 158

wEW = 1/N to avoid a dependence on the portfolio weights of the form h = h(w), which would unpredictably affect the minimum-divergence portfolio.

5.6.4 Estimation for the Gaussian target return

The counterpart of the expectation estimator in Proposition 5.4 for the Gaussian target return (γ = 2) is immediate given the property 1F1(x; −1, 1/2) = 1 − 2x. Interestingly, when using the kernel density estimator in (5.25) and a bandwidth function independent of the portfolio weights w, which corresponds to the setting used in the empirical study, the objective function no longer depends on the bandwidth function chosen. This is shown in the corollary below.

ˆ ˆKDE Corollary 5.1. Let γ = 2, fP = fP and ht be independent from the portfolio weights w. Then, the objective function Cˆ(w) := Cˆ(w; 2) reduces to

2 2 (ˆµP − αˆ) +σ ˆ Cˆ(w) = Hb(m, T ) − P , (5.28) 2βˆ2

0 2 0 where µˆP := w µˆ and σˆP := w Σbw.

5.7 Out-of-sample performance

We now proceed with an assessment of the out-of-sample performance of the minimum- divergence portfolio strategy. In Section 5.7.1, we discuss the data and methodology employed. In Section 5.7.2, we explain which portfolio strategies we report in this section. In Section 5.7.3, we report and discuss the results. A discussion on portfolio-weight diversification is provided in Appendix 5.9.4.

5.7.1 Data and methodology

All the datasets and portfolio strategies considered in the empirical study, along with their abbreviations, are listed in tables 5.3 and 5.4, respectively. As explained in Section 5.7.2, we do not report the out-of-sample performance of all considered portfolio strategies in this section. The full results are reported in Appendix 5.9.3.

Datasets. We consider six equity-return datasets from Kenneth French’s library. We download value-weighted daily arithmetic returns from 1978 to 2018. Chapter 5. Portfolio selection: A target-distribution approach 159

Table 5.3: List of datasets considered in the empirical study of Chapter5

Datasets Abb. N Time period 6 Fama-French portfolios of firms sorted by size and book-to-market 6BTM 6 1978-2018 25 Fama-French portfolios of firms sorted by size and book-to-market 25BTM 25 1978-2018 6 Fama-French portfolios of firms sorted by size and operating profitability 6Prof 6 1978-2018 25 Fama-French portfolios of firms sorted by size and operating profitability 25Prof 25 1978-2018 10 industry portfolios representing the US stock market 10Ind 10 1978-2018 30 industry portfolios representing the US stock market 30Ind 30 1978-2018 Notes. The table reports the six datasets that we use in the empirical study of Section 5.7. All datasets are made of daily value-weighted arithmetic returns. Source: Kenneth French library. The time period goes precisely from 01/1978 to 12/2018.

Minimum-divergence portfolios. The proposed portfolio strategy is the minimum- divergence portfolio based on a generalized-normal target return. We rely on two values for γ. First, γ = 2, corresponding to the Gaussian target return studied in Section 5.4. Second, γ = 4, which corresponds to thinner tails than the Gaussian as it gives a 2 target-return excess kurtosis of κT = Γ(1/4)Γ(5/4)/Γ(3/4) − 3 = −0.81.

Reference portfolios. As proposed in Section 5.5, we fix the mean and variance of the target-return distribution via mean-variance-efficient reference portfolios w0. We rely on

five estimators wˆ0 of minimum-variance or mean-variance portfolios that have been shown to provide robust and satisfying out-of-sample mean-variance trade-offs; see Section 1.1.3. Among the minimum-variance portfolios, we rely on

(i) the sample estimator of the covariance matrix;

(ii) the shrinkage estimator of the covariance matrix developed by Ledoit and Wolf (2004a) using as target covariance matrix a multiple of the identity matrix;

(iii) the 2-norm-constrained portfolio of DeMiguel et al. (2009a) with the 2-norm thresh- old set via 10-fold cross-validation using variance as calibration criterion.4

Among the mean-variance portfolios, we rely on

(iv) the Bayes-Stein estimator of the mean-return vector developed by Jorion (1986);

(v) the combination of the sample mean-variance and minimum-variance portfolios of Kan and Zhou (2007) minimizing the out-of-sample expected-quadratic-utility loss.5

4DeMiguel et al. (2009a) propose several other versions of norm-constrained minimum-variance portfolios. We restrict to the 2-norm-constrained portfolio because Kourtis et al. (2012) find, for similar datasets, that the 2-norm portfolio dominates the 1-norm and A-norm portfolios in terms of Sharpe ratio. 5As explained in Section 1.1.3, the original portfolio of Kan and Zhou (2007) is a three-fund rule that also allows for investment in the risk-free asset. Because we rely solely on risky assets here, we rely on the normalized Kan–Zhou estimator where the portfolio weights sum to one, as proposed by Frahm and Memmel (2010). Chapter 5. Portfolio selection: A target-distribution approach 160

Table 5.4: List of portfolio strategies considered in the empirical study of Chapter5 Abbreviation Description Reference portfolios from the literature MV Sample minimum-variance portfolio LW Minimum-variance portfolio based on the shrinkage estimator of the covariance matrix of Ledoit and Wolf (2004a) MV2n 2-norm-constrained minimum-variance portfolio of DeMiguel et al. (2009a) using variance as 10-fold cross-validation criterion MSR Maximum-Sharpe-ratio (tangent) portfolio based on the Bayes-Stein estimator of the mean-return vector of Jorion (1986) KZ Combination of sample minimum-variance and tangent portfolios of Kan and Zhou (2007) Higher-moment portfolios from the literature CRRAsam Maximum-CRRA-utility portfolio with a fourth-order Taylor series expansion as in Martellini and Ziemann (2010) using sample moment estimators CRRArob Maximum-CRRA-utility portfolio with a fourth-order Taylor series expansion as in Martellini and Ziemann (2010) using shrinkage moment estimators MVKsam Mean-variance-kurtosis portfolio based on Briec et al. (2007) using an equally weighted benchmark portfolio and sample moment estimators MVKrob Mean-variance-kurtosis portfolio based on Briec et al. (2007) using an equally weighted benchmark portfolio and shrinkage moment estimators MVaRsam Minimum-VaR portfolio based on the modified VaR of Favre and Galeano (2002) using sample moment estimators MVaRrob Minimum-VaR portfolio based on the modified VaR of Favre and Galeano (2002) using shrinkage moment estimators Minimum-divergence portfolios developed in this chapter G2MV Gaussian target return (γ = 2) and MV as reference portfolio G4MV Generalized-normal target return (γ = 4) and MV as reference portfolio G2LW Gaussian target return (γ = 2) and LW as reference portfolio G4LW Generalized-normal target return (γ = 4) and LW as reference portfolio G2MV2n Gaussian target return (γ = 2) and MV2n as reference portfolio G4MV2n Generalized-normal target return (γ = 4) and MV2n as reference portfolio G2MSR Gaussian target return (γ = 2) and MSR as reference portfolio G4MSR Generalized-normal target return (γ = 4) and MSR as reference portfolio G2KZ Gaussian target return (γ = 2) and KZ as reference portfolio G4KZ Generalized-normal target return (γ = 4) and KZ as reference portfolio Notes. The table reports all the portfolio strategies considered in the empirical study of Section 5.7. Details regarding their definition and computation are provided in Section 5.7.1.

Higher-moment portfolios. Our minimum-divergence portfolio strategy aims to improve upon the above reference portfolios in terms of higher moments. Therefore, we compare our approach with other existing higher-moment portfolio strategies. We consider three strategies, listed in Table 5.4 and discussed in Section 1.2.

1. The first strategy is the maximum-CRRA-utility portfolio of Martellini and Ziemann (2010) using the Taylor series expansion in (1.43). As proposed by Martellini and Ziemann (2010), we assume a constant mean-return vector µ to neutralize its large Chapter 5. Portfolio selection: A target-distribution approach 161

estimation risk. Thus, we compute the portfolio solving

λ 0 λ(λ + 1) 0 λ(λ + 1)(λ + 2) 0 min w Σw − w M3(w ⊗ w) + w M4(w ⊗ w ⊗ w). w∈W 2 6 24 (5.29) We consider three arbitrary values for the risk-aversion coefficient: λ ∈ {10, 50, 200}.

2. The second strategy is the one of Briec et al. (2007) in (1.44), which looks for the portfolio that gives the best improvements in moments relative to a benchmark

portfolio wb. Briec et al. (2007) focus on the third moment, but their approach can be extended to the fourth moment as in Jurczenko et al. (2006). We focus on the fourth moment that is more consistent with our choice of a symmetric target-return 6 distribution. Thus, given a benchmark portfolio wb, we solve

max δ δ∈[0,1],w∈W subject to w0µ ≥ w0 µ, b (5.30) 0 0 w Σw ≤ wbΣwb(1 − δ), 0 0 w M4(X)(w ⊗ w ⊗ w) ≤ wbM4(X)(wb ⊗ wb ⊗ wb)(1 − δ).

This portfolio is identical to wb if wb is mean-variance efficient to begin with, and

thus, the choice of wb is less flexible than the choice of reference portfolio w0 in our approach. As in Boudt et al. (2018), we take the equally weighted portfolio as

benchmark: wb = 1/N. 3. The third strategy is the minimum-MVaR portfolio of Favre and Galeano (2002), considered as well in chapters3 and4: 7

min MVaRP (ε), (5.31) w∈W

where MVaRP (ε) is defined in (1.48). As for the maximum-CRRA-utility portfolio above, we ignore the portfolio-mean-return component. We select ε = 1% as in chapters3 and4.

Estimating these three higher-moment portfolios requires estimating the asset-return covariance, coskewness and cokurtosis matrices: Σ, M3(X) and M4(X). We rely on two sets of estimators: sample estimators on the one hand, and shrinkage estimators where the

6The conclusions obtained with the original mean-variance-skewness portfolio of Briec et al. (2007) in (1.44) are consistent with those of the mean-variance-kurtosis formulation that we consider in (5.30). 7We have also considered the minimum-CVaR portfolio of Alexander and Baptista (2004) using the modified CVaR of Boudt et al. (2008). We do not report these results because it performs consistently less well than the minimum-MVaR portfolio, in line with Lim et al. (2011). Chapter 5. Portfolio selection: A target-distribution approach 162

target matrix assumes i.i.d. asset returns on the other hand. For the covariance matrix, this corresponds to the shrinkage estimator of Ledoit and Wolf (2004a). Analogs for the coskewness and cokurtosis matrices are available in the R package PerformanceAnalytics based on the works of Martellini and Ziemann (2010) and Boudt et al. (2018). For the mean-variance-kurtosis portfolio of Briec et al. (2007) in (5.30), we estimate the mean-return vector µ with the sample estimator, as in Boudt et al. (2018).

Optimization. For the reference portfolios, only the 2-norm-constrained minimum-variance portfolio is not available in closed-form, and is solved via the Matlab fmincon local optimizer. Regarding the three higher-moment portfolios, we solve the optimization programs in (5.29)–(5.30)–(5.31) with fmincon, using the GlobalSearch option for (5.29) and (5.31) that are non-convex problems. Finally, because the KL divergence is not a convex function of the portfolio weights, we also solve the minimum-divergence portfolios via GlobalSearch.

Rebalancing methodology. As in the previous chapters, we evaluate the out-of-sample performance of the different portfolios via a rolling-horizon methodology. We use a rolling window of six months and an estimation window of five years (sample size T = 1260 days). Thus, we obtain 36 years of out-of-sample daily portfolio returns.

Performance criteria. We compare the out-of-sample performance of the different portfolios in terms of mean-variance trade-off, tail risk and portfolio-weight stability using the same criteria than in chapters3 and4. To test the statistical significance of the mean-VaR trade-off improvement allowed by our minimum-divergence portfolios compared to the reference portfolios, we report two-sided p-values of the difference in modified Sharpe ratio as in Ardia and Boudt (2015b) (see Section 3.6.1). The stars ?, ?? and ??? mean that the two-sided p-value is less than 20%, 10% and 5%.

5.7.2 Reported portfolio strategies

The full results related to all portfolio strategies in Table 5.4 are reported in Appendix 5.9.3 of the e-companion. We now explain which of those portfolio strategies we report in this section.

Concerning the minimum-divergence portfolios, we only report the results for γ = 2 because γ = 4 yields qualitatively similar conclusions. Chapter 5. Portfolio selection: A target-distribution approach 163

Concerning the reference portfolios, we only report the results for MV and MSR. The two other minimum-variance strategies, LW and MV2n, perform similarly to MV. The other maximum-Sharpe-ratio strategy, KZ, performs similarly to MSR.

Concerning the higher-moment portfolios, we find that the minimum-modified-VaR portfolio is the best higher-moment strategy, and thus, we only report the results for this strategy in this section. In particular, the maximum-CRRA portfolio remains quite close to the minimum-variance portfolio, and the mean-variance-kurtosis portfolio of Briec et al. (2007) achieves a lesser improvement in kurtosis and modified VaR compared to the minimum-divergence portfolios. Also, we only report the results for the sample estimator, MVaRsam, as it outperforms the robust estimator, MVaRrob. In general, the only benefit of robust estimators over sample ones is that they achieve a lower turnover, in line with Olivares-Nadal and DeMiguel (2018).

5.7.3 Results

We summarize in tables 5.5 and 5.6 the out-of-sample performance of the considered portfolio policies in Section 5.7.2 across the six datasets considered in Table 5.3. We divide the discussion in two sections. In the first section, we discuss the performance during the full sample (Table 5.5). In the second section, we discuss the performance during the 2007-2009 financial crisis (Table 5.6).

5.7.3.1 Full sample

Comparison with the reference portfolios We begin with a comparison between the Gaussian-based minimum-divergence portfolios (G2MV, G2MSR) and the reference portfolios based on which the mean and variance of the target-return distribution are fixed (MV, MSR).

First, let us look at the mean-variance trade-off. As desired, the portfolio-return mean and volatility of the minimum-divergence portfolio are close to those of the reference portfolio. The out-of-sample mean return attained by the minimum-divergence portfolio can either be higher or lower than the one of the reference portfolio without any clear trend. The out-of-sample volatility attained is always larger when MV is taken as reference portfolio, which is natural because it aims to minimize the portfolio-return variance. In contrast, the volatility is frequently lower than the one of the MSR reference portfolio that also aims to maximize the portfolio mean return. In terms of Sharpe ratio, the table shows that the minimum-divergence portfolios perform similarly to the reference Chapter 5. Portfolio selection: A target-distribution approach 164

Table 5.5: Out-of-sample performance during the full sample

MV G2MV MSR G2MSR MVaRsam

6BTM dataset Mean 19.30% 22.04% 19.48% 22.31% 22.91% Volatility 13.40% 14.78% 20.92% 18.79% 15.20% Sharpe ratio 1.44 1.49 0.93 1.19 1.51 Skewness -0.36 0.05 -0.37 0.40 -0.04 Excess kurtosis 19.61 9.40 24.72 14.66 8.68 1% modified VaR 5.94% 4.09% 10.90% 6.31% 4.11% 1% modified SR 1.29 2.14??? 0.71 1.40??? 2.21 Turnover 2.46% 5.01% 12.10% 8.55% 5.64% 25BTM dataset Mean 20.27% 22.09% 26.23% 24.00% 22.51% Volatility 11.09% 11.79% 16.36% 15.49% 12.11% Sharpe ratio 1.83 1.87 1.60 1.55 1.86 Skewness -0.53 -0.35 -0.08 0.16 -0.40 Excess kurtosis 16.77 10.10 23.24 10.43 11.63 1% modified VaR 4.48% 3.55% 7.95% 4.43% 3.94% 1% modified SR 1.80 2.47?? 1.31 2.15??? 2.27 Turnover 4.42% 7.49% 12.67% 13.02% 7.31% 6Prof dataset Mean 17.43% 20.53% 22.68% 25.78% 21.72% Volatility 14.04% 16.11% 22.08% 24.68% 16.31% Sharpe ratio 1.24 1.27 1.03 1.04 1.33 Skewness -0.08 -0.01 -0.10 0.01 -0.08 Excess kurtosis 22.44 16.36 12.52 14.82 15.30 1% modified VaR 6.68% 6.17% 7.31% 8.89% 6.04% 1% modified SR 1.04 1.32??? 1.23 1.15 1.43 Turnover 2.41% 4.98% 13.79% 11.50% 5.27% 25Prof dataset Mean 16.05% 17.75% 26.38% 21.55% 18.01% Volatility 12.11% 13.20% 29.41% 29.04% 13.20% Sharpe ratio 1.33 1.35 0.90 0.74 1.37 Skewness -0.39 -0.33 0.84 0.26 -0.31 Excess kurtosis 24.61 14.83 45.25 29.33 17.77 1% modified VaR 6.27% 4.92% 22.17% 16.31% 5.48% 1% modified SR 1.02 1.43?? 0.47 0.52 1.31 Turnover 4.53% 8.20% 34.25% 27.46% 7.76% 10Ind dataset Mean 12.81% 12.56% 11.28% 12.66% 12.58% Volatility 12.51% 14.12% 22.36% 21.76% 14.28% Sharpe ratio 1.02 0.89 0.50 0.58 0.88 Skewness -0.33 -0.30 -1.12 1.08 -0.53 Excess kurtosis 22.03 12.84 33.33 31.77 14.43 1% modified VaR 6.00% 4.85% 14.70% 11.62% 5.33% 1% modified SR 0.85 1.03 0.30 0.43 0.94 Turnover 1.28% 3.14% 6.76% 8.30% 3.08% 30Ind dataset Mean 10.95% 10.37% 13.39% 8.81% 10.02% Volatility 11.72% 13.01% 24.64% 22.49% 13.26% Sharpe ratio 0.93 0.80 0.54 0.39 0.76 Skewness -0.74 -0.66 -0.51 -0.25 -0.98 Excess kurtosis 23.30 13.00 13.24 11.65 21.53 1% modified VaR 5.95% 4.62% 8.80% 7.34% 6.41% 1% modified SR 0.73 0.89 0.60 0.48 0.62 Turnover 2.45% 4.60% 11.36% 12.29% 4.37% Notes. The table reports the out-of-sample portfolio performances of the portfolios considered in Section 5.7.2 during the full sample. The datasets and portfolio descriptions are provided in tables 5.3 and 5.4. The methodology is described in Section 5.7.1. We collect daily returns from 1978 to 2018, which, because we use an estimation window of five years, corresponds to an out-of-sample period going from 1983 to 2018. We evaluate the statistical significance of the difference in modified Sharpe ratio of G2MV versus MV and G2MSR versus MSR as in Ardia and Boudt (2015b). The stars ?, ?? and ??? mean that the p-value is less than 20%, 10% and 5%. Chapter 5. Portfolio selection: A target-distribution approach 165 portfolios. These results show that the minimum-divergence portfolio indeed inherits from the mean-variance trade-off of the reference portfolio, supporting the soundness of the reference-portfolio approach proposed in Section 5.5.

Second, with regards to tail risk, we expect the minimum-divergence portfolio to improve upon the reference portfolio. Indeed, in addition to fitting its mean and variance, the minimum-divergence portfolio also aims to maximize the standardized entropy of portfolio returns H[P ?] and, consequently, to produce a skewness and excess kurtosis closer to zero; see Theorem 5.2. The out-of-sample figures show that the performances are very satisfactory in this respect. In terms of skewness, we improve upon the reference portfolio in 11 out of 12 categories (two reference portfolios times six datasets). The skewness of the reference portfolios is nearly systematically negative, while the one of the minimum-divergence portfolio is positive in 6 out of 12 categories. In terms of kurtosis, the improvement is even more striking. We improve upon the reference portfolio in 11 out of 12 categories, and the average reduction in excess kurtosis is of 31% over the 12 categories. This improvement in skewness and kurtosis makes that the modified VaR and modified Sharpe ratio almost always favor the minimum-divergence portfolio as well. In particular, the improvement in modified Sharpe ratio is often statistically significant. Overall, the gain in tail risk offered by the minimum-divergence strategy is thus highly significant.

Third, let us look at turnover. One feature of minimum-variance portfolios is that, because they do not rely on estimates of asset mean returns, they require only little turnover over time (van Vliet 2018). It is confirmed from our results: the MV portfolio achieves a daily turnover of around 1-2% for the 6BTM, 6Prof and 10Ind datasets, and 2-4% for the 25BTM, 25Prof and 30Ind datasets. In comparison, it is natural that the G2MV portfolios, which also accounts for higher moments via the entropy term, faces a larger portfolio turnover: it is around 3-5% for the 6BTM, 6Prof and 10Ind datasets, and 5-8% for the 25BTM, 25Prof and 30Ind datasets. Next, when looking at the MSR reference portfolio, the conclusion is different. Because it also aims to maximize the portfolio mean return, the MSR portfolio depicts a larger turnover and, in comparison, the G2MSR portfolio has a turnover that is overall similar, not higher.

Finally, the table shows that the MV and G2MV portfolios largely outperform the MSR and G2MSR portfolios in terms of mean-variance trade-off, tail risk and turnover. This is consistent with the literature that shows that the minimum-variance portfolio Chapter 5. Portfolio selection: A target-distribution approach 166

often outperforms any other portfolio on the mean-variance efficient frontier out of sample because it does not rely on estimates of asset mean returns; see Section 1.1.2.

Comparison with the minimum-VaR portfolio We now look at the MVaRsam portfolio, which minimizes the 1% modified VaR of portfolio returns, thus accounting for all four moments. Compared to the G2MV portfolio that performs consistently well in terms of Sharpe ratio and tail risk over the six datasets, it generates a higher Sharpe ratio for the 6BTM, 6Prof and 25Prof datasets, and a higher modified Sharpe ratio for the 6BTM and 6Prof datasets. The portfolio turnovers are very similar. Thus, it can compete with the minimum-divergence approach. However, the objective of minimizing the 1% modified VaR is, interestingly, better realized by our G2MV portfolio than the MVaRsam portfolio: the G2MV portfolio has a lower modified VaR on all datasets except 6Prof.

Overall, the results suggest that the minimum-divergence portfolio is a credible, worth considering alternative to mean-variance and higher-moment portfolios proposed in the literature.

5.7.3.2 Financial crisis

It is important to evaluate how portfolios perform during crisis periods. Indeed, Massacci (2017) shows that tail risk and tail dependence in equity returns increases during periods of turmoil. As the proposed minimum-divergence portfolio strategy aims to improve the tail risk of portfolio returns, it is expected to outperform the MV and MSR reference portfolios during crises as well.

To evaluate this, we report the out-of-sample performance achieved during the global financial crisis, which, according to the “NBER based Recession Indicators for the United States from the Peak through the Trough”, goes from December 2007 to June 2009, a period of 397 trading days.

The results are reported in Table 5.6. We do not report the Sharpe ratio and modified Sharpe ratio performance criteria because they are more difficult to interpret when the portfolio mean return is negative. Let us first look at the MV and G2MV portfolios. The MV portfolio systematically achieves a negative mean return during the financial crisis. Except for the 10Ind dataset, the mean return of the G2MV portfolio is also negative but it is higher than the one of the MV portfolio for five of the six datasets. At the same time, moreover, G2MV achieves a reduced volatility: the improvement compared Chapter 5. Portfolio selection: A target-distribution approach 167

Table 5.6: Out-of-sample performance during the financial crisis

MV G2MV MSR G2MSR MVaRsam

6BTM dataset Mean -19.59% -7.05% 2.04% 41.52% -17.32% Volatility 34.62% 31.53% 64.59% 50.83% 32.77% Skewness 0.25 0.77 -0.04 0.68 0.45 Excess kurtosis 5.93 5.76 1.52 2.33 4.64 1% modified VaR 7.73% 5.75% 11.03% 6.87% 6.28% 25BTM dataset Mean -27.64% -17.23% 17.47% 25.02% -27.81% Volatility 27.09% 25.72% 48.27% 37.66% 27.70% Skewness 0.29 0.37 0.12 0.51 0.15 Excess kurtosis 5.47 4.29 3.38 1.75 4.96 1% modified VaR 5.84% 4.94% 9.12% 5.28% 5.99% 6Prof dataset Mean -13.92% -4.75% -44.41% 31.32% -9.68% Volatility 36.02% 38.99% 56.61% 63.93% 39.67% Skewness 0.46 0.40 0.11 0.25 0.38 Excess kurtosis 7.75 7.15 1.46 2.05 5.99 1% modified VaR 8.50% 8.96% 9.38% 10.34% 8.53% 25Prof dataset Mean -32.73% -33.88% 67.92% -32.97% -36.23% Volatility 34.16% 32.62% 104.73% 84.10% 34.40% Skewness 0.22 0.09 0.31 0.27 0.08 Excess kurtosis 5.21 5.09 3.08 3.75 5.41 1% modified VaR 7.38% 7.22% 18.10% 15.91% 7.80% 10Ind dataset Mean -2.44% 5.34% -27.08% 14.48% 2.73% Volatility 25.01% 25.17% 42.00% 29.96% 24.90% Skewness 1.19 0.85 0.37 0.15 0.71 Excess kurtosis 11.13 6.92 4.29 2.78 5.87 1% modified VaR 5.55% 4.81% 8.05% 5.34% 4.68% 30Ind dataset Mean -17.75% -15.00% -27.29% -20.38% -17.18% Volatility 23.02% 22.64% 57.02% 45.17% 22.87% Skewness 0.65 0.07 -0.05 0.41 0.15 Excess kurtosis 6.13 2.53 2.13 3.30 2.86 1% modified VaR 4.61% 4.14% 10.38% 7.85% 4.22% Notes. The table reports the out-of-sample portfolio performances of the portfolios considered in Section 5.7.2 during the financial crisis. We follow the “NBER based Recession Indicators for the United States from the Peak through the Trough” and look at the period December 2007 to June 2009, a period of 397 trading days. The results are discussed in Section 5.7.3.2. The datasets and portfolio descriptions are provided in tables 5.3 and 5.4, respectively. The methodology is described in Section 5.7.1. to MV is of 0.54 percentage points on average over the six datasets. This is worth underlining since the MV portfolio aims to minimize the portfolio-return variance. Thus, the G2MV portfolio is characterized by a superior mean-variance trade-off. In terms of higher moments, the G2MV portfolio features slightly less skewness but substantially less kurtosis. The combined effect of the four moments is that the G2MV portfolio outperforms the MV portfolio in terms of modified VaR for five of the six datasets. Those conclusions are qualitatively similar when looking at MSR as reference portfolio. This provides clear evidence that, during crisis periods, there are benefits in explicitly managing higher moments, rather than optimizing the mean and variance only. Finally, Chapter 5. Portfolio selection: A target-distribution approach 168 the G2MV portfolio behaves overall better than the MVaRsam portfolio. It provides a substantially larger mean on all datasets, less volatility on five of the six datasets, and a modified VaR reduced by 0.27 percentage points on average over the six datasets.

5.8 Conclusion

Since Markowitz (1952), the expected-utility framework has largely dominated the sci- entific production dealing with portfolio theory. There are good reasons for this. For instance, it leads to a clear (and sometimes simple) optimization problem and, at the same time, has a sound economic interpretation in terms of investors’ preferences, specified via the utility function. Surprisingly or not, portfolio managers and investors rarely reason in terms of utility functions. Instead, they often think in terms of return statistics like mean, dispersion, asymmetry or tail risk; that is, in terms of return distribution. Of course, the expected-utility framework is intrinsically linked to the portfolio-return distribution: both depend on the asset-return distribution and the portfolio weights. Yet, the link between the portfolio delivered by maximizing the expected utility on the one hand, and the shape of its return distribution on the other hand, remains difficult to grasp for investors. This explains why, in practice, many investment policies directly rely on alternative performance indicators derived from the return distribution, such as moments or downside-risk statistics like the Value-at-Risk.

In this chapter, we introduced a new theoretical framework to construct portfolios. It aims to be similar in spirit to the expected-utility framework in the sense that both aim to maximize a function of portfolio returns, fully accounting for the investor’s preferences. However, instead of using a utility function, the investor now uses a target-return density whose shape reflects his preferences, and then invest in the portfolio whose return distribution is as close as possible to the target-return distribution given the considered investment universe. Yet, the second approach is expected to look more natural and intuitive to practitioners, and thus, we believe, provides an interesting way to reconcile the academic theory with the industry viewpoints.

To operationalize this approach, we have focused in the chapter on the generalized- normal target-return distribution, with a mean and variance matching those of a mean- variance-efficient reference portfolio. In this setup, we have demonstrated that the minimum-divergence portfolio is located on Markowitz’s efficient frontier if the asset re- turns are Gaussian (Theorem 5.1). Otherwise, it can move away from the efficient frontier so as to better fit the target-return higher moments. Moreover, when targeting a Gaussian Chapter 5. Portfolio selection: A target-distribution approach 169 return distribution, the objective function has a clear economic decomposition (Theorem 5.2). The first two terms control the fit to the target-return mean and variance, respectively. The third term collapses to the Shannon entropy of the standardized portfolio return, and acts as a natural penalty that “shrinks” the shape of the portfolio-return distribution toward that of a Gaussian. Another important result is that Markowitz’s efficient frontier is recovered by considering a set of Dirac-delta target-return densities (Theorem 5.3). The soundness of the proposed approach is corroborated by an extensive empirical study. For six datasets and five reference portfolios, our minimum-divergence portfolios display a similar mean-variance trade-off than the references portfolios, combined with substantially less tail risk, including in crisis periods.

Hence, the minimum-divergence portfolio is a framework that is theoretically and economically sound while, at the same time, remaining simple, clear and intuitive. Moreover, it exhibits appealing performances, which can be explained from the underlying theory. Therefore, we believe that the minimum-divergence portfolio is a strategy that provides an interesting response to actual ongoing challenges in the field. For this reason, it would be worth considering in both academic and industry-driven portfolio-management studies yet to come.

5.9 Appendix

This appendix contains three sections. Section 5.9.1 gives proofs of all results in the chapter. Section 5.9.2 details the bandwidth-selection method. Section 5.9.3 reports and discusses the results of all considered portfolio strategies. Section 5.9.4 discusses the connection between Shannon entropy and portfolio-weight diversification.

5.9.1 Proofs of results

5.9.1.1 Proposition 5.1

Proof. When using the KL divergence in (5.4), the objective function can be decomposed as

C(w; fT ) = H(P ) + E[ln fT (P )].

For the generalized-normal target-return density fT (x; α, β, γ), this becomes

− γ+1 2 γ γ 1 C(w; f ) = H(P ) + ln − [|P − α|γ]. T βΓ(1/γ) 2βγ E Chapter 5. Portfolio selection: A target-distribution approach 170

Discarding the constant middle term completes the proof.

5.9.1.2 Proposition 5.2

Proof. To prove the main statement, we make use of two preliminary results. First, from (2.6)–(2.7), the portfolio-return entropy H(P ) is shift-invariant but increases linearly with the log-volatility:

? ? P − µP H(P ) = H(P ) + ln σP with P := . (5.32) σP

? 1 For P ∼ N (µP , σP ), H(P ) = H[N (0, 1)] = 2 ln 2πe is a constant that can be discarded in the objective function. Second, for a generic random variable Z ∼ N (µ, σ) and p > −1 a real number, the pth absolute moment of Z is given by (Winkelbauer 2014)

σp2p/2Γ p+1   1µ2 p 1 [|Z|p] = 2 F − ; − , , (5.33) E π1/2 1 1 2 σ 2 2

where ∞ Γ(c ) X Γ(c + n) xn F (x; c , c ) := 2 1 . (5.34) 1 1 1 2 Γ(c ) Γ(c + n) n! 1 n=0 2 is the confluent hypergeometric function (see Slater 1972).

Combining these two properties, the KL divergence in (5.7) reduces to

γ−2 γ+1 γ 2 ! 2     2 Γ 2 σP 1 µP − α γ 1 hfP |fT i = − ln σP + 1/2 1F1 − ; − , . (5.35) π β 2 σP 2 2

∂hf |f i To prove the proposition, we need to show that P T = 0 and, then, that ∂µP µP =α ∂hfP |fT i = 0 for σP = C(γ)σT . ∂σP µP =α

The first result is verified because

  2  1 µP −α γ 1 ∂1F1 − 2 σ ; − 2 , 2  2 !  P 1 µP − α γ 3 µP − α = γ1F1 − ; 1 − , 2 = 0 for µP = α. ∂µP 2 σP 2 2 σP (5.36)

γ 1  Then, because 1F1 0; − 2 , 2 = 1 for all γ, the KL divergence hfP |fT i in (5.35) becomes,

for µP = α, γ−2 γ+1  γ 2 2 Γ σ  hf |f i = − ln σ + 2 P . (5.37) P T P π1/2 β Chapter 5. Portfolio selection: A target-distribution approach 171

∂hf |f i It it then straightforward to check that P T = 0 for ∂σP µP =α

− 1 γ−2 γ+1  ! γ 2 2 Γ γ σ = 2 β = C(γ)σ . P π1/2 T

5.9.1.3 Theorem 5.1

Proof. First, we show that the optimal σP will necessarily be lower than σT . This is straightforward because, for σP ≥ σT , the target mean return µT is achievable and, from

(5.36), is optimal for any σP . Moreover, given µP = µT , one can check from (5.37) that ∂C(w;α,β,γ) ≤ 0 for σP ≥ σT /C(γ) ≥ σT . Overall, this means that any pair (µP , σP ) with ∂σP

σP > σT is dominated by the achievable pair (µT , σT ). Thus, the optimal σP cannot be higher than σT .

Second, we show that for any σP ≤ σT , it is always favorable to choose µP as large as possible; that is, on the efficient frontier. This holds because, for σP ≤ σT , µP is always ∂C(w;α,β,γ) lower than α and, following (5.36), > 0 for µP < α. Thus, whatever the optimal ∂µP

σP given the parameters (µT , σT , γ), the optimal µP is such that (µP , σP ) is located on the efficient frontier. Clearly, it follows as well that µMV ≤ µP ≤ µT . The optimal value of µP , given by solving (5.10), is found from the expression of the KL divergence for the Gaussian portfolio return in (5.35).

5.9.1.4 Theorem 5.2

Proof. Plugging γ = 2 in the objective function in (5.7) gives

1 1µ − α2 σ2 C(w; α, β, 2) = H(P ) − (P − α)2 = H(P ) − P − P . 2β2 E 2 β 2β2

? The theorem results from the property in (5.32) that H(P ) = H(P ) + ln σP .

5.9.1.5 Proposition 5.3

Proof. The Lagrangian associated to the objective function C(w; α) is given by

L = −(w0µ − α)2 − w0Σw + τ(10w − 1), (5.38) Chapter 5. Portfolio selection: A target-distribution approach 172 where τ is the Lagrange multiplier associated to the full-investment constraint 10w = 1. Setting the derivative of L with respect to w to zero gives, after some developments,

 τ  w = (Σ + µµ0)−1 αµ − 1 . (5.39) 2

Replacing this expression in the condition 10w = 1 then leads to the following value for the Lagrange multiplier τ:

2α10(Σ + µµ0)−1µ − 1 τ = . (5.40) 10(Σ + µµ0)−11

Combining (5.39) and (5.40) results in the solution in (5.18).

5.9.1.6 Theorem 5.3

Proof. We first show that, given any target mean return α, the minimum-divergence portfolio wα in (5.18) is a mean-variance portfolio. We prove this statement by contradic- tion. Suppose that the portfolio wα is a maximizer of the objective function C(w; α), but that it is not a mean-variance portfolio. Then, we can find another portfolio with the same mean return having a lower variance. However, it is easy to see that this portfolio is associated with a larger objective function, meaning that our original portfolio wα was not a maximizer of the objective function, which is a contradiction. The contradiction disappears if and only if wα is a mean-variance portfolio to begin with.

We can now prove the results one by one:

(i) When plugging α = µMV in C(w; α), choosing wα = wMV makes the first term 2 2 −(µP − µMV ) go to zero, and the second term −σP is maximized by definition of

the minimum-variance portfolio. Thus, the portfolio wMV is the maximizer of the

objective function when α = µMV .

(ii) As proved above, the minimum-divergence portfolio wα is a mean-variance portfolio for any α. Thus, for all α, there exists µ for which we can write w = w . To find 0 α µ0 α as a function of µ , we use the fact that, by definition, the mean return w0 µ, 0 µ0 0 and thus wαµ, is µ0. From (5.18), this gives the equation

 (α10Aµ − 1)1 µ0A αµ − = µ . 10A1 0

The result is then obtained by isolating α. Chapter 5. Portfolio selection: A target-distribution approach 173

(iii) This holds because w is mean-variance efficient if and only if µ ≥ µ . µ0 0 MV

5.9.1.7 Proposition 5.4

Proof. To prove the proposition, we need to evaluate the integral in (5.26). By definition ˆGME of fP in (5.24) and applying Fubini’s theorem, this integral becomes

K K Z +∞ X π x − m  X Z +∞ π x − m  |x − αˆ|γ k φ k dx = |x − αˆ|γ k φ k dx h h h h −∞ k=1 k k k=1 −∞ k k K X γ = πkE[|Z − αˆ| ] where Z ∼ N (mk, hk) k=1 K X γ = πkE[|Z| ] where Z ∼ N (mk − α,ˆ hk). k=1

The proposition follows from the absolute moment of a Gaussian random variable in (5.33).

5.9.1.8 Corollary 5.1

Proof. From the properties of the 1F1 function, we have that 1F1(x; −1, 1/2) = 1 − 2x. Thus, the estimated expectation in (5.27) becomes, for γ = 2,

K 2! X m − αˆ  ˆ(P − αˆ)2 = π h2 1 + k . E k k h k=1 k

Combined with the definition of Cˆ(w; γ) in (5.21), the objective function Cˆ(w) becomes

K ˆ 1 X 2 2 C(w) = Hb(m, T ) − πk hk + (mk − αˆ) . ˆ2 2β k=1

ˆKDE Considering the density estimator fP , this simplifies to

T T ! ˆ 1 1 X 2 1 X 2 C(w) = Hb(m, T ) − (Pt − αˆ) + ht . (5.41) ˆ2 T T 2β t=1 t=1

For ht independent from w, the second sum in (5.41) is a constant that can be discarded 1 PT 2 in the objective function. Moreover, T t=1(Pt − αˆ) is the sample estimator of the 2 2 2 expectation E[(P − αˆ) ] = (µP − αˆ) + σP , giving the final result. Chapter 5. Portfolio selection: A target-distribution approach 174

5.9.2 Choice of kernel bandwidth

One common choice to set the kernel bandwidth h is Silverman (1986)’s rule-of-thumb. However, it is well known that this bandwidth tends to oversmooth the density. Another class of bandwidths are those found via cross-validation; see Duin (1976) and Rudemo (1982). While more accurate, such bandwidths are known to lack robustness and be highly variable, thus often undersmoothing the density. We rely on a more recent approach proposed by Zhang and Wang (2008), who devise a bandwidth based on quantile ratios that is both robust and accurate. It is defined as

 4 1/5 h = Qb0.3, (5.42) 3T

where Qb0.3 is the 30% quantile of the ratio quantile ranges

P(t+m:T ) − P(t−m:T ) t + m − 0.5 t − m − 0.5 RQR[ t = −1 −1 , qt = , pt = , t = 1,...,T, (5.43) Φ (qt) − Φ (pt) N N

where m = T 1/2, x = x if 1 ≤ x ≤ T , 1 if x < 1, and T if x > T . To avoid that h varies with the portfolio weights w, which would unpredictably affect the minimum-divergence portfolio obtained, we actually compute the bandwidth in (5.42)–(5.43) based on the 0 equally weighted portfolio Pt = 1 Xt/N.

5.9.3 Out-of-sample performance of all considered portfolio strategies

In Table 5.7, we report the out-of-sample performance for all the considered portfolio strategies that are listed in Table 5.4. We discuss these figures below.

5.9.3.1 Comparison with reference portfolios

We focus here on the Gaussian target-return distribution, corresponding to γ = 2. The results for γ = 4 are discussed in Section 5.9.3.3.

First, let us look at the mean-variance trade-off. As desired, the portfolio-return mean and volatility of the minimum-divergence portfolio are close to those of the reference portfolio. The out-of-sample mean return attained by the minimum-divergence portfolio can either be higher or lower than the one of the reference portfolio without any clear trend. The out-of-sample volatility attained is always larger than the reference portfolio Chapter 5. Portfolio selection: A target-distribution approach 175

Table 5.7: Out-of-sample performance of all considered portfolios (1/3)

Sharpe Excess Modified Modified Mean Volatility Skewness Turnover ratio kurtosis 1% VaR 1% SR 6BTM dataset Reference portfolios and minimum-divergence portfolios MV 19.30% 13.40% 1.44 -0.36 19.61 5.94% 1.29 2.46% G2MV 22.04% 14.78% 1.49 0.05 9.40 4.09% 2.14 5.01% G4MV 22.45% 15.15% 1.48 -0.11 9.50 4.32% 2.06 4.76% LW 18.41% 13.37% 1.38 -0.38 20.45 6.10% 1.20 1.83% G2LW 22.51% 15.15% 1.49 0.07 9.14 4.12% 2.17 5.42% G4LW 22.50% 15.20% 1.48 -0.11 9.37 4.31% 2.07 4.80% MV2n 18.58% 13.91% 1.34 -0.53 21.39 6.60% 1.12 1.92% G2MV2n 22.46% 15.22% 1.48 0.08 8.76 4.05% 2.20 5.44% G4MV2n 22.39% 15.24% 1.47 -0.09 9.20 4.27% 2.08 4.81% MSR 19.48% 20.92% 0.93 -0.37 24.72 10.90% 0.71 12.10% G2MSR 22.31% 18.79% 1.19 0.40 14.66 6.31% 1.40 8.55% G4MSR 22.06% 17.79% 1.24 0.25 12.79 5.63% 1.55 7.06% KZ 19.48% 19.29% 1.01 -0.39 21.05 9.01% 0.86 11.40% G2KZ 23.56% 19.04% 1.24 0.31 11.35 5.57% 1.68 8.92% G4KZ 22.24% 17.85% 1.25 0.14 10.37 5.13% 1.72 7.11% Higher-moment portfolios CRRAsam λ = 10 19.46% 13.37% 1.45 -0.34 19.58 5.91% 1.31 2.53% λ = 50 20.20% 13.66% 1.48 -0.29 17.48 5.59% 1.43 3.00% λ = 200 21.52% 14.46% 1.49 -0.20 13.13 4.95% 1.72 3.90% CRRArob λ = 10 17.96% 13.60% 1.32 -0.42 20.32 6.20% 1.15 1.74% λ = 50 17.59% 14.24% 1.24 -0.45 18.79 6.19% 1.13 1.42% λ = 200 16.76% 14.86% 1.13 -0.47 16.29 5.92% 1.12 1.14% MVKsam 19.49% 13.71% 1.42 -0.50 19.42 6.09% 1.27 2.61% MVKrob 18.01% 14.36% 1.25 -0.52 18.47 6.19% 1.15 1.61% MVaRsam 22.91% 15.20% 1.51 -0.04 8.68 4.11% 2.21 5.64% MVaRrob 16.06% 16.06% 1.00 -0.43 12.01 5.38% 1.18 1.14% 25BTM dataset Reference portfolios and minimum-divergence portfolios MV 20.27% 11.09% 1.83 -0.53 16.77 4.48% 1.80 4.42% G2MV 22.09% 11.79% 1.87 -0.35 10.10 3.55% 2.47 7.49% G4MV 22.26% 11.75% 1.89 -0.34 11.16 3.72% 2.38 7.14% LW 19.62% 11.03% 1.78 -0.55 17.47 4.58% 1.70 3.84% G2LW 22.23% 11.79% 1.89 -0.37 9.53 3.46% 2.55 7.50% G4LW 22.35% 11.80% 1.89 -0.33 10.99 3.70% 2.40 7.24% MV2n 20.33% 11.00% 1.85 -0.56 15.76 4.29% 1.88 4.22% G2MV2n 21.97% 11.78% 1.86 -0.37 9.47 3.45% 2.53 7.60% G4MV2n 22.25% 11.76% 1.89 -0.34 11.10 3.71% 2.38 7.20% MSR 26.23% 16.36% 1.60 -0.08 23.24 7.95% 1.31 12.67% G2MSR 24.00% 15.49% 1.55 0.16 10.43 4.43% 2.15 13.02% G4MSR 23.89% 15.48% 1.54 -0.22 14.44 5.61% 1.69 13.68% KZ 24.74% 13.20% 1.87 -0.31 13.04 4.53% 2.17 8.38% G2KZ 22.77% 13.34% 1.71 -0.09 9.07 3.70% 2.44 10.32% G4KZ 22.52% 13.05% 1.73 -0.33 10.43 3.99% 2.24 10.02% Higher-moment portfolios CRRAsam λ = 10 20.33% 11.08% 1.84 -0.53 16.68 4.46% 1.81 4.47% λ = 50 20.69% 11.14% 1.86 -0.50 16.10 4.39% 1.87 4.81% λ = 200 21.67% 11.41% 1.90 -0.42 14.15 4.14% 2.08 5.69% CRRArob λ = 10 19.57% 11.00% 1.78 -0.54 17.21 4.52% 1.72 3.84% λ = 50 18.97% 10.93% 1.74 -0.49 17.23 4.49% 1.68 3.81% λ = 200 20.18% 11.16% 1.81 -0.52 16.37 4.44% 1.80 4.49% MVKsam 20.41% 10.99% 1.86 -0.60 14.51 4.09% 1.98 4.43% MVKrob 19.43% 11.00% 1.77 -0.62 14.86 4.16% 1.85 3.72% MVaRsam 22.51% 12.11% 1.86 -0.40 11.63 3.94% 2.27 7.31% MVaRrob 20.27% 11.09% 1.83 -0.53 16.77 4.48% 1.80 4.42% Chapter 5. Portfolio selection: A target-distribution approach 176

Table 5.7: Out-of-sample performance of all considered portfolios (2/3)

Sharpe Excess Modified Modified Mean Volatility Skewness Turnover ratio kurtosis 1% VaR 1% SR 6Prof dataset Reference portfolios and minimum-divergence portfolios MV 17.43% 14.04% 1.24 -0.08 22.44 6.68% 1.04 2.41% G2MV 20.53% 16.11% 1.27 -0.01 16.36 6.17% 1.32 4.98% G4MV 21.44% 15.94% 1.35 -0.01 17.19 6.29% 1.35 4.34% LW 16.63% 14.10% 1.18 -0.14 22.31 6.72% 0.98 1.74% G2LW 20.15% 16.33% 1.23 -0.04 15.58 6.09% 1.31 5.37% G4LW 21.50% 16.01% 1.34 -0.01 16.95 6.26% 1.36 4.46% MV2n 15.99% 14.46% 1.11 -0.28 21.03 6.69% 0.95 1.79% G2MV2n 20.72% 16.53% 1.25 -0.03 15.53 6.14% 1.34 5.69% G4MV2n 21.77% 16.11% 1.35 0.00 16.49 6.19% 1.40 4.72% MSR 22.68% 22.08% 1.03 -0.10 12.52 7.31% 1.23 13.79% G2MSR 25.78% 24.68% 1.04 0.01 14.82 8.89% 1.15 11.50% G4MSR 24.79% 22.34% 1.11 -0.02 13.64 7.68% 1.28 8.86% KZ 21.81% 20.90% 1.04 -0.09 11.54 6.62% 1.31 12.00% G2KZ 25.37% 23.20% 1.09 0.04 13.29 7.80% 1.29 10.20% G4KZ 23.60% 20.93% 1.13 -0.01 13.74 7.22% 1.30 8.76% Higher-moment portfolios CRRAsam λ = 10 17.66% 14.03% 1.26 -0.09 22.25 6.64% 1.06 2.50% λ = 50 18.78% 14.31% 1.31 -0.09 20.77 6.46% 1.15 2.83% λ = 200 20.35% 15.12% 1.35 -0.07 18.09 6.21% 1.30 3.54% CRRArob λ = 10 16.44% 14.28% 1.15 -0.17 21.00 6.54% 1.00 1.69% λ = 50 16.50% 14.82% 1.11 -0.24 18.30 6.24% 1.05 1.44% λ = 200 16.05% 15.32% 1.05 -0.33 16.59 6.12% 1.04 1.11% MVKsam 17.04% 14.53% 1.17 -0.10 19.78 6.36% 1.06 2.64% MVKrob 16.31% 14.99% 1.09 -0.16 18.16 6.24% 1.04 1.68% MVaRsam 21.72% 16.31% 1.33 -0.08 15.30 6.04% 1.43 5.27% MVaRrob 15.94% 16.40% 0.97 -0.18 15.83 6.28% 1.01 1.24% 25Prof dataset Reference portfolios and minimum-divergence portfolios MV 16.05% 12.11% 1.33 -0.39 24.61 6.27% 1.02 4.53% G2MV 17.75% 13.20% 1.35 -0.33 14.83 4.92% 1.43 8.20% G4MV 17.82% 13.35% 1.34 -0.37 15.45 5.11% 1.38 7.83% LW 15.88% 12.01% 1.32 -0.40 25.66 6.41% 0.98 3.94% G2LW 17.71% 13.33% 1.33 -0.38 15.10 5.04% 1.40 8.37% G4LW 17.59% 13.15% 1.34 -0.36 16.01 5.13% 1.36 7.78% MV2n 16.02% 12.12% 1.32 -0.34 25.23 6.37% 1.00 4.34% G2MV2n 17.34% 13.43% 1.29 -0.35 16.35 5.31% 1.30 8.28% G4MV2n 17.20% 13.26% 1.30 -0.34 15.55 5.09% 1.34 8.22% MSR 26.38% 29.41% 0.90 0.84 45.25 22.17% 0.47 34.25% G2MSR 21.55% 29.04% 0.74 0.26 29.33 16.31% 0.52 27.46% G4MSR 19.56% 25.64% 0.76 -0.10 20.94 11.70% 0.66 24.53% KZ 20.86% 15.72% 1.33 0.00 20.31 6.92% 1.20 10.37% G2KZ 19.23% 15.95% 1.21 -0.33 11.81 5.24% 1.46 11.96% G4KZ 19.19% 15.59% 1.23 -0.32 14.08 5.64% 1.35 11.63% Higher-moment portfolios CRRAsam λ = 10 16.05% 12.08% 1.33 -0.39 24.56 6.25% 1.02 4.59% λ = 50 16.31% 12.17% 1.34 -0.40 23.44 6.10% 1.06 4.92% λ = 200 17.07% 12.57% 1.36 -0.37 20.18 5.68% 1.19 5.87% CRRArob λ = 10 15.98% 11.98% 1.33 -0.39 25.19 6.31% 1.01 3.96% λ = 50 15.93% 12.07% 1.32 -0.41 25.62 6.44% 0.98 3.98% λ = 200 16.07% 12.66% 1.27 -0.48 23.67 6.42% 0.99 4.41% MVKsam 16.25% 12.17% 1.33 -0.44 23.84 6.19% 1.04 4.68% MVKrob 16.15% 12.27% 1.32 -0.44 23.25 6.13% 1.05 3.97% MVaRsam 18.01% 13.20% 1.37 -0.31 17.77 5.48% 1.31 7.76% MVaRrob 16.05% 12.11% 1.33 -0.39 24.61 6.27% 1.02 4.53% Chapter 5. Portfolio selection: A target-distribution approach 177

Table 5.7: Out-of-sample performance of all considered portfolios (3/3)

Sharpe Excess Modified Modified Mean Volatility Skewness Turnover ratio kurtosis 1% VaR 1% SR 10Ind dataset Reference portfolios and minimum-divergence portfolios MV 12.81% 12.51% 1.02 -0.33 22.03 6.00% 0.85 1.28% G2MV 12.56% 14.12% 0.89 -0.30 12.84 4.85% 1.03 3.14% G4MV 12.52% 14.10% 0.89 -0.28 12.62 4.79% 1.04 2.97% LW 13.14% 12.47% 1.05 -0.34 22.47 6.06% 0.86 1.12% G2LW 12.59% 14.14% 0.89 -0.31 13.14 4.93% 1.01 3.14% G4LW 12.49% 14.12% 0.88 -0.28 12.55 4.78% 1.04 2.98% MV2n 12.78% 12.52% 1.02 -0.30 21.21 5.84% 0.87 1.25% G2MV2n 12.48% 14.16% 0.88 -0.28 12.76 4.84% 1.02 3.15% G4MV2n 12.52% 14.12% 0.89 -0.27 12.51 4.77% 1.04 2.98% MSR 11.28% 22.36% 0.50 -1.12 33.33 14.70% 0.30 6.76% G2MSR 12.66% 21.76% 0.58 1.08 31.77 11.62% 0.43 8.30% G4MSR 12.80% 20.13% 0.64 1.31 43.77 13.83% 0.37 7.18% KZ 12.66% 24.82% 0.51 0.01 31.75 15.18% 0.33 8.95% G2KZ 8.79% 24.52% 0.36 0.43 21.32 10.66% 0.33 10.08% G4KZ 12.95% 22.46% 0.58 0.99 28.56 11.14% 0.46 8.91% Higher-moment portfolios CRRAsam λ = 10 12.90% 12.52% 1.03 -0.34 21.71 5.95% 0.86 1.32% λ = 50 12.96% 12.71% 1.02 -0.40 20.89 5.91% 0.87 1.53% λ = 200 12.59% 13.26% 0.95 -0.42 17.64 5.54% 0.90 2.09% CRRArob λ = 10 13.11% 12.46% 1.05 -0.35 22.71 6.11% 0.85 1.10% λ = 50 13.11% 12.48% 1.05 -0.39 24.16 6.40% 0.81 1.02% λ = 200 13.02% 12.90% 1.01 -0.46 24.54 6.71% 0.77 0.93% MVKsam 12.54% 12.98% 0.97 -0.38 20.46 5.94% 0.84 1.51% MVKrob 13.37% 13.20% 1.01 -0.38 18.96 5.75% 0.92 1.33% MVaRsam 12.58% 14.28% 0.88 -0.53 14.43 5.33% 0.94 3.08% MVaRrob 12.53% 14.21% 0.88 -0.53 20.70 6.62% 0.75 1.28% 30Ind dataset Reference portfolios and minimum-divergence portfolios MV 10.95% 11.72% 0.93 -0.74 23.30 5.95% 0.73 2.45% G2MV 10.37% 13.01% 0.80 -0.66 13.00 4.62% 0.89 4.60% G4MV 10.42% 13.04% 0.80 -0.60 12.07 4.44% 0.93 4.60% LW 11.23% 11.69% 0.96 -0.75 23.84 6.02% 0.74 2.26% G2LW 10.61% 13.00% 0.82 -0.58 11.66 4.34% 0.97 4.65% G4LW 10.03% 12.85% 0.78 -0.63 12.72 4.51% 0.88 4.48% MV2n 11.10% 11.67% 0.95 -0.72 23.56 5.96% 0.74 2.41% G2MV2n 10.31% 12.98% 0.79 -0.62 12.14 4.44% 0.92 4.63% G4MV2n 10.37% 13.28% 0.78 -0.57 11.36 4.38% 0.94 4.81% MSR 13.39% 24.64% 0.54 -0.51 13.24 8.80% 0.60 11.36% G2MSR 8.81% 22.49% 0.39 -0.25 11.65 7.34% 0.48 12.29% G4MSR 9.82% 21.80% 0.45 -0.09 12.19 7.16% 0.54 12.74% KZ 7.01% 22.98% 0.31 0.35 25.35 11.47% 0.24 11.48% G2KZ 12.32% 22.02% 0.56 -0.41 16.40 8.83% 0.55 11.23% G4KZ 13.44% 21.28% 0.63 -0.17 15.77 8.16% 0.65 11.45% Higher-moment portfolios CRRAsam λ = 10 11.00% 11.73% 0.94 -0.75 23.35 5.96% 0.73 2.47% λ = 50 11.01% 11.85% 0.93 -0.80 23.25 6.01% 0.73 2.64% λ = 200 10.67% 12.27% 0.87 -0.83 20.59 5.75% 0.74 3.22% CRRArob λ = 10 11.22% 11.74% 0.96 -0.74 22.88 5.88% 0.76 2.29% λ = 50 11.30% 11.68% 0.97 -0.70 24.24 6.08% 0.74 2.16% λ = 200 11.14% 12.04% 0.93 -0.48 22.52 5.92% 0.75 1.81% MVKsam 11.71% 12.13% 0.96 -0.66 21.98 5.91% 0.79 2.63% MVKrob 11.89% 12.15% 0.98 -0.66 21.75 5.87% 0.80 2.44% MVaRsam 10.02% 13.26% 0.76 -0.98 21.53 6.41% 0.62 4.37% MVaRrob 12.13% 13.56% 0.89 -0.28 19.15 5.92% 0.81 1.91% Chapter 5. Portfolio selection: A target-distribution approach 178 for the MV, LW and MV2n portfolio policies, which is natural because they aim to minimize the portfolio-return variance. In contrast, the volatility is frequently lower than the one of the reference portfolio for the MSR and KZ portfolio policies that also aim to maximize the portfolio mean return. The Sharpe ratio shows that the minimum-divergence portfolios perform similarly to the reference portfolios: they outperform in 17 out of 30 categories (five reference portfolios times six datasets).

Second, with regards to tail risk, the minimum-divergence portfolios have a skewness and excess kurtosis closer to zero compared to the reference portfolios, as desired. In terms of skewness, we improve upon the reference portfolio in 26 out of 30 categories. The skewness of the reference portfolios is nearly systematically negative, while the one of the minimum-divergence portfolio can sometimes be positive. In terms of kurtosis, the improvement is even more significant. We improve upon the reference portfolio in 28 out of 30 categories, and the average reduction in excess kurtosis is of 35% over the 30 categories. This improvement in skewness and kurtosis makes that the modified VaR and modified Sharpe ratio almost always favor the minimum-divergence portfolio as well.

Third, in terms of turnover, the conclusions are similar to those made in Section 5.7.3. For the minimum-variance reference portfolios (MV, LW, MV2n), the minimum- divergence portfolios have a higher turnover. For the maximum-Sharpe-ratio reference portfolios (MSR, KZ), the minimum-divergence portfolios have a similar turnover.

5.9.3.2 Comparison with higher-moment portfolios

In Section 5.7.3, we discussed the performance of the MVaRsam portfolio. In this section, we look at the performance of the two other higher-moment portfolio strategies: CRRAsam and MVKsam. Again, we do not discuss the robust counterparts, CRRArob and MVKrob, because they do not perform as well except in terms of turnover.

First, we look at the CRRAsam portfolio. This four-moment utility approach does not seem the best candidate to improve higher moments. Indeed, even for values of the risk-aversion coefficient as extreme as λ = 200, this portfolio remains quite close to the MV portfolio that disregards the portfolio-return skewness and kurtosis in the objective function. This is consistent with Cremers et al. (2005) and Markowitz (2014) as explained in Section 1.2.1. In particular, the CRRAsam portfolio is systematically outperformed by the G2MV portfolio in terms of modified VaR and modified Sharpe ratio for all values of λ, with a similar level of Sharpe ratio except for the 10Ind and 30Ind datasets. Chapter 5. Portfolio selection: A target-distribution approach 179

Second, we look at the MVKsam portfolio, which looks for the best improvements in mean, variance and kurtosis starting from the equally weighted portfolio. The performance of this portfolio is disappointing in terms of higher moments. Indeed, even though its objective is to improve kurtosis, this portfolio has a lesser improvement in kurtosis, modified VaR and modified Sharpe ratio compared to the minimum-divergence portfolios. Its only benefit is that it achieves a turnover of the same magnitude than the sample minimum-variance portfolio, which improves upon the minimum-divergence portfolios.

5.9.3.3 Results for the generalized-normal target with γ = 4

In this section, we assess the performance of minimum-divergence portfolios using a generalized-normal target-return distribution with γ = 4, which corresponds to thinner tails than the Gaussian (γ = 2). Thus, we expect the resulting minimum-divergence portfolio to penalize a misfit in the tails more strongly than for the Gaussian target return. Indeed, the objective function is given by C(w; fT ) = H[P ] + E[ln fT (P )], and the integrand associated with the expectation, given by fP (z) ln fT (z), can be highly negative if, in the target-return tails, fP (z) is large relative to fT (z), which can easily happen when the target-return tails decay rapidly. Such a misfit in the tails can be reduced by decreasing the portfolio-return kurtosis but also, as seen in Theorem 5.2, its variance.

In terms of mean-variance trade-off, it is interesting to observe that γ = 4 tends to produce a lower portfolio-return volatility than γ = 2, with a reduction of 0.55 percentage point on average over the 30 categories. At the same time, the portfolio mean return remains similar, and thus, the Sharpe ratio increases by 4.5% on average over the 30 categories compared to γ = 2. In terms of higher moments however, the skewness remains similar but the kurtosis is worsened: it is increased by 0.73 on average. The combined effect of the reduction in volatility and increase in kurtosis makes that the 1% modified VaR and 1% modified Sharpe ratios remain similar to those obtained for γ = 2. These results show that, for the datasets considered, the fit to the target-return tails is better realized via a lower volatility than a lower kurtosis. Finally, because setting γ = 4 favors a reduction in volatility rather than kurtosis, it is associated with less rebalancing: the daily portfolio turnover is reduced by 0.61 percentage points on average over the 30 categories.

It is important though to notice that these differences are not substantial, and thus, the sensitivity to the particular value of γ chosen remains reasonable. Chapter 5. Portfolio selection: A target-distribution approach 180

5.9.4 Entropy and diversification

In this section, we provide a discussion on the relationship between higher moments, as captured by Shannon entropy, and portfolio-weight diversification. Several studies provide evidence that increased portfolio-weight diversification can actually deteriorate the portfolio-return skewness and kurtosis; see Simkowitz and Beedles (1978), Conine and Tamarkin (1981), Tang and Choi (1998) and Desmoulins-Lebeault and Kharoubi- Rakotomalala (2012). The latter find that “[...] the increasing number of stocks in the portfolio, far from diversifying away the negative asymmetry of returns, actually amplifies skewness” (p.1991), and that “[...] an increase in portfolio size first has a very positive effect, diminishing the kurtosis brutally to reach a minimum for a size of three or four stocks. However, after this minimum, the kurtosis re-increases, first rapidly and then more slowly [...]” (p.1992). Thus, it may be rational for investors with preferences for higher skewness and lower kurtosis to hold more concentrated portfolios than what a simple mean-variance analysis would advise.

In this section, we evaluate whether this stance, opposite to the traditional point of view in investment management, is reflected by our empirical results. We restrict to the Gaussian target return (γ = 2) and the MV and MSR reference portfolios, which is sufficient to make our point. The G2MV and G2MSR portfolios are less diversified than the MV and MSR portfolios if the entropy of standardized portfolio returns, H(P ?), is increased (that is, if the portfolio-return distribution is closer to a Gaussian distribution) by taking more concentrated positions than the reference portfolio. The relationship between the standardized entropy H(P ?) and portfolio-weight diversification depends on the asset-return multivariate distribution. Indeed, as we first show in Section 5.9.4.1, the standardized entropy H(P ?) actually favors portfolio-weight diversification when assuming i.i.d. asset returns (except for possibly different means). However, the i.i.d. assumption is far from being realistic in practice. Instead, as we then show in Section 5.9.4.2, the empirical results point to the reverse relationship, in line with the above literature.

5.9.4.1 The case of i.i.d. asset returns

In this section, we show that the standardized entropy term H(P ?), appearing in the objective function related to the Gaussian target return in (5.13), favors portfolio-weight diversification when assuming i.i.d. asset returns. This can be intuitively understood from the core idea of the central limit theorem that “mixing Gaussianizes”. This idea is Chapter 5. Portfolio selection: A target-distribution approach 181 relatively similar to blind source separation where sources are unmixed by maximizing departure from Gaussianity; see Hyv¨arinenet al. (2001) and Appendix 3.8.2. As shown in Artstein et al. (2004), mixing more and more signals increases Gaussianity when using entropy as a Gaussianity measure, even when normalizing for the number of signals. N Specifically, the Shannon entropy of √1 P X , where the X are assumed i.i.d. (except N i=1 i i for possibly different means), is a non-decreasing sequence:

N ! N−1 ! 1 X 1 X H √ Xi ≥ H √ Xi . (5.44) N i=1 N − 1 i=1

Using this result, we arrive at the following proposition.

Proposition 5.5. Let X1,...,XN be i.i.d. asset returns, except with possibly different means. Consider the following sequence of portfolios going from most concentrated to most diversified: w1 = (1, 0,..., 0)0, w2 = (1/2, 1/2, 0,..., 0)0,..., wN = (1/N, . . . , 1/N)0. ? i Let Pi be the standardized return of the ith portfolio w in this sequence. Then, we have that ? ? ? H(PN ) ≥ H(PN−1) ≥ · · · ≥ H(P1 ), (5.45) with equality if and only if the asset returns are Gaussian.

? ? 2 Proof. Consider the standardized portfolio returns PN−1 and PN in the sequence. Let σ N−1 be the identical variance of each asset return Xi. By definition of the portfolio w , ? PN−1 is given by 1 PN−1 N−1 ? N−1 i=1 Xi 1 X PN−1 = q = √ Xi. 1 2 σ N − 1 N−1 σ i=1 Similarly, we find that N ? 1 X PN = √ Xi. σ N i=1 ? ? From (5.44), we easily find that H(PN ) ≥ H(PN−1). The sequence of inequalities in (5.45) follows by induction. It remains to show that all the entropies in (5.45) are equal if and ? only if the i.i.d. asset returns are Gaussian. Let us compute the entropy of Pk for an arbitrary k knowing that the entropy of a Gaussian random variable Z ∼ N (µ, σ) is √ H(Z) = ln 2πeσ2:

k ! X 1 1 1 1 √ H(P ?) = H X − ln(kσ2) = ln(2πe) + ln(kσ2) − ln(kσ2) = ln 2πe, k i 2 2 2 2 i=1 Chapter 5. Portfolio selection: A target-distribution approach 182 which is a constant irrespective of k, thus proving the if part of the statement. We have equalities only if the asset returns are Gaussian because the entropy power inequality holds with equality only for Gaussian distributions (Dembo et al. 1991) and that Artstein et al. (2004)’s result hinges on the entropy power inequality.

Proposition 5.5 shows that the standardized entropy term H(P ?) in (5.13), which captures the degree of Gaussianity of the portfolio return, is positively affected by increased portfolio-weight diversification when asset returns are i.i.d.

5.9.4.2 Empirical point of view

To evaluate the link between entropy and portfolio-weight diversification empirically, we measure the portfolio-weight concentration of each portfolio strategy as

R N 1 X X |w − 1/N|, (5.46) RN r,i r=1 i=1 where R = 72 is the number of rolling windows and wr,i is the portfolio weight of asset i in the rth rebalancing period.

The results are reported in Table 5.8. Whether the minimum-divergence portfolio is more concentrated depends on the reference portfolio chosen. On the one hand, the MSR and G2MSR portfolios depict a similar level of portfolio-weight concentration. This can be explained by the fact that the MSR portfolio often takes quite large positions because of its sensitivity to the mean-return vector, and thus, the KL divergence does not deem it favorable to take even more extreme positions to improve higher moments. On the other hand, when relying on MV as reference portfolio, one can clearly see that the standardized-entropy term shrinks the MV portfolio toward a portfolio that is more concentrated. To illustrate this, we depict in Figure 5.7 boxplots of portfolio weights for the MV and G2MV portfolios and the 6BTM dataset. The boxplots show the distribution of portfolio weights over the 72 rebalancing periods. One can indeed see that the median portfolio weights are closer to 1/N for the MV portfolio than the G2MV portfolio.

Thus, we confirm that portfolio-weight diversification, which is often seen as favorable in a mean-variance context to combat estimation risk (Bera and Park 2008, DeMiguel et al. 2009b, Tu and Zhou 2011), may not be advisable when taking higher moments (Shannon entropy) into account. To approach a Gaussian return distribution, more concentration Chapter 5. Portfolio selection: A target-distribution approach 183

Table 5.8: Portfolio-weight concentration of minimum-divergence versus refer- ence portfolios 6BTM 25BTM 6Prof 25Prof 10Ind 30Ind Average MV 0.65 0.22 0.77 0.23 0.22 0.10 0.36 G2MV 0.98 0.29 1.03 0.32 0.33 0.14 0.51 MSR 1.13 0.35 1.47 0.50 0.36 0.19 0.67 G2MSR 1.14 0.37 1.44 0.48 0.46 0.21 0.68 Notes. The table reports the results for the portfolio-weight-diversification analysis of Appendix 5.9.4.2. We restrict to the MV and MSR reference portfolios and the Gaussian target return. We report the portfolio-weight concentration of these portfolios for the six datasets with the concentration measure in (5.46). The higher its value, the more the portfolio is distant from the equally weighted portfolio on average over the 72 rebalancing periods.

Figure 5.7: Boxplots of portfolio weights for the MV and G2MV portfolios 4

2

1/N

-2

-4 w1 w2 w3 w4 w5 w6 Notes. The figure depicts boxplots of portfolio weights, for the 6BTM dataset, for the MV reference portfolio (red left boxplots) and the resulting minimum-divergence portfolio with a Gaussian target return, G2MV (blue right boxplots); see Appendix 5.9.4.2. The figure is constructed using Matlab boxplot function. The central line on each boxplot indicates the median, the box indicates the 25% and 75% quantiles, the whiskers extend to the most extreme data points that are not considered as outliers, and the outliers are represented with the + symbol. seems instead necessary, contrary to what the central limit theorem—that relies on the independence assumption—may suggest at first sight.

Chapter 6

Conclusion

The purpose of this thesis is to propose several routes to address some of the challenges encountered in the field of portfolio selection, particularly related to the inclusion of higher-moment risk in portfolio-selection strategies. After reviewing the relevant literature in Chapter1, we made four original proposals in chapters2 to5. In this last chapter, we conclude with a summary of the main results in Section 6.1 as well as some open questions and challenges for future research in Section 6.2.

6.1 Summary of main results

Much of the research in portfolio selection remains grounded in the original mean-variance framework of Markowitz (1952). There are good reasons for that: it is intuitively clear for investors, it is mathematically tractable, and it enjoys the large body of statistics research related to the robust estimation of the first two moments of the multivariate Gaussian distribution. The purpose of this thesis is to go beyond the mean-variance framework and account for higher moments too. This is of fundamental importance given that investors are highly concerned about the downside risk they are exposed to when investing their capital. For example, Ang et al. (2006) show that the cross section of stock returns reflects a downside-risk premium of around 6% a year. Meanwhile, the distribution of asset returns is typically negatively skewed and leptokurtic, and it is thus dangerous to ignore higher-moment risk when judging the merits of asset allocations.

To carry out this task, our overarching framework lies in information theory and, in particular, one of its most central concepts introduced by Shannon (1948): entropy. Intuitively, entropy measures the amount of information conveyed on average by the observation of a random variable. Thus, the lower the entropy, the less information we get when observing the random variable, and the less “surprised” we are. Given that investors dislike uncertainty, wouldn’t they want to be least surprised by the future realizations of their portfolio return? This is what we investigate in Chapter2, in which we study the properties of portfolios with minimum R´enyi entropy. Although we show

185 Chapter 6. Conclusion 186 that the exponential R´enyi entropy has close connections with the class of deviation risk measures of Rockafellar et al. (2016), its intuitive appeal with respect to higher-moment risk does not hold either theoretically, which we show via a Gram-Charlier expansion, or empirically. This is actually not surprising: given a fixed variance, (Shannon) entropy is maximized by the Gaussian distribution as this distribution is the result of adding many independent random variables. Thus, by minimizing entropy, one aims to get a distribution as far away from the Gaussian as possible, and this is often best realized by taking a lower skewness and a higher kurtosis than the minimum-variance portfolio.

Interestingly, entropy provides a natural way for retrieving independent factors from a mixture of them, a problem known as blind source separation and solved by independent component analysis (ICA); see Hyv¨arinenet al. (2001). Indeed, from the central limit theorem, a sum of independent random variables eventually approaches the Gaussian distribution, which has the maximal entropy as noted above. Inverting this phenomenon, one can retrieve independent factors from a mixture of them by minimizing entropy as explained in Section 3.2.2. In portfolio selection, asset returns are in general not a mixture of independent factors, but ICA can still be used to retrieve the maximally independent factors underlying the asset returns. In turn, ICA provides a natural way of diversifying portfolios: wouldn’t investors want their risk to be diversified over maximally independent factors rather than any other choice of factors? We investigate this proposal in Chapter3. This idea of diversifying risk over factors is known as factor-risk parity and was introduced by Meucci (2009). However, Meucci relied on principal components as factors, which are merely uncorrelated. Our first contribution in this chapter is to show that relying on principal components is arbitrary because any portfolio is a factor-variance-parity portfolio for some rotation of the principal components. Going from decorrelation to maximum independence solves this arbitrariness. Moreover, our second contribution is to show that the factor-variance-parity portfolio based on independent components is associated with a low return kurtosis, which is desirable in terms of downside risk, contrary to the minimum-entropy portfolio put forward in Chapter2. Empirically, using variance and modified Value-at-Risk as risk measures, we find that shrinking the minimum-risk portfolio toward the IC-risk-parity portfolio is an investment strategy that performs impressively well in terms of Sharpe ratio and tail risk, while keeping a similar turnover than the minimum-risk portfolio.

Actually, independent components have a further important benefit in portfolio selection: they ease the estimation of the higher moments of portfolio returns. Indeed, being maximally independent, it is reasonable to neglect the estimation of their higher Chapter 6. Conclusion 187 comoments, which drastically reduces the number of parameters involved in higher-moment portfolio strategy. This trick was proposed in Chapter3 for factor-risk-parity purposes. In Chapter4, we exploit the same trick in the more standard context where investors aim to minimize their risk, expressed in terms of moments. In particular, estimating the kth moment of the portfolio return requires estimating a number of parameters of the order N k, where N is the number of assets. Projecting the asset returns onto the first K principal components yields an order of Kk, which is still an exponential law. With independent components however, the number of parameters becomes linear in K. Compared to the literature, the main benefit of the proposed estimation strategy is that it is particularly well suited for high-dimensional settings. Indeed, efficient algorithms exist to perform ICA and, as we show, our dimension-reduction framework reduces the problem of finding N optimal portfolio weights to that of finding only K optimal factor exposures. Empirically, using variance and modified Value-at-Risk as risk measures, we find that compared to classical estimates, our sparse estimates based on ICA lead to portfolios with a much better out-of-sample performance and a much lower turnover, particularly in high dimension.

As the previous chapters and the literature show, investors’ preferences are generally represented via different statistics of portfolio returns such as mean, variance, higher moments or downside risk. The classical way of summarizing those preferences is via the utility function, which associates a value to each possible realization of the portfolio return. Maximizing expected utility is supposed to fully represent the investor’s objective. The issue is that, in a non-Gaussian setting where higher moments cannot be ignored, choosing a sensible utility function is arduous. Actually, many common choices barely differ from Markowitz’s mean-variance portfolio, and many papers depart from expected utility in their higher-moment strategies. The purpose of Chapter5 is then to capture investors’ preferences in a way that, we believe, is much more natural and intuitive than the utility function, and that truly accounts for higher-moment preferences. The way we do this is via a target-return distribution that the investor aims to approach. The optimal portfolio for the investor is then the minimum-divergence portfolio, whose return density is as close as possible to the target-return one. Once again, entropy allows us to operationalize this problem. Indeed, a natural way of measuring the divergence between the target-return and portfolio-return distributions is by measuring the increase of entropy, that is the amount of lost information, when approximating the portfolio-return density via the target-return density. This is known as the Kullback-Leibler divergence (Kullback and Leibler 1951). We analyze this setting in the case where the target-return mean and Chapter 6. Conclusion 188

variance are located on Markowitz’s efficient frontier, and that the overall shape of the target-return density is governed by the generalized-normal distribution of Nadarajah (2005). We show theoretically that we recover mean-variance-efficient portfolios in two cases. First, when asset returns are Gaussian, which is a reassuring result. Second, when targeting a Dirac-delta distribution, as in that case the objective function amounts to fit the target-return mean and minimize the portfolio-return variance. Further, when targeting a Gaussian distribution, the fit to the higher moments is controlled via the entropy of standardized portfolio returns, which, contrary to the proposal in Chapter2, we now aim to maximize in order to obtain a skewness and excess kurtosis close to zero. Finally, the empirical results are very supportive of the proposed minimum-divergence portfolio strategy. Taking several state-of-the-art mean-variance-efficient portfolios to fix the mean and variance of the generalized-normal distribution, we obtain portfolios with Sharpe ratios that are just as good, but with improved skewness, excess kurtosis and downside risk.

6.2 Open questions for future research

The research conducted in this thesis raises questions that would be interesting to tackle in the future. Here are those that seem especially relevant to us. In Section 6.2.1, we list specific questions linked to our four main chapters. In Section 6.2.2, we take a step back and discuss two broader challenges.

6.2.1 Specific questions

On “Minimum R´enyi entropy portfolios”

• We show in Proposition 2.1 that the exponential R´enyi entropy is in general not subadditive. In particular, our counterexamples show that subadditivity can be violated for very fat-tailed distributions such as the L´evydistribution or the Student t distribution with degree of freedom ν < 1. If one embraces entropy as measure of uncertainty, this means that diversification can actually increase uncertainty, and thus, is undesirable. This raises the question: what are the benefits of diversification for fat-tailed asset returns? On the one hand, some may argue that our counterexamples are too extreme (none of the moments exist) and that diversification remains desirable for the observed asset-return distributions. On the other hand, several papers show that diversification can have an adverse effect on the skewness and kurtosis of portfolio returns; see Simkowitz and Beedles (1978), Chapter 6. Conclusion 189

Conine and Tamarkin (1981), Tang and Choi (1998) and Desmoulins-Lebeault and Kharoubi-Rakotomalala (2012). The latter find that “[...] the increasing number of stocks in the portfolio, far from diversifying away the negative asymmetry of returns, actually amplifies skewness” (p.1991), and that “[...] an increase in portfolio size first has a very positive effect, diminishing the kurtosis brutally to reach a minimum for a size of three or four stocks. However, after this minimum, the kurtosis re-increases, first rapidly and then more slowly [...]” (p.1992). Thus, the consideration of higher moments challenges the desirability of subadditive risk measures as they may yield portfolios that are too diversified with respect to higher moments.

• In Section 2.5.2.1, we explain that van Es (1992) proved that the asymptotic bias of the m-spacings estimator of Shannon entropy in (2.23) is constant and only depends on m, not on the underlying density. Does this hold as well for the m- spacings estimator of R´enyi entropy in (2.22)? This is an important problem given the numerous data-driven applications of entropy, such as signal processing and independent component analysis. Hegde et al. (2005) derive a m-spacings estimator of R´enyi entropy that is similar to ours and note: “While we leave the calculation of this bias to an extended journal version of this paper, in many practical applications, such as ICA or other entropy-based optimal signal processing scenarios, this bias does not affect the solution, since it is independent of the true data distribution [...]” (p.136). The extended journal version never came and the problem remains open. We have performed simulations using different distributions. For α close to one and α higher than one, we observe that the estimator converges quite fast and that the bias for a very large sample size T is very similar for all distributions and several values of m. However, for values of α close to zero, the convergence of the estimator becomes very slow, in line with the finding of Acharya et al. (2015) for discrete distributions. In that case, numerical simulations are not informative.

• Could generalizations of R´enyi entropy provide better risk criteria than R´enyi entropy? For example, Koski and Persson (1992) propose the following generalized form of exponential entropy, which depends on parameters α and β:

1 R α ! β−α fX (x)dx Hexp := ΩX . (6.1) α,β R f β (x)dx ΩX X

exp It recovers R´enyi entropy Hα as β → 1. What is the impact of β on higher moments? Can we obtain more desirable effects of kurtosis by playing with β? Chapter 6. Conclusion 190

• The exponential R´enyi entropy is a symmetric measure of uncertainty, just like the standard deviation. Thus, it is not ideal for portfolio-selection applications as investors treat gains and losses differently. One straightforward way to address this is, inspired by the semivariance of Markowitz (1959), to integrate the entropy function only on negative returns. The semi exponential R´enyi entropy would then be defined as 1 Z 0  1−α exp α SHα := fX (x)dx . (6.2) inf X It would be worth investigating the properties of the semi entropy in (6.2) to see whether it keeps the properties of the original exponential R´enyi entropy.

On “Optimal portfolio diversification via independent component analysis”

• Robustness is crucial to ensure a solid out-of-sample portfolio performance. There- fore, robust PCA and ICA approaches, and their impact on the out-of-sample performance and stability of factor-risk-parity portfolios, are worth investigating. At the dimension-reduction step, the PCs are used to transform the N asset returns into K uncorrelated factors. Compared with PCs found from the eigendecomposition of the sample covariance matrix, we expect that relying on shrinkage estimators of the covariance matrix would generate a more robust orthogonal decomposition, especially in high dimension. This is one possible alternative among numerous robust PCA algorithms; see Hallin et al. (2014). At the ICA step, FastICA with robust entropy estimation already displays a satisfying sensitivity to outliers, but more sophisticated robust ICA algorithms have been developed; see, for instance, the RobustICA algorithm of Zarzoso and Comon (2010).

• ICA relies on the assumption that the observed data, here the asset returns, have been generated by a linear transformation of independent source factors. Extensions have been proposed to deal with non-linear transformations; see the founding works of Hyv¨arinenand Pajunen (1999) and Taleb and Jutten (1999). Nonlinear ICA would retrieve factors that are more in line with the market reality where many factors can blend together in complex ways to produce the observed time series of asset returns. In its general form, nonlinear ICA amounts to find factors Y obtained via a transformation ϕ : RN → RK of the asset returns X,

Y = ϕ(X), (6.3) Chapter 6. Conclusion 191

such that Y is mutually independent. However, this is a highly ill-posed problem: “A fundamental characteristic of the nonlinear ICA problem is that in the general case, solutions always exist, and they are highly nonunique. One reason for this is that if x and y are two independent random variables, any of their functions f(x) and g(y) are also independent” (Hyv¨arinenet al. 2001, p.316). Thus, without restrictions on the function ϕ, many functions can lead to perfect independence. One well-known approach that addresses this non-uniqueness issue is the post-nonlinear mixture of Taleb and Jutten (1999). It assumes that the transformation is of the form N ! X Yi = ϕi aijXj . (6.4) j=1 In this setup, Taleb and Jutten (1999) show that the factors are independent and well identified, up to a change of sign and permutation as in the linear case. Another possibility would be to reduce dimension from N to K via nonlinear PCA, and then find the maximally independent rotation of the PCs via linear ICA. However, in such a case, the ICs are not guaranteed to be independent. In any case, it is worth investigating what are the benefits of nonlinear transformations with regards to the performance of IC-risk-parity portfolios. On the one hand, diversifying risk across truly independent factors would enhance diversification. On the other hand, the estimation of the nonlinear transformation is much more complex than that of the linear one, and thus, the diversification benefits may not realize out of sample. In addition, nonlinear ICA is often computationally demanding, which restricts its applicability for the real-world high-dimensional datasets used in finance.

On “Robust portfolio selection using sparse estimation of comoment tensors”

• Our sparse estimation strategy focuses on the estimation of the moments of portfolio returns. However, one can exploit the near independence of the ICs more generally to estimate any objective function that depends on the distribution of the portfolio returns. Indeed, the reduced portfolio return, Pˆ = w0A†Y †, is nothing but a linear combination of the ICs. Thus, assuming the latter are independent, one can

compute the density f ˆ via convolution of the densities f † , or via inverse Fourier P Yi transform of the sum of the characteristic functions of the ICs. This would generalize the proposed approach to objective functions that are not directly functions of the moments, such as the quantiles. However, it requires estimating the marginal Chapter 6. Conclusion 192

densities of the ICs, which is not an easy task when sample sizes are not large. This might worsen the robustness of the approach.

• When using variance as risk measure, we obtain a closed-form solution in (4.15) for the estimate of the minimum-variance portfolio. It amounts to estimate the covariance matrix Σ by Σe = V ΛV 0, which depends on the number of PCs, K. Assuming that the eigenvectors and eigenvalues are estimated via sample averages and that asset returns are Gaussian, could we use the statistical properties of Σe to

derive an analytical expression of the value of K for which the portfolio wMV,K = Σe −11/10Σe −11 has the minimum return variance under the true distribution of asset returns? This would be in a similar vein to DeMiguel et al. (2013) who, assuming asset returns are Gaussian, derive analytical expressions for the optimal shrinkage intensity between the sample estimate of the optimal portfolio and a target estimate, using several calibration criteria such as minimum variance. Achieving this would require studying the distributional properties of the eigendecomposition of a Wishart-distributed covariance matrix; see Takemura and Sheena (2005) for example. Choosing K in such a way should further reduce risk compared to common statistical criteria, such as the percentage of retained variance used in Section 4.4.1, that are disconnected from the portfolio-selection application.

• As we show in the empirical study of Section 4.4, our sparse comoment tensor estimator based on ICA improves upon the classical sample estimator, except when the number of assets is low. This suggests that it may be worth using the tensor ˆ ˆ estimate Mk(X) in (4.21) as a shrinkage target, and rely on the shrinkage estimator

ˆ ˆ Mk,δ(X) := (1 − δ)Mk(X) + δMk(X). (6.5)

In doing so, we would balance the unbiasedness of Mk(X) and the sparsity and ˆ ˆ robustness of Mk(X). This should improve performance, particularly when the number of assets is low relative to the sample size, in which case the optimal ˆ ˆ shrinkage intensity δ is expected to be quite small. The shrinkage target Mk(X) is related to those used in Ledoit and Wolf (2004a) and Boudt et al. (2018) who respectively shrink the sample covariance and coskewness matrices toward a sparse target that assumes the asset returns are independent. By assuming that the ICs are independent, rather than the asset returns, our target is expected to less biased. Chapter 6. Conclusion 193

On “Portfolio selection: A target-distribution approach”

• Our focus in this chapter is the generalized-normal distribution for the target return, which is a symmetric distribution. It is worth investigating the case of a target-return distribution with non-zero skewness, both theoretically and with regards to the realized skewness on empirical data. For example, a popular skewed distribution used in many financial applications is the skew-normal distribution of Azzalini (1985), T ∼ SN (α, β, ξ), with density function

2 x − α  x − α f (x) = φ Φ ξ . (6.6) T β β β

The parameter ξ ∈ R controls the skewness: ξ > 0 implies a positive skewness and inversely. For this target-return distribution, the objective function to maximize, C(w; α, β, ξ), becomes

  P − α C(w; α, β, ξ) = C(w; α, β) + ln Φ ξ , (6.7) E β

where C(w; α, β) is the objective function for the Gaussian target return in (5.13). The second term in (6.7) is however analytically intractable, even in the special case where the portfolio return is Gaussian. Can we approximate this term via some sort of expansion to study it analytically? Alternatively, can we use another skewed target-return distribution for which the objective function behaves more nicely, for example among the exponential family? The shifted log-normal distribution (Chen et al. 2018), which belongs to the exponential family, could be an appealing candidate.

• We rely on the Kullback-Leibler divergence because it is very popular and mathe- matically convenient. However, many other divergence measures could be employed, such as the R´enyi divergence (R´enyi 1961) defined as

" α−1# 1 fP (P ) hfP |fT iα = ln E . (6.8) α − 1 fT (P )

From the shape of the bivariate function (x/y)α−1, one can see that decreasing

α increases the relative contribution of the regions where fP < fT . Thus, if the investor wants to avoid portfolio returns that have more probability mass in the tails than the target return, he would be advised to select a low value of α. Whereas the R´enyi divergence is less convenient than the KL divergence to obtain mathematical Chapter 6. Conclusion 194

results, it is still worth investigating the impact of the parameter α on the tail-risk behavior of the minimum-divergence portfolio. Preliminary empirical results show that using the R´enyi divergence with a low value of α indeed leads to a lower kurtosis than the KL divergence.

• The minimum-divergence portfolio is similar to the minimum-divergence estimation procedure used in statistics; see the textbook of Basu (2011), and Toma and Leoni-Aubin (2015) for an application in portfolio selection. The principle of this estimation procedure is to estimate some vector of parameters θ by minimizing

the divergence between the assumed parametric distribution fθ and the empirical ˆ distribution fT given a sample of size T . The maximum-likelihood estimator of θ corresponds in particular to using the KL divergence, and maximum likelihood is known to be non-robust to outliers. In contrast, using a different divergence measure than the KL divergence can yield estimators that are more robust. For example, Basu et al. (1998) consider the density-power divergence

Z  1  1 hf |f i = f 1+β(x) − 1 + f (x)f β (x) + f 1+β(x) dx (6.9) X Y β Y β X Y β X

with β > 0. It recovers the KL divergence for β → 0. As they show, by taking β > 0, one can strike an appealing trade-off between robustness and efficiency of the minimum-divergence estimator of θ. Specifically, the higher the β, the more

the estimator associates low weights to regions where the two distributions fθ and ˆ fT differ sensibly, hence increasing robustness to outliers. In particular, Basu et al. (1998, p.550) find that estimators with β higher than but close to zero (such as 0.2) “[...] have strong robustness properties with little loss in asymptotic efficiency relative to maximum likelihood [...].” It would be interesting to study whether this phenomenon also applies to the minimum-divergence portfolio. That is, whether using a divergence measure as in (6.9) with β > 0 could improve robustness and turnover compared to using the KL divergence obtained for β → 0.

• We propose to minimize the divergence between the portfolio-return distribution and a target-return distribution whose mean and variance match a mean-variance- efficient reference portfolio. Intuitively, this amounts to shrink the reference portfolio toward a portfolio whose higher moments are closer to those of the target distribution. Can we make this intuition more formal? Can the minimum-divergence portfolio be equivalently written as a shrinkage portfolio? Alternatively, as shrinkage and Bayesian estimation are closely connected (DeMiguel et al. 2009a), can we show Chapter 6. Conclusion 195

that the minimum-divergence portfolio corresponds to a Bayesian procedure with a specific prior? Such results would help to understand the properties of our minimum- divergence portfolio strategy.

6.2.2 General questions

The above specific questions are directly linked to the contributions of our four main chapters. In addition, the writing of this thesis made us aware of more general questions and challenges that are relevant for the broad field of portfolio selection and information theory. We discuss two of them here. The first one concerns independent component analysis. The second one concerns the notion of diversification in portfolio selection.

6.2.2.1 Properties of independent components

The theory underlying ICA has mainly been developed under the “independence assump- tion”. That is, the observed data consist of a linear transform of independent sources as in (4.17). In that case, one can be guaranteed to be able to retrieve the original sources. For instance, Delfosse and Loubaton (1995) show that every local optimum of their kurtosis-based objective function corresponds to the extraction of one source factor. The same property applies to the minimum-range approach of Vrins et al. (2007) when dealing with bounded sources. These are nice theoretical results, but they hold only when the independence assumption is met. This assumption is justified for the original signal-processing applications of ICA, such as blind source separation, where the observed signals are known to correspond to a mixture of distinct physical processes.

However, ICA is increasingly applied in social sciences, such as economics and finance, in which case the independence assumption is generally violated. The independent components are then no longer independent but, instead, correspond to minimum-mutual- information factors or, equivalently, to most-non-Gaussian factors. This is true in particular for our applications of ICA in chapters3 and4. The problem is that very little is known about the properties of the ICA objective function, such as mutual information, when the independence assumption does not hold. What are the properties of minimum-mutual- information factors? Under which conditions are they unique? Under which conditions are there no local spurious minima to the mutual-information criterion? In FastICA, how does the function G impact the properties of the factors obtained when independence is not met? Resolving these blind spots may accelerate the applicability of ICA that, outside of signal processing, is much less frequently used than PCA for example. Chapter 6. Conclusion 196

6.2.2.2 The notion of diversification

In portfolio selection, there is a striking mismatch between how central is the notion of diversification on the one hand, and the lack of consensus on its meaning on the other hand. We discuss here two important questions concerning diversification. Namely: what is diversification, and why is diversification useful.

• What is diversification? As reviewed by Koumou and Dionne (2019), many diversifi- cation measures have been put forward in the literature, and measure diversification in different ways. In Markowitz’s mean-variance theory, diversification is synony- mous with variance reduction. It can also be measured in terms of portfolio weights, quantifying how far from equally weighted they are. Or, it can quantify the effect of correlation on risk reduction. As reviewed in Section 1.3 and Chapter3, a popular way of interpreting diversification is also in terms of risk contribution of assets or factors. This shows that diversification is far from being a well-defined concept. Instead, it encompasses several distinct notions, which all seem intuitively justified on their own. To put some structure, it would be worth formalizing the properties that a “good” diversification measure should fulfill. Koumou and Dionne (2019) mark a first step in that direction. They propose a list of nine axioms that define coherent diversification measures and show that the classes of diversification mea- sures developed by Embrechts et al. (2009) and Tasche (2008), which depend on the selected risk measure %, are coherent given some technical conditions on %. This is a good step forward, but they do not study diversification measures based on risk contributions, which are very popular.

• Why is diversification useful? Investors ultimately care about different facets of their portfolio-return distribution, such as the moments. These preferences can be captured via a utility function or a target-return distribution as proposed in Chapter 5. Suppose now that the investor has determined his optimal portfolio according to his preferences but that, according to some diversification measure, this portfolio is not well diversified. Why would the investor select another more-diversified portfolio instead of the optimal one? He would do so if and only if he had preferences for more diversification, and thus that the utility function or the target-return distribution chosen was not fully representative of his preferences. This raises the question: in which dimension is diversification useful? As mentioned in Section 1.3.3, it can be desirable with respect to out-of-sample performance because well- diversified portfolios can be less sensitive to estimation risk. Ignoring these practical Chapter 6. Conclusion 197

concerns, it is more difficult to say why, theoretically, diversification would be desirable. In particular, how does diversification impact the distribution of portfolio returns? Naturally, the answer to this question depends on how diversification is measured. Theorem 3.2 in Chapter3 brings a partial response: if diversification is measured in terms of variance contribution of independent components, an increase in diversification reduces the portfolio-return kurtosis (under some assumptions).

These questions are of central importance as diversification is ubiquitous in the portfolio strategies implemented by practitioners, and in the regulations written by risk- management committees such as the Basel committee. Nonetheless, diversification-based portfolio strategies are until now largely dismissed by academics. In our humble opinion, this represents an important gap to bridge in future academic research.

References

Ablin, P., Cardoso, J., & Gramfort, A. (2018). Faster independent component analysis by preconditioning with hessian approximations. IEEE Transactions on Signal Processing, 66(15):4040–4049. Acerbi, C., Tasche, D. (2003). Expected shortfall: A natural coherent alternative to Value at Risk. Economic Notes, 31(2):379–388. Acharya, J., Orlitsky, A., Suresh, A., & Tyagi, H. (2015). The complexity of estimating R´enyi entropy. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, San Diego, 1855–1869. Alexander, G., & Baptista, A. (2002). Economic implications of using a mean-VaR model for portfolio selection: A comparison with mean-variance analysis. Journal of Economic Dynamics & Control, 26(7-8):1159–1193. Alexander, G., & Baptista, A. (2004). A comparison of VaR and CVaR constraints on portfolio selection with the mean-variance model. Management Science, 50:1261–1273. Ali, S. M., & Silvey, S. D. (1966). A general class of coefficients of divergence of one distri- bution from another. Journal of the Royal Statistical Society Series B. (Methodological), 18(1):131–142. Amigo, J., Balogh, S., & Hernandez, S. (2018). A brief review of generalized entropies. Entropy, 20(11):813. Anderson, R., Bianchi, S., & Goldberg, L. (2012), Will my risk parity strategy outperform? Financial Analysts Journal, 68:75–93. Ang, A., Bekaert, G., & Liu, J. (2005). Why stocks may disappoint. Journal of Financial Economics, 76(3):471–508. Ang, A., Chen, J., & Yu, H. (2006). Downside risk. Review of Financial Studies, 19(4):1191– 1239. Ardia, D., Bolliger, G., Boudt, K., & Gagnon-Fleury, J. (2017). The impact of covariance misspecification in risk-based portfolios. Annals of Operations Research, 254(1-2):1–16. Ardia, D., & Boudt, K. (2015a). Implied expected returns and the choice of a mean–variance efficient portfolio proxy. Journal of Portfolio Management, 41(4):68– 81. Ardia, D., & Boudt, K. (2015b). Testing equality of modified Sharpe ratios. Finance Research Letters, 13:97–104.

199 References 200

Arnold, B., Balakrishnan, N., & Nagaraja, H. (1992). A First Course in Order Statistics. New York: Wiley. Artstein, S., Ball, K. M., Barthe, F., & Naor, A. (2004). Solution of Shannon’s problem on the monotonicity of entropy. Journal of the Americal Mathematical Society, 17(4):975– 982. Artzner, P., Delbaen, F., Eber, J., & Heath, D. (1999). Coherent measures of risk. Mathematical Finance, 9(3):203–228. Asness, C., Frazzini, A., & Pedersen, L. (2012), Leverage aversion and risk parity. Financial Analysts Journal, 68:47–59. Athayde, G., & Flores, R. (2004). Finding a maximum skewness portfolio: A general solution to three-moments portfolio choice. Journal of Economic Dynamics and Control, 28(7):1335–1352. Avramov, D., & Zhou, G. (2010). Bayesian portfolio analysis. Annual Review of Financial Economics, 2:25–47. Azzalini, A. (1985). A class of distributions which includes the Normal ones. Scandinavian Journal of Statistics, 12(2):171–178. Bai, X., Scheinberg, K., & Tutuncu, R. (2016). Least-squares approach to risk parity in portfolio selection. Quantitative Finance, 16(3):357–376. Baillie, R., & Bollerslev, T. (1992). Prediction in dynamic models with time-dependent conditional variances. Journal of Econometrics, 52(1–2):91–113. Baitinger, E., Dragosch, A., & Topalova, A. (2017). Extending the risk parity approach to higher moments: Is there any value added? Journal of Portfolio Management, 43(2):24–36. Basak, S., & Shapiro, A. (2001). Value-at-risk-based risk management: Optimal policies and asset prices. Review of Financial Studies, 14(2):371–405. Basseville, M. (2013). Divergence measures for statistical data processing – An annotated bibliography. Signal Processing, 93(4):621–633. Basu, A., Harris, I., Hjort, N., & Jones, M. (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika, 85:549–59. Basu, A., Shioya, H., & Park, C. (2011). Statistical Inference: The Minimum Distance Approach. Boca Raton: Chapman & Hall. Behr, P., Guettler, A., & Miebs, F. (2013). On portfolio optimization: Imposing the right constraints. Journal of Banking and Finance, 37:1232–1242. Beirlant, J., Dudewicz, E., Gyofi, L., & van der Meulen, E. (1997). Non-parametric References 201

entropy estimation: An overview. International Journal of Mathematical and Statistical Sciences, 6(1):17–39. Bera, A., & Park, S. (2008). Optimal portfolio diversification using maximum entropy principle. Econometric Reviews, 27:484–512. Berger, J. (1978). Minimax estimation of a multivariate normal mean under polynomial loss. Journal of Multivariate Analysis, 8(2):173–180. Bernard, C., Boyle, P., & Vanduffel, S. (2014). Explicit representation of cost-efficient strategies. Finance, 35(2):5-55. Bertsimas, D., & Shioda, R. (2009). Algorithm for cardinality-constrained quadratic optimization. Computational Optimization and Applications, 43(1):1–22. Best, M., & Grauer, R. (1991). On the sensitivity of mean–variance-efficient portfolios to changes in asset means: some analytical and computational results. Review of Financial Studies, 4(2):315–342. Bodnar, T., Parolya, N., & Schmid, W. (2013). On the equivalence of quadratic optimiza- tion problems commonly used in portfolio theory. European Journal of Operational Research, 229(3):637–644. Bodnar, T., Mazur, S., & Okhrin, Y. (2017). Bayesian estimation of the global minimum variance portfolio. European Journal of Operational Research, 256(1):292–307. Bouchaud, J., Potters, M., & Aguilar, J. (1997). Missing information and asset allocation. Available on arXiv: arxiv.org/pdf/cond-mat/9707042.pdf. Boudt, K., Carl, P., & Peterson, B. (2013). Asset allocation with conditional value-at-risk budgets. Journal of Risk, 15(3):39–68. Boudt, K., Cornilly, D., & Verdonck, T. (2018). A coskewness shrinkage approach for estimating the skewness of linear combinations of random variables. Journal of Financial Econometrics, 22:1–23. Boudt, K., Cornilly, D., & Verdonck, T. (2019). Nearest comoment estimation with unobserved factors. Journal of Econometrics, forthcoming. Boudt, K., Lu, W., & Peeters, B. (2015). Higher order comoments of multifactor models and asset allocation. Finance Research Letters, 13:225–233. Boudt, K., & Peeters, B. (2013). Asset allocation with risk factors. Quantitative Finance Letters 1(1):60–65. Boudt, K., Peterson, B., & Croux, C. (2008). Estimation and decomposition of downside risk for portfolios with non-normal returns. Journal of Risk, 11(2):79–103. References 202

Brandt, M., Santa-Clara, P., & Valkanov, R. (2009). Parametric portfolio policies: Ex- ploiting characteristics in the cross section of equity returns. Review of Financial Studies, 22(9):3411–3447. Briec, W., Kerstens, K., & Jokung, O. (2007). Mean-variance-skewness portfolio perfor- mance gauging: A general shortage function and dual approach. Management Science, 53(1):135–149. Briec, W., Kerstens, K., & Van de Woestyne, I. (2013). Portfolio selection with skewness: A comparison of methods and a generalized one fund result. European Journal of Operational Research, 230(2):412–421. Brody, D., Buckley, I., & Constantinou, I. (2007). Option price calibration from R´enyi entropy. Physics Letters A, 366:298–307. Campbell, L. (1966). Exponential entropy as a measure of extent of a distribution, Z. Wahrsch., 5:217–225. Campbell, R., Huisman, R., & Koedijk, K. (2001). Optimal portfolio selection in a Value-at-Risk framework. Journal of Banking and Finance, 25:1789–1804. Campbell, J., Lo, A., & MacKinlay, A. (1997). The Econometrics of Financial Markets. New Jersey: Princeton University Press. Carmichael, B., Koumou, G., & Moran, K. (2015). Unifying portfolio diversification measures using Rao’s quadratic entropy. CIRANO technical report. Carroll, R., Conlon, T., Cotter, J., & Salvador, E. (2017). Asset allocation with correlation: A composite trade-off. European Journal of Operational Research, 262(3):1164–1180. Cavenaile, L., & Lejeune, T. (2012). A note on the use of modified value-at-risk. Journal of Alternative Investments, 14(4):79–83. Chalabi, Y., & W¨urtz,D. (2012). Portfolio optimization based on divergence measures. MPRA working paper, 43332. Chan, L., Karceski, J., & Lakonishok, J. (1999). On portfolio optimization: Forecasting covariances and choosing the risk model. Review of Financial Studies, 12(5):937–974. Chen, Y., H¨ardle, W., & Spokoiny, V. (2007). Portfolio value at risk based on independent component analysis. Journal of Computational and Applied Mathematics, 205(1):594– 607. Chen, Y., H¨ardle,W., & Spokoiny, V. (2010). GHICA—Risk analysis with GH distribu- tions and independent components. Journal of Empirical Finance, 17(2):255-269. Chen, L., He, S., & Zhang, S. (2011). When all risk-adjusted performance measures are the same: In praise of the Sharpe ratio. Quantitative Finance, 11(10):1439–1447. References 203

Chen, R., Hsieh, P., & Huang, J. (2018). It is time to shift log-normal. Journal of Derivatives, 25(3):89–103. Chopra, V., & Ziemba, W. (1993). The effect of errors in means, variances, and covariances on optimal portfolio choice. Journal of Portfolio Management, 19(2):6–11. Choueifaty, Y., & Coignard, Y. (2008). Toward maximum diversification. Journal of Portfolio Management, 35(1):40–51. Clarke, R., De Silva, H., & Thorley, S. (2013). Risk parity, Maximum diversification, and minimum variance: An analytic perspective. Journal of Portfolio Management, 39(3):39–53. Comon, P. (1994) Independent component analysis, a new concept? Signal Processing, 36(3):287–314. Conine, T. E., & Tamarkin, M. J. (1981). On diversification given asymmetry in returns. Journal of Finance, 36(5):1143–1155. Cont, R. (2001). Empirical properties of asset returns: Stylized facts and statistical issues. Quantitative Finance, 1:223–236. Cornish, E., & Fisher, R. (1938). Moments and cumulants in the specification of distribu- tions. Review of the International Statistical Institute, 5(4):307–320. Cover, T., & Thomas, J. (2006). Elements of Information Theory (2nd ed.). New Jersey: John Wiley & Sons. Cremers, J., Kritzman, M., & Page, S. (2005). Optimal hedge fund allocations—Do higher moments matter? Journal of Portfolio Management, 31(3):70–81. Csiszar, I. (1967). Information-type measures of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar., 2:299–318. Dahlquist, M., Farago, A., & T´edongap,R. (2017). Asymmetries and portfolio choice. Review of Financial Studies, 30(2):667–702. Dan´ıelsson,J., Jorgensenb, B., Samorodnitskyc, G., Sarma, M., & de Vries, C. (2013). Fat tails, VaR and subadditivity. Journal of Econometrics, 172:283–291. Das, S., & Uppal, R. (2004). Systemic risk and international portfolio choice. Journal of Finance, 59(6):2809–2834. Deguest, R., Martellini, L., & Meucci, A. (2013). Risk parity and beyond—From asset allocation to risk allocation decisions. EDHEC working paper. Delfosse, N., & Loubaton, P. (1995). Adaptive blind separation of independent sources: A deflation approach. Signal Processing, 45(1):59–83. References 204

Dembo, A., Cover, T. M., & Thomas, J. A. (1991). Information theoretic inequalities. IEEE Transactions on Information Theory, 37(6):1501–1518. DeMiguel, V., Garlappi, L., Nogales, F., & Uppal, R. (2009a). A generalized approach to portfolio optimization: Improving performance by constraining portfolio norms. Management Science, 55(5):798–812. DeMiguel, V., Garlappi, L., & Uppal, R. (2009b). Optimal versus naive diversification: How inefficient is the 1/N portfolio strategy? Review of Financial Studies, 22(5):1915–1953. DeMiguel, V., Martin-Utrera, A., & Nogales, F. (2013). Size matters: Optimal calibra- tion of shrinkage estimators for portfolio selection. Journal of Banking & Finance, 37(8):3018–3034. DeMiguel, V., & Nogales, F. (2009) Portfolio selection with robust estimation. Operations Research, 55(5):798–812. De Nard, G., Ledoit, O., & Wolf, M. (2019). Factor models for portfolio selection in large dimensions: The good, the better and the ugly. Journal of Financial Econometrics, 33. Desmoulins-Lebeault, F., & Kharoubi-Rakotomalala, C. (2012). Non-Gaussian diversifica- tion: When size matters. Journal of Banking & Finance, 36(7):1987–1996. Dionisio, A., Menezes, R., & Mendes, A. (2006). An econophysics approach to analyse uncertainty in financial markets: An application to the Portuguese stock market. The European Physical Journal B, 50:161–164. Dittmar, R. (2002). Nonlinear pricing kernels, kurtosis preference, and evidence from the cross section of equity returns. Journal of Finance, 57(1):369–403. Duin, R. P. (1976). On the choice of smoothing parameters for Parzen estimators of probability density functions. IEEE Transactions on Computers, C-25(11):1175–1179. Dybvig, P. (1988). Inefficient dynamic portfolio strategies or how to throw away a million dollars in the stock market. Review of Financial Studies, 1(1):67-88. El Ghaoui, L., Oks, M., Oustry, F. (2003). Worst-case Value-At-Risk and robust portfolio optimization: A conic programming approach. Operations Research, 51(4):543–556. Embrechts, P., Furrer, H., & Kaufmann, R. (2009). Different kinds of risk. In Andersen, T., Davis, R., Kreiss, J., & Mikosch, T. Handbook of Financial Time Series, 729–751, Berlin: Springer. Erdogmus, D., & Xu, D. (2010). R´enyi’s entropy, divergence and their nonparametric esti- mators. In Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives, Information Science and Statistics, 47–102, New York: Springer. References 205 van Erven, T., & Harremo¨es,P. (2014). R´enyi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820. van Es, B. (1992). Estimating functionals related to a density by a class of statistics based on spacings. Scandinavian Journal of Statistics, 19(1):61–72. Everitt, B., & Dunn, G. (2010). Applied Multivariate Data Analysis (2nd ed.). New Jersey: Wiley. Everitt, B., & Hand, D. (1981). Finite Mixture Distributions. London: Chapman and Hall. Fabozzi, F., Giacometti, R., & Tsuchida, N. (2016). Factor decomposition of the Eurozone sovereign CDS spreads. Journal of International Money and Finance, 65:1–23. Fabozzi, F., Huang, D., & Zhou, G. (2010). Robust portfolios: Contributions from operations research and finance. Annals of Operations Research, 176(1):191–220. Fama, E. (1963). Mandelbrot and the stable Paretian hypothesis. Journal of Business, 36(4):420–429. Fan, J., Liao, Y., & Liu, H. (2016). An overview of the estimation of large covariance and precision matrices. Econometrics Journal, 19(1):C1–C32. Fang, K., & Zhang, Y. (1990). Generalized Multivariate Analysis. New York: Springer. Favre, L., & Galeano, J. (2002). Mean-modified value-at-risk optimization with hedge funds. Journal of Alternative Investments, 5(2):21–25. Flores, Y., Bianchi, R., Drew, M., & Tr¨uck, S. (2017). The diversification delta: A different perspective. Journal of Portfolio Management, 43(4):112–124. Frahm, G., & Memmel, C. (2010). Dominating estimators for minimum-variance portfolios. Journal of Econometrics, 159(2):289–302. Garc´ıa-Ferrer, A., Gonz´alez-Prieto,E., & Pe˜na,D. (2012). A conditionally heteroskedastic independent factor model with an application to financial stock returns. International Journal of Forecasting, 28(1):70–93. Ghalanos, A., Rossi, E., & Urga, G. (2015). Independent factor autoregressive conditional density model. Econometric Reviews, 45:594–616. Golan, A., & Maasoumi, E. (2008). Information theoretic and entropy methods: An overview. Econometrics Review, 27:317–328. Gormsen, N., & Jensen, C. (2019). Higher-moment risk. Available on SSRN: https: //papers.ssrn.com/sol3/papers.cfm?abstract_id=3069617. Green, J., Hand, J., & Zhang, X. (2013). The supraview of return predictive signals. Review of Accounting Studies, 18:692–730. References 206

Gregoriou, G., & Gueyie, J. (2003). Risk-adjusted performance of funds of hedge funds using a modified Sharpe ratio. Journal of Wealth Management, 6(3):77–83. Guidolin, M., & Timmermann, A. (2008). International asset allocation under regime switching, skew, and kurtosis preferences. Review of Financial Studies, 21(2):889–935. Hallin, M., Paindaveine, D., & Verdebout, T. (2014). Efficient R-estimation of principal and common principal components. Journal of the American Statistical Association, 109(507):1071–1083. Hampel, F., Ronchetti, E., Rousseeuw, P., & Stahel, W. (1986). Robust Statistics: The Approach Based on Influence Functions. New York: Wiley. Hanoch, G., & Levy, H. (1970). Efficient portfolio selection with quadratic and cubic utility. Journal of Business, 43(2):181–189. Harvey, C., Liechty, J., Liechty, M., & M¨uller,P. (2010). Portfolio selection with higher moments. Quantitative Finance, 10(5):469–485. Harvey, C., & Siddique, A. (2000). Conditional skewness in asset pricing tests. Journal of Finance, 55(3):1263–1295. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning: Prediction, Inference and Data Mining (2nd ed.). New York: Springer. Haugh, M., Iyengar, G., & Song, I. (2017). A generalized risk budgeting approach to portfolio construction. Journal of Computational Finance, 21(2):29–60. Hegde, A., Lan, T., & Erdogmus, D. (2005). Order statistics based estimator for R´enyi entropy. IEEE Workshop on Machine Learning for Signal Processing, 335–339. Hitaj, A., Mercuri, L., & Rroji, E. (2015). Portfolio selection with independent component analysis. Finance Research Letters, 15:146–159. Hlavackova-Schindler K., Palus, M., Vejmelka, M., & Bhattacharya, J. (2007). Causality detection based on information-theoretic approaches in time series analysis. Physics Reports, 441:1–46. Huang, X. (2012). An entropy method for diversified fuzzy portfolio selection. International Journal of Fuzzy Systems, 14(1):160–165. Huber, P. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35(1):73–101. Huber, P. (2004). Robust Statistics. New York: John Wiley & Sons. Hyv¨arinen,A. (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3):626–634. References 207

Hyv¨arinen,A., Karhunen, J., & Oja, E. (2001). Independent Component Analysis. New York: Wiley. Hyv¨arinen,A., & Pajunen, P. (1999). Nonlinear independent component analysis: Exis- tence and uniqueness results. Neural Networks, 12(3):429–439. Ingersoll, J. (1975). Multidimensional security pricing. Journal of Financial and Quanti- tative Analysis, 10(5):785–798. Jagannathan, R., & Ma, T. (2003). Risk reduction in large portfolios: Why imposing the wrong constraints helps. Journal of Finance, 58(4):1651–1684. James, W., & Stein, C. (1961). Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, LeCam, L., & Neyman, J. 361-379. Berkeley: University of California Press. Jarque, C., & Bera, A. (1980). Efficient tests for normality, homoscedasticity and serial independence of regression residuals. Economics Letters, 6(3):255–259. Jaynes, E. (1957). Information theory and statistical mechanics. Physical Review, 106(4):620–630. Jean, W. (1971). The extension of portfolio analysis to three or more parameters. Journal of Financial and Quantitative Analysis, 6(1):505–515. Jiang, L., Wu, K., & Zhou, G. (2018). Asymmetry in stock comovements: An entropy approach. Journal of Financial and Quantitative Analysis, 53(4):1479–1507. Johnson, O., & Vignat, C. (2007). Some results concerning maximum R´enyi entropy distributions. Annales de l’Institut Henri-Poincar´e(B) Probab. Statist., 43(3):339–351. Jondeau, E., Jurczenko, E., & Rockinger, M. (2018). Moment component analysis: An il- lustration with international stock markets. Journal of Business & Economic Statistics, 36(4):576–598. Jondeau, E., & Rockinger, M. (2006). Optimal portfolio allocation under higher moments. European Financial Management, 12(1):29–55. Jorion, P. (1986). Bayes-Stein estimation for portfolio analysis. Journal of Financial and Quantitative Analysis, 21(3):279–292. Jurczenko, E., & Maillet, B. (2006). Multi-Moment Asset Allocation and Pricing Models. West Sussex: Wiley. Kan, R., & Zhou, G. (2007). Optimal portfolio choice with parameter uncertainty. Journal of Financial and Quantitative Analysis, 42(3):621–656. Kan, R., & Zhou, G. (2017). Modeling non-normality using multivariate t: Implications for asset pricing. China Finance Review International, 7:2–32. References 208

Kim, W., Fabozzi, F., Cheridito, P., & Fox, C. (2014). Controlling portfolio skewness and kurtosis without directly optimizing third and fourth moments. Economics Letters, 122:154–158. Kirchner, U., & Zunckel, C. (2011). Measuring portfolio diversification. Available on arXiv: arxiv.org/pdf/1102.4722.pdf. Kolm, P., T¨ut¨unc¨u,R., & Fabozzi, F. (2014). 60 years of portfolio optimization: Practical challenges and current trends. European Journal of Operational Research, 234(2):356– 371. Kolmogorov, A. (1930). On the notion of mean. In Tikhomirov, V. (1991). Selected Works of A. N. Kolmogorov, Kluwer Academic Publishers, 144–146. Koski, T., & Persson, L. (1992). Some properties of generalized exponential entropies with application to data compression. Information Sciences, 62:103–132. Koumou, G., & Dionne, G. (2019). Coherent diversification measures in portfolio theory: An axiomatic foundation. Available on SSRN: https://papers.ssrn.com/sol3/pa pers.cfm?abstract_id=3351423. Kourtis, A., Dotsis, G., & Markellos, R. (2012). Parameter uncertainty in portfolio selection: Shrinking the inverse covariance matrix. Journal of Banking & Finance, 36(9):2522–2531. Kozak, S., Nagel, S., & Santosh, S. (2019). Shrinking the cross section. Journal of Financial Economics, forthcoming. Kullback, S., & Leibler, R. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86. Lai, T. (1991). Portfolio selection with skewness: A multiple-objective approach. Review of Quantitative Finance and Accounting, 1:293–305. Lassance, N., DeMiguel, V., & Vrins, F. (2019). Optimal portfolio diversification via independent component analysis. London Business School working paper. Available on SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3285156. Lassance, N., & Vrins, F. (2019a). Minimum R´enyi entropy portfolios. Annals of Operations Research, https://doi.org/10.1007/s10479-019-03364-2. Lassance, N., & Vrins, F. (2019b). Robust portfolio selection using sparse estimation of comoment tensors. Louvain Finance working paper. Available on SSRN: https: //papers.ssrn.com/sol3/papers.cfm?abstract_id=3455400. Lassance, N., & Vrins, F. (2019c). Portfolio selection: A target-distribution approach. References 209

Learned-Miller, E., & Fisher, J. (2003). ICA using spacings estimates of entropy. Journal of Machine Learning Research, 4:1271–1295. Ledoit, O., & Wolf, M. (2003). Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. Journal of Empirical Finance, 10(5):603–621. Ledoit, O., & Wolf, M. (2004a). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88(2):365–411. Ledoit, O., & Wolf, M. (2004b). Honey, I shrunk the sample covariance matrix. Journal of Portfolio Management, 30(4):110–119. Ledoit, O., & Wolf, M. (2008). Robust performance hypothesis testing with the Sharpe ratio. Journal of Empirical Finance, 15(5):850–859. Ledoit, O. and Wolf, M. (2017). Nonlinear shrinkage of the covariance matrix for portfolio selection: Markowitz meets Goldilocks. Review of Financial Studies, 30:4349–4388. Lee, W. (2011). Risk-based asset allocation: A new answer to an old question? Journal of Portfolio Management, 37(4):11–28. Levy, H. (1969). A utility function depending on the first three moments. Journal of Finance, 24(4):715–719. Levy, H., & Levy, M. (2014). The benefits of differential variance-based constraints in portfolio optimization. European Journal of Operational Research, 234:372–381. Levy, H., & Markowitz, H. (1979). Approximating expected utility by a function of mean and variance. American Economic Review, 69:308–317. Lim, A., Shanthikumar, J., & Vahn, G. (2011). Conditional value-at-risk in portfolio optimization: Coherent but fragile. Operations Research Letters, 39(3):163–171. Lohre, H., Neugebauer, U., & Zimmer, C. (2012). Diversified risk parity strategies for equity portfolio selection. Journal of Investing, 21(3):111–128. Lohre, H., Opfer, H., & Orszag, G. (2014). Diversifying risk parity. Journal of risk, 16(5):53–79. Lwin, K., Qu, R., & MacCarthy, B. (2017). Mean-VaR portfolio optimization: A non- parametric approach. European Journal of Operational Research, 260(2):751–766. Maillard, S., Roncalli, T., & Teiletche, J. (2010). The properties of equally weighted risk contribution portfolios. Journal of Portfolio Management, 36(4):60–70. Mandelbrot, B. (1963). The variation of certain speculative prices. Journal of Business, 36(4):417–492. Markowitz, H. (1952). Portfolio selection. Journal of Finance, 7(1):77–91. References 210

Markowitz, H. (1959). Portfolio selection: Efficient diversification of investment. New York: John Wiley & Sons. Markowitz, H. (1999). The early history of portfolio theory: 1600–1960. Financial Analysts Journal, 55(4):5–16. Markowitz, H. (2014). Mean–variance approximations to expected utility. European Journal of Operational Research, 234(2):346–355. Martellini, L., & Ziemann, V. (2010). Improved estimates of higher-order comoments and implications for portfolio selection. Review of Financial Studies, 23(4):1467–1502. Massacci, D. (2017) Tail risk dynamics in stock returns: Links to the macroeconomy and global markets connectedness. Management Science, 63(9):2773–3145. Merton, R. (1971). Optimum consumption and portfolio rules in a continuous-time model. Journal of Economic Theory, 3(4):373–413. Merton, R. (1972). An analytic derivation of the efficient portfolio frontier. Journal of Financial and Quantitative Analysis, 7(4):1851–1872. Merton, R. (1980). On estimating the expected return on the market: An exploratory investigation. Journal of Financial Economics, 8(4):323–361. Meucci, A. (2009). Managing diversification. Risk, 22(5):74–79. Meucci, A., Santangelo, A., & Deguest, R. (2015). Risk budgeting and diversification based on optimized uncorrelated factors. Risk, 11(29):70–75. Michaud, R. (1989). The Markowitz optimization enigma: Is optimized optimal? Financial Analysts Journal, 45(1):31–42. Nadarajah, S. (2005). A generalized normal distribution. Journal of Applied Statistics, 32(7):685–694. von Neumann, J., & Morgenstern, O. (1953). Theory of Games and Economic Behavior. Princeton University Press: New Jersey. Olivares-Nadal, A., & DeMiguel, V. (2018). Technical note—A robust perspective on transaction costs in portfolio optimization. Operations Research, 66(3):733–739. Ormos, M., & Zibriczky, D. (2014). Entropy-based financial asset pricing. Plos One, 9(12), e115742. Parzen, E. (1962). On estimation of a probability density function and mode. Annals of Mathematical Statistics, 33(3):1065–1076. Pham, D., & Vrins, F. (2005). Local minima of information-theoretic criteria in blind source separation. IEEE Signal Processing Letters, 12(11):788–791. References 211

Pham, D., Vrins, F., & Verleysen, M. (2008). On the risk of using R´enyi’s entropy for blind source separation. IEEE Transactions on Signal Processing, 56(10):4611–4620. Philippatos, G., & Wilson, C. (1972). Entropy, market risk, and the selection of efficient portfolios. Applied Economics, 4(3):209–220. Poddig, T., & Unger, A. (2012). On the robustness of risk-based asset allocations. Financial Markets and Portfolio Management, 26(3):369–401. Pola, G. (2016). On entropy and portfolio diversification. Journal of Asset Management, 17(4):218–228. Poon, S., Rockinger, M., & Tawn, J. (2004). Extreme value dependence in financial markets: Diagnostics, models, and financial implications. Review of Financial Studies, 17(2):581–610. Pressley, A. (2010). Elementary Differential Geometry. London: Springer. Qian, E. (2005). Risk parity portfolios: Efficient portfolios through true diversification. Panagora. Qian, E. (2006), On the financial interpretation of risk contributions: Risk budgets do add up. Journal of Investment Management, 4:1–11. R´enyi, A. (1961). On measures of entropy and information. Fourth Berkeley Symposium on Mathematical Statistics and Probability, 547–561. Rockafellar, R., Uryasev, S., & Zabarankin, M. (2006). Generalized deviations in risk analysis. Finance and Stochastics, 10:51–74. Roncalli, T. (2013). Introduction to Risk Parity and Budgeting. Boca Raton: Chapman & Hall, CRC Financial Mathematics Series. Roncalli, T., & Weisang, G. (2015). Risk parity portfolios with risk factors. Quantitative Finance, 16(3):377–388. Roy, A. (1952). Safety first and the holding of assets. Econometrica, 20(3):431–449. Royston, J. (1982). Expected normal order statistics (exact and approximate). Journal of the Royal Statistical Society Series C (Applied Statistics), 31(2):161–165. Rudemo, M. (1982). Empirical choice of histograms and kernel density estimators. Scan- dinavian Journal of Statistics, 9(2):65–78. Schreiner, J. (1980). Portfolio revision: A turnover-constrained approach. Financial Management, 9(1):67–75. Scott, R., & Horvath, P. (1980). On the direction of preference for moments of higher order than the variance. Journal of Finance, 35(4):915–919. References 212

Shannon, C. (1948). A mathematical theory of communication. Bell Systems Technical Journal, 27(3):379–423. Sharpe, W. (1963). A simplified model for portfolio analysis. Management Science, 9(2):277–293. Sharpe, W. (1994). The Sharpe Ratio. Journal of Portfolio Management, 21(1):49–58. Silverman, B. (1986). Density Estimation for Statistics and Data Analysis. London: Chapman & Hall. Simkowitz, M., & Beedles, W. (1978). Diversification in a three-moment world. Journal of Financial and Quantitative Analysis, 13(5):927–941. Simkowitz, M., & Beedles, W. (1980). Asymmetric stable distributed security returns. Journal of the American Statistical Association, 75(370):306–312. Slater, L. (1972). Confluent hypergeometric functions. In Abramowitz, M., & Stegun, I. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. New York: Dover Publications. Stein, C. (1955). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proceedings of the 3rd Berkeley Symposium on Probability and Statistics, 1:197–206, Berkeley: University of California Press. Takemura, A., & Sheena, Y. (2005). Distribution of eigenvalues and eigenvectors of Wishart matrix when the population eigenvalues are infinitely dispersed and its application to minimax estimation of covariance matrix. Journal of Multivariate Analysis, 94(2):271–299. Taleb, A., & Jutten, C. (1999). Source separation in post-nonlinear mixtures. IEEE Transactions on Signal Processing, 47(10):2807–2820. Tang, G., & Choi, D. (1998). Impact of diversification on the distribution of stock returns: International evidence. Journal of Economics and Finance, 22(2-3):119–127. Tasche, D. (2008). Capital allocation to business units and sub-portfolios: the Euler principle. In Resti, A. Pillar II in the New Basel Accord: The Challenge of Economic Capital, Risk Books. Terrell, D., & Scott, D. (1992). Variable kernel density estimation. Annals of Statistics, 20(3):1236–1265. Theodossiou, P. (1998). Financial data and the skewed generalized t distribution. Man- agement Science, 44(12):1650–1661. Toma, A., & Leoni-Aubin, S. (2015). Robust portfolio optimization Using pseudodistances. Plos one, 10(10), e0140546. References 213

Tu, J., & Zhou, G. (2011). Markowitz meets Talmud: A combination of sophisticated and naive diversification strategies. Journal of Financial Economics, 99(1):204–215. Ugray, Z., Lasdon, L., Plummer, J., Glover, F., Kelly, J., & Mart´ı,R. (2007). Scatter search and local NLP solvers: A multistart framework for global optimization. INFORMS Journal on Computing, 19(3):328–340. Ullah, A. (1996). Entropy, divergence and distance measures with econometric applications. Journal of Statistical Planning and Inference, 49(1):137–162. Usta, I., & Kantar, Y. (2011). Mean-variance-skewness-entropy measures: A multi- objective approach for portfolio selection. Entropy, 13:117–133. Vanderveken, R. (2019). Bayesian approach to portfolio selection. Master’s thesis, UCLou- vain. Vasicek, O. (1976). A test for normality based on entropy. Journal of the Royal Statistical Society Series B (Methodological), 38(1):54–59. Vaz-de Melo, B., & Camara, R. (2003). Robust modeling of multivariate financial data. Coppead Working Paper Series 355, Federal University at Rio de Janeiro, Rio de Janeiro, Brazil. Velicer, W. (1976). Determining the number of components from the matrix of partial correlations. Psychometrika, 41:321–327. Vermorken, M., Medda, F., & Schroder, T. (2012). The diversification delta: A higher- moment measure for portfolio diversification. Journal of Portfolio Management, 39(1):67–74. van Vliet, P. (2018) Low volatility needs little trading. Journal of Portfolio Management, 44(3):33–42. Vrins, F. (2007). Contrast Properties of Entropic Criteria for Blind Source Separation: A Unifying Framework based on Information-Theoretic Inequalities. Louvain-la-Neuve: Presses Universitaires de Louvain. Vrins, F., Lee J., & Verleysen, M. (2007). A minimum-range approach to blind extraction of bounded sources. IEEE Transactions on Neural Networks and Learning Systems, 18(3):809–822. Vrins, F., Pham, D., & Verleysen, M. (2007). Mixing and non-mixing local minima of the entropy contrast for blind source separation. IEEE Transactions on Information Theory, 53(3):1030–1042. Vrins, F., & Verleysen, M. (2005). On the entropy minimization of a linear mixture of variables for source separation. Signal Processing, 85(5):1029–1044. References 214

Wachowiak, M., Smolikova, R., Tourassi, G., & Elmaghraby, A. (2005). Estimation of generalized entropies with sample spacing. Pattern Analysis and Applications, 8:95–101. Wahba, G. (1975). Optimal convergence properties of variable knot, kernel, and orthogonal series methods for density estimation. Annals of Statistics, 3(1):15–29. Wand, M., & Jones, M. (1995). Kernel Smoothing. London: Chapman & Hall. Winkelbauer, A. (2014). Moments and absolute moments of the normal distribution. Available on arXiv: https://arxiv.org/abs/1209.4340. Xu, H., Caraminis, C., & Mannor, S. (2016). Statistical optimization in high dimensions. Operations Research, 64(4):958–979. Yu, J., Lee, W., & Chiou, W. (2014). Diversified portfolios with different entropy measures. Applied Mathematics and Computation, 241:47–63. Zarzoso, V., & Comon, P. (2010). Robust independent component analysis by itera- tive maximization of the kurtosis contrast with algebraic optimal step size. IEEE Transactions on Neural Networks, 21(2):248–261. Zhang, J., & Wang, X. (2008). Robust normal reference bandwidth for kernel density estimation. Statistica Neerlandica, 63(1):13–23. Zhou, R., Cai, R., & Tong, G. (2013). Applications of entropy in finance: A review. Entropy, 15:4909–4931. Zografos, K., & Nadarajah, S. (2003). Formulas for R´enyi information and related measures for univariate distributions. Information Sciences, 155(1-2):119–138. Zografos, K., & Nadarajah, S. (2005). Expressions for R´enyi and Shannon entropies for multivariate distributions. Statistics & Probability Letters, 71(1):71–84. Louvain School of Management Doctoral Thesis

ctoral Thesis Information-theoretic approaches to portfolio selection

Nathan LASSANCE

Ever since modern portfolio theory was introduced by Harry Markowitz in 1952, a plethora of papers have been written on the mean-variance investment problem. However, due to the non-Gaussian nature of asset returns, the mean and variance statistics are insufficient to adequately represent their full distribution, which depends on higher moments too. Higher-moment portfolio selection is however more complex; a smaller literature has been dedicated to this problem and no consensus emerges about how investors should allocate their wealth when higher moments cannot be ignored. Among the proposed alternatives, researchers have recently considered information theory, and entropy in particular, as a new framework to tackle this problem. Entropy provides an appealing criterion as it measures the amount of randomness embedded in a random variable from the shape of its density function, thus accounting for all moments. The application of information theory to portfolio selection is however nascent and much remains to explore. Therefore, in this thesis, we aim to explore the portfolio-selection problem from an information-theoretic angle, accounting for higher moments.

Nathan Lassance obtained a bachelor and master’s degree in Business Engineering, and a CEMS master’s degree in International Management, from the Louvain School of Management, UCLouvain, in Belgium. He is currently a research fellow at the Fonds de la Recherche Scientifique (F.R.S.-FNRS). During his PhD studies between 2016 and 2020, he was a visiting researcher at the London Business School, presented at several national and international conferences, and published part of his research in two scientific journals. His research focuses on portfolio selection, information theory and risk management.

UNIVERSITÉ CATHOLIQUE DE LOUVAIN www.uclouvain.be