sustainability

Article A Sustainable Quantitative Selection Strategy Based on Dynamic Factor Adjustment

Yi Fu 1, Shuai Cao 1 and Tao Pang 2,* 1 School of and Business, Shanghai Normal University, Shanghai 200234, China; [email protected] (Y.F.); [email protected] (S.C.) 2 Department of Mathematics, North Carolina State University, Raleigh, NC 27695-8205, USA * Correspondence: [email protected]

 Received: 7 April 2020; Accepted: 11 May 2020; Published: 13 May 2020 

Abstract: In this paper, we consider a sustainable quantitative stock selection strategy using some machine learning techniques. In particular, we use a random forest model to dynamically select factors for the training set in each period to ensure that the factors that can be selected in each period are the optimal factors in the current period. At the same time, the classification probability prediction (CPP) of stock returns is performed. Historical back-testing using Chinese stock market data shows that the proposed CPP quantitative stock selection strategy performs better than the traditional machine learning stock selection methods, and it can outperform the market index over the same period in most back-testing periods. Moreover, this strategy is sustainable in all market conditions, such as a bull market, a bear market, or a volatile market.

Keywords: stock selection; machine learning; classification probability prediction; back-testing

1. Introduction In modern investing, is getting more and more attention from individual and institutional traders. “Algorithmic trading is a method of executing orders using automated pre-programmed trading instructions accounting for variables such as time, price, and ” (https://en.wikipedia.org/wiki/Algorithmic_trading). It considers market observable variables such as time, price, and volume, and sends instructions to the market based on a preset algorithm. Algorithmic trading, on the one hand, can prevent traders from frequently repeating observations and manually sending trading instructions; on the other hand, it can also prevent traders’ decisions from being disturbed by subjective emotions. According to a May 2019 report from Research and Markets, “The researchers forecast the global algorithmic trading market size to grow from USD 11.1 billion in 2019 to USD 18.8 billion by 2024, at a CAGR of 11.1% during 2019–2024. The major growth drivers of the algorithmic trading market include the increasing demand for fast and effective order execution, and reducing transaction costs” (https://www.researchandmarkets.com/reports/4770543/). With the development of new technologies such as machine learning, the current algorithmic trading not only includes automatic sending of transaction instructions, but also includes the automatic decision-making of the algorithm in terms of transaction time, transaction objects, and number of transactions. Quantitative stock selection, as an important part of in algorithmic trading, focuses on using various algorithms to select stock combinations in order to achieve a benchmark return rate. Quantitative stock selection is a popular academic research area. Fama and French (1993) [1], Lakonishok (1994) [2], and Song (1994) [3] established a linear model of stock excess returns, and proposed that the excess returns can be well explained by current stock prices, book value of equity, and . Compared with the classic linear multi-factor models, the machine learning model pays more attention to the prediction ability of the model. It can capture more detailed market signals

Sustainability 2020, 12, 3978; doi:10.3390/su12103978 www.mdpi.com/journal/sustainability Sustainability 2020, 12, 3978 2 of 12 and obtain more stable excess returns by constructing a nonlinear relationship between the prediction target and the factors. Jigar Patel et al. (2015) [4] studied and compared the performance of the four prediction models artificial neural network (ANN), support vector machine (SVM), random forest (RF), and Naive-Bayes. Their results show that the overall performance of the random forest model is better than the other three prediction models. Liu et al. (2017) [5] proposed a convolutional neural network and --term memory (CNN-LSTM) model to analyze the quantitative strategy of the stock selection. In their study, the CNN-LSTM neural network model could be successfully applied to the formulation of quantitative strategies and achieve better returns than basic strategies and benchmark indexes. Li and Zhang (2018) [6] used the XGBoost model to establish a dynamic weighted multi-factor stock selection strategy. They used the XGBoost machine learning method to predict the information coefficients (ICs) of various factors. The empirical results showed that the XGBoost model is effective in predicting the ICs, and the dynamic weights based on the XGBoost model can improve the performance of multi-factor stock selection strategies. Yang and Chen (2019) [7] combined stock forecasting and stock selection to form a new hybrid stock selection method. Based on the research sample of the A-share stock market in China, they showed that the novel hybrid method is superior to the traditional methods in market returns. Chen and Ge (2019) [8] studied the stock price movement prediction based on LSTM networks, and compared the attention LSTM (AttLSTM) model with the LSTM model. Their results verify the effectiveness of the attention mechanism in the LSTM-based prediction method. Although a lot of works on quantitative models and processes have been done, there are still some areas that can be improved. First of all, in the setting of prediction targets, previous studies often used the stock return or whether the price is up or down as the prediction target, but the return rate often contains some noise, and the setting of the two classifications (up or down) does not catch much of the existing information. Secondly, in factor selection, previous studies often selected factors statically, but factors are usually valid for a certain period of time, and may not be valid after that. Therefore, the entire strategy design needs to select factors dynamically, e.g., eliminating failed ones, and introducing effective ones. In this paper, we propose a sustainable quantitative stock selection strategy using RF to dynamically adjust the factors to predict the importance of the training set for each period. The factors are sorted in descending order. The cumulative importance of the selected factors must reach 80% to ensure that the factors selected in each period are the most important factors. Then, we use the XGBoost or RF model to classify each stock into five fixed ranges. For each yield range, we sort the in descending order of probability, and take the top 20 most likely stocks into the stock pool for purchase. We call this a classification probability of prediction (CPP) strategy. The back-testing results from November 2013 to December 2019 show that the stock selection strategy of the XGBoost or RF CPP method can significantly outperform the Chinese Stock Index 300 (CSI 300) index. Moreover, we find that the XGBoost CPP performs better than the RF CPP method in terms of returns. Finally, the proposed strategy is a sustainable investment strategy in the sense that it works well over a long time period that consists of bear market, bull market, and volatile market periods.

2. The Basic Idea of CPP Quantitative Stock Selection Strategy Design The general steps of the CPP quantitative stock selection strategy design were as follows (see also Figure1). The first step was to use all stocks in the Chinese A-share market (exclude special treated “ST” stocks and new stocks listed less than 60 days) as the stock pool, and classify the stock based on their monthly rate of return. In particular, we classified each stock into five ranges (see Table1). We considered nine broad categories: quality factors, fundamental factors, emotional factors, growth factors, risk factors, stock factors, momentum factors, technical factors, and style factors. Then, we selected 45 factors from 9 categories as the initial factor pool. The factors in this article came Sustainability 2020, 12, 3978 3 of 12 fromSustainability JoinQuant’s 2020, 12, factorx FOR PEER library. REVIEW Table 2 shows the 45 factors in the model factor pool of this article.3 of 12 These factors were dynamically screened into the model by the random forest (RF) model.

Figure 1. Flow diagram of factor dynamic adjustment and classification probability of prediction (CPP) quantitativeFigure 1. Flow stock diagram selection of strategy. factor dynamic adjustment and classification probability of prediction (CPP) quantitative stock selection strategy. Table 1. Range criteria (monthly rate of return).

The first step was toRange use all 1 stocks Range in the 2 Chinese Range A 3-share Rangemarket 4 (exclude Range special 5 treated “ST” stocks and newCriteria stocks listed above less 10% than 5–10%60 days) as the 0–5% stock pool, 10–0%and classify the10% stock or less based on their monthly rate of return. In particular, we classified each stock− into five ranges− (see Table 1). We considered nine broad categories: quality factors, fundamental factors, emotional factors, growth Table 2. Factors list. factors, risk factors, stock factors, momentum factors, technical factors, and style factors. Then, we selectNo.ed 45 Classificationfactors from 9 categories Factors as the initial factor No. pool. Classification The factors in this article Factors came from 1 Quality factor net_profit_to_total_operate_revenue_ttm 24 Risk factor Skewness20 JoinQuant2’ Qualitys factor factor library. Table 2 shows DEGM the 45 factors 25in the model Risk factor factor pool of sharpe_ratio_60 this article. These factors3 were Quality dynamically factor screened into roe_ttm the model by the 26 random Stock forest factor (RF) model. net_asset_per_share 4 Quality factor GMI 27 Stock factor net_operate_cash_flow_per_share 5 Quality factor ACCA 28 Stock factor eps_ttm 6 Fundamental factorTable financial_liability 1. Range criteria (monthly 29 rate of Stock return) factor. retained_earnings_per_share 7 Fundamental factor cash_flow_to_price_ratio 30 Stock factor cashflow_per_share_ttm 8 Fundamental factor Range market_cap 1 Range 2 Range 31 3 Range Momentum 4 factorRange 5 ROC20 9 Fundamental factor net_profit_ttm 32 Momentum factor Volume1M Criteria above 10% 5–10% 0–5% −10–0% −10% or less 10 Fundamental factor EBIT 33 Momentum factor TRIX10 11 Emotional factor VOL20 34 Momentum factor Price1M 12 Emotional factor DAVOL20Table 2. Factors 35 list. Momentum factor PLRC12 13 Emotional factor VOSC 36 Technical factor MAC20 14 Emotional factor VMACD 37 Technical factor boll_down No. Classification Factors No. Classification Factors 15 Emotional factor ATR14 38 Technical factor boll_up 116 Quality Growth factor factor net_profit_to_total_operate_revenue_ttm PEG 24 39 TechnicalRisk factor factor Skewness20 MFI14 217 Quality Growth factor factor net_profit_growth_rateDEGM 25 40 Risk Style factor factors sharpe_ratio_60 size 318 Quality Growth factor factor operating_revenue_growth_rateroe_ttm 26 41 Stock Style factor factors net_asset_per_share 419 Quality Growth factor factor net_asset_growth_rateGMI 27 42 Stock Style factor factors net_operate_cash_flow_per_share momentum 520 Quality Growth factor factor net_operate_cashflow_growth_rateACCA 28 43 Stock Style factor factors book_to_price_ratioeps_ttm 621 Fundamental Risk factor factor financial_liability Variance20 29 44 Stock Style factor factors retained_earnings_per_share liquidity 722 Fundamental Risk factor factor cash_flow_to_price_ratio sharpe_ratio_20 30 45 Stock Style factor factors cashflow_per_share_ttm growth 23 Risk factor Kurtosis20 8 Fundamental factor market_cap 31 Momentum factor ROC20 9 Fundamental factor net_profit_ttm 32 Momentum factor Volume1M 10 Fundamental factor EBIT 33 Momentum factor TRIX10 11 In theEmotional second factor step, the trainingVOL20 and test sets were34 constructed Momentum byfactor recombining Price1M the factors and yield12 intervalsEmotional of factor each period. In particular,DAVOL20 the period35 i 3Momentum factor was factor combined withPLRC12 the monthly 13 Emotional factor VOSC 36 − Technical factor MAC20 rate of return for period i 2, the period i 2 factor was combined with the monthly rate of return for 14 Emotional factor − VMACD− 37 Technical factor boll_down period15 i Emotional1, and thefactor period i 1 factorATR14 was combined with38 the monthlyTechnical factor rate of return ofboll_up the period i. All − − together16 wereGrowth combined factor to constructPEG the training set of the39 periodTechnical i. The factor factors of the periodMFI14 i and the 17 Growth factor net_profit_growth_rate 40 Style factors size monthly18 rateGrowth of factor return of theoperating_revenue_growth_rate period i+1 constructed the test41 setStyle of period factors i. See Figure2 forbeta illustration. 19 Growth factor net_asset_growth_rate 42 Style factors momentum 20 Growth factor net_operate_cashflow_growth_rate 43 Style factors book_to_price_ratio 21 Risk factor Variance20 44 Style factors liquidity 22 Risk factor sharpe_ratio_20 45 Style factors growth 23 Risk factor Kurtosis20 Sustainability 2020, 12, x FOR PEER REVIEW 4 of 12

Sustainability 2020, 12, x FOR PEER REVIEW 4 of 12 In the second step, the training and test sets were constructed by recombining the factors and Sustainability 2020, 12, x FOR PEER REVIEW 4 of 12 yieldIn intervals the second of each step, period. the training In particular, and test the sets period were i− 3constructed factor was combinedby recombining with the the monthly factors andrate yieldof return intervals for period of each i−2, period. the period In particular, i−2 factor the wa periods combined i−3 factor with wathes monthly combined rate with of returnthe monthly for period rate In the second step, the training and test sets were constructed by recombining the factors and ofi−1, return and thefor period period i − i−2,1 the factor period wa si− combined2 factor wa withs combined the monthly with the rate monthly of return rate ofof return the period for period i. All yield intervals of each period. In particular, the period i−3 factor was combined with the monthly rate itogether−1, and we there period combined i−1 factorto construct was combined the training with set the of monthlythe period rate i. The of returnfactors ofof thethe periodperiod i i. andAll of return for period i−2, the period i−2 factor was combined with the monthly rate of return for period togetherthe monthly were rate combined of return to construct of the period the training i+1 construct set ofed the the period test i. set The of factors period of i. Seethe periodFigure i 2 and for i−1, and the period i−1 factor was combined with the monthly rate of return of the period i. All theillus monthlytration. rate of return of the period i+1 constructed the test set of period i. See Figure 2 for together were combined to construct the training set of the period i. The factors of the period i and illustration. Sustainabilitythe monthly2020 rate, 12, 3978 of return of the period i+1 constructed the test set of period i. See Figure 4 2 of for 12 illustration.

Figure 2. Training data and test data construction. Figure 2. Training data and test data construction. In the third step, we used an RF model to predict the importance of factors for each training set, FigureFigure 2.2. Training datadata andand testtest datadata construction.construction. and sortIn the the third importance step, we in use descendingd an RF model order. to We predict chose the the importance most important of factors factors for toeach ensure training that set,the andcumulative sort the importanceimportance ofin descendingthe selected order. factors We reache chosed the80%. most As importantthe factors factors had their to ensure own validitythat the InIn thethe thirdthird step,step, wewe usedused anan RFRF model to predict the importance of factors for each training set, cumulativeperiods, the importanceIC values of of the the factors selected in different factors reache periodsd 80%.were Asnot thecompletely factors ha unchanged.d their own As validity shown andand sortsort thethe importanceimportance inin descendingdescending order.order. We chosechose thethe mostmost importantimportant factorsfactors toto ensureensure thatthat thethe periods,in Figure thes 3 ICand values 4, the of IC the values factors of thein different factors ATR14 periods and we reEBIT not have completely changed unchanged. in different As periods. shown cumulativecumulative importanceimportance ofof thethe selectedselected factorsfactors reachedreached 80%.80%. As the factors hadhad theirtheir ownown validityvalidity inTherefore, Figures 3the and factors 4, the applicable IC values to of different the factors time ATR14 periods and are EBIT also have different. changed For thisin different reason, weperiods. used periods,periods, thethe ICIC valuesvalues ofof thethe factorsfactors inin didifferentfferent periodsperiods werewere notnot completelycompletely unchanged.unchanged. AsAs shownshown Therefore,dynamic factor the factors selection applicable to select to the different most important time periods factor are inalso the different. current periodFor this and reason, improve we use thed inin FiguresFigures3 3 and and4 ,4 the, the IC IC values values of of the the factors factors ATR14 ATR14 and and EBIT EBIT have have changed changed in in di differentfferent periods. dynamicaccuracy offactor stock selection selection. to select the most important factor in the current period and improve the Therefore,Therefore, thethe factorsfactors applicableapplicable toto didifferentfferent timetime periodsperiods areare alsoalso didifferent.fferent. ForFor thisthis reason,reason, wewe usedused accuracy of stock selection. dynamicdynamic factorfactor selectionselection to selectselect thethe mostmost importantimportant factorfactor in thethe currentcurrent periodperiod andand improveimprove thethe accuracyaccuracy ofof stockstock selection.selection.

Figure 3. IC in ATR 14 (Data Source: JoinQuant platform).

FigureFigure 3. 3. ICIC in in ATR ATR 14 14 (Data (Data Source: Source: JoinQuant JoinQuant platform). platform). Figure 3. IC in ATR 14 (Data Source: JoinQuant platform).

FigureFigure 4. 4. ICIC in in EBIT EBIT (Data (Data Source: Source: JoinQuant JoinQuant platform). platform). Figure 4. IC in EBIT (Data Source: JoinQuant platform). TheThe fourth step step was was to to use use XGBoost XGBoost CPP CPP method method to predict to predict the classification the classification (the previous (the previous month’s factor predicts the monthlyFigure yield 4. range),IC in EBIT and (Data classify Source: each JoinQuant stock into platform). five yield ranges based on the monthThe’s factor fourth predicts step wa thes tomonthly use XGBoost yield range), CPP methodand classify to predict each stock the classificationinto five yield (theranges previous based factors dynamically selected in the third step. The stocks in the group yield range were sorted in monthon the ’factorss factor dynamically predicts the selectedmonthly in yield the thirdrange), step. and The classify stocks each in the stock group into yield five yield range ranges were sortedbased descendingThe fourth order step of probability,was to use and XGBoost the top CPP 20 stocks method with to the predict highest the probability classification were (the taken previous into onin descending the factors dynamicallyorder of probability, selected andin the the third top 20step. stock Thes withstocks the in highest the group probability yield range were we takenre sorted into themonth buying’s factor stock predicts pool. Onthe themonthly last trading yield range), day of and each classify month, each the positionstock into was five adjusted. yield ranges When based the inthe descending buying stock order pool. of probability,On the last tradingand the daytop 20of stockeach smonth, with the the highest probability was adjusted. were takenWhen into the on the factors dynamically selected in the third step. The stocks in the group yield range were sorted thepositionposition buying wa was stocks adjusted, pool. thetheOn stocksstocksthe last thatthat trading were were notday not in ofin the eachthe buying buying month, stock stock the pool position pool are are sold, wa sold,s and adjusted. and new new stocks When stocks in the thein in descending order of probability, and the top 20 stocks with the highest probability were taken into positionthebuying buying stockwa stocks adjusted, pool pool were we the bought.re stocks bought. Then, that Then we we,re we looped not loop in intoedthe into buying the the training trainingstock set pool forset are for the sold,the next next and period. period. new stocks in the buying stock pool. On the last trading day of each month, the position was adjusted. When the the buyingThe CPP stock quantitative pool were stockbought. selection Then, strategywe looped with into dynamic the training factor set adjustment for the next has period. some obvious advantages.position was The adjusted, core of the quantitative stocks that investments were not in is the the buying model, stock and the pool core are of sold, the modeland new is the stocks factor. in Thisthe buying is particularly stock pool true we inre the bought. neutral Then , we strategy looped into with the huge training market set capacity. for the next Therefore, period. how to find a stable and effective factor becomes the first step in developing a mature profitable quantitative strategy. The random forest (RF) model is an ensemble learning method for classification, regression, and other tasks (https://en.wikipedia.org/wiki/Random_forest). The RF model can not only effectively correct the overfitting problem in the decision tree model, but also give the importance of each input variable (importance). In 1995, Ho proposed the RF algorithm [9], and some scholars extended the algorithm and conducted subsequent research (see, e.g., Breiman [10] and Lin and Jeon [11]). In this Sustainability 2020, 12, 3978 5 of 12 paper, we used the RF model to predict the importance of the factors in the training set, and rank the importance of the factors in descending order. Then, we selected the cumulative importance of the factors to reach 80%, ensuring that the factors in each period were the optimal choices. By doing that, we enhanced the impacts of the factors. To the best our knowledge, most quantified stock selection strategies based on machine learning use the regression method to accurately predict the future return of the stock, and then buy stocks with high predicted returns. The fitted stock selection method seems to be more accurate than the multi-class probability prediction stock selection method, but its fault tolerance is relatively low. Once a prediction error occurs, it will have a greater impact on the overall return. Moreover, the noise in the yield is usually large, and the probability of regression errors is usually high. Therefore, it is easy to cause a large maximum retracement. The proposed multi-class probability prediction stock selection strategy is not to select the stock with the highest predicted return rate, but to select the stock with the highest probability of return in this range after the determined expected return range. Although some of the benefits are sacrificed in this way, the accuracy rate and fault tolerance rate are both improved, and with the increase of the accuracy rate, some of the sacrificed benefits will also be made up.

3. Back-test Analysis of CPP Quantitative Stock Selection Strategy In this section, we conduct 74 back-testing analyses of market data from November 2013 to December 2019. The data source was from the JoinQuant quantization platform. The goal of the stock selection was to achieve a high return, and we did not limit the investment strategies to any particular investment style. Therefore, it was natural to use the overall market return as the benchmark. In this paper, we chose the CSI 300 index as the benchmark.

3.1. Dynamic Factor Adjustment Analysis Among the 45 factors, the style category was most likely to be selected (see Table3). The liquidity factor (liquidity) had a probability of being selected as high as 98.65%. The market value factor (size) was selected with probability 94.59% and the beta factor (beta) was selected with probability 68.92%. There were three growth type factors in the top ten factors, where the net asset growth rate (net_asset_growth_rate) had a selection probability of 95.95%, the net profit growth rate (net_profit_growth_rate) had a selection probability of 79.73%, and the price-earnings (P/E) ratio relative to the earnings growth ratio (PEG) had a selection probability of 71.62%. There were two risk type factors in the top ten. In particular, the 20-day annualized return variance (Variance20) was selected with a probability of 95.95%, the 20-day Sharpe ratio (sharpe_ratio_20) was selected with a probability of 74.32%. Finally, there was one emotion factor and one momentum factor among the top ten factors, where the trading volume shock (VOSC) was selected with a probability of 93.24%, and Price1M was selected with a probability of 90.54%.

Table 3. Choosing the TOP10 factors with the highest probability.

Rank Factor Selected Times Total Times Selected Probability 1 liquidity 73 74 98.65% 2 Variance20 71 74 95.95% 3 net_asset_growth_rate 71 74 95.95% 4 size 70 74 94.59% 5 VOSC 69 74 93.24% 6 Price1M 67 74 90.54% 7 net_profit_growth_rate 59 74 79.73% 8 sharpe_ratio_20 55 74 74.32% 9 PEG 53 74 71.62% 10 beta 51 74 68.92% Sustainability 2020, 12, 3978 6 of 12

The market value factor considered here is not the same as the traditional market value factor. It refers to the natural logarithm of the company’s total market value. The formula of liquidity factor is given by: Liquidity Factor = 0.35 STOM + 0.35 STOQ + 0.3 STOA, (1) × × × where STOM is the stock turnover rate in one month, given by the logarithm of the sum of stock turnover rates in the past 21 days; STOQ is the average turnover rate in the past three months, given by the logarithm of the average STOM in the past three months; and STOA is the average turnover rate in the past 12 months, given by the logarithm of the average STOM in the past 12 months. The formula for net asset growth rate is given by:

shareholder equity for the current quarter Net asset growth rate = 1. (2) shareholder equity before the third quarte −

3.2. Back-testing Revenue In this section, we compare and analyze the benefits under different back-testings. See Table4 for parameter settings.

Table 4. Parameter settings for policy back-testing.

Item Detail all stocks after screening (excluding ST shares, new shares, secondary shares, and Object of transaction stocks suspended within 20 days) Returns of the benchmark Index gains for the CSI 300 Time of back-testing 1 November 2013 (Fri.) to 31 December 2019 (Tue.) Days of back-testing 1507 trading days Data sources JoinQuant quantitative investment platform Initial funding 10 million Overnight or not yes Stop’s way RSRS stop loss Number of the position 20 stocks Adjustable frequency one month 0.2% 0.03% commission when buying, 0.03% commission plus 0.1% stamp duty when Commission charge selling, commission for each transaction a minimum deduction of 5 yuan Software language Python

3.2.1. XGBoost Classification Prediction and XGBoost regression Prediction In 2015, the XGBoost model was proposed by Chen et al. [12], which is optimized for fast parallel tree construction. “It has gained much popularity and attention recently as the algorithm of choice for many winning teams of machine learning competitions (https://en.wikipedia.org/wiki/XGBoost)”. Because of the XGBoost model’s good performance, we chose the XGBoost model to predict the stock’s return rate. The core model of this paper is the XGBoost multi-class prediction model, and the model parameters are shown in Table5. We used the XGBoost multi-class prediction model to perform back-testing from November 2013 to December 2019. A total of 74 class predictions were carried out. The comprehensive evaluation of the prediction is shown in Table6. Among them, accuracy, sensitivity C1, and precision C1 are defined similar to those for the two-class classification. The specific formulas are given by Equations (3)–(5), where xij is given in Table7.

P5 = xii accuracy = i 1 (3) P5 P5 i=1 j=1 xij Sustainability 2020, 12, 3978 7 of 12

x sensitivity C1 = 11 (4) P5 i=1 xi1 x precision C1 = 11 (5) P5 j=1 x1j

Table 5. Main parameters of XGBoost classification.

Parameter Value max_depth 10 learning_rate 0.1 n_estimators 500 min_child_weight 5 colsample_bytree 0.7 reg_lambda 0.4 scale_pos_weight 0.8 subsample 0.8

Table 6. XGBoost Multi-class prediction evaluation.

Item Mean Stdev Max Min accuracy 51.7% 7.9% 67.0% 37.5% sensitivity C1 75.4% 7.8% 87.7% 59.3% precision C1 62.1% 10.3% 78.2% 41.2%

Table 7. Confusion matrix of 5 classification model.

True Condition Category1 Category2 Category3 Category4 Category5

Category1 x11 x12 x13 x14 x15 Category2 x x x x x Predicted 21 22 23 24 25 Category3 x x x x x Condition 31 32 33 34 35 Category4 x41 x42 x43 x44 x45 Category5 x51 x52 x53 x54 x55

The stock selection criterion is to hold stocks that are predicted to be in the first category and are ranked in the top 20 in probability. Therefore, sensitivity C1 and precision C1 are more important for evaluating the prediction ability. Among them, sensitivity C1 represents the proportion of stocks that can be correctly predicted in the first category of stocks, and precision C1 represents the proportion of stocks that are truly in the first category. In the 74 predictions, the mean value of sensitivity C1 was 75.4% and the was 7.8%; the mean value of precision C1 was 62.1% and the standard deviation was 10.3%. The average accuracy of the 74 predictions was 51.7% and the standard deviation was 7.9%. Although the overall accuracy was not very high, this indicator had little effect on the overall performance in terms of back-testing returns. We believe that the precision C1 indicator is the most important of the three indicators. The higher value of this indicator indicates that the model can screen out high-yield stocks with a high probability. Next, the comparison between XGBoost classification prediction and XGBoost regression prediction was performed. In XGBoost classification prediction, we used the XGBoost model to predict the return rate range of each period of the back-testing stage; that is, to carry out multi-class prediction. In XGBoost regression prediction (parameters are given in Table8), we predicted the return rate value of each period of the back-testing stage, that is, regression the yield, and holding the 20 stocks with the highest predicted returns. Both methods use the RSRS index (relative strength of resistance support) stop-loss module to stop the loss. Sustainability 2020, 12, x FOR PEER REVIEW 8 of 12 is the most important of the three indicators. The higher value of this indicator indicates that the model can screen out high-yield stocks with a high probability. Next, the comparison between XGBoost classification prediction and XGBoost regression prediction was performed. In XGBoost classification prediction, we used the XGBoost model to predict the return rate range of each period of the back-testing stage; that is, to carry out multi-class prediction. In XGBoost regression prediction (parameters are given in Table 8), we predicted the Sustainability 2020, 12, 3978 8 of 12 return rate value of each period of the back-testing stage, that is, regression the yield, and holding the 20 stocks with the highest predicted returns. Both methods use the RSRS index (relative strength of resistance support) stop-lossTable module 8. Main to parameters stop the loss. of XGBoost regression.

Table 8. MainParameter parameters of XGBoost Value regression. max_depth 10 Parameter Value learning_ratemax_depth 0.310 gammalearning_rate 0.10.3 min_child_weightgamma 30.1 colsample_bytreemin_child_weight 0.73 colsample_bytreelambda 30.7 subsamplelambda 0.53 subsample 0.5

As shown in Figure5 and Table9, the performance of the quantitative stock selection strategy As shown in Figure 5 and Table 9, the performance of the quantitative stock selection strategy based on the XGBoost multi-class prediction was much better than the CSI 300 Index in the back-testing based on the XGBoost multi-class prediction was much better than the CSI 300 Index in the back- interval from November 2013 to December 2019. In terms of the annualized yield, Sharpe ratio, testing interval from November 2013 to December 2019. In terms of the annualized yield, Sharpe maximum retracement, and Calmar ratio, the performances of the XGBoost multi-class prediction ratio, maximum retracement, and Calmar ratio, the performances of the XGBoost multi-class method were significantly better than the quantitative stock selection strategy based on XGBoost prediction method were significantly better than the quantitative stock selection strategy based on regression and XGBoost two-class classification in the same period. Therefore, we believe that XGBoost regression and XGBoost two-class classification in the same period. Therefore, we believe the quantitative stock selection strategy of XGBoost multi-class probability prediction has a better that the quantitative stock selection strategy of XGBoost multi-class probability prediction has a back-testing performance. better back-testing performance.

FigureFigure 5.5. Back-testingBack-testing earningearning chartchart ofof regressionregression andand classificationclassification stockstock selection.selection.

Table 9. Stock selection strategy back-testing indicators of XGBoost-regression and classification.

Model and Stop-Loss + XGBoost-Regression + RSRS XGBoost-Classification + RSRS XGBoost-Dichotomy + RSRS Annual yield rate 0.26 0.57 0.36 Accumulated yield rate 3.65 7.76 4.91 Annualized 0.25 0.23 0.24 Sharpe Ratio 0.62 2.21 1.42 Calmar Ratio 0.90 2.71 1.33 Stability_of_timeseries 0.71 0.84 0.57 Maximum Drawdown 0.29 0.21 0.27 Sortino Ratio 0.90 3.03 1.81 Information Ratio 0.96 1.92 1.05 Alpha 0.16 0.54 0.31 Beta 0.70 0.36 0.58 Sustainability 2020, 12, 3978 9 of 12

3.2.2. Back-testing Revenue of Different Models Next, in order to compare the combined back-testing effects of different models and stop-loss modules, we compared the performances of different combinations of the XGBoost and random forest decision-making models (parameters of the RF model are given by Table 10) with the RSRS index (relative strength of resistance support) stop-loss module and the MACD ( of similarities and differences) stop-loss module. The back-testing results are given in Figure6 and Table 11.

Table 10. Main parameters of random forest.

Parameter Values max_depth 5 min_samples_leaf 2 n_estimators 200 min_samples_split 2 criterion gini Sustainability 2020, 12, x FOR PEER REVIEW 10 of 12

Figure 6.6. The back-testingback-testing for return rate.

Table 11.11. The back-testingback-testing index of didifferentfferent models.models.

ModelModel and and Stop-LossStop-Loss IndexIndex XGBoostXGBoost+ RSRS + RSRS RFRF+ + RSRS RF RF + MACD+ MACD XGBoost XGBoost + MACD+ MACD AnnualAnnual yield yield rate rate 0.570.57 0.410.41 0.37 0.37 0.28 0.28 AccumulatedAccumulated yield yield rate rate 7.767.76 5.635.63 4.96 4.96 3.84 3.84 AnnualizedAnnualized Volatility Volatility 0.230.23 0.240.24 0.33 0.33 0.35 0.35 SharpeSharpe Ratio Ratio 2.212.21 1.541.54 1.02 1.02 0.65 0.65 CalmarCalmar Ratio Ratio 2.712.71 1.861.86 0.71 0.71 0.51 0.51 Stability_of_timeseriesStability_of_timeseries 0.840.84 0.820.82 0.72 0.72 0.70 0.70 Maximum Drawdown 0.21 0.22 0.51 0.55 Maximum Drawdown 0.21 0.22 0.51 0.55 Sortino Ratio 3.03 2.57 1.01 0.92 InformationSortino Ratio Ratio 1.923.03 1.262.57 1.01 1.18 0.92 0.98 InformationAlpha Ratio 0.541.92 0.411.26 1.18 0.24 0.98 0.17 BetaAlpha 0.360.54 0.480.41 0.24 0.83 0.17 0.79 Beta 0.36 0.48 0.83 0.79

As shown in Figure6 and Table 11, the back-testing benefit of the combination of the XGBoost For the CPP quantitative stock selection strategy proposed in this paper, the annualized return model and the RSRS index stop loss module was higher than that of the random forest model. This reached 57%, the Sharpe ratio was 2.21, the maximum drawdown was 21%, the Calmar ratio was indicates that, under the timing given by the RSRS index stop loss module, the XGBoost multi-class 2.71, and the win rate was 63.5%. The return of the strategy reached the lowest value of −3.85% on 10 probability prediction is more accurate than the random forest model. However, under the timing January 2014, and reached the highest point on 14 October 2019 when cumulative gain of the strategy given by the MACD stop loss module, the return of the XGBoost model was lower than that of the was 788.52%. Since 19 December 2013, the cumulative returns of CPP’s quantitative stock selection random forest model. In the case of the same machine learning model, the effect of the RSRS index strategy have been better than the CSI 300 Index over the same period.

3.2.3. CPP Quantitative Stock Selection Back-Testing Income After determining that the main model is a combination of the XGBoost multi-class forecast and the RSRS index stop loss module, this paper conducted back-testing in the back-testing interval from 1 November 2013 to 31 December 2019, and the results were given in Figure 7 and Table 12. In different periods of the market, the applicable strategies will be different, and it is difficult for a strategy to perform well in all periods. The CPP quantitative stock selection strategy has different levels of excess returns at different time periods. As shown in Table 12 and Figure 7, from 1 November 2013 to 31 August 2014, a horizontal price movement period (volatile market) before the bull market, the CPP quantitative stock selection strategy achieved an excess yield of 30.29% during this 10-month period. From 1 September 2014 to 31 May 2015, the CPP quantitative stock selection strategy achieved an excess return of 94.4%. From 1 June 2015 to 31 December 2015, after the stock market crashed sharply, the CPP quantitative stock selection strategy achieved an excess return of 80.58%. From 1 January 2016 to 31 December 2019, another horizontal price movement period (volatile market), the CPP quantitative stock selection strategy achieved an excess return of 86.63%. As we can see, the proposed CPP quantitative stock selection strategy is a sustainable investment strategy that works well over an extensive period that covers bull market, bear market, and volatile market states. Sustainability 2020, 12, 3978 10 of 12 stop loss module is significantly stronger than the MACD stop loss module. Therefore, we decided to choose the combination of XGBoost model and RSRS index stop loss module as the main model of CPP quantitative stock selection strategy. For the CPP quantitative stock selection strategy proposed in this paper, the annualized return reached 57%, the Sharpe ratio was 2.21, the maximum drawdown was 21%, the Calmar ratio was 2.71, and the win rate was 63.5%. The return of the strategy reached the lowest value of 3.85% on 10 − January 2014, and reached the highest point on 14 October 2019 when cumulative gain of the strategy was 788.52%. Since 19 December 2013, the cumulative returns of CPP’s quantitative stock selection strategy have been better than the CSI 300 Index over the same period.

3.2.3. CPP Quantitative Stock Selection Back-Testing Income After determining that the main model is a combination of the XGBoost multi-class forecast and the RSRS index stop loss module, this paper conducted back-testing in the back-testing interval from 1 SustainabilityNovember 2020 2013, 12 to, x 31FOR December PEER REVIEW 2019, and the results were given in Figure7 and Table 12. 11 of 12

Figure 7. ChartChart of of CPP CPP stock stock selection selection back back-testing-testing return rate.

Table 12. Excess returns of CPP quantitative stock selection strategy at different time periods. Table 12. Excess returns of CPP quantitative stock selection strategy at different time periods.

DifferentDifferent Period Period StateState of of Market Market ExcessExcess Rate Rate of of Return Return 11 November November 2013 2013–31–31 August August 2014 2014 volatilevolatile market market 30.29%30.29% 11 September September 2014 2014–31–31 May May 2015 2015 bullbull market market 94.40%94.40% 11 June June 2015 2015–31–31 December December 2015 2015 bearbear market market 80.58%80.58% 11 January January 2016 2016–31–31 December December 2019 2019 volatilevolatile market market 86.63%86.63%

4. ConclusionIn differents periods of the market, the applicable strategies will be different, and it is difficult for a strategyIn this to paper, perform we wellused ina random all periods. forest The model CPP to quantitative dynamically stock select selection factors strategyfor the training has diff seterent in eachlevels period of excess to ensure returns that at di thefferent factors time that periods. could As be shownselected in in Table each 12 period and Figure were7 ,the from optimal 1 November factors in2013 the to current 31 August period. 2014, At athe horizontal same time, price the movementclassification period probability (volatile prediction market) before(CPP) theof stock bull returns market, wathes CPP performed. quantitative This stockmethod selection can effectively strategy take achieved into account an excess the yield accuracy of 30.29% of income during prediction this 10-month and avoidperiod. the From interference 1 September of 2014noise to in 31 the May rate 2015, of the return. CPP Historicalquantitative back stock-testing selection shows strategy that achieved the CPP quantitativean excess return stock of selection 94.4%. Fromstrategy 1 June based 2015 on todynamic 31 December factor adjustment 2015, after theperforms stock marketbetter than crashed the traditionalsharply, the machine CPP quantitative learning stock stock selection selection methods, strategy and achieved can outperform an excess the return CSI of300 80.58%. Index over From the 1 sameJanuary period 2016 in to most 31 December back-testing 2019, periods. another It horizontalis a sustainable price investment movement strategy period (volatile in the sense market), that, theno matterCPP quantitative in a bull market, stock a selection bear market, strategy or a achievedvolatile market an excess state, return the CPP of 86.63%.quantitative As we stock can selection see, the strategyproposed based CPP on quantitative dynamic factor stock ad selectionjustments strategy can achieve is a sustainable better excess investment returns. strategy that works wellIt over should an extensive be noted periodthat all thatthe results covers in bull this market, article we bearre market,derived andfrom volatile historical market data states.back-testing, and the results may be different from the results of actual investments. As we used the historical data for back-testing, we did not consider the impacts of the market liquidity, and the impacts of this strategy on the decisions of other market participants, etc. Therefore, there is no guarantee that the strategy works for real market investments. We are not responsible for any loss caused by implementing the strategy.

Author Contributions: Conceptualization, Y.F. and T.P.; writing—original draft preparation, Y.F., T.P. and S.C.; writing—review and editing, Y.F. and T.P.; software, S.C. All authors have read and agreed to the published version of the manuscript.

Funding: This research was partially supported by MOE (Ministry of Education in China) Youth Project of Humanities and Social Sciences (Project No. 17YJCZH044), MOE (Ministry of Education in China) Project of Humanities and Social Sciences (Project No. 18YJAZH127) and The 10th Key Discipline of Shanghai Normal University: Quantitative Economics.

Acknowledgments: Data and back-testing are based on JoinQuant Quantization Platform (https://www.joinquant.com/).

Conflicts of Interest: The authors declare no conflict of interest.

Sustainability 2020, 12, 3978 11 of 12

4. Conclusions In this paper, we used a random forest model to dynamically select factors for the training set in each period to ensure that the factors that could be selected in each period were the optimal factors in the current period. At the same time, the classification probability prediction (CPP) of stock returns was performed. This method can effectively take into account the accuracy of income prediction and avoid the interference of noise in the rate of return. Historical back-testing shows that the CPP quantitative stock selection strategy based on dynamic factor adjustment performs better than the traditional machine learning stock selection methods, and can outperform the CSI 300 Index over the same period in most back-testing periods. It is a sustainable investment strategy in the sense that, no matter in a bull market, a bear market, or a volatile market state, the CPP quantitative stock selection strategy based on dynamic factor adjustments can achieve better excess returns. It should be noted that all the results in this article were derived from historical data back-testing, and the results may be different from the results of actual investments. As we used the historical data for back-testing, we did not consider the impacts of the market liquidity, and the impacts of this strategy on the decisions of other market participants, etc. Therefore, there is no guarantee that the strategy works for real market investments. We are not responsible for any loss caused by implementing the strategy.

Author Contributions: Conceptualization, Y.F. and T.P.; writing—original draft preparation, Y.F., T.P. and S.C.; writing—review and editing, Y.F. and T.P.; software, S.C. All authors have read and agreed to the published version of the manuscript. Funding: This research was partially supported by MOE (Ministry of Education in China) Youth Project of Humanities and Social Sciences (Project No. 17YJCZH044), MOE (Ministry of Education in China) Project of Humanities and Social Sciences (Project No. 18YJAZH127) and The 10th Key Discipline of Shanghai Normal University: Quantitative Economics. Acknowledgments: Data and back-testing are based on JoinQuant Quantization Platform (https://www.joinquant. com/). Conflicts of Interest: The authors declare no conflict of interest.

References

1. Fama, E.F.; French, K.R. Business Conditions and Expected Returns on Stocks and Bonds. J. Financ. Econ. 1989, 25, 23–49. [CrossRef] 2. Lakonishok, J.; Shleifer, A.; Vishny, R.W. Contrarian Investment, Extrapolation, and Risk. J. Financ. 1994, 49, 1541–1578. [CrossRef] 3. Song, F.M. A Two-Factor ARCH Model for Deposit-Institution Stock Returns. J. Money Credit Bank. 1994, 26, 323–340. [CrossRef] 4. Patel, J.; Shah, S.; Thakkar, P.; Kotecha, K. Predicting Stock and Stock Price Index Movement Using Trend Deterministic Data Preparation and Machine Learning Techniques. Expert Syst. Appl. 2015, 42, 259–268. [CrossRef] 5. Liu, S.; Zhang, C.; Ma, J. CNN-LSTM Neural Network Model for Quantitative Strategy Analysis in Stock Markets. In Proceedings of the International Conference on Neural Information Processing, Guangzhou, China, 14–18 November 2017; pp. 198–206. 6. Li, J.; Zhang, R. Dynamic Weighting Multi Factor Stock Selection Strategy Based on XGboost Machine Learning Algorithm. In Proceedings of the 2018 IEEE International Conference of Safety Produce Informatization (IICSPI), Chongqing, China, 10–12 December 2018. 7. Yang, F.; Chen, Z.; Li, J.; Tang, L. A Novel Hybrid Stock Selection Method with Stock Prediction. Appl. Soft Comput. 2019, 80, 820–831. [CrossRef] 8. Chen, S.; Ge, L. Exploring the attention mechanism in LSTM-based Hong Kong stock price movement prediction. Quant. Financ. 2019, 19, 1507–1515. [CrossRef] 9. Ho, T.K. Random Decision Forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995. 10. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [CrossRef] Sustainability 2020, 12, 3978 12 of 12

11. Lin, Y.; Jeon, Y. Random Forests and Adaptive Nearest Neighbors. J. Am. Stat. Assoc. 2006, 101, 578–590. [CrossRef] 12. Chen, T.; Guestrin, C. XGBoost: Reliable Large-Scale Tree Boosting System. In Proceedings of the 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; pp. 13–17.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).