Curse of Dimensionality for TSK Fuzzy Neural Networks
Total Page:16
File Type:pdf, Size:1020Kb
Curse of Dimensionality for TSK Fuzzy Neural Networks: Explanation and Solutions Yuqi Cui, Dongrui Wu and Yifan Xu School of Artificial Intelligence and Automation Huazhong University of Science and Technology Wuhan, China yqcui, drwu, yfxu @hust.edu.cn { } Abstract—Takagi-Sugeno-Kang (TSK) fuzzy system with usually use distance based approaches to compute membership Gaussian membership functions (MFs) is one of the most widely grades, so when it comes to high-dimensional datasets, the used fuzzy systems in machine learning. However, it usually fuzzy partitions may collapse. For instance, FCM is known has difficulty handling high-dimensional datasets. This paper explores why TSK fuzzy systems with Gaussian MFs may fail to have trouble handling high-dimensional datasets, because on high-dimensional inputs. After transforming defuzzification the membership grades of different clusters become similar, to an equivalent form of softmax function, we find that the poor leading all centers to move to the center of the gravity [18]. performance is due to the saturation of softmax. We show that Most previous works used feature selection or dimensional- two defuzzification operations, LogTSK and HTSK, the latter of ity reduction to cope with high-dimensionality. Model-agnostic which is first proposed in this paper, can avoid the saturation. Experimental results on datasets with various dimensionalities feature selection or dimensionality reduction algorithms, such validated our analysis and demonstrated the effectiveness of as Relief [19] and principal component analysis (PCA) [20], LogTSK and HTSK. [21], can filter the features before feeding them into TSK Index Terms—Mini-batch gradient descent, fuzzy neural net- models. Neural networks pre-trained on large datasets can also work, high-dimensional TSK fuzzy system, HTSK, LogTSK be used as feature extractor to generate high-level features with low dimensionality [22], [23]. I. INTRODUCTION There are also approaches to select the fuzzy sets in each Takagi-Sugeno-Kang (TSK) fuzzy systems [1] have rule so that rules may have different numbers of antecedents. achieved great success in numerous machine learning applica- For instance, Alcala-Fdez et al. proposed an association anal- tions, including both classification and regression. Since a TSK ysis based algorithm to select the most representative patterns fuzzy system is equivalent to a five layer neural network [2], as rules [24]. C´ozar et al. further improved it by proposing a [3], it is also known as TSK fuzzy neural network. local search algorithm to select the optimal fuzzy regions [25]. Fuzzy clustering [4]–[6] and evolutionary algorithms [7], Xu et al. proposed to use the attribute weights learned by [8] have been used to determine the parameters of TSK fuzzy soft subspace fuzzy clustering to remove fuzzy sets with low systems on small datasets. However, their computational cost weights to build a concise TSK fuzzy system [4]. However, may be too high for big data. Inspired by its great success in there were few approaches that directly train TSK models on deep learning [9]–[11], mini-batch gradient descent (MBGD) high-dimensional datasets. based optimization was recently proposed for training TSK Our previous experiments found that when using MBGD- fuzzy systems [12], [13]. based optimization, the initialization of the standard deviation arXiv:2102.04271v1 [cs.LG] 8 Feb 2021 Traditional optimization algorithms for TSK fuzzy systems of Gaussian membership functions (MFs), σ, is very important use grid partition to partition the input space into different for high-dimensional datasets, and larger σ may lead to fuzzy regions, whose number grows exponentially with the better performance. In this paper, we demonstrate that this input dimensionality. A more popular and flexible way is improvement is due to the reduction of saturation caused by clustering-based partition, e.g., fuzzy c-means (FCM) [14], the increase of dimensionality. Furthermore, we validate two EWFCM [15], ESSC [4], [16] and SESSC [6], in which the convenient approaches to accommodate saturation. fuzzy sets in different rules are independent and optimized Our main contributions are: separately. • To the best of our knowledge, we are the first to discover Although the combination of MBGD-based optimization that the curse of dimensionality in TSK modeling is due and clustering-based rule partition can handle the problem to the saturation of the softmax function. As a result, of optimizing antecedent parameters on high-dimensional there exists an upper bound on the number of rules that datasets, TSK fuzzy systems still have difficulty achieving each input can fire. Furthermore, the loss landscape of a acceptable performance. The main reason is the curse of saturated TSK system is more rugged, leading to worse dimensionality, which affects all machine learning models. generalization. When the input dimensionality is high, the distances between • We demonstrate that the initialization of δ should be data points become very similar [17]. TSK fuzzy systems correlated with the input dimensionality to avoid satu- ration. Based on this, we propose a high-dimensional is the firing level of Rule r. We can also re-write (3) as: TSK (HTSK) algorithm, which can be viewed as a new R defuzzification operation or initialization strategy. y(x)= f r(x)yr(x), (5) • We validate LogTSK [23] and our proposed HTSK on r=1 datasets with a large range of dimensionality. The results X where indicate that HTSK and LogTSK can not only avoid saturation, but also are more accurate and more robust fr(x) f r(x)= R (6) than the vanilla TSK algorithm with simple initialization. i=1 fi(x) The remainder of this paper is organized as follows: Sec- is the normalized firing level ofP Rule r. (5) is the defuzzifica- tion II introduces TSK fuzzy systems and the saturation phe- tion operation of TSK fuzzy systems. nomenon on high-dimensional datasets. Section III introduces In this paper, we use k-means clustering to initialize the the details of LogTSK and our proposed HTSK. Section IV antecedent parameters mr,d, and MBGD to optimize the validates the performances of LogTSK and HTSK on datasets parameters br,d, mr,d and σr,d. More specifically, we run k- with various dimensionality. Section V draws conclusions. means clustering and assign the R cluster centers to mr,d as II. TRADITIONAL TSK FUZZY SYSTEMS ON the centers of the rules. We use different initializations of σr,d HIGH-DIMENSIONAL DATASETS to validate their influence on the performance of TSK models on high-dimensional datasets. He initialization [27] is used for This section introduces the details of TSK fuzzy system with the consequent parameters. Gaussian MF [26], the equivalence between defuzzification and softmax function, and the saturation phenomenon of B. TSK Fuzzy Systems on High-Dimensional Datasets softmax on high-dimensional datasets. When using Gaussian MFs and the product t-norm, we can A. TSK Fuzzy Systems re-write (6) as: N Let the training dataset be = xn,yn n=1, in which fr(x) T D×D1 { } f r(x)= xn = [xn,1, ..., xn,D] R is a D-dimensional feature R ∈ i=1 fi(x) vector, and yn 1, 2, ..., C can be the corresponding class − 2 ∈{ } D (xd mr,d) (7) label for a C-class classification problem, or y R for a P exp d=1 2σ2 n ∈ − r,d regression problem. = − 2 . R P D (xd mi,d) i=1 exp d=1 2σ2 Suppose a D-input single-output TSK fuzzy system has R − i,d rules. The r-th rule can be represented as: 2 D P(xd−mr,d) P Replacing 2 with Zr, we can observe that d=1 2σr,d Ruler :IF x1 is Xr,1 and and xD is Xr,D, − ··· f r is a typicalPsoftmax function: D (1) Then y (x)= b + b x , exp(Z ) r r,0 r,d d f (x)= r , (8) d=1 r R X i=1 exp(Zi) where X (r = 1, ..., R; d = 1, ..., D) is the MF for the r,d where Z < 0 x. We can also show that, as the dimension- d-th attribute in the r-th rule, and b , d = 0, ..., D, are the r P r,d ality increases,∀Z decreases, which causes the saturation of consequent parameters. Note that here we only take single- r softmax [28]. output TSK fuzzy systems as an example, but the phenomenon Let Z = [Z , ..., Z ] and f = [f , ..., f ]. In a three- and conclusion can also be extended to multi-output TSK 1 R 1 R rule TSK fuzzy system for low-dimensional task, if Z = systems. [ 0.1, 0.5, 0.3], then f = [0.4018, 0.2693, 0.3289].As the Consider Gaussian MFs. The membership degree µ of x d dimensionality− − − increases, Z may increase to, for example, on X is: r,d [ 10, 50, 30], and then f = [1, 4 10−18, 2 10−9], which 2 − − − × × (xd mr,d) means the final prediction is dominated by one rule. In other µXr,d (xd) = exp − 2 , (2) words, f in (8) with high-dimensional inputs tends to only − 2σr,d ! give non-zero firing level to the rule with the maximum Zr. where mr,d and σr,d are the center and the standard deviation In order to avoid numeric underflow, we compute the of the Gaussian MF Xr,d, respectively. normalized firing level by a common trick: The final output of the TSK fuzzy system is: exp(Zr Zmax) R f r(x)= R − , (9) r=1 fr(x)yr(x) x (3) i=1 exp(Zi Zmax) y( )= R , − i=1 fi(x) P where Zmax = max(Z1P, ..., ZR). In this paper, we consider where P that a rule is fired by x when the corresponding normalized −4 D D 2 firing level f (x) > 10 .