1

An Effective Multi-Resolution Hierarchical Granular Representation based Classifier using General Fuzzy Min-Max Neural Network

Thanh Tung Khuat , Fang Chen , and Bogdan Gabrys , Senior Member, IEEE

Abstract—Motivated by the practical demands for simplifica- numeric data [4]. Information granules have also contributed tion of data towards being consistent with human thinking and to quantifying the limited numeric precision in data [5]. problem solving as well as tolerance of uncertainty, information Utilizing information granules is one of the problem-solving granules are becoming important entities in data processing at different levels of data abstraction. This paper proposes a methods based on decomposing a big problem into sub-tasks method to construct classifiers from multi-resolution hierarchical which can be solved individually. In the world of big data, granular representations (MRHGRC) using hyperbox fuzzy sets. one regularly departs from specific data entities and discover The proposed approach forms a series of granular inferences general rules from data via encapsulation and abstraction. The hierarchically through many levels of abstraction. An attractive use of information granules is meaningful when tackling the characteristic of our classifier is that it can maintain a high accuracy in comparison to other fuzzy min-max models at a low five Vs of big data [6], i.e., volume, variety, velocity, veracity, degree of granularity based on reusing the knowledge learned and value. Granulation process gathering similar data together from lower levels of abstraction. In addition, our approach contributes to reducing the data size, and so the volume can reduce the data size significantly as well as handle the issue is addressed. The information from many heterogeneous uncertainty and incompleteness associated with data in real- sources can be granulated into various granular constructs, and world applications. The construction process of the classifier consists of two phases. The first phase is to formulate the model then several measures and rules for uniform representation at the greatest level of granularity, while the later stage aims are proposed to fuse base information granules as shown to reduce the complexity of the constructed model and deduce in [7]. Hence, the data variety is addressed. Several studies it from data at higher abstraction levels. Experimental analyses constructed the evolving information granules to adapt to the conducted comprehensively on both synthetic and real datasets changes in the streams of data as in [8]. The variations of indicated the efficiency of our method in terms of training time and predictive performance in comparison to other types of information granules in a high-speed data stream assist in fuzzy min-max neural networks and common tackling the velocity problem of big data. The process of algorithms. forming information granules is often associated with the Index Terms—Information granules, granular , hy- removal of outliers and dealing with incomplete data [6]; perbox, general fuzzy min-max neural network, classification, thus the veracity of data is guaranteed. Finally, the multi- hierarchical granular representation. resolution hierarchical architecture of various granular levels can disregard some irrelevant features but highlight facets of interest [9]. In this way, the granular representation may help I.INTRODUCTION with cognitive demands and capabilities of different users. IERARCHICAL problem solving, where the problems A multi-dimensional hyperbox is a fundamental are analyzed in a variety of granularity degrees, is a arXiv:1905.12170v3 [cs.LG] 3 Dec 2019 H conceptual vehicle to represent information granules. Each typical characteristic of the human brain [1]. Inspired by fuzzy min-max hyperbox is determined by the minimum this ability, granular computing was introduced. One of the and maximum points and a fuzzy membership function. A critical features of granular computing is to model the data as classifier can be built from the hyperbox fuzzy sets along high-level abstract structures and to tackle problems based on with an appropriate training algorithm. We can extract a these representations similar to structured human thinking [2]. rule set directly from hyperbox fuzzy sets or by using it in Information granules (IGs) [3] are underlying constructs of combination with other methods such as decision trees [10] the granular computing. They are abstract entities describing to account for the predictive results. However, a limitation of important properties of numeric data and formulating knowl- hyperbox-based classifiers is that their accuracy at the low edge pieces from data at a higher abstraction level. They play level of granularity (corresponding to large-sized hyperboxes) a critical role in the concise description and abstraction of decreases. In contrast, classifiers at the high granularity level are more accurate, but the building process of classifiers at T.T. Khuat (email: [email protected]) and B. Gabrys (email: [email protected]) are with Advanced Analytics Institute, this level is time-consuming, and it is difficult to extract the Faculty of Engineering and Information Technology, University of Technology rule set interpretable for predictive outcomes because of the , Ultimo, NSW 2007, Australia. high complexity of resulting models. Hence, it is desired to F. Chen (email: [email protected]) is with Data Science Centre, Faculty of Engineering and Information Technology, University of Technology construct a simple classifier with high accuracy. In addition, Sydney, Ultimo, NSW 2007, Australia. we expect to observe the change in the predictive results at 2 different data abstraction levels. This paper proposes a method between efficiency and simplicity of the classifiers. A model of constructing a high-precision classifier at the high data with high resolution corresponds to the use of a small value of abstraction level based on the knowledge learned from lower maximum hyperbox size, and vice versa. A choice of suitable abstraction levels. On the basis of classification errors on the resolution level results in better predictive accuracy of the validation set, we can predict the change in the accuracy of generated model. Our main contributions in this paper can the constructed classifier on unseen data, and we can select be summarized as follows: an abstraction level satisfying both acceptable accuracy and • We propose a new data classification model based on simple architecture on the resulting classifier. Furthermore, the multi-resolution of granular data representations in our method is likely to expand for large-sized datasets due combination with the online learning ability of the general to the capability of parallel execution during the constructing fuzzy min-max neural network. process of core hyperboxes at the highest level of granularity. • The proposed method is capable of reusing the learned In our method, the algorithm starts with a relatively small knowledge from the highest granularity level to construct value of maximum hyperbox size (θ) to produce base hyperbox new classifiers at higher abstraction levels with the low fuzzy sets, and then this threshold is increased in succeeding trade-off between the simplification and accuracy. levels of abstraction whose inputs are the hyperbox fuzzy • The efficiency and running time of the general fuzzy min- sets formed in the previous step. By using many hierarchical max classifier are significantly enhanced in the proposed resolutions of granularity, the information captured in earlier algorithm. steps is transferred to the classifier at the next level. Therefore, • Our classifier can perform on large-sized datasets because the classification accuracy is still maintained at an acceptable of the parallel execution ability. value when the resolution of training data is low. • Comprehensive experiments are conducted on synthetic Data generated from complex real-world applications fre- and real datasets to prove the effectiveness of the pro- quently change over time, so the machine learning models posed method compared to other approaches and base- used to predict behaviors of such systems need the efficient lines. online learning capability. Many studies considered the online The rest of this paper is organized as follows. Section II learning capability when building machine learning models presents existing studies related to information granules as such as [11], [12], [13], [14], [15], [16], and [17]. Fuzzy min- well as briefly describes the online learning version of the max neural networks proposed by Simpson [11] and many general fuzzy min-max neural network. Section III shows our of its improved variants only work on the input data in the proposed method to construct a classifier based on data gran- form of points. In practice, due to the uncertainty and some ulation. Experimental configuration and results are presented abnormal behaviors in the systems, the input data include not in Section IV. Section V concludes the main findings and only crisp points but also intervals. To address this problem, discusses some open directions for future works. Gabrys and Bargiela [12] introduced a general fuzzy min-max (GFMM) neural network, which can handle both fuzzy and crisp input samples. By using hyperbox fuzzy sets for the input II.PRELIMINARIES layer, this model can accept the input patterns in the granular A. Related Work form and process data at a high-level abstract structure. As a result, our proposed method used a similar mechanism as in There are many approaches to representing information the general fuzzy min-max neural network to build a series of granules [19]. Several typical methods include intervals [20], classifiers through different resolutions, where the small-sized fuzzy sets [21], shadowed sets [22], and rough sets [23]. Our resulting hyperbox fuzzy sets generated in the previous step study only focuses on fuzzy sets and intervals. Therefore, become the input to be handled at a higher level of abstraction related works only mention the granular representation using (corresponding to a higher value of the allowable hyperbox these two methods. size). Going through different resolution degrees, the valuable The existing studies on the granular data representation have information in the input data is fuzzified and reduced in size, deployed a specific clustering technique to find representative but our method helps to preserve the amount of knowledge prototypes, and then build information granules from these contained in the original datasets. This capability is illustrated prototypes and optimize the constructed granular descriptors. via the slow decline in the classification accuracy. In some The principle of justifiable granularity [24] has been usually cases, the predictive accuracy increases at higher levels of utilized to optimize the construction of information granules abstraction because the noise existing in the detailed levels from available experimental evidence. This principle aims to is eliminated. make a good balance between coverage and specificity prop- Building on the principles of developing GFMM classifiers erties of the resulting granule concerning available data. The with good generalization performance discussed in [18], this coverage property relates to how much data is located inside paper employs different hierarchical representations of granu- the constructed information granule, whereas the specificity lar data with various hyperbox sizes to select a compact clas- of a granule is quantified by its length of the interval such sifier with acceptable accuracy at a high level of abstraction. that the shorter the interval, the better the specificity. Pedrycz Hierarchical granular representations using consecutive maxi- and Homenda [24] made a compromise between these two mum hyperbox sizes form a set of multi-resolution hyperbox- characteristics by finding the parameters to maximize the based models, which can be used to balance the trade-off product of the coverage and specificity. 3

Instead of just stopping at the numeric prototypes, partition learners, the authors indicated a great demand on building matrices, or dendrograms for data clustering, Pedrycz and robust techniques with high reliability for dynamic operating Bargiela [25] offered a concept of granular prototypes to environments. To meet the continuous changing in data and the capture more details of the structure of data to be clustered. adaptation of the analytic system to this phenomenon, Peters Granular prototypes were formed around the resulting numeric and Weber [27] suggested a framework, namely Dynamic prototypes of clustering algorithms by using some degree Clustering Cube, to classify dynamic granular clustering meth- of granularity. Information granularity is allocated to each ods. Al-Hmouz et al. [8] introduced evolvable data models numeric prototype to maximize the quality of the granulation- through the dynamic changing of temporal or spatial IGs. The degranulation process of generated granules. This process was number of information granules formed from prototypes of also built as an optimization problem steered by the coverage data can increase or decrease through merging or splitting criteria, i.e., maximization of the original number of data existing granules according to the varying complexity of data included in the information granules after degranulation. streams. In addtion to the ability to merge existing hyperboxes In [5], Pedrycz developed an idea of granular models for the construction of granular hierarchies of hyperboxes, our derived from the establishment of optimal allocation of infor- proposed method also has the online learning ability, so it mation granules. The authors gave motivation and plausible can be used to tackle classification problems in a dynamic explanations in bringing the numeric models to the next environment. abstraction levels to form granular models. In the realization flow of the general principles, Pedrycz et al. [4] introduced a holistic process of constructing information granules through a two-phase procedure in a general framework. The first phase Although this study is built based on the principles of focuses on formulating numeric prototypes using fuzzy C- GFMM classifiers, it differs from the GFMM neural network means, and the second phase refines each resulting prototype with adaptive hyperbox sizes [12]. In [12], the classifier was to form a corresponding information granule by employing the formed at the high abstraction level with large-sized hyper- principle of justifiable granularity. boxes and then repeating many times the process of traversing When the problem becomes complicated, one regularly entire training dataset to build additional hyperboxes at lower splits it into smaller sub-tasks and deal with each sub-task abstraction levels with the aim of covering patterns missed on a single level of granularity. These actions give rise to by large-sized hyperboxes due to the contraction procedure. the appearance of multi-granularity computing, which aims to This operation can make the final classifier complex with a tackle problems from many levels of different IGs rather than large number of hyperboxes at different levels of granularity just one optimal granular layer. Wang et al. [1] conducted a coexisting in a single classifier, and overfitting phenomenon review on previous studies of granular computing and claimed on the training set is more likely to happen. In contrast, our that multi-granularity joint problem resolving is a valuable method begins with the construction process of the classifier research direction to enhance the quality and efficiency of at the highest resolution of training patterns with small-sized solutions based on using multiple levels of information gran- hyperboxes to capture detailed information and relationships ules rather than only one granular level. This is the primary among data points located in the vicinity of each other. After motivation for our study to build suitable classifiers from many that, at higher levels of abstraction, we do not use the data resolutions of granular data representations. points from the training set. Instead, we reuse the hyperboxes All the above methods of building information granules generated from the preceding step. For each input hyperbox, are based on the clustering techniques and affected by a the membership value with the hyperboxes in the current pre-determined parameter, i.e., the number of clusters. The step are computed, and if the highest membership degree is resulting information granules are only summarization of the larger than a pre-defined threshold, the aggregation process is original data at a higher abstraction level, and they did not use performed to form hyperboxes with higher abstraction degrees. the class information in the constructing process of granules. Our research is also different from the approach presented in The authors have not used the resulting granules to deal with [28], where the incremental algorithm was employed to reduce classification problems either. Our work is different from these the data complexity by creating small-sized hyperboxes, and studies because we propose a method to build classifiers from then these hyperboxes were used as training inputs of an various abstraction levels of data using the hyperbox fuzzy agglomerative learning algorithm with a higher abstraction sets while maintaining the reasonable stability of classification level. The method in [28] only constructs the classifier with results. In addition, our method can learn useful information two abstraction levels, while our algorithm can generate a from data through an online approach and the continuous series of classifiers at hierarchical resolutions of abstraction adjustment of the existing structure of the model. levels. In addition, the agglomerative learning in [28] is time- In the case of formulating models in a non-stationary envi- consuming, especially in large-sized datasets. Therefore, when ronment, it is essential to endow them with some mechanisms the number of generated hyperboxes using the incremental to deal with the dynamic environment. In [26], Sahel et al. learning algorithm on large-sized training sets is large, the assessed two adaptive methods to tackle data drifting prob- agglomerative learning algorithm takes a long time to formu- lems, i.e., retraining models using evolving data and deploying late hyperboxes. On the contrary, our method takes advantage incremental learning algorithms. Although these approaches of the incremental learning ability to build rapidly classifiers improved the accuracy of classifiers compared to non-adaptive through different levels of the hierarchical resolutions. 4

B. General Fuzzy Min-Max Neural Network in the matrix U are computed as follows: ( General fuzzy min-max (GFMM) neural network [12] is 1, if hyperbox Bi represents class cj uij = (2) a generalization and extension of the fuzzy min-max neural 0, otherwise network (FMNN) [11]. It combines both classification and clustering in a unified framework and can deal with both fuzzy where Bi is the hyperbox of the second layer, and cj is the th and crisp input samples. The architecture of the general fuzzy j node in the third layer. Output of each node cj in the third min-max neural network comprises three layers, i.e., an input layer is a membership degree to which the input pattern X fits layer, a hyperbox fuzzy set layer, and an output (class) layer, within the class j. The transfer function of each node cj in shown in Fig. 1. p + 1 nodes belonging to the third layer is computed as:

m l cj = max bi · uij (3) x 1 i=1 V Node c0 is connected to all unlabeled hyperboxes of the middle l x 2 B1 U layer. The values of nodes in the output layer can be fuzzy if they are computed directly from (3) or crisp in the case that c 0 the node with the largest value is assigned to 1 and the others B2 l x n get a value of zero [12]. c1 The incremental learning algorithm for the GFMM neural

u B3 x 1 network to adjust the values of two matrices V and W includes four steps, i.e., initialization, expansion, hyperbox overlap test,

u and contraction, in which the last three steps are repeated. In x 2 cp the initialization stage, each hyperbox Bi is initialized with Bm Vi = 1 and Wi = 0. For each input sample X, the algorithm finds the hyperbox Bi with the highest membership value u W x n representing the same class as X to verify two expansion conditions, i.e., maximum allowable hyperbox size and class Input nodes Hyperbox nodes Class nodes label compatibility as shown in the supplemental file. If both Fig. 1. The architecture of GFMM neural network. criteria are met, the selected hyperbox is expanded. If no hyperbox meets the expansion conditions, a new hyperbox The input layer contains 2 · n nodes, where n is the number is created to accommodate the input data. Otherwise, if the of dimensions of the problem, to fit with the input sample hyperbox Bi is expanded in the prior step, it will be checked l u X = [X ,X ], determined within the n-dimensional unit cube for an overlap with other hyperboxes Bk as follows. If the class n I . The first n nodes in the input layer are connected to m of Bi is equal to zero, then the hyperbox Bi must be checked nodes of the second layer, which contains hyperbox fuzzy sets, for overlap with all existing hyperboxes; otherwise, the overlap by the minimum points matrix V. The remaining n nodes are checking is only performed between Bi and hyperboxes Bk linked to m nodes of the second layer by the maximum points representing other classes. If the overlap occurs, a contraction matrix W. The values of two matrices V and W are adjusted process is carried out to remove the overlapping area by during the learning process. Each hyperbox Bi is defined by adjusting the sizes of hyperboxes according to the dimension an ordered set: Bi = {X,Vi,Wi, bi(X,Vi,Wi)}, where Vi ⊂ with the smallest overlapping value. Four cases of the overlap V,Wi ⊂ W are minimum and maximum points of hyperbox checking and contraction procedures were presented in detail Bi respectively, bi is the membership value of hyperbox Bi in in the supplemental file and can be found in [12]. the second layer, and it is also the transfer function between input nodes in the first layer and hyperbox nodes in the second III.PROPOSED METHODS layer. The membership function bi is computed using (1) [12]. A. Overview

u The learning process of the proposed method consists of bi(X,Vi,Wi) = min (min([1 − f(xj − wij, γj)], two phases. The first phase is to rapidly construct small- j=1...n (1) l sized hyperbox fuzzy sets from similar input data points. This [1 − f(vij − x , γj)])) j phase is performed in parallel on training data segments. The  data in each fragment can be organized according to two 1, if rγ > 1  modes. The first way is called heterogeneous mode, which where f(r, γ) = rγ, if 0 ≤ rγ ≤ 1 is the threshold func- uses the data order read from the input file. The second mode  0, if rγ < 0 is homogeneous, in which the data are sorted according to tion and γ = [γ1, . . . , γn] is a sensitivity parameter regulating groups; each group contains data from the same class. The the speed of decrease of the membership values. main purpose of the second phase is to decrease the complexity Hyperboxes in the middle layer are fully connected to the of the model by reconstructing phase-1 hyperboxes with a third-layer nodes by a binary valued matrix U. The elements higher abstraction level. 5

Phase 1 maximum hyperbox size Merging & Training Next() Constructing deleting Worker Hyperboxes Data

Load in chunk Chunk of Distribute -Incremental learning Hyperbox data -Pattern centroid building fuzzy sets

Worker Hyperboxes Heterogeneous Homogeneous Constructing Pruning OR (No grouping) (Group by labels)

Phase-2 Expansion/Creation of Incremental passing Phase-1 Hyperbox hyperboxes and overlap resolving Hyperbox fuzzy sets fuzzy sets -maximum hyperbox size Phase 2 -minimum membership threshold

Fig. 2. Pipeline of the training process of the proposed method.

In the first step of the training process, the input samples B. Formal Description are split into disjoint sets and are then distributed to dif- (T r) Consider a training set of NT r data vectors, X = ferent computational workers. On each worker, we build an (T r) (T r) n {Xi : Xi ∈ R , i = 1,...,NT r}, and the correspond- independent general fuzzy min-max neural network. When all (T r) (T r) (T r) ing classes, C = {ci : ci ∈ N, i = 1,...,NT r}; training samples are handled, all created hyperboxes at differ- (V ) (V ) a validation set of NV data vectors, X = {Xi : ent workers are merged to form a single model. Hyperboxes (V ) n completely included in other hyperboxes representing the same Xi ∈ R , i = 1,...,NV }, and the corresponding classes, (V ) (V ) (V ) class are eliminated to reduce the redundancy and complexity C = {ci : ci ∈ N, i = 1,...,NV }. The details of the of the final model. After combining hyperboxes, the pruning method are formally described as follows. procedure needs to be applied to eliminate noise and low- 1) Phase 1: quality hyperboxes. The resulting hyperboxes are called phase- Let nw be the number of workers to execute the hyper- (T r) (T r) 1 hyperboxes. box construction process in parallel. Let Fj(Xj , Cj , θ0) be the procedure to construct hyperboxes on the jth However, phase-1 hyperboxes have small sizes, so the worker with maximum hyperbox size θ0 using training data complexity of the system can be high. As a result, all these (T r) (T r) {Xj , Cj }. Procedure Fj is a modified fuzzy min-max hyperboxes are put through phase-2 of the granulation process neural network model which only creates new hyperboxes with a gradual increase in the maximum hyperbox sizes. At or expands existing hyperboxes. It accepts the overlapping a larger value of the maximum hyperbox size, hyperboxes at regions among hyperboxes representing different classes, be- a low level of abstraction will be reconstructed with a higher cause we expect to capture rapidly similar samples and group data abstraction degree. Previously generated hyperboxes are them into specific clusters by small-sized hyperboxes without fetched one at a time, and they are aggregated with newly spending much time on computationally expensive hyperbox constructed hyperboxes at the current granular representation overlap test and resolving steps. Instead, each hyperbox Bi is level based on a similarity threshold of the membership degree. added a centroid Gi of patterns contained in that hyperbox This process is repeated for each input value of the maximum and a counter Ni to store the number of data samples covered hyperbox sizes. The whole process of the proposed method is by it in addition to maximum and minimum points. This shown in Fig. 2. Based on the classification error of resulting information is used to classify data points located in the classifiers on the validation set, one can select an appropriate overlapping regions. When a new pattern X is presented to predictor satisfying both simplicity and precision. The fol- the classifier, the operation of building the pattern centroid lowing part provides the core concepts for both phases of for each hyperbox (line 12 and line 15 in Algorithm 1) is our proposed method in a form of mathematical descriptions. performed according to (4). The details of Algorithm 1 for the phase 1 and Algorithm 2 old corresponding to the phase 2 as well as their implementation new Ni · Gi + X Gi = (4) aspects are shown in the supplemental material. The readers Ni + 1 can refer to this document to find more about the free text where Gi is the sample centroid of the hyperbox Bi, Ni is descriptions, pseudo-codes, and implementation pipeline of the the number of current samples included in the Bi. Next, the algorithms. number of samples is updated: Ni = Ni + 1. It is noted that 6

Gi is the same as the first pattern covered by the hyperbox Bi 2) Phase 2: when Bi is newly created. Unlike phase 1, the input data in this phase are hyperboxes After the process of building hyperboxes in all workers generated in the previous step. The purpose of stage 2 is to finishes, merging step is conducted (lines 19-23 in Algorithm reduce the complexity of the model by aggregating hyperboxes 1), and it is mathematically represented as: created at a higher resolution level of granular data represen- tations. At the high level of data abstraction, the confusion nw [ (T r) (T r) among hyperboxes representing different classes needs to be M = {Bi|Bi ∈ Fj (Xj , Cj , θ0)} (5) j=1 removed. Therefore, the overlapping regions formed in phase where M is the model after performing the merging procedure. 1 have to be resolved, and there is no overlap allowed in this It is noted that hyperboxes contained in the larger hyperboxes phase. Phase 2 can be mathematically represented as: representing the same class are eliminated (line 24 in Algo- rithm 1) and the centroids of larger hyperboxes are updated HH (Θ, ms) = {Hi|Hi = G(Hi−1, θi, ms)}, ∀i ∈ [1, |Θ|], θi ∈ Θ using (6). (10) where HH is a list of models Hi constructed through different old old new N1 · G1 + N2 · G2 levels of granularity represented by maximum hyperbox sizes G1 = (6) N1 + N2 θi, Θ is a list of maximum hyperbox sizes, |Θ| is the where G1 and N1 are the centroid and the number of samples cardinality of Θ, ms is the minimum membership degree of of the larger sized hyperbox, G2 and N2 are the centroid two aggregated hyperboxes, and G is a procedure to construct and the number of samples of the smaller sized hyperbox. the models in phase 2 (it uses the model at previous step as The number of samples in the larger sized hyperbox is also input), H0 is the model in phase 1. The aggregation rule of updated: N1 = N1 + N2. This whole process is similar to hyperboxes, G, is described as follows: the construction of an ensemble classifier at the model level For each input hyperbox Bh in Hi−1, the membership shown in [29]. values between Bh and all existing hyperboxes with the Pruning step is performed after merging hyperboxes to re- same class as Bh in Hi are computed. We select the winner move noise and low-quality hyperboxes (line 26 in Algorithm hyperbox with maximum membership degree with respect 1). Mathematically, it is defined as: to Bh, denoted Bk, to aggregate with Bh. The following ( constraints are verified before conducting the aggregation: M1 = M \{Bk|Ak < α ∨ Ak = Nil}, if EV (M1) ≤ EV (M2) H0 = • Maximum hyperbox size: M2 = M \{Bk|Ak < α}, otherwise (7) max(w , w ) − min(v , w ) ≤ θ , ∀j ∈ [1, n] where H0 is the final model of stage 1 after applying the hj kj hj kj i pruning operation, EV (Mi) is the classification error of the (11) (V ) (V ) • model Mi on the validation set {X , C }, α is the The minimum membership degree: minimum accuracy of each hyperbox to be retained, and Ak is b(Bh,Bk) ≥ ms (12) the predictive accuracy of hyperbox Bk ∈ M on the validation set defined as follows: • Overlap test. New hyperbox aggregated from Bh and Bk does not overlap with any existing hyperboxes in Hi S Pk belonging to other classes Rkj j=1 If hyperbox Bk has not met all of the above conditions, the A = (8) k S Pk hyperbox with the next highest membership value is selected (Rkj + Ikj ) j=1 and the process is repeated until the aggregation step occurs or no hyperbox candidate is left. If the input hyperbox cannot be where S is the number of validation samples classified by k mergered with existing hyperboxes in H , it will be directly hyperbox B , R is the number of samples predicted correctly i k k inserted into the current list of hyperboxes in H . After that, by B , and I is the number of incorrect predicted samples. i k k the overlap test operation between the newly inserted hyperbox If S = 0, then A = Nil. k k and hyperboxes in H representing other classes is performed, The classification step of unseen samples using model H is i 0 and then the contraction process will be executed to resolve performed in the same way as in the GFMM neural network overlapping regions. The algorithm is iterated for all input with an exception of the case of many winning hyperboxes hyperboxes in H . with the same maximum membership value. In such a case, i−1 The classification process for unseen patterns using the we compute the Euclidean distance from the input sample X hyperboxes in phase 2 is realized as in the GFMM neural to centroids G of winning hyperboxes B using (9). If the i i network. A detailed decription of the implementation steps input sample is a hyperbox, X is the coordinate of the center for the proposed method can be found in the supplemental point of that hyperbox. material. v u n uX 2 d(X,Gi) = t (xj − Gij ) (9) C. Missing Value Handling j=1 The proposed method can deal with missing values since The input pattern is then classified to the hyperbox Bi with it inherits this characteristic from the general fuzzy min- the minimum value of d(X,Gi). max neural network as shown in [30]. A missing feature 7

xj is assumed to be able to receive values from the whole networks and popular algorithms based on other data range, and it is presented by a real-valued interval as follows: representations such as support vector machines, Naive l u xj = 1, xj = 0. By this initialization, the membership Bayes, and decision tree? value associated with the missing value will be one, and • Whether we can obtain a classifier with high accuracy at thus the missing value does not cause any influence on the high abstraction levels of granular representations? membership function of the hyperbox. During the training • Whether we can rely on the performance of the model process, only observed features are employed for the update on validation sets to select a good model for unseen data, of hyperbox minimum and maximum vertices while missing which satisfies both simplicity and accuracy? variables are disregarded automatically. For the overlap test • How good is the ability of handling missing values in procedure in phase 2, only the hyperboxes which satisfy datasets without requiring data imputation? vij ≤ wij for all dimensions j ∈ [1, n] are verified for the • How critical are the roles of the pruning process and the undesired overlapping areas. The second change is related to use of sample centroids? the way of calculating the membership value for the process of hyperbox selection or classification step of an input sample The limitation of runtime for each execution is seven days. If with missing values. Some hyperboxes’ dimensions have not an execution does not finish within seven days, it will be termi- been set, so the membership function shown in (1) is changed nated, and the result is reported as N/A. In the experiments, we set up parameters as follows: n = 4, α = 0.5, m = 0.4, γ = to bi(X, min(Vi,Wi), max(Wi,Vi)). With the use of this w s method, the training data uncertainty is represented in the 1 because they gave the best results on a set of preliminary classifier model. tests with validation sets for the parameter selection. All datasets are normalized to the range of [0, 1] because of the characteristic of the fuzzy min-max neural networks. Most IV. EXPERIMENTS of the datasets except the SUSY datasets utilized the value Marcia et al. [31] argued that data set selection poses a of 0.1 for θ0 in phase 1, and Θ = {0.2, 0.3, 0.4, 0.5, 0.6} considerable impact on conclusions of the accuracy of learners, for different levels of granularity in phase 2. For the SUSY and then the authors advocated for considering properties of dataset, due to the complexity and limitation of runtime for the datasets in experiments. They indicated the importance of the proposed method and other compared types of fuzzy min- employing artificial data sets constructed based on previously max neural networks, the θ0 = 0.3 was used for phase 1, defined characteristics. In these experiments, therefore, we and Θ = {0.4, 0.5, 0.6} was employed for phase 2. For Naive considered two types of synthetic datasets with linear and Bayes, we used Gaussian Naive Bayes (GNB) algorithm for non-linear class boundaries. We also changed the number of classification. The radial basis function (RBF) was used as a features, the number of samples, and the number of classes for kernel function for the support vector machines (SVM). We synthetic datasets to assess the variability in the performance used the default setting parameters in the scikit-learn library of the proposed method. In practical applications, the data for Gaussian Naive Bayes, SVM, and decision tree (DT) in are usually not ideal as well as not following a standard the experiments. The performance of the proposed method distribution rule and including noisy data. Therefore, we also was also compared to other types of fuzzy min-max neural carried out experiments on real datasets with diversity in the networks such as the original fuzzy min-max neural network numbers of samples, features, and classes. (FMNN) [11], the enhanced fuzzy min-max neural network For medium-sized real datasets such as Letter, Magic, White (EFMNN) [33], the enhanced fuzzy min-max neural network wine quality, and Default of credit card clients, the density- with a K-nearest hyperbox expansion rule (KNEFMNN) [34], preserving sampling (DPS) method [32] was used to separate and the general fuzzy min-max neural network (GFMMNN) the original datasets into training, validation, and test sets. For [12]. These types of fuzzy min-max neural networks used the large-sized datasets, we used the hold-out method for splitting same pruning procedure as our proposed method. datasets, which is the simplest and the least computationally Synthetic datasets in our experiments were generated by us- expensive approach to assessing the performance of classifiers ing Gaussian distribution functions, so Gaussian Naive Bayes because more advanced resampling approaches are not essen- and SVM with RBF kernel which use Gaussian distribution tial for large amounts of data [32]. The classification model is assumptions to classify data will achieve nearly optimal er- then trained on the training dataset. The validation set is used ror rates because they match perfectly with underlying data for the pruning step and evaluating the performance of the distribution. Meanwhile, fuzzy min-max classifiers employ constructed classifier aiming to select a suitable model. The the hyperboxes to cover the input data, thus they are not testing set is employed to assess the efficiency of the model an optimal representation for underlying data. Therefore, the on unseen data. accuracy of hyperbox-based classifiers on synthetic datasets The experiments aim to answer the following questions: cannot outperform the predictive accuracy of Gaussian NB or • How is the classification accuracy of the predictor us- SVM with RBF kernel. However, Gaussian NB is a linear ing multi-resolution hierarchical granular representations classifier, and thus, it only outputs highly accurate predictive improved in comparison to the model using a fixed results for datasets with linear decision boundary. In contrast, granulation value? decision tree, fuzzy min-max neural networks, and SVM • How good is the performance of the proposed method with RBF kernel are universal approximators, and they can compared with other types of fuzzy min-max neural deal effectively with both linear and non-linear classification 8

problems. TABLE I All experiments were conducted on the computer with Xeon THE LOWEST ERROR RATES AND TRAINING TIMEOF CLASSIFIERS ON SYNTHETIC LINEAR BOUNDARY DATASETS WITH DIFFERENT NUMBER 6150 2.7GHz CPU and 180GB RAM. We repeated each OF SAMPLES (n = 2,C = 2) experiment five times to compute the average training time. The accuracy of types of fuzzy min-max neural networks N Algorithm min EV min ET θV θT Time (s) He-MRHGRC 10.25 10.467 0.1 0.1 1.1378 remains the same through different iterations because they only Ho-MRHGRC 10.1 10.413 0.1 0.1 1.3215 depend on the data presentation order and we kept the same GFMM 11.54 11.639 0.1 0.1 8.6718 FMNN 10.05 10.349 0.1 0.1 46.4789 order of training samples during the experiments. 10K KNEFMNN 12.07 12.232 0.1 0.1 9.4459 EFMNN 10.44 10.897 0.1 0.1 48.9892 GNB 9.85 9.964 - - 0.5218 A. Performance of the Proposed Method on Synthetic Datasets SVM 9.91 9.983 - - 1.5468 DT 15.33 14.861 - - 0.5405 The first experiment was conducted on the synthetic datasets He-MRHGRC 10.31 10.386 0.3 0.3 20.0677 Ho-MRHGRC 10.24 10.401 0.1 0.1 16.0169 with the linear or non-linear boundary between classes. For GFMM 11.47 11.783 0.1 0.1 405.4642 each synthetic dataset, we generated a testing set containing FMNN 10.98 11.439 0.2 0.2 13163.1404 1M KNEFMNN 10.36 10.594 0.1 0.1 413.8296 100,000 samples and a validation set with 10,000 instances EFMNN 11.61 11.923 0.6 0.6 10845.1280 using the same probability density function as the training GNB 9.87 9.972 - - 5.0133 SVM 9.86 9.978 - - 21798.2803 sets. DT 14.873 14.682 - - 9.9318 1) Linear Boundary Datasets: He-MRHGRC 10.11 10.208 0.5 0.5 101.9312 Increase the number of samples: Ho-MRHGRC 10.04 10.222 0.1 0.1 75.2254 GFMM 13.14 13.243 0.1 0.1 1949.2138 We kept both the number of dimensions n = 2 and the FMNN 12.68 12.751 0.6 0.6 92004.7253 C = 2 5M KNEFMNN 17.31 17.267 0.1 0.1 1402.1173 number of classes the same, and only the number of EFMNN 12.89 13.032 0.1 0.1 41888.5296 samples was changed to evaluate the impact of the number of GNB 9.88 9.976 - - 22.9343 SVM N/A N/A - - N/A patterns on the performance of classifiers. We used Gaussian DT 15.253 14.692 - - 70.2041 distribution to construct synthetic datasets as described in [35]. The means of the Gaussians of two classes are given T T as follows: µ1 = [0, 0] , µ2 = [2.56, 0] , and the covariance We can see that the accuracy of our method on unseen matrices are as follows: data using the heterogeneous data distribution (He-MRHGRC) h 1 0 i regularly outperforms the accuracy of the classifier built based Σ1 = Σ2 = 0 1 on the homogeneous data distribution (Ho-MRHGRC) using With the use of these configuration parameters, training sets large-sized training sets. It is also observed that our method with different sizes (10K, 1M, and 5M samples) were formed. is less affected by overfitting when increasing the number Fukunaga [35] indicated that the general Bayes error of of training samples while keeping the same testing set. For the datasets formed from these settings is around 10%. We other types of fuzzy min-max neural networks, their error generated an equal number of samples for each class to remove rates frequently increase with the increase in training size the impact of imbalanced class property on the performance because of overfitting. The total training time of our algorithm of classifiers. Fig. 3 shows the change in the error rates of is faster than that of other types of fuzzy min-max classifiers different fuzzy min-max classifiers on the testing synthetic since our proposed method executes the hyperbox building linear boundary datasets with the different numbers of training process at the lowest value of θ in parallel, and we accept patterns when the level of granularity (θ) changes. The other the overlapping areas among hyperboxes representing different fuzzy min-max neural networks used the fixed value of θ classes to rapidly capture the characteristics of sample points to construct the model, while our method builds the model locating near each other. The hyperbox overlap resolving step starting from the defined lowest value of θ (phase 1) to the is only performed at higher abstraction levels with a smaller current threshold. For example, the model at θ = 0.3 in our number of input hyperboxes. proposed method is constructed with θ = 0.1, θ = 0.2, and Our proposed method also achieves better classification θ = 0.3. accuracy compared to the decision tree, but it cannot over- It can be seen from Fig. 3 that the error rates of our method come the support vector machines and Gaussian Naive Bayes are lower than those of other fuzzy min-max classifiers, es- methods on synthetic linear boundary datasets. However, the pecially in high abstraction levels of granular representations. training time of the support vector machines on large-sized At high levels of abstraction (corresponding to high values of datasets is costly, even becomes unacceptable on training sets θ), the error rates of other classification models are relatively with millions of patterns. The synthetic datasets were con- high, while our proposed classifier still maintains the low error structed based on the Gaussian distribution, so the Gaussian rate, just a little higher than the error at a high-resolution Naive Bayes method can reach the minimum error rate, but our level of granular data. The lowest error rates of the different approach can also obtain the error rates relatively near these classifiers on validation (EV ) and testing (ET ) sets, as well optimal error values. We can observe that the best performance as total training time for six levels of abstraction, are shown of the He-MRHGRC attains at quite high abstraction levels in Table I. Best results are highlighted in bold in each table. of granular representations because some noisy hyperboxes at The training time reported in this paper consists of time for high levels of granularity are eliminated at lower granulation reading training files and model construction. levels. These results demonstrate the efficiency and scalability 9

N = 10K, n = 2, C = 2 N = 1M, n = 2, C = 2 N = 5M, n = 2, C = 2 40 50 50 Heterogeneous MRHGRC Heterogeneous MRHGRC Heterogeneous MRHGRC Homogeneous MRHGRC Homogeneous MRHGRC Homogeneous MRHGRC GFMMNN 45 45 35 GFMMNN GFMMNN FMNN FMNN FMNN KNEFMNN 40 KNEFMNN 40 KNEFMNN 30 EFMNN EFMNN EFMNN 35 35

25 30 30

Error rate (%) rate Error 25 Error rate (%) rate Error Error rate (%) rate Error 25 20 20 20 15 15 15

10 10 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6

(a) 10K samples (b) 1M samples (c) 5M samples Fig. 3. The error rate of classifiers on synthetic linear boundary datasets with the different number of samples.

TABLE II In addition, the error rates of our method also slowly augment THE LOWEST ERROR RATES AND TRAINING TIMEOF CLASSIFIERS ON in contrast to the behaviors of other considered types of fuzzy SYNTHETIC LINEAR BOUNDARY DATASETS WITH DIFFERENT CLASSES (N = 10K, n = 2) min-max neural networks when increasing the abstraction level of granular representations. These facts demonstrate the C Algorithm min EV min ET θV θT Time (s) efficiency of our proposed method on multi-class datasets. The He-MRHGRC 10.25 10.467 0.1 0.1 1.1378 Ho-MRHGRC 10.10 10.413 0.1 0.1 1.3215 lowest error rates of classifiers on validation and testing sets, as GFMM 11.54 11.639 0.1 0.1 8.6718 well as total training time, are shown in Table II. It is observed FMNN 10.05 10.349 0.1 0.1 46.4789 2 KNEFMNN 12.07 12.232 0.1 0.1 9.4459 that the predictive accuracy of our method outperforms all EFMNN 10.44 10.897 0.1 0.1 48.9892 considered types of fuzzy min-max classifiers and decision GNB 9.85 9.964 - - 0.5218 SVM 9.91 9.983 - - 1.5468 tree, but it cannot overcome the Gaussian Naive Bayes and DT 15.33 14.861 - - 0.5405 support vector machine methods. The training time of our He-MRHGRC 19.76 19.884 0.4 0.4 1.0754 method is faster than other fuzzy min-max neural networks Ho-MRHGRC 19.97 21.135 0.1 0.1 1.5231 GFMM 22.34 22.515 0.1 0.1 10.8844 and support vector machines on the considered multi-class FMNN 20.00 20.350 0.1 0.1 65.7884 synthetic datasets. 4 KNEFMNN 20.54 20.258 0.1 0.1 12.5618 EFMNN 21.75 21.736 0.1 0.1 55.1921 Increase the number of features: GNB 19.35 19.075 - - 0.5492 To generate the multi-dimensional synthetic datasets with SVM 19.34 19.082 - - 1.6912 DT 26.94 27.014 - - 0.5703 the number of samples N = 10K and the number of classes He-MRHGRC 30.11 30.996 0.1 0.4 1.2686 C = 2, we used the similar settings as in generation of datasets Ho-MRHGRC 28.70 30.564 0.1 0.1 1.8852 GFMM 32.66 33.415 0.1 0.1 18.0554 with different number of samples. The means of classes are FMNN 29.78 31.035 0.1 0.1 69.6761 T T µ1 = [0,..., 0] , µ2 = [2.56, 0,..., 0] , and the covariance 16 KNEFMNN 33.42 34.670 0.1 0.1 22.3418 EFMNN 31.80 33.239 0.1 0.1 76.0920 matrices are as follows: GNB 27.12 28.190 - - 0.5764 SVM 27.29 28.103 - - 1.6455 " 1 ... 0 # DT 38.813 39.644 - - 0.6023 ...... Σ1 = Σ2 = . . . 0 ... 1 of our proposed approach. The size of each expression corresponds to the number of Increase the number of classes: dimensions n of the problem. Fukunaga [35] stated that the The purpose of the experiment in this subsection is to general Bayes error of 10% and this Bayes error stays the evaluate the performance of the proposed method on multi- same even when n changes. class datasets. We kept the number of dimensions n = 2, Fig. 5 shows the change in the error rates with different the number of samples N = 10, 000, and only changed the levels of granularity on multi-dimensional synthetic datasets. number of classes to form synthetic multi-class datasets with In general, with a low number of dimensions, our method C ∈ {2, 4, 16}. The covariance matrices stay the same as in outperforms other fuzzy min-max neural networks. With high the case of changing the number of samples. dimensionality and a small number of samples, the high levels Fig. 4 shows the change in error rates of fuzzy min-max of granularity result in high error rates, and misclassification classifiers with a different number of classes on the testing results considerably drops when the value of θ increases. The sets. It can be easily seen that the error rates of our method same trend also happens to the FMNN when its accuracy at are lowest compared to other fuzzy min-max neural networks θ = 0.5 or θ = 0.6 is quite high. Apart from the FMNN on all multi-class synthetic datasets at high abstraction levels on high dimensional datasets, our proposed method is better of granular representations. At high abstraction levels, the error than three other fuzzy min-max classifiers at high abstraction rates of other fuzzy min-max neural networks increase rapidly, levels. Table III reports the lowest error rates of classifiers on while the error rate of our classifier still maintains the stability. validation and testing multi-dimensional sets as well as the 10

N = 10K, n = 2, C = 2 N = 10K, n = 2, C = 4 N = 10K, n = 2, C = 16 40 80 70 Heterogeneous MRHGRC Heterogeneous MRHGRC Heterogeneous MRHGRC Homogeneous MRHGRC Homogeneous MRHGRC Homogeneous MRHGRC GFMMNN 65 35 70 GFMMNN GFMMNN FMNN FMNN FMNN KNEFMNN KNEFMNN 60 KNEFMNN 60 30 EFMNN EFMNN EFMNN 55 50 25 50

40

Error rate (%) rate Error Error rate (%) rate Error Error rate (%) rate Error 45 20 30 40 15 20 35

10 10 30 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6

(a) 2 classes (b) 4 classes (c) 16 classes Fig. 4. The error rate of classifiers on synthetic linear boundary datasets with the different number of classes.

N = 10K, n = 2, C = 2 N = 10K, n = 8, C = 2 N = 10K, n = 32, C = 2 40 45 45 Heterogeneous MRHGRC Heterogeneous MRHGRC Heterogeneous MRHGRC Homogeneous MRHGRC Homogeneous MRHGRC Homogeneous MRHGRC GFMMNN 40 40 GFMMNN 35 GFMMNN FMNN FMNN FMNN KNEFMNN KNEFMNN 35 KNEFMNN 35 EFMNN 30 EFMNN EFMNN

30 30 25

25 25

Error rate (%) rate Error Error rate (%) rate Error Error rate (%) rate Error 20 20 20

15 15 15

10 10 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6

(a) 2 features (b) 8 features (c) 32 features Fig. 5. The error rate of classifiers on synthetic linear boundary datasets with the different number of features.

total training time through six abstraction levels of granular representations. The training time of our method is much TABLE III THE LOWEST ERROR RATES AND TRAINING TIMEOF CLASSIFIERS ON faster than other types of fuzzy min-max neural networks. SYNTHETIC LINEAR BOUNDARY DATASETS WITH DIFFERENT FEATURES Generally, the performance of our proposed method overcomes (N = 10K,C = 2) the decision tree and other types of fuzzy min-max neural networks, but its predictive results cannot defeat the Gaussian n Algorithm min EV min ET θV θT Time (s) He-MRHGRC 10.250 10.467 0.1 0.1 1.1378 Naive Bayes and support vector machines. It can be observed Ho-MRHGRC 10.100 10.413 0.1 0.1 1.3215 that the best performance on validation and testing sets obtains GFMM 11.540 11.639 0.1 0.1 8.6718 FMNN 10.050 10.349 0.1 0.1 46.4789 at the same abstraction level of granular representations on all 2 KNEFMNN 12.070 12.232 0.1 0.1 9.4459 considered multi-dimensional datasets. This fact indicates that EFMNN 10.440 10.897 0.1 0.1 48.9892 GNB 9.850 9.964 - - 0.5218 we can use the validation set to choose the best classifier at SVM 9.910 9.983 - - 1.5468 a given abstraction level among constructed models through DT 15.330 14.861 - - 0.5405 He-MRHGRC 10.330 10.153 0.3 0.3 21.9131 different granularity levels. Ho-MRHGRC 10.460 10.201 0.3 0.3 23.0554 2) Non-linear Boundary: GFMM 12.170 12.474 0.1 0.2 196.0682 FMNN 10.250 10.360 0.6 0.6 302.8683 To generate non-linear boundary datasets, we set up the T 8 KNEFMNN 12.720 12.844 0.1 0.1 618.2524 Gaussian means of the first class: µ1 = [−2, 1.5] , µ2 = EFMNN 11.300 10.907 0.4 0.4 579.3113 T GNB 9.940 9.919 - - 0.5915 [1.5, 1] and the Gaussian means of the second class: µ3 = T T SVM 9.980 9.927 - - 2.0801 [−1.5, 3] , µ4 = [1.5, 2.5] . The covariance matrices for DT 15.383 15.087 - - 0.6769 He-MRHGRC 11.070 10.995 0.5 0.5 226.3193 the first class Σ1, Σ2 and for the second class Σ3, Σ4 were Ho-MRHGRC 11.070 10.995 0.5 0.5 226.0611 established as follows: GFMM 12.390 12.625 0.3 0.3 847.6977 FMNN 11.830 11.637 0.5 0.6 1113.6836     32 KNEFMNN 17.410 18.395 0.1 0.4 837.9571 0.5 0.05 0.5 0.05 Σ1 = , Σ2 = , EFMNN 13.890 13.766 0.4 0.4 1114.4976 0.05 0.4 0.05 0.3 GNB 10.280 10.088 - - 0.7154  0.5 0   0.5 0.05  Σ = , Σ = , SVM 10.220 10.079 - - 4.5937 3 0 0.5 4 0.05 0.2 DT 15.400 15.201 - - 1.0960 The number of samples for each class was equal, and the generated samples were normalized to the range of [0, 1]. We 11

created only a testing set including 100,000 samples and a TABLE IV validation set with 10,000 patterns. Three different training THE LOWEST ERROR RATES AND TRAINING TIMEOF CLASSIFIERS ON SYNTHETIC NON-LINEAR BOUNDARY DATASETS WITH DIFFERENT sets containing 10K, 100K, and 5M samples were used to NUMBEROF SAMPLES (n = 2,C = 2) train classifiers. We aim to evaluate the predictive results of our method on the non-linear boundary dataset when changing N Algorithm min EV min ET θV θT Time (s) He-MRHGRC 9.950 9.836 0.2 0.2 0.9616 the sizes of the training set. Ho-MRHGRC 9.820 9.940 0.1 0.1 1.1070 Fig. 6 shows the changes in the error rates through different GFMM 10.200 9.787 0.4 0.5 10.5495 FMNN 9.770 9.753 0.5 0.5 61.1130 levels of granularity of classifiers on non-linear boundary 10K KNEFMNN 9.890 9.505 0.2 0.2 16.1099 datasets. It can be observed that the error rates of our proposed EFMNN 9.750 9.565 0.1 0.4 60.6073 GNB 10.740 10.626 - - 0.5218 method trained on the large-sized non-linear boundary datasets SVM 9.750 9.490 - - 1.5565 are better than those using other types of fuzzy min-max neural DT 14.107 13.831 - - 0.5388 He-MRHGRC 10.130 9.670 0.3 0.3 2.5310 networks, especially at high abstraction levels of granular Ho-MRHGRC 9.910 9.412 0.1 0.1 2.3560 representations. While other fuzzy min-max neural networks GFMM 11.810 11.520 0.1 0.1 44.7778 FMNN 10.880 10.575 0.1 0.1 588.4412 show the increase in the error rates if the value of θ grows up, 100K KNEFMNN 12.470 11.836 0.1 0.1 42.9151 our method is capable of maintaining the stability of predictive EFMNN 11.020 10.992 0.1 0.1 485.7613 GNB 10.830 10.702 - - 0.9006 results even in the case of high abstraction levels. When the SVM 9.650 9.338 - - 93.4474 number of samples increases, the error rates of other fuzzy DT 14.277 13.642 - - 1.1767 min-max classifiers usually rise, whereas the error rate in He-MRHGRC 10.370 10.306 0.1 0.6 91.7894 Ho-MRHGRC 9.940 9.737 0.1 0.1 69.5106 our approach only fluctuates a little. These results indicate GFMM 15.260 14.730 0.1 0.1 1927.6191 that our approach may reduce the influence of overfitting FMNN 13.160 13.243 0.1 0.1 53274.4387 5M KNEFMNN 15.040 14.905 0.1 0.1 1551.5220 because of constructing higher abstraction level of granular EFMNN 15.660 15.907 0.2 0.2 54487.6978 data representations using the learned knowledge from lower GNB 10.840 10.690 - - 22.9849 SVM N/A N/A - - N/A abstraction levels. DT 13.790 13.645 - - 49.9919 The best performance of our approach does not often happen at the smallest value of θ on these non-linear datasets. Results regarding accuracy on validation and testing sets reported in much compared to those at θ = 0.1. This fact is illustrated in Table IV confirm this statement. These figures also illustrate Fig. 7 and figures in the supplemental file. From these figures, the effectiveness of the processing steps in phase 2. Unlike the it is observed that at the high values of maximum hyperbox linear boundary datasets, our method overcomes the Gaussian size such as θ = 0.5 and θ = 0.6, our classifier achieves Naive Bayes to become two best classifiers (along with SVM) the best performance compared to other considered types of among classifiers considered. Although SVM outperformed fuzzy min-max neural networks. We can also observe that the our approach, its runtime on large-sized datasets is much prediction accuracy of our method is usually much better than slower than our method. The training time of our algorithm that using other types of fuzzy min-max classifiers on most is much faster than other types of fuzzy min-max neural of the data granulation levels. The error rate of our classifier networks and SVM, but it is still slower than Gaussian Naive regularly increases slowly with the increase in the abstraction Bayes and decision tree techniques. level of granules, even in some cases, the error rate declines at a high abstraction level of granular representations. The best performance of classifiers on validation and testing sets, B. Performance of the Proposed Method on Real Datasets as well as training time through six granularity levels, are Aiming to attain the fairest comparison, we used 12 datasets reported in the supplemental file. with diverse ranges of the number of sizes, dimensions, and Although our method cannot achieve the best classification classes. These datasets were taken from the LIBSVM [36], accuracy on all considered datasets, its performance is located Kaggle [37], and UCI repositories [38] and their properties in the top 2 for all datasets. The Gaussian Naive Bayes are described in Table V. For the SUSY dataset, the last classifiers obtained the best predictive results on synthetic 500,000 patterns were used for the test set as shown in linear boundary datasets, but it fell to the last position and [39]. From the results of synthetic datasets, we can see that became the worst classifier on real datasets because real the performance of the multi-resolution hierarchical granular datasets are highly non-linear. On datasets with highly non- representation based classifier using the heterogeneous data linear decision boundaries such as covtype, PhysioNet MIT- distribution technique is more stable than that utilizing the BIH Arrhythmia, and MiniBooNE, our proposed method still homogeneous distribution method. Therefore, the experiments produces the good predictive accuracy. in the rest of this paper were conducted for only the hetero- The training process of our method is much faster than other geneous classifier. types of fuzzy min-max neural networks on all considered Table VI shows the number of generated hyperboxes for the datasets. Notably, on some large-sized complex datasets such He-MRHGRC on real datasets at different abstraction levels as covtype and SUSY, the training time of other fuzzy min- of granular representations. It can be seen that the number of max classifiers is costly, but their accuracy is worse than hyperboxes at the value of θ = 0.6 is significantly reduced our method, which takes less training time. Our approach is in comparison to those at θ = 0.1. However, the error rates frequently faster than SVM and can deal with datasets with of the classifiers on testing sets at θ = 0.6 do not change so millions of samples, while the SVM approach cannot perform. 12

N = 10K, n = 2, C = 2 N = 100K, n = 2, C = 2 N = 5M, n = 2, C = 2 15 45 50 Heterogeneous MRHGRC Heterogeneous MRHGRC Heterogeneous MRHGRC Homogeneous MRHGRC Homogeneous MRHGRC Homogeneous MRHGRC 40 45 14 GFMMNN GFMMNN GFMMNN FMNN FMNN 40 FMNN KNEFMNN 35 KNEFMNN KNEFMNN 13 EFMNN EFMNN 35 EFMNN 30 30 12 25

25

Error rate (%) rate Error Error rate (%) rate Error

Error rate (%) rate Error 20 11 20 15 15 10 10 10

9 5 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6

(a) 10K samples (b) 100K samples (c) 5M samples Fig. 6. The error rate of classifiers on synthetic non-linear boundary datasets with the different number of samples.

TABLE V THE REAL DATASETS AND THEIR STATISTICS

Dataset #Dimensions #Classes #Training samples #Validation samples #Testing samples Source Poker Hand 10 10 25,010 50,000 950000 LIBSVM SensIT Vehicle 100 3 68,970 9,852 19,706 LIBSVM Skin NonSkin 3 2 171,540 24,260 49,257 LIBSVM Covtype 54 7 406,709 58,095 116,208 LIBSVM White wine quality 11 7 2,449 1,224 1,225 Kaggle PhysioNet MIT-BIH Arrhythmia 187 5 74,421 13,133 21,892 Kaggle MAGIC Gamma Telescope 10 2 11,887 3,567 3,566 UCI Letter 16 26 15,312 2,188 2,500 UCI Default of credit card clients 23 2 18,750 5,625 5,625 UCI MoCap Hand Postures 36 5 53,104 9,371 15,620 UCI MiniBooNE 50 2 91,044 12,877 26,143 UCI SUSY 18 2 4,400,000 100,000 500,000 UCI

TABLE VI Letter dataset 70 THE CHANGEINTHE NUMBEROF GENERATED HYPERBOXES THROUGH Heterogeneous MRHGRC DIFFERENT LEVELSOF GRANULARITYOFTHE PROPOSED METHOD GFMMNN 60 FMNN θ KNEFMNN Dataset 0.1 0.2 0.3 0.4 0.5 0.6 50 EFMNN Skin NonSkin 1012 248 127 85 64 51 Poker Hand 11563 11414 10905 3776 2939 2610 40 Covtype 94026 13560 5224 2391 1330 846 SensIT Vehicle 5526 2139 1048 667 523 457 PhysioNet 60990 26420 15352 8689 5261 3241 30

MIT-BIH (%) rate Error Arrhythmia White wine quality 1531 676 599 559 544 526 20 Default of credit 2421 529 337 76 48 29 card clients 10 Letter 9236 1677 952 646 595 556 MAGIC Gamma 1439 691 471 384 335 308 0 Telescope 0 0.1 0.2 0.3 0.4 0.5 0.6 MiniBooNE 444 104 24 10 6 6 SUSY - - 26187 25867 16754 13017 Fig. 7. The error rate of classifiers on the Letter datasets through data abstraction levels.

On many datasets, the best predictive results on validation C. The Vital Role of the Pruning Process and the Use of and testing sets were achieved at the same abstraction level of Sample Centroids granular representations. In the case that the best model on the validation set has different abstraction level compared to the This experiment aims to assess the important roles of the best model on the testing set, the error rate on the testing set pruning process and the use of sample centroids on the if using the best classifier on the validation set is also near the performance of the proposed method. The experimental results minimum error. These figures show that our proposed method related to these issues are presented in Table VII. It is easily is stable, and it can achieve a high predictive accuracy on both observed that the pruning step contributes to significantly synthetic and real datasets. reducing the number of generated hyperboxes, especially in 13

TABLE VII THE ROLEOFTHE PRUNING PROCESSANDTHE USEOF SAMPLE CENTROIDS

Num hyperboxes Error rate Error rate No. of predicted No. of predicted Dataset before after pruning samples using centroids samples using centroids pruning (%) (%) before pruning after pruning Before pruning After pruning Total Wrong Total Wrong Skin NonSkin 1,358 1,012 0.1726 0.0974 1,509 73 594 30 Poker Hand 24,991 11,563 53.5951 49.8128 600,804 322,962 725,314 362,196 SensIT Vehicle 61,391 5,526 23.6730 20.9073 2 1 0 0 Default of credit card clients 9,256 2,421 22.3822 19.7689 662 291 312 127 Covtype 95971 94026 7.7335 7.5356 2700 975 2213 783 PhysioNet MIT-BIH Arrhythmia 61,419 60,990 3.6589 3.5492 49 9 48 8 MiniBooNE 1,079 444 16.4289 13.9043 14,947 3,404 11,205 2,575 SUSY 55,096 26,187 30.8548 28.3456 410,094 145,709 370,570 124,850

SensIT Vehicle, Default of credit card clients, SUSY datasets. TABLE VIII When the poorly performing hyperboxes are removed, the THE TRAINING TIMEANDTHE LOWEST ERROR RATES OF OUR METHOD ONTHE DATASETS WITH MISSING VALUES accuracy of the model increases considerably. These figures indicate the critical role of the pruning process with regards Dataset Training min EV min ET to reducing the complexity of the model and enhancing the time (s) Arrhythmia with replacing 53,100.2895 3.0762 3.5492 predictive performance. missing values by zero values (θ = 0.1) (θ = 0.1) We can also see that the use of sample centroids and Arrhythmia with replacing 60,980.5110 2.6879 3.3848 missing values by mean values (θ = 0.1) (θ = 0.1) Euclidean distance may predict accurately from 50% to 95% Arrhythmia with replacing 60,570.4315 2.7031 3.2980 of the samples located in the overlapping regions between missing values by median (θ = 0.1) (θ = 0.2) values different classes. The predictive accuracy depends on the Arrhythmia with missing 58,188.8138 2.6955 3.1473 distribution and complexity of underlying data. With the use values retained (θ = 0.1) (θ = 0.1) Postures with replacing missing 5,845.9722 6.6482 7.7529 of sample centroids, we do not need to use the overlap test values by zero values (θ = 0.1) (θ = 0.4) and contraction process in phase 1 at the highest level of Postures with replacing missing 5,343.0038 8.5370 9.7631 values by mean values (θ = 0.1) (θ = 0.3) granularity. This strategy leads to accelerating the training Postures with replacing missing 4,914.4475 8.4089 9.9936 process of the proposed method compared to other types values by median values (θ = 0.1) (θ = 0.3) of fuzzy min-max neural networks, especially in large-sized Postures with missing values 2,153.8121 14.5662 13.7900 retained (θ = 0.4) (θ = 0.4) datasets such as covtype or SUSY. These facts point to the effectiveness of the pruning process and the usage of sample centroids on improving the performance of our approach in E. Comparison to State-of-the-art Studies terms of both accuracy and training time. The purpose of this section is to compare our method D. Ability to Handling Missing Values with recent studies of classification algorithms on large- This experiment was conducted on two datasets containing sized datasets in physics and medical diagnostics. The first many missing values, i.e., PhysioNet MIT-BIH Arrhythmia and experiment was performed on the SUSY dataset to distinguish MoCap Hand Postures datasets. The aim of this experiment between a signal process producing super-symmetric particles is to demonstrate the ability to handle missing values of and a background process. To attain this purpose, Baldi et our method to preserve the uncertainty of input data without al. [39] compared the performance of a deep neural network doing any pre-processing steps. We also generated three other with boosted decision trees using the area under the curve training datasets from the original data by replacing missing (AUC) metrics. In another study, Sakai et al. [40] evaluated values with the zero, mean, or median value of each feature. different methods of AUC optimization in combination with Then, these values were used to fill in the missing values of support vector machines to enhance the efficiency of the final corresponding features in the testing and validation sets. The predictive model. The AUC values of these studies along with obtained results are presented in Table VIII. The predictive our method are reported in Table IX. It can be seen that our accuracy of the classifier trained on the datasets with missing approach overcomes all approaches in Sakai’s research, but values cannot be superior to ones trained on the datasets it cannot outperform the deep learning methods and boosted imputed by the median, mean or zero values. However, the trees on the considered dataset. training time is reduced, and the characteristic of the proposed The second experiment was conducted on a medical dataset method is still preserved, in which the accuracy of the classifier (PhysioNet MIT-BIH Arrhythmia) containing Electrocardio- is maintained at high levels of abstraction, and its behavior gram (ECG) signal used for the classification of heartbeats. is nearly the same on both validation and testing sets. The There are many studies on ECG heartbeat classification such replacement of missing values by other values is usually biased as deep residual convolution neural network [41], a 9-layer and inflexible in real-world applications. The capability of deep convolutional neural network on the augmentation of deducing directly from data with missing values ensures the the original data [42], combinations of a discrete wavelet maintenance of the online learning property of the fuzzy min- transform with neural networks, SVM [43], and random forest max neural network on the incomplete input data. [44]. The PhysioNet MIT-BIH Arrhythmia dataset contains 14

TABLE IX the proposed method depend on the order of presentations of THE AUC VALUEOFTHE PROPOSED METHODAND OTHER METHODSON the training patterns because it is based on the online learning THE SUSY DATASET ability of the general fuzzy min-max neural network. In Method AUC addition, the proposed method is sensitive to noise and outliers Boosted decision tree [39] 0.863 as well. In real-world applications, noisy data are frequently Deep neural network [39] 0.876 Deep neural network with dropout [39] 0.879 encountered; thus they can lead to serious stability issue. Positive-Negative and unlabeled data based AUC optimization [40] 0.647 Therefore, outlier detection and noise removal are essential Semi-supervised rankboost based AUC optimization [40] 0.709 Semi-supervised AUC-optimized logistic sigmoid [40] 0.556 issues which need to be tackled in future work. Furthermore, Optimum AUC with a generative model [40] 0.577 we also intend to combine hyperboxes generated in different He-MRHGRC (Our method) 0.799 levels of granularity to build an optimal ensemble model for pattern recognition. TABLE X THE ACCURACYOFTHE PROPOSED METHODAND OTHER METHODSON THE PhysioNet MIT-BIH Arrhythmia DATASET ACKNOWLEDGMENT

Method Accuracy(%) T.T. Khuat acknowledges FEIT-UTS for awarding his PhD Deep residual Convolutional neural network [41] 93.4 scholarships (IRS and FEIT scholarships). Augmentation + Deep convolutional neural network [42] 93.5 Discrete wavelet transform + SVM [43] 93.8 Discrete wavelet transform + NN [43] 94.52 REFERENCES Discrete wavelet transform + Random Forest [44] 94.6 Our method on the dataset with the missing values 96.85 [1] G. Wang, J. Yang, and J. Xu, “Granular computing: from granularity Our method on the dataset with zero padding 96.45 optimization to multi-granularity joint problem solving,” Granular Com- puting, vol. 2, no. 3, pp. 105–120, 2017. [2] J. A. Morente-Molinera, J. Mezei, C. Carlsson, and E. Herrera-Viedma, “Improving supervised learning classification methods using multigranu- many missing values and above studies used the zero padding lar linguistic modeling and fuzzy entropy,” IEEE Transactions on Fuzzy mechanism for these values. Our method can directly handle Systems, vol. 25, no. 5, pp. 1078–1089, 2017. [3] L. A. Zadeh, “Toward a theory of fuzzy information granulation and its missing values without any imputations. The accuracy of centrality in human reasoning and fuzzy logic,” Fuzzy Sets and Systems, our method on the datasets with missing values and zero vol. 90, no. 2, pp. 111 – 127, 1997. paddings is shown in Table X along with results taken from [4] W. Pedrycz, G. Succi, A. Sillitti, and J. Iljazi, “Data description: A general framework of information granules,” Knowledge-Based Systems, other studies. It is observed that our approach on the dataset vol. 80, pp. 98 – 108, 2015. including missing values outperforms all other methods con- [5] W. Pedrycz, “Allocation of information granularity in optimization and sidered. From these comparisons, we can conclude that our decision-making models: Towards building the foundations of granular computing,” European Journal of Operational Research, vol. 232, no. 1, proposed method is extremely competitive to other state-of- pp. 137 – 145, 2014. the-art studies published on real datasets. [6] J. Xu, G. Wang, T. Li, and W. Pedrycz, “Local-density-based optimal granulation and manifold information granule description,” IEEE Trans- actions on Cybernetics, vol. 48, no. 10, pp. 2795–2808, 2018. V. CONCLUSIONAND FUTURE WORK [7] W. Xu and J. Yu, “A novel approach to information fusion in multi- This paper presented a method to construct classification source datasets: A granular computing viewpoint,” Information Sciences, vol. 378, pp. 410 – 423, 2017. models based on multi-resolution hierarchical granular rep- [8] R. Al-Hmouz, W. Pedrycz, A. S. Balamash, and A. Morfeq, “Granular resentations using hyperbox fuzzy sets. Our approach can description of data in a non-stationary environment,” Soft Computing, maintain good classification accuracy at high abstraction levels vol. 22, no. 2, pp. 523–540, 2018. [9] C. P. Chen and C.-Y. Zhang, “Data-intensive applications, challenges, with a low number of hyperboxes. The best classifier on the techniques and technologies: A survey on big data,” Information Sci- validation set usually produces the best predictive results on ences, vol. 275, pp. 314 – 347, 2014. unseen data as well. One of the interesting characteristics [10] T. T. Khuat, D. Ruta, and B. Gabrys, “Hyperbox based machine learning algorithms: A comprehensive survey,” CoRR, vol. abs/1901.11303, 2019. of our method is the capability of handling missing values [11] P. K. Simpson, “Fuzzy min-max neural networks. i. classification,” IEEE without the need for missing values imputation. This prop- Transactions on Neural Networks, vol. 3, no. 5, pp. 776–786, 1992. erty makes it flexible for real-world applications, where the [12] B. Gabrys and A. Bargiela, “General fuzzy min-max neural network for clustering and classification,” IEEE Transactions on Neural Networks, data incompleteness usually occurs. In general, our method vol. 11, no. 3, pp. 769–783, 2000. outperformed other typical types of fuzzy min-max neural [13] J. de Jesus´ Rubio, “Sofmls: online self-organizing fuzzy modified least- networks using the contraction process for dealing with over- squares network,” IEEE Transactions on Fuzzy Systems, vol. 17, no. 6, pp. 1296–1309, 2009. lapping regions in terms of both accuracy and training time. [14] X.-M. Zhang and Q.-L. Han, “State estimation for static neural networks Furthermore, our proposed technique can be scaled to large- with time-varying delays based on an improved reciprocally convex sized datasets based on the parallel execution of the hyperbox inequality,” IEEE transactions on neural networks and learning systems, vol. 29, no. 4, pp. 1376–1381, 2017. building process at the highest level of granularity to form [15] J. de Jesus´ Rubio, “A method with neural networks for the classification core hyperboxes from sample points rapidly. These hyperboxes of fruits and vegetables,” Soft Computing, vol. 21, no. 23, pp. 7207– are then refined at higher abstraction levels to reduce the 7220, 2017. [16] M.-Y. Cheng, D. Prayogo, and Y.-W. Wu, “Prediction of permanent complexity and maintain consistent predictive performance. deformation in asphalt pavements using a novel symbiotic organisms The patterns located in the overlapping regions are currently search–least squares support vector regression,” Neural Computing and classified by using Euclidean distance to the sample centroids. Applications, pp. 1–13, 2018. [17] J. d. J. Rubio, D. Ricardo Cruz, I. Elias, G. Ochoa, R. Balcazarand, and Future work will focus on deploying the probability estimation A. Aguilar, “Anfis system for classification of brain signals,” Journal of measure to deal with these samples. The predictive results of Intelligent & Fuzzy Systems, no. Preprint, pp. 1–9, 2019. 15

[18] B. Gabrys, “Learning hybrid neuro-fuzzy classifier models from data: to [43] R. J. Martis, U. R. Acharya, C. M. Lim, K. M. Mandana, A. K. Ray, combine or not to combine?” Fuzzy Sets and Systems, vol. 147, no. 1, and C. Chakraborty, “Application of higher order cumulant features for pp. 39–56, 2004. cardiac health diagnosis using ecg signals,” International Journal of [19] W. Pedrycz, “Granular computing for data analytics: A manifesto of Neural Systems, vol. 23, no. 04, p. 1350014, 2013. human-centric computing,” IEEE/CAA Journal of Automatica Sinica, [44] T. Li and M. Zhou, “Ecg classification using wavelet packet entropy and vol. 5, no. 6, pp. 1025–1034, 2018. random forests,” Entropy, vol. 18, no. 8, 2016. [20] R. E. Moore, R. B. Kearfott, and M. J. Cloud, Introduction to Interval Analysis. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2009. [21] L. Zadeh, “Fuzzy sets,” Information and Control, vol. 8, no. 3, pp. 338– 353, 1965. [22] W. Pedrycz, “Interpretation of clusters in the framework of shadowed sets,” Pattern Recognition Letters, vol. 26, no. 15, pp. 2439 – 2449, 2005. Thanh Tung Khuat received the B.E degree in Soft- [23] Z. Pawlak and A. Skowron, “Rough sets and boolean reasoning,” ware Engineering from University of Science and Information Sciences, vol. 177, no. 1, pp. 41–73, 2007. Technology, Danang, Vietnam, in 2014. Currently, [24] W. Pedrycz and W. Homenda, “Building the fundamentals of granular he is working towards the Ph.D. degree at the Ad- computing: A principle of justifiable granularity,” Applied Soft Comput- vanced Analytics Institute, Faculty of Engineering ing, vol. 13, no. 10, pp. 4209 – 4218, 2013. and Information Technology, University of Technol- [25] W. Pedrycz and A. Bargiela, “An optimization of allocation of informa- ogy Sydney, Ultimo, NSW, Australia. His research tion granularity in the interpretation of data structures: Toward granular interests include machine learning, fuzzy systems, fuzzy clustering,” IEEE Transactions on Systems, Man, and Cybernetics, knowledge discovery, evolutionary computation, in- Part B (Cybernetics), vol. 42, no. 3, pp. 582–590, 2012. telligent optimization techniques and applications [26] Z. Sahel, A. Bouchachia, B. Gabrys, and P. Rogers, “Adaptive mech- in software engineering. He has authored and co- anisms for classification problems with drifting data,” in Knowledge- authored over 20 peer-reviewed publications in the areas of machine learning Based Intelligent Information and Engineering Systems, B. Apolloni, and computational intelligence. R. J. Howlett, and L. Jain, Eds. Springer Berlin Heidelberg, 2007, pp. 419–426. [27] G. Peters and R. Weber, “Dcc: a framework for dynamic granular clustering,” Granular Computing, vol. 1, no. 1, pp. 1–11, 2016. [28] B. Gabrys, “Agglomerative learning algorithms for general fuzzy min- Fang Chen is the Executive Director Data Science max neural network,” Journal of VLSI signal processing systems for and a Distinguished Professor with the University of signal, image and video technology, vol. 32, no. 1, pp. 67–82, 2002. Technology Sydney, Ultimo, NSW, Australia. She [29] ——, “Combining neuro-fuzzy classifiers for improved generalisation is a Thought Leader in AI and data science. She and reliability,” in Proceedings of the 2002 International Joint Confer- has created many world-class AI innovations while ence on Neural Networks, vol. 3, 2002, pp. 2410–2415. working with Jiaotong University, Intel, Mo- [30] ——, “Neuro-fuzzy approach to processing inputs with missing values torola, NICTA, and CSIRO, and helped governments in pattern recognition problems,” International Journal of Approximate and industries utilising data and significantly in- Reasoning, vol. 30, no. 3, pp. 149–179, 2002. creasing productivity, safety, and customer satisfac- [31] N. Macia, E. Bernado-Mansilla, A. Orriols-Puig, and T. K. Ho, “Learner tion. Through impactful successes, she gained many excellence biased by data set selection: A case for data characterisation recognitions, such as the ITS Australia National and artificial data sets,” Pattern Recognition, vol. 46, no. 3, pp. 1054 – Award 2014 and 2015, and NSW iAwards 2017. She is the NSW Water 1066, 2013. Professional of the Year 2016, the National and NSW Research, and the [32] M. Budka and B. Gabrys, “Density-preserving sampling: Robust and Innovation Award by Australian Water association. She was the recipient of efficient alternative to cross-validation for error estimation,” IEEE Trans- the “Brian Shackle Award” 2017 for the most outstanding contribution with actions on Neural Networks and Learning Systems, vol. 24, no. 1, pp. international impact in the field of human interaction with computers and 22–34, 2013. information technology. She is the recipient of the Oscar Prize in Australian [33] M. F. Mohammed and C. P. Lim, “An enhanced fuzzy minmax neural science-Australian Museum Eureka Prize 2018 for Excellence in Data Science. network for pattern classification,” IEEE Transactions on Neural Net- She has 280 publications and 30 patents in 8 countries. works and Learning Systems, vol. 26, no. 3, pp. 417–429, 2015. [34] M. F. Mohammed and C. P. Lim, “Improving the fuzzy min-max neural network with a k-nearest hyperbox expansion rule for pattern classification,” Appl. Soft Comput., vol. 52, no. C, pp. 135–145, 2017. Bogdan Gabrys received the M.Sc. degree in elec- [35] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed. tronics and telecommunication from Silesian Tech- San Diego, CA, USA: Academic Press Professional, Inc., 1990. nical University, Gliwice, Poland, in 1994, and the [36] C.-C. Chang and C.-J. Lin, “Libsvm: A library for support vector Ph.D. degree in computer science from Nottingham machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 27:1– Trent University, Nottingham, U.K., in 1998. 27:27, 2011. Over the last 25 years, he has been working at [37] Kaggle, “Kaggle datasets,” 2019. [Online]. Available: https://www. various universities and research and development kaggle.com/datasets departments of commercial institutions. He is cur- [38] D. Dua and E. Karra Taniskidou, “UCI machine learning repository,” rently a Professor of Data Science and a Director of 2017. [Online]. Available: http://archive.ics.uci.edu/ml the Advanced Analytics Institute at the University of [39] P. Baldi, P. Sadowski, and D. Whiteson, “Searching for exotic particles Technology Sydney, Sydney, Australia. His research in high-energy physics with deep learning,” Nature Communications, activities have concentrated on the areas of data science, complex adaptive vol. 5, p. 4308, 2014. systems, computational intelligence, machine learning, predictive analytics, [40] T. Sakai, G. Niu, and M. Sugiyama, “Semi-supervised auc optimization and their diverse applications. He has published over 180 research papers, based on positive-unlabeled learning,” Machine Learning, vol. 107, chaired conferences, workshops, and special sessions, and been on program no. 4, pp. 767–794, 2018. committees of a large number of international conferences with the data sci- [41] M. Kachuee, S. Fazeli, and M. Sarrafzadeh, “Ecg heartbeat classi- ence, computational intelligence, machine learning, and themes. fication: A deep transferable representation,” in IEEE International He is also a Senior Member of the Institute of Electrical and Electronics Conference on Healthcare Informatics (ICHI), 2018, pp. 443–444. Engineers (IEEE), a Memebr of IEEE Computational Intelligence Society and [42] U. R. Acharya, S. L. Oh, Y. Hagiwara, J. H. Tan, M. Adam, A. Gertych, a Fellow of the Higher Education Academy (HEA) in the UK. He is frequently and R. S. Tan, “A deep convolutional neural network model to classify invited to give keynote and plenary talks at international conferences and heartbeats,” Computers in Biology and Medicine, vol. 89, pp. 389 – 396, lectures at internationally leading research centres and commercial research 2017. labs. More details can be found at: http://bogdan-gabrys.com