3rd International Conference on Automation, Control, Engineering and (ACECS-2016) 20 - 22 March 2016 - Hammamet, Tunisia

Dimension Reduction Based on Scattering Matrices and Classification Using Fisher's Linear Discriminant

Nasar Aldian Ambark Shashoa#1, Salah Mohamed Naas #2, Abdurrezag S. Elmezughi #2 #1 Electrical and Electronics Engineering Department, Azzaytuna University #2 Computer Engineering Department, Azzaytuna University Tarhuna, Libya [email protected] [email protected] [email protected]

Abstract— This paper presents Feature Extraction Based on Y , the detection method generate parameter estimatesˆ , Scattering Matrices for Classification using fisher's linear which are called features[3]. Although all these techniques are discriminant approach. The recursive least square identification algorithm is estimated for autoregressive model (ARX) system. well designed for the fault detection, one of the most relevant Next, Dimension reduction based on scattering matrices is used techniques for the diagnosis is the supervised classification. It to obtain good classification performance. The classification is not unusual to see the fault diagnosis as a classification task between two classes is done by a fisher's linear discriminant. Our whose objective is to classify new observations to one of the simulation results illustrate the usefulness of the proposed existing classes. Many methods have been developed for procedures. supervised classification. Such as the Fisher Discriminant. Nevertheless, the classification (diagnosis) is a hard task in a Keywords— Scattering Matrices; Fault Diagnosis; Feature large number of parameters. Indeed, it is not uncommon for a Selection; Fisher's linear discriminant. process to be described by a large number of parameters, I. INTRODUCTION where not all parameters are of equal informative value, such that it is possible to describe the behavior of the process well Feature selection is a process of selecting a small number of enough using a smaller set of parameters. Therefore a highly predictive features out of a large set of candidate selection of the informative variables for the classification attributes that might be strongly irrelevant or redundant. It task should be done in order to increase the accuracy of the plays a fundamental role in pattern recognition, data mining, classification. The paper is structured in the following and more generally machine learning tasks [1], e.g., manner: in the section 2, parameters estimation using facilitating data interpretation, reducing measurement and recursive least square algorithm is presented; in the section 3, storage requirements, increasing predeceasing speeds, dimension reduction techniques using Scattering Matrices and improving generalization performance, etc. System Separability Criteria is derived; the Section 4 provides a identification is an important approach to model dynamical detailed description of the linear discriminant function. in the systems and has been used in many areas such as chemical section 5, the simulation results is presented. Section 6 processes and signal processing [2]. Fault detection and contains the conclusion. isolation (FDI) using analytical redundancy methods are currently the subject of extensive research and numerous II. MODEL IDENTIFICATION OF THE AUTOREGRESSIVE MODEL surveys can be found. The analytical methods compare real The basic step in identification procedure is the choice of process data to those obtained by mathematical models of the suitable type of the model. General linear model takes the system. The most common popular analytical redundancy following form, called (Auto-Regressive model). techniques are parameter estimation, parity relation and 1 1 observer-based approaches. In the parameter estimation A(z )y(k)  B(z )u(k)  n(k) (1) method, the target system is considered as a continuous- where variable dynamic system, which has an input U and an output A(z 1)  1 a z 1  a z 2  ... a z na 1 2 na The correcting vector is given by 1 1 2 nb T 1 T B(z )  b1z  b2 z  ... bna z (2)  ( k )  P(k)Z(k) =[Z (k)P(k 1)Z(k) 1] Z (k)P(k 1) (9) are shift operators polynomials is introduced as And the matrix P(k) is calculated from the recursive formula i z y(k)  y(k  i) , Where y(k),u(k),n(k) are the sequences T P(k) [I   (k)Z (k)]P(k 1) (10) of system output, measurable input and stochastic input, or With initial conditions noise, respectively, while the constants ai , b j , and ci P(0)  I and (0)  0 (11) represent system parameters. III. DIMENSION REDUCTION TECHNIQUES

In pattern classification, dimension reduction of estimated parameters is often unavoidable. Namely, it is not uncommon u(i) B(z) y(i) for a process to be described by a large number of parameters, where not all parameters are of equal informative value, such A(z) that it is possible to describe the behavior of the process well enough using a smaller set of parameters. Numerous dimension reduction techniques have been developed, which Fig.1 model structure for ARX system largely seek out suitable transformation matrices Anm , where The recursive parameter estimation algorithms are based on n is the initial vector dimension, and m is the desired the data analysis of the input and output signals from the dimension (m  n) that will allow for appropriate projection process to be identified. This method can be used for parameter estimate of ARX model [4]. The algorithm can be T written in following form: Consider linear, dynamic, time- Ym1  Anm X n1 (12) invariant, discrete-time system, which can be represented by na nb Of the initial measurement vectors X (in the present case this is the parameter vector of the identified model) onto reduced- y(k)  ak y(k  i)  bku(k  i)  n(k) (3) i1 i1 dimension vectors Y , which need not have a physical Equation (3) can be written in a linear regression form meaning in the general case. Therefore, the major task now is T summarized as follows. Given a number of features, how can y(k)  Z (k)  n(k) (4) one select the most important of them so as to reduce their where number and at the same time retain their class discriminatory information as much as possible? The procedure is known as T feature selection or reduction [5]. There are many types of   [a1...ana ,b1...bnb ] (5) dimension reduction methods, The Scattering Matrices and Represents unknown parameter vector and Separability Criteria was selected for this work. T Z (k)  [y(k 1)... y(k  na) u(k 1)...u(k  nb)] (6)  SCATTERING MATRICES AND SEPARABILITY CRITERIA Represents a vector of input and output measurable samples (the information vector) and the residual n(k) is introduced Let X be an n-dimensional random vector. Then, X can be as represented without error by the summation of n linearly T independent vectors as n(k)  y(k)  Z (k) (7) In many practical cases, it is necessary that parameter n (13) estimation takes place concurrently with the system’s X   yii  Y operation. This parameter estimation problem is called on-line i1 identification and its methodology usually leads to a recursive procedure for every new measurement (or data entry). For this Where reason, it is also called recursive least-squares estimate (RLS) or recursive identification[5]. The proposed recursive   [1 . . . n ] (14) algorithm is given by the following theorem. Suppose that ˆ(k) is the estimate of the parameters of the nth order system And for k data entries. Then, the estimate of the parameter vector Y  [y1 . . . yn ] (15) ˆ (k 1) for (k 1) data entries, with (k  1,2,3,...) is given by the expression. The matrix  is deterministic and is made up of n linearly ˆ(k)  ˆ(k 1)  (k)[y(k)  ZT (k)ˆ(k 1)] (8) independent column vectors. Thus,   0 .We may assume that the columns of  form an feature space has several attractive properties which we can list [8]. orthonormal set, that is, 1. The effectiveness of each feature, in terms of representing X , is determined by its corresponding eigenvalue. If a

T 1 for i  j feature, say , is deleted, the mean-square error increases by     (16) i i j 0 for i  j  i . Therefore, the feature with the smallest eigenvalue should be deleted first, and so on. If the eigenvalues are indexed as

We may call i the ith feature or feature vector, and yi the 1  2  ......  n  0 (in ascending order), the features ith component of the sample in the feature (or mapped) space. should be ordered in the same manner. The feature values are We should aim to select features leading to large between- mutually uncorrelated, that is, the covariance matrix of Y is class distance and small within-class variance in the feature diagonal. This follows because vector space. This means that features should take distant values in the different classes and closely located values in the 1 0 0  same class. In discriminant analysis of statistics, within-class,    2 0  between-class, and mixture scatter matrices are used to T  .  formulate criteria of class separability. A within-class scatter Y    X       (22) matrix shows the scatter of samples around their respective  . 0  class expected vectors, and is expressed by  .    L L 0 . . n  T (17) Sw   Pi E{(X  Mi )(X  Mi ) i}   Pi i i1 i1 2. The set of m eigenvectors of  X . Which correspond to the 2 n Where P the a priori probability of class  , that is P  i , m largest eigenvalues, minimizes  (m) over all choices of i i i N m orthonormal basis vectors. where ni is the number of samples in class i , out of a total  m*m  of N samples [7]. On the other hand, a between-class scatter 1 0     matrix is the scatter of the expected vectors around the  2   mixture mean as  .      L  .  0 S  P (M  M )(M  M )T (18)    b  i i 0 i 0  .  (23) i1 Y         0   Where M represents the expected vector of the mixture  m   0   distribution and is given by     L   (19) 0 n  M0  E{X}  PiMi i1 and the transformation matrix is The mixture scatter matrix is the covariance matrix of all   [ . . .  ] (24) samples regardless of their class assignments, and is defined 1 m n*m by Then, reduced-dimension vectors are T Sm  E{(X  M0 )(X  M0 ) }  Sw  Sb (20) T X m*1  n*m X n*1 (25) The covariance matrix of X equal IV. LINEAR DISCRIMINANT FUNCTION 1 1 Discrimination is a separation procedure that tries to  X  Sw Sm  Sw Sb (21) find a discriminant function whose numerical values are such that the observations from several classes are separated as In the context of pattern recognition, the coefficients much as possible. An allocation procedure that uses a y1,...,yn in the expansion are viewed as feature values discrimination function as a well-defined rule in order to representing the observed vector X in the feature space. The optimally assign a new observation to the labelled classes is called classification. It is evident that only good discrimination leads to good classification [9]. Let us once Let f ( , , 2 , 2 ) be any criterion to be minimized or more focus on the two-class case and consider the family of 1 2 1 2 discriminant functions that are linear combinations of the maximized for determining the optimum V and v0 . The components of x  (x , . . . , x )T . 1 n derivative of f with respect to V and v0 give two equations h(X )  V T X  v  0   and their solution forV , gives the optimum V 0 1 (26) T 1 h(X )  V X  v0  0  2 V  [s1  (1 s)2 ] (M 2  M1) (28)

This is called a linear discriminant function, f T  2 V  [v1,v2 , . . . ,vn ] is known as the weight vector and v0 s  1 (29) f f as the threshold. Our design work is to find the optimum 2  2 coefficients of the weight vector and the threshold value for 1  2 given distributions under various criteria.  FISHER'S LINEAR DISCRIMINANT  OPTIMUM DESIGN PROCEDURE Fisher's linear discriminant is very popular among users of Equation (26) indicates that an n-dimensional vector X is discriminant analysis. Some of the reasons for this are its projected onto a vectorV , and that the variable, y  V T X in simplicity and unnecessary of strict assumptions. For two classes of observations, the Fisher criterion is given by the projected one dimensional h-space is classified to either 2 1 or2 , depending on whether y  v0 or y  v0 . Figure 2 (  ) f  1 2 (30) shows an example in which distributions are projected onto 2 2 (1   2 ) two vectors, V andV ' . From Fig. 2, we notice the error on ' V is smaller than that onV . Therefore, the optimum design This criterion measures the difference of two means procedure for a linear classifier is to select V and v0 which normalized by the averaged variance. The derivates of f give the smallest error in the projected h-space. 2 2 with respect to  and  is substitute into equation (29) When X is normally distributed, h(X ) of equation (26) is 1 2 gives the value of s  0.5, and from equation (26) the also normal. Therefore, the error in the h-space is determined optimum V is 2 by i  E{h(X )i} and i Var{h(X)i} which are 1 V  [0.5  0.5 ] (M  M ) (31) functions of V and v0 [10]. Thus, The expected values and 1 2 2 1 variances of h(X ) are h(X ) With V of equation (31) is called the Fisher T T i  E{h(X)i} V E{X i} v0 V Mi  v0 (27) discriminant function and Fisher linear classifier, respectively.

The Fisher criterion does not depend on v0 because the  2  Var{h(X ) }  V T E{(X  M )(X  M )T  }V  V T  V subtraction of  from  eliminates v , from equation (26). i i i i i i 2 1 0

V. SIMULATION RESULTS X 1 2 To show efficiency of the proposed algorithm, consider the second order system with 2 1 1 2 1 1 2 A(z ) 1 0.2z  0.85z , B(z )  0.8z  0.55z  1 ˆ T  1 (k)  0.2,0.85,0.8,0.55

v 0 The sequence u(k) is generated as a white sequence with a

X 1  2 Gaussian distribution of zero mean and unit variance, while 2 ' the disturbance n(k) is generated as a white noise sequence  2  ' 2 1 V ' with zero mean and variance   0.2 . The system v ' ' parameters were estimated using recursive least squares V 0   ' 2 algorithm. Fig. 3 shows estimated parameters when there is no 1 fault. Data from t =0 to t =4000 s has been used for identification purposes. Fig. 2 An example of linear mapping 20 0.22 0.88 X1(Free Fault Class) 0.86 X2(Faulty Class 0.2 15

0.84 parametera2 parametera1 0.18 0.82 10 0 800 1600 2400 3200 4000 0 800 1600 2400 3200 4000 Time (Sec) Time (Sec) 5 0.58 -0.78 0.56 0 -0.8

0.54 -5 parameterb2 parameterb1 -0.82 0.52 0 800 1600 2400 3200 4000 0 800 1600 2400 3200 4000 -10 Time (Sec) Time (Sec) -6 -4 -2 0 2 4 6 8 10 12 Fig. 3 Estimated parameters when there is no fault Fig. 6 Fisher's linear discriminant classifier separates between free fault class Fig. 4 shows the residual and the threshold in faulty situation. and faulty class As shown in Figure 5 the fault happen at the second 2400. VI. CONCLUSION

*Residual (Black) This paper presents an approach for dimension reduction 60 * Threshold (Red) based on scattering matrices and classification using fisher's 50 linear discriminant. The parameters is estimated by the 40 recursive least squares method. Dimension reduction based on scattering matrices is derived and it is used to make a 30 separations between nominal mode and faulty mode. The 20 classification between nominal mode and faulty mode is 10 made using Fisher's linear discriminant approach. Finally, The 0 performance of the proposed method is illustrated by simulation results. -10 0 800 1600 2400 3200 4000 REFERENCES Fig. 4 The residuals and threshold when fault happen [1] Guyon, I., Elissee, A.; An Introduction to Variable and Feature Selection. JMLR 3,1157–1182 (2003). when the fault happen, the estimated parameters will be [2] B. Bao., Y. Xu., J. Sheng., R. Ding, “Least Squares Based Iterative changed. Also, not all parameters are of equal informative Parameter Estimation Algorithm for Multivariable Controlled ARMA value. Then, dimension reduction from the initial dimensions System Modelling With Finite Measrement Data, ” MathematicalandComputerModelling, Vol. 53, pp.1664-1669. 2011. n = 4 to the reduced dimensions m = 2 parameters was done [3] N. Orani, “Higher-Order Sliding Mode Techniques For Fault by using scattering matrices and separability criteria. Fig. 5 Diagnosis,” PhD thesis, Dept. of Electrical and Electronic Engineering, shows the current mode represented by a point in two- University of Cagliari, March, 2010 . dimensional (2D) space Y1-Y2 for (nominal mode and faulty [4] S. Bedoui, M. Ltaief, K. Abderrahim, “A new recursive algorithm for mode). simultaneous identification of discrete time delay systems,” 16th IFAC Symposium on System Identification The International Federation of Automatic Control Brussels, Belgium. July 11-13, 2012. 20 X1(Free Fault Class) [5] J. Jiang., Y. Zhang, “ A Revisit to Block and Recursive Least Square X2(Faulty Class for Parameter Estimation,” Computer and , Vol. 15 30, pp.403-416. May 2004. [6] W. Silverman, “ Density Estimation for Statistics and Data Analysis,” 10 Chapman and Hall, 1986. [7] P. Kvam,and B.Vidakovic, “Nonparametric Statistics with Applications to Science and Engineering,” John Wiley & Sons Inc, 5 2007. [8] H. Akaike , A new look at the statistical model identification, IEEE 0 Transactions on Automatic Control, Vol. 19, No. 6, pp716-723. 1974. [9] Sever M., Lajovic J., Rajer B., Robustness of the Fisher's discriminant

-5 function to skew-curved normal distribution, Metodološki zvezki 241- -6 -4 -2 0 2 4 6 8 10 12 242, National and university library of Slovenia, 2005. Fig. 5 Dimension reduction using Scattering Matrices and Separability [10] Silverman B. W., Density Estimation for Statistics and Data Analysis, Criteria Chapman and Hall, 1986

Fig. 6 shows the separation between free fault class and faulty class using Fisher's linear discriminant.