Using a New Nonlinear Gradient Method for Solving Large Scale Convex Optimization Problems with an Application on Arabic Medical Text
Total Page:16
File Type:pdf, Size:1020Kb
Using a New Nonlinear Gradient Method for Solving Large Scale Convex Optimization Problems with an Application on Arabic Medical Text Jaafar Hammouda*, Ali Eisab, Natalia Dobrenkoa, Natalia Gusarovaa aITMO University, St. Petersburg, Russia *[email protected] bAleppo University, Aleppo, Syria Abstract Gradient methods have applications in multiple fields, including signal processing, image processing, and dynamic systems. In this paper, we present a nonlinear gradient method for solving convex supra-quadratic functions by developing the search direction, that done by hybridizing between the two conjugate coefficients HRM [2] and NHS [1]. The numerical results proved the effectiveness of the presented method by applying it to solve standard problems and reaching the exact solution if the objective function is quadratic convex. Also presented in this article, an application to the problem of named entities in the Arabic medical language, as it proved the stability of the proposed method and its efficiency in terms of execution time. Keywords: Convex optimization; Gradient methods; Named entity recognition; Arabic; e-Health. 1. Introduction Several nonlinear conjugate gradient methods have been presented for solving high-dimensional unconstrained optimization problems which are given as [3]: min f(x) ; x∈Rn (1) To solve problem (1), we start with the following iterative relationship: 푥(푘+1) = 푥푘 + 훼푘 푑푘, 푘 = 0,1,2, (2) where α푘 > 0 is a step size, that is calculated by strong Wolfe-Powell’s conditions [4] 푇 푓(푥푘 + 훼푘푑푘) ≤ 푓(푥푘) + 훿훼푘푔푘 푑푘, (3) 푇 푇 |푔(푥푘 + 훼푘푑푘) 푑푘| ≤ 휎|푔푘 푑푘| where 0 < 훿 < 휎 < 1 푑푘 is the search direction that is computed as follow [3] −푔푘 , if 푘 = 0 (4) 푑푘+1 = { −푔푘+1 + 훽푘푑푘 , if 푘 ≥ 1 Where 푔푘 = 푔(푥푘) = ∇푓(푥푘) represent the gradient vector for 푓(푥) at the point 푥푘 , 훽푘 ∈ ℝ is known as CG coefficient that characterizes different CG methods. Some classical methods such below: 2 푇 퐻푆 푔푘 (푔푘 − 푔푘−1) HS [4] 훽푘 = 푇 (푔푘 − 푔푘−1) 푑푘−1 푇 FR [5] 퐹푅 푔푘 푔푘 훽푘 = 2 ‖푔푘−1‖ 푇( ) PRP [6, 7] 푃푅푃 푔푘 푔푘 − 푔푘−1 훽푘 = 2 ‖푔푘−1‖ 푇 ‖푔푘‖ HRM [2] 푔푘 (푔푘 − 푔푘−1) 퐻푅푀 ‖푔푘−1‖ 훽푘 = 2 2 ; 휏 = 0.4 휏‖푔푘−1‖ + (1 − 휏)‖푑푘−1‖ 2 ‖푔푘‖ 푇 NHS [1] ‖푔푘‖ − max 0, 푔푘 푔푘−1 푁퐻푆 ‖푔푘−1‖ 훽푘 = 푇 2 푇 ; 푢 = 1.1 max {푚푎푥 0, 푢푔푘 푑푘−1 + ‖푔푘−1‖ , 푑푘 푦푘−1} The iterative solution stops when we reach a point 푥푘 where the condition ‖푔푘‖ ≤ 휀 is fulfilled, where 휖 is a very small positive number. Among the most common methods that rely on the aforementioned strategy are Newton's methods [9], quasi-Newton methods [10, 11, 12], trust region methods [13, 14], and conjugated gradient methods [15, 16]. On the other hand, the optimization techniques and methods play one of the most important roles in training neural networks (NN), because it is used to reduce the losses by changing the attributes of NN such as weights and learning rate. The effect of choosing one optimization algorithm over another has been studied previously by many researchers, and despite the continuous development of this aspect, the widespread platforms that are used in the field of machine learning and deep learning depend on a specific group of these algorithms such as ADAM [22], SGD with momentum [23], and RMSprop [24], but these platforms come with the possibility of creating our own optimizer. In [25], the named entity problem was studied on Arabic medical text taken from three medical volumes issued by the Arab Medical Encyclopedia, the researchers used a BERT model [26] that was introduced by Google. As an application of the presented method, we implement our optimizer on the same dataset and show the results of comparison with the previous one. During the following sections, a hybrid method for solving Problem (1) will be presented, and then its convergence will be studied, the numerical results of the mentioned method will be presented, and in the end, an application in the field of Arabic medical text processing will prove the efficiency of the method. 2. The formula and its convergence In the following, we show a nonlinear gradient method for solving convex functions with high dimensions by hybridizing two CG formulas [28], and the new formula is given as: 푊퐻푀 푁퐻푆 퐻푅푀 훽푘 = (1 − 휃푘)훽푘 + 휃푘훽푘 (5) From (5) we distinguish the following cases: 푊퐻푀 푁퐻푆 • Case 1: if 휃푘 = 0 then 훽푘 = 훽푘 . 3 • Case 2: if 0 < 휃푘 < 1 then we find the new value for 휃푘 by using the search direction that was introduced by [16] as below: 푇 푇 푑푘+1푦푘 = −푡푠푘 푔푘+1; 푡 > 0 (6) Where is 푦푘 = 푔푘+1 − 푔푘 and . 푠푘 = 푥푘+1 − 푥푘. The new search direction is given by relation: 푊퐻푀 푑푘+1 = −푔푘+1 + 훽푘 푑푘 (7) From (5) and (7) we find that: 푁퐻푆 퐻푅푀 (8) 푑푘+1 = −푔푘+1 + ((1 − 휃푘)훽푘 + 휃푘훽푘 ) 푑푘 And from (8) and (6) we find that: 푇 푁퐻푆 푇 퐻푅푀 푇 푇 −푔푘+1 푦푘 + (1 − 휃푘)훽푘 푑푘 푦푘 + 휃푘 훽푘 푑푘 푦푘 = −푡푠푘 푔푘+1 퐻푅푀 푁퐻푆 푇 푇 푇 푁퐻푆 푇 휃푘 (훽푘 − 훽푘 )푑푘 푦푘 = −푡푠푘 푔푘+1 + 푔푘+1 푦푘 − 훽푘 푑푘 푦푘 푇 푇 푁퐻푆 푇 푛푒푤 −푡푠푘 푔푘+1 + 푔푘+1 푦푘 − 훽푘 푑푘 푦푘 (9) 휃푘 = 퐻푅푀 푁퐻푆 푇 (훽푘 − 훽푘 )푑푘 푦푘 푊퐻푀 퐻푅푀 • Case 3: if 휃푘 = 1 then 훽푘 = 훽푘 . 2.1 Algorithms steps 푛 Input: 푥0 ∈ ℝ a start point, 푓 a goal function, and 휀 > 0. Step 0: calculate the gradient vector 푔0 = ∇푓(푥0), the initial search direction 푑0 = −푔0 , and the step size 휆0 = 1 ⁄ , then we put 푘 = 0, if ‖푔푘)‖ ≤ 휖 we stop, else we go to step 1. ‖푔0)‖ Step 1: we calculate the new search direction that satisfied the strong Wolfe-Powell conditions. 푇 푓(푥푘 + 휆푘푑푘) − 푓푘 ≤ 훿휆푘푔푘 푑푘 푇 푇 |푔(푥푘 + 휆푘푑푘) 푑푘| ≤ −휎푔푘 푑푘 Where 훿 ∈ (0,0.5), 휎 ∈ (훿, 1). Step 2: set a new point 푥푘+1 = 푥푘 + 휆푘푑푘, if ‖푔푘)‖ ≤ 휖 we stop else go to step 3. Step 3: calculate , 푠푘 = 푥푘+1 − 푥푘, . 푦푘 = 푔푘+1 − 푔푘. 푁퐻푆 퐻푅푀 Step 4: find the value of each of 훽푘 , 훽푘 from below relations: 푇 ‖푔푘‖ 푔푘 (푔푘 − 푔푘−1) 퐻푅푀 ‖푔푘−1‖ 훽푘 = 2 2 ; 휏 = 0.4. 휏‖푔푘−1‖ + (1 − 휏)‖푑푘−1‖ 2 ‖푔푘‖ 푇 ‖푔푘‖ − max 0, 푔푘 푔푘−1 푁퐻푆 ‖푔푘−1‖ 훽푘 = 푇 2 푇 ; 푢 = 1.1 max {푚푎푥 0, 푢푔푘 푑푘−1 + ‖푔푘−1‖ , 푑푘 푦푘−1} 푛푒푤 Step 5: calculate 휃푘 by the relation below: 푇 푇 푁퐻푆 푇 푛푒푤 −푡푠푘 푔푘+1 + 푔푘+1 푦푘 − 훽푘 푑푘 푦푘 휃푘 = 퐻푅푀 푁퐻푆 푇 (훽푘 − 훽푘 )푑푘 푦푘 푛푒푤 푊퐻푀 Step 6: if 0 < 휃푘 < 1 then calculate the 훽푘 as below: 4 푊퐻푀 푁퐻푆 퐻푅푀 훽푘 = 훽푘 = (1 − 휃푘)훽푘 + 휃푘훽푘 푛푒푤 푁퐻푆 If 휃푘 = 0 then 훽푘 = 훽푘 푛푒푤 퐻푅푀 If 휃푘 = 1 then 훽푘 = 훽푘 Step 7: set the new search direction with the relation: 푛푒푤 푑 = −푔푘+1 + 훽푘푑푘 푇 2 푛푒푤 Step 8: if |푔푘+1 푔푘| ≥ 0.2‖푔푘+1‖ then 푑푘+1 = −푔푘+1 else 푑푘+1 = 푑 , after that find ‖푑푘‖ 휆푘+1 = 휆푘 × ‖푑푘+1‖ Step 9: set k = k + 1 and go to step 1. 2.2 Convergence analysis The following assumptions are often used in previous studies of the conjugate gradient methods: [2, 18] Assumption A: 푛 푓(푥) is bounded from below on the level set Ω = {푥 ∈ ℝ | 푓(푥) ≤ 푓(푥0)}, where 푥0 is the starting point. Assumption B: In some neighbourhood 푁 of Ω, the objective function is continuously differentiable, and its gradient is Lipschitz continuous, that is, there exists a constant 퐿 > 0 such that. ‖푔(푥) − 푔(푦)‖ ≤ 푙‖푥 − 푦‖ ∀ 푥, 푦 ∈ 푁 Assumption C: ∀푥 ∈ Ω||푔(푥)|| ≤ Γ; Γ ≥ 0 Theorem 1: Suppose that the sequences {푔푘} and {푑푘} are generated by the presented method. Then the sequence {푑푘} possesses as the sufficient descent condition 푇 2 푔푘 푑푘 ≤ 푐‖푔푘‖ ∀푘 ≥ 0, 푐 > 0 Proof: For 푘 = 0 the relation (10) is fulfilled, because: 푇 2 푔0 푑0 = −‖푔0‖ For 푘 ≥ 1 푊퐻푀 푑푘+1 = −푔푘+1 + 훽푘 푑푘 푁퐻푆 퐻푅푀 푑푘+1 = −푔푘+1 + ((1 − 휃푘)훽푘 + 휃푘훽푘 ) 푑푘 푁퐻푆 퐻푅푀 푑푘+1 = −(휃푘푔푘+1 + (1 − 휃푘)푔푘+1) + ((1 − 휃푘)훽푘 + 휃푘훽푘 ) 푑푘 퐻푅푀 푁퐻퐶 푑푘+1 = 휃푘(−푔푘+1 + 훽푘 푑푘) + (1 − 휃푘)(−푔푘+1 + 훽푘 푑푘) 퐻푅푀 푁퐻퐶 푑푘+1 = 휃푘푑푘+1 + (1 − 휃푘)푑푘+1 We discuss according to the value of 휃푘 we find: 5 푁퐻퐶 i. If 휃푘 = 0 then 푑푘+1 = 푑푘+1 푇 푇 푁퐻퐶 푇 푁퐻퐶 2 푔푘+1푑푘+1 = 푔푘+1푑푘+1 = 푔푘+1(−푔푘+1 + 훽푘 푑푘) ≤ 푐1‖푔푘+1‖ 1 where 푐 = (1 − ) ; 휇 = 1.1 1 휇 퐻푅푀 ii. If 휃푘 = 1 then 푑푘+1 = 푑푘+1 푇 푇 퐻푅푀 푇 퐻푅푀 2 푔푘+1푑푘+1 = 푔푘+1푑푘+1 = 푔푘+1(−푔푘+1 + 훽푘 푑푘) ≤ 푐2‖푔푘+1‖ 1 where 푐 = (2 − ) ; 휎 = 0.001 2 1−5휎 iii. If 0 < 휃푘 < 1 then we suppose that: 0 < 푚1 ≤ 휃푘 ≤ 푚2 < 1 푇 푇 퐻푅푀 푇 푁퐻퐶 푔푘+1푑푘+1 = 휃푘푔푘+1푑푘+1 + (1 − 휃푘)푔푘+1푑푘+1 푇 푇 퐻푅푀 푇 푁퐻퐶 푔푘+1푑푘+1 ≤ 푚1푔푘+1푑푘+1 + (1 − 푚2)푔푘+1푑푘+1 푇 2 2 푔푘+1푑푘+1 ≤ 푚1푐2‖푔푘+1‖ + (1 − 푚2)푐1‖푔푘+1‖ 푇 2 푔푘+1푑푘+1 ≤ 푐 ‖푔푘+1‖ , 푐 = 푚1푐2 + (1 − 푚2)푐1 Theorem 2: If the assumptions are fulfilled, and since the search direction fulfills the condition of sufficient descent condition, then the presented method fulfills the property of global convergence.