AN EFFICIENT GRADIENT METHOD WITH APPROXIMATELY OPTIMAL STEPSIZES BASED ON REGULARIZATION MODELS FOR UNCONSTRAINED OPTIMIZATION

. It is widely accepted that the stepsize is of great significance to gradient method. An efficient gradient method with approximately optimal stepsizes mainly based on regularization models is proposed for unconstrained optimization. More specifically, if the objective function is not close to a quadratic function on the line segment between the current and latest iterates, regularization model is exploited carefully to generate approximately optimal stepsize. Otherwise, quadratic approximation model is used. In addition, when the curvature is non-positive, special regularization model is developed. The convergence of the proposed method is established under some weak conditions. Extensive numerical experiments indicated the proposed method is very promising. Due to the surprising efficiency, we believe that gradient methods with approximately optimal stepsizes can become strong candidates for large-scale unconstrained optimization.

It is widely accepted that the stepsize is of great significance to the theory and numerical performance of gradient method, and the stepsize is the core problem of gradient method.The classical steepest descent Z. LIU ET AL. method [10], where the stepsize is given by  SD  = arg min >0  (  −   ), is badly affected by ill conditioning and thus converges slowly [1].In 1988, Barzilai and Borwein [3] proposed a two-point gradient method (BB method), where the famous stepsize (BB stepsize) is given by Due to the simplicity and nice numerical efficiency, the BB method has received extensive attention.The BB method has been shown to be globally [30] and R-linearly [12] convergent for any dimensional strictly convex quadratic functions.In 2021, Li and Sun [23] presented an interesting and improved R-linear convergence result of the BB method.Raydan [31] proposed the global BB method by incorporating the nonmonotone line search (GLL line search) [19].Dai et al. [13] presented a quite efficient gradient method by adaptively choosing the BB stepsizes.Dai et al. [14] viewed the BB stepsize from a new angle and constructed a quadratic model and a conic model to derive two stepsizes for gradient methods.In 2018, Liu et al. [26] viewed the stepsize  BB1 from the approximation model and introduced a new type of stepsize called approximately optimal stepsize for gradient method.( From (1.4), we can easily obtain the following simple facts: (i) If   () =  (  −   ), then the resulted approximately optimal stepsize corresponds to Cauchy stepsize.This is the reason that we call the stepsize (1.4) approximately optimal stepsize.)︂   , it is easy to see that the resulted approximately optimal stepsize is exactly   .As a result, all existing stepsizes for gradient methods can be treated as approximately optimal stepsizes in this sense.
Some gradient methods with approximately optimal stepsizes [24,25] were proposed, and the numerical experiments in [24,25] indicated that these gradient methods are very efficient.Gradient methods with approximately optimal stepsizes have illustrated powerful potentiality for unconstrained optimization.
In addition, based on a fourth order conic model and some modified secant equations, Biglari and Solimanpur [6] presented some modified BB methods.Recently, motivated by Yuan's stepsize [36], Huang et al. [22] equipped the Barzilai and Borwein method with two dimensional quadratic termination property and proposed a novel stepsize for gradient method (HDL, corresponding to Algorithm 3.1 in [22]) for general unconstrained optimization.More modified BB methods can be found in [15,28,29,35].Contributions.According to Definition 1.1, it is not difficult to see that the effectiveness of approximately optimal stepsize relies heavily on the approximation model   ().To obtain more efficient gradient methods with approximately optimal stepsizes, one should take full advantage of the properties of  at   to exploit suitable approximation models including quadratic models and non-quadratic models for deriving approximately optimal stepsize.In the paper, we present an efficient gradient method with approximately optimal stepsizes based on regularization models for unconstrained optimization.In the proposed method, if the objective function  is not close to a quadratic function on the line segment between  −1 and   , then a regularization model is exploited to generate approximately optimal stepsizes.Otherwise, a quadratic approximation model is used to derive approximately optimal stepsize.In addition, when   −1  −1 ≤ 0, a special regularization model is developed carefully.The global convergence of the proposed method is analyzed.The numerical results indicate that the proposed method is superior to the HDL method [22] and other efficient gradient methods, and is competitive to two famous conjugate gradient software packages CGOPT (1.0) [11] and CG DESCENT (5.0) [20] for the 145 test problems in the CUTEr library [18], and has significant improvement over CGOPT (1.0) [11] and CG DESCENT (5.0) [20] for the 80 test problems mainly from [2].
The rest of the paper is organized as follows.In Section 2, some approximation models including regularization models and quadratic models are exploited to generate approximately optimal stepsizes for gradient method.In Section 3, an efficient gradient method with the approximately optimal stepsizes is described in detail.The global convergence of the proposed method is analyzed in Section 4. In Section 5, some numerical results are presented.Conclusion and discussion are given in the last section.

Derivation of approximately optimal stepsizes
Based on the properties of  at the current iterate   , some approximation models including regularization models and quadratic models are exploited carefully to derive approximately optimal stepsizes for gradient method in the section.
As mentioned above, the effectiveness of approximately optimal stepsize relies heavily on approximation model   ().So we design carefully suitable approximation models mainly based on the properties of  at   .The choices of approximation models are from the following observation. Define According to [26],   is an important criterion for measuring the degree of  to approximate quadratic function.
If the condition [14,25] holds, then  might be close to a quadratic function on the line segment between  −1 and   .Here 0 <  1 <  2 .
When  is close to a quadratic function on the line segment between  −1 and   , quadratic approximation model is certainly preferable.However, if the objective function  possesses high non-linearity, then quadratic models might not work very well [32,33], so some nonquadratic approximation models should be considered in this case.In recent years, regularization algorithms, which are defined as the standard quadratic model plus a regularization term, have been proposed for unconstrained optimization [8].An adaptive regularization algorithm using cubics (ARC) was proposed by Cartis et al. [8].The trial step in ARC algorithm [8] is computed by minimizing the following regularization model: where   is a symmetric approximation to the Hessian matrix and   > 0 is a regularization parameter.And the numerical results in [9] indicated that ARC algorithm is quite efficient.More advance about regularization algorithms can be referred to [4,5,34].Regularization algorithms have become an alternative to trust region and line search schemes [8].All of this indicates that when  is not close to a quadratic function around   , regularization models might serve better than quadratic models generally.Motivated by the above observation, we consider the approximation model (2.3), and derive approximately optimal stepsizes for gradient methods in the following four cases based on the sign of   −1  −1 and the condition (2.2).
In the case, the objective function  might be not close to a quadratic function on the line segment between  −1 and   , we thus use the regularization model (2.3) with  = −  : Taking account of the computational cost and storage,   is generated by imposing the modified Broyden-Fletcher-Goldfarb-Shanno (BFGS) update formula [38] on a scalar matrix   : where Here we take   as   =  0 Therefore, to improve the numerical performance we restrict   as where 0 <  1 < 0.1.
As for the choice of regularization parameter   in (2.4), we determine it as follow.The regularization parameter is significant to the effectiveness of regularization model.However, it is universally acknowledged that it is challenging to determine a proper regularization parameter   .Some ways including the interpolation condition and the trust-region strategy [8,17] were developed to determine the regularization parameter   .Here we use the interpolation condition to determine the regularization parameter: which implies that To improve the numerical performance and make it to be positive, we take the following truncated form of (2.8): where 0 <  min <  max .It is not difficult to obtain the following lemma.
By imposing 1  = 0, we obtain the equation the above equation has a positive root and a negative root.According to Definition 1.1, it is not difficult to verify that the positive root is the approximately optimal stepsize, namely, ᾱAOS where   is given by (2.5) with (2.7).

It is observed by numerical experiments that the bound
is very preferable.Therefore, if   −1  −1 > 0 holds and the condition (2.2) does not hold, then we take the following truncated approximately optimal stepsize for gradient method.
In the case, the objective function  might be close to a quadratic function on the line segment between  −1 and   , we thus consider the following quadratic approximation model: where   is given by (2.5) with (2.7) for simplicity.It follows from Lemma 2.1 that   is symmetric and positive definite.By imposing d 2 d = 0, we can easily obtain the approximately optimal stepsize ᾱAOS( 2) When   −1  −1 ≤ 0, the BB stepsizes or the approximately optimal stepsizes described above can not be used, and thus it is difficult to determine suitable stepsize for gradient method.In some modified BB methods [6,14], the stepsize is usually set simply to   = 10 30 when   −1  −1 ≤ 0. As a result, it will cause large computational cost for seeking a suitable stepsize in a line search for gradient method.
It follows from holds, where 0 <  2 < 1 is close to 1, then   and  −1 tend to be collinear and are approximately equal.In the case, we can use  −1 to approximate   , which will be useful for constructing approximation model, as described below.Suppose for the moment that  is twice continuously differentiable, we consider the following regularization model: which yields the following approximation model: As for the choice of regularization parameter in the regularization model, similarly to Case I, we also use the interpolation condition to determine the regularization parameter   : which implies that To improve the numerical performance and make it to be positive, we take the following truncation form: where 0 <  min <  max are the same as that in (2.9).
By imposing 3  = 0, we get the equation the above equation has a positive root and a negative root.By Definition 1.1, it is not difficult to verify that the positive root is the approximately optimal stepsize, namely, Case IV.   −1  −1 ≤ 0 holds and the condition (2.16) does not hold.

It also has been shown that if 𝛼 BB1
is reused in a cyclic fashion, then the convergence rate is accelerated [27].It appears that  −1 may be helpful for determining the current stepsize   .Therefore, we take  3  −1 as the stepsize, where  3 > 0. In actual, the stepsize can also be regarded as an approximately optimal stepsize.Substituting   = 1 3 −1  into (2.13)yields the following approximation model By imposing 4  = 0, we obtain the approximately optimal stepsize: (2.22)

Gradient method with approximately optimal stepsizes
We describe the gradient method with approximately optimal stepsizes in the section.The famous nonmonotone line search (GLL line search) [19] was firstly incorporated into the BB method [31].Though GLL line search works well in many cases, there are some drawbacks.For example, the numerical performance depends heavily on the choice of a pre-fixed memory constant  .To overcome the above drawbacks, another nonmonotone Armijo line search (Zhang-Hager line search) was proposed by Zhang and Hager [37], which is defined as where 0 <  < 1, It is observed that Zhang-Hager line search [37] is usually preferable for modified BB methods.To improve the numerical performance and obtain nice convergence, we take   as : where 0 <  < 1 and mod(, ) represents the residue for  modulo .As a result, Zhang-Hager line search with (3.3) and the following strategy [7]: and  ∈ [0.1 is used in the our method.Here is approximately optimal stepsize described in Section 2 and  is obtained by a quadratic interpolation at   and   −   .
We describe the gradient method with approximately optimal stepsizes in detail.

Convergence analysis
In the section the global convergence of GM AOS (Reg  = 3) is analyzed under some weak assumptions: (D1)  is continuously differentiable on   ; (D2)  is bounded below on   ; (D3) The gradient  is uniformly continuous on   .
We first give two lemmas, which are important to the convergence.
where ⌊•⌋ is the floor function.By (4.1) and 0 <  < 1, we obtain that which completes the proof.
The above lemma implies that the sequence {  } is convergent.
which together with Lemma 4.1 implies that Combining (4.9) and  (︀ It follows from the mean-value theorem that there exists where According to (4.7), we know that It follows from (4.5), (4.8) and 1 Since the gradient  is uniformly continuous, for (1−)ε
Z. LIU ET AL.

Numerical experiments
We compare GM AOS (Reg  = 3) with GM AOS (1.2) [24], the BB method, CGOPT (1.0) [11], CG DESCENT (5.0) [20] and HDL method [22] (corresponding to Algorithm 3.1 in [22]) in the section.It is widely accepted that CGOPT [11] and CG DESCENT [20] are the two most famous conjugate gradient software packages.The BB method, GM AOS (1.2) [24] and GM AOS (Reg  = 3) were implemented by C code, and the C codes of CG DESCENT (5.0) and CGOPT (1.0) can be downloaded from Hager's homepage: http://users.clas.ufl.edu/hager/papers/Software and Dai's homepage: http://lsec.cc.ac.cn/ ~dyh/ software.html,respectively.The Matlab code of HDL can be also found in Dai's homepage.Two test sets were used, which include the 145 test problems in the CUTEr library [18] (we call it CUTEr145 for short) and the 80 test problems mainly from [2] collected by Andrei (we call it Andr80 for short), respectively.The two test sets can be found in Hager's homepage: http://users.clas.ufl.edu/hager/papers/CG/results6.0.txt and Andrei's homepage: http://camo.ici.ro/neculai/AHYBRIDM,respectively.The dimensions of the test problem in the test set CUTEr145 are default and the dimension of each test problem in the test set Andr80 is set to 10,000.All numerical experiments were done in Ubuntu 10.04 LTS in a VMware Workstation 10.0 installed in Win 10.
The numerical experiments are divided into the following four groups.
In the first group of the numerical experiments, we compare the performance of GM AOS (Reg  = 3) with that of GM AOS (1.2) [24] and the BB method on the test set CUTEr145.Figures 1-4 present the performance profiles on the test set CUTEr145.As shown in Figures 1-4, we can observe that GM AOS (Reg  = 3) performs better than GM AOS (1.2) and is superior very much to the BB method, and GM AOS (1.2) outperforms the BB method.The first group of the numerical experiments indicates that the approximately optimal stepsizes described in Section 2 are quite efficient.
In the second group of the numerical experiments, we compare the numerical performance of GM AOS (Reg  = 3) with that of the HDL method [22] on the same 147 test problems from the CUTEst library, which can be found in Dai's homepage.We do not compare the performance about the running time due to the fact that the HDL method was implemented by Matlab code and GM AOS (Reg  = 3) was implemented by C code.As shown in Figures 5-7, we can observed that GM AOS (Reg  = 3) is superior to the HDL method in term of the number of iteration, the number of function evaluation and the number of gradient evaluation, while the HDL method has been regarded as an import advance of gradient method.
In the third group of the numerical experiments, we compare the performance of GM AOS (Reg  = 3) with that of CGOPT (1.0) on the two test sets CUTEr145 and Andr80.present the performance profiles on the test set CUTEr145.As shown in Figure 8, we see that GM AOS (Reg  = 3) performs much better  CGOPT (1.0) in term of   , since GM AOS (Reg  = 3) solves successfully about 79% test problems with the least function evaluations, while the percentage of CGOPT (1.0) is only about 38%. Figure 9 indicates that GM AOS (Reg  = 3) is at a disadvantage over CGOPT (1.0) in term of   , and Figure 10 shows that GM AOS (Reg  = 3) outperforms slightly CGOPT (1.0) in term of   + 3  [21].We can observe from Figure 11 that GM AOS (Reg  = 3) is as fast as CGOPT (1.0).In the fourth group of the numerical experiments, we compare the performance of GM AOS (Reg  = 3) with that of CG DESCENT (5.0) on the two test sets CUTEr145 and Andr80.Figures 16-19 present the performance profiles on the test set CUTEr145.As shown in Figure 16, we see that GM AOS (Reg  = 3) performs better than CG DESCENT (5.0) in term of   , since GM AOS (Reg  = 3) solves successfully about  65% test problems with the least function evaluations, while the percentage of CG DESCENT (5.0) is only about 39%. Figure 17 shows that GM AOS (Reg  = 3) is at a disadvantage over than CG DESCENT (5.0) in term of   , and Figure 18 indicates that GM AOS (Reg  = 3) outperforms slightly CG DESCENT (5.0) in term of   + 3  [21].We can observe from Figure 19 that GM AOS (Reg  = 3) is as fast as CG DESCENT (5.0).Figures 20-23 present the performance profiles on the test set Andr80.As shown in Figures 20-22, we see that GM AOS (Reg  = 3) is at a little disadvantage over CG DESCENT (5.0) in term of  iter , and has a significant performance boost over CG DESCENT (5.0) in term of   and   .We also can see that GM AOS (Reg  = 3) is faster much than CG DESCENT (5.0).The fourth group of the numerical experiments indicates that GM AOS (Reg  = 3) is competitive to CG DESCENT (5.0) on the test set CUTEr145, and has a significant advantage over CG DESCENT (5.0) on the test set Andr80.
As for the reason that GM AOS (Reg  = 3) has so important improvement over CG DESCENT (5.0) and CGOPT (1.0) on Andr80 and is only competitive to CG DESCENT (5.0) and CGOPT (1.0) on CUTEr145, I think that it lies mainly in that most test problems in CUTEr145 is relatively difficult to solve compared to the test problems in Andr80.It seems that one can draw the following conclusion: Gradient methods with approximately optimal stepsize are sufficient for those test problems that are not very ill-conditioned.
As for the reasons for the surprising numerical performance of GM AOS (Reg  = 3), we think that it lies in two aspects: (i) The approximately optimal stepsizes are generated by the approximation models including regularization models and quadratic models at the current iterate   .Since these approximation models possess rich second or higher order information of the objection function at the current iterate   , the resulted approximately optimal stepsize is integrated into rich second or higher order information properly and thus is very efficient.(ii) The approximately optimal stepsize can readily satisfy Zhang-Hager line search directly in most cases compared to other stepsizes in gradient method, which implies that it requires less much function evaluations and thus save much computational cost.This can be observed in Figures 2, 8, 13, 16 and 21.Some     statistical results can be seen in Table 1, where  linsear denotes the times that the stepsize is updated by (3.4) during all iterations of solving a test problem. linsear = 0 indicates the initial stepsize (approximately optimal stepsize or BB stepsize) satisfies (3.1) directly at all iterations and thus Zhang-Hager line search is not invoked at all.As shown in Table 1, we can see that there are 68 (out of 145) problems for which Zhang-Hager line search is not invoked at all during the solving process, while the number for the BB method is only 41, and there are 90 (out of 145) problems for each of which the times that Zhang-Hager line search is invoked is less than or equal to 3, while the number for the BB method is only 50.It is observed from the Table 1 that the approximately optimal stepsizes described in Section 2 satisfy (3.1) in most cases and thus the proposed method requires less much function evaluations.

Conclusion and discussion
In this paper, we present an efficient gradient method with approximately optimal stepsizes for unconstrained optimization.In the proposed method, some approximation models including regularization models and quadratic models are exploited carefully to derive approximately optimal stepsizes.The convergence of the proposed methods is analyzed.Extensive numerical results indicates that the proposed method GM AOS (Reg  = 3) is very promising.Due to the surprising numerical performance, gradient methods with approximately optimal stepsizes can become strong candidates for large scale unconstrained optimization and has potential in constrained optimization and some fields such as machine learning.
Though gradient methods with approximately optimal stepsize is surprisingly efficient, there are still some questions under investigation: (i) Like the BB method, it is very challenging to explain that gradient methods with approximately optimal stepsizes converge so fast in theory.Does gradient method with approximately optimal stepsize based on quadratic approximation model (2.13) possess Q-linear convergence for convex quadratic minimization?If yes, what conditions should be imposed on the distance ‖  − ‖?Here  is the Hessian matrix for strictly convex quadratic function.(ii) Can the type of gradient method with approximately optimal stepsize possess local R-linear convergence or better convergence rate when it is applied to general unconstrained optimization?
present the performance profiles on the test set Andr80.As shown in Figures12-15, we observe that GM AOS (Reg  = 3) illustrates huge advantage over CGOPT (1.0) on the test set Andr80.The third group of the numerical experiments indicates that GM AOS (Reg  = 3) is competitive to CGOPT (1.0) on the test set CUTEr145, and has a significant advantage over CGOPT (1.0) on the test set Andr80.

Z
. LIU ET AL.
Z. LIU ET AL.