Why was your optimization terminating in the first place? The reason for termination can be found in the parameter opt_stop
.
Assuming the termination ocurs due to the relative gradient becoming small, the right parameter to tune is eps_grad
. Really, eps_dx
is there as a safeguard to prevent the iterations from becoming too small. This can happen for a variety of reasons, but what I see most often is that numerical error like underflow on poorly scaled problems stop us from finding a better point. This causes the optimiation algorithms to spin and without this safeguard never terminate.
Now, there’s still going to be a question if whether or not asking for more reduction on the gradient is going to improve anything. And, unfortunately, the answer to this is tricky. On a purely quadratic problem with m variables, both nonlinear-CG and SR1 should terminate in at most m iterations with the exact solution. For nonlinear-CG, on a quadratic problem, the algorithm reduces to applying CG to a linear system and after m iterations the Krylov subspace is just the entire subspace, so we have the exact solution. For SR1, every iteration of he algorithm essentially finds one eigenvalue of the Hessian and forces it to 1. Once all the eigenvalues are 1, we’re solving a linear system with the identity and we find the solution immediately.
In practice, this probably won’t happen. In CG, the vectors in the Krylov subspace are supposed to be orthogonal and the equations that we use to derive linear-CG and nonlinear-CG depend on this. Unfortunately, we quickly lose orthogonality of the vectors unless we over orthogonalize. For linear problems, this is the conjugate direction algorithm, but I don’t know of a good analogy on how this works for nonlinear-CG. In any case, the bottom line is we’re going to take more than m iterations even of a quadratic problem when m is large.
For SR1, there’s a similar situation. If we generate a big enough Hessian approximation and then apply it to an approximation of the inverse, it turns out we don’t get back the exact same vector. For tens of iterations, we will, but it will eventually break down. There are some ways to try and stabilize this, but they all have a cost.
Alright, so what do we do? Really, the performance of these algorithms is tied to the eigenvalue distribution of the Hessian near optimality. If the eigenvalues are tightly clustered, these algorithms work well. If the eigenvalues are not tightly clustered, these algorithms will take forever. Basically, we want the number of eigenvalue clusters to be smaller than the number of iterations that the above algorithms take to have problems. If you want a number thrown out of nowhere, less than 50 clusters would be good.
Even if the eigenvalues of the Hessian are not well clustered and we run into numerical problems, algorithms like SR1 and nonlinear-CG will work, eventually. That’s the nice thing about optimization algorithms is that they’ll keep grinding on the problem looking for that fixed point. However, it may take forever.
Anyway, the way to speed this up is to add a preconditioner to the Hessian that better clusters the eigenvalues. Of course, the best preconditioner would be the inverse, which would give us Newton’s method. Short of that, the goal is to cluster the eigenvalues as best as possible. In Optizelle, we use PH
in the bundle of functions to hold the preconditioner to the Hessian and then set the parameter PH_type
to UserDefined
.
As another idea, if the problem looks pretty quadratic, something like the Barzilai-Borwein two-point Hessian approximation may work well. In order to set this up, set dir
to SteepestDescent
and kind
to either TwoPointA
or TwoPointB
. Note, this algorithm is kind of weird because it will be nonmonotonic. Basically, the objective will go up and down. Nevertheless, on some problems, it works really well.
Anyway, let me know if that works if you need more pointers!
Joe