Refining the optimum

Davidmh · September 11, 2014, 1:14pm

Hi, there,

I have successfully interfaced with Optizelle, and it is very fast (6E4 parameters take 4 s with SR1, 25 with BFGS, and 35s with Polak-Ribiere and friend). The results are equivalent for all of them. I get quite close to the minimum, very much in the parabolic region, but it looks it could be better:

http://s28.postimg.org/osxe99sa5/profile.png

Function profile for one random dimension, with the minimum found by SR1 indicated.

Is there a fast way of refining this in a fast way? Re-running Optizelle from the last point (with same or different algorithm) doesn’t help. I have tried reducing the parameter dx_eps, but I don’t get any significant difference in runtime nor accuracy. I think there should be a fast way to refine this because the behaviour is very much parabolic.

A function evaluation takes 5 ms, and the gradient and function 15 ms, and say I can invest around 1 s in refining.

Thanks!

joe · September 11, 2014, 6:46pm

Why was your optimization terminating in the first place? The reason for termination can be found in the parameter opt_stop.

Assuming the termination ocurs due to the relative gradient becoming small, the right parameter to tune is eps_grad. Really, eps_dx is there as a safeguard to prevent the iterations from becoming too small. This can happen for a variety of reasons, but what I see most often is that numerical error like underflow on poorly scaled problems stop us from finding a better point. This causes the optimiation algorithms to spin and without this safeguard never terminate.

Now, there’s still going to be a question if whether or not asking for more reduction on the gradient is going to improve anything. And, unfortunately, the answer to this is tricky. On a purely quadratic problem with m variables, both nonlinear-CG and SR1 should terminate in at most m iterations with the exact solution. For nonlinear-CG, on a quadratic problem, the algorithm reduces to applying CG to a linear system and after m iterations the Krylov subspace is just the entire subspace, so we have the exact solution. For SR1, every iteration of he algorithm essentially finds one eigenvalue of the Hessian and forces it to 1. Once all the eigenvalues are 1, we’re solving a linear system with the identity and we find the solution immediately.

In practice, this probably won’t happen. In CG, the vectors in the Krylov subspace are supposed to be orthogonal and the equations that we use to derive linear-CG and nonlinear-CG depend on this. Unfortunately, we quickly lose orthogonality of the vectors unless we over orthogonalize. For linear problems, this is the conjugate direction algorithm, but I don’t know of a good analogy on how this works for nonlinear-CG. In any case, the bottom line is we’re going to take more than m iterations even of a quadratic problem when m is large.

For SR1, there’s a similar situation. If we generate a big enough Hessian approximation and then apply it to an approximation of the inverse, it turns out we don’t get back the exact same vector. For tens of iterations, we will, but it will eventually break down. There are some ways to try and stabilize this, but they all have a cost.

Alright, so what do we do? Really, the performance of these algorithms is tied to the eigenvalue distribution of the Hessian near optimality. If the eigenvalues are tightly clustered, these algorithms work well. If the eigenvalues are not tightly clustered, these algorithms will take forever. Basically, we want the number of eigenvalue clusters to be smaller than the number of iterations that the above algorithms take to have problems. If you want a number thrown out of nowhere, less than 50 clusters would be good.

Even if the eigenvalues of the Hessian are not well clustered and we run into numerical problems, algorithms like SR1 and nonlinear-CG will work, eventually. That’s the nice thing about optimization algorithms is that they’ll keep grinding on the problem looking for that fixed point. However, it may take forever.

Anyway, the way to speed this up is to add a preconditioner to the Hessian that better clusters the eigenvalues. Of course, the best preconditioner would be the inverse, which would give us Newton’s method. Short of that, the goal is to cluster the eigenvalues as best as possible. In Optizelle, we use PH in the bundle of functions to hold the preconditioner to the Hessian and then set the parameter PH_type to UserDefined.

As another idea, if the problem looks pretty quadratic, something like the Barzilai-Borwein two-point Hessian approximation may work well. In order to set this up, set dir to SteepestDescent and kind to either TwoPointA or TwoPointB. Note, this algorithm is kind of weird because it will be nonmonotonic. Basically, the objective will go up and down. Nevertheless, on some problems, it works really well.

Anyway, let me know if that works if you need more pointers!

Joe

Davidmh · September 12, 2014, 9:47am

Thanks for your detailed comments!

The stopping reason is RelativeStepSmall, and achieved in very few iterations. Attached is the verbose output. It looks like ||dx|| is indeed the stopping criteria. Nevertheless, changing eps_dx to 1E-15 just makes it do more iterations, but yielding exactly the same result. That makes sense, our steps are orders of magnitude smaller than the distance to minimum. Another thing to notice is that the predicted reduction in the merit function is much bigger than the real one.

Regarding the Hessian, it is quite large, so computing it explicitely is probably not feasible. It does have some degree of sparsity, but I think it will still take several seconds, if not minutes, to compute it explicitely. Do you think the SR1 estimation of the hessian would be good enough to work out a decent preconditioner? On the other hand, the hessian seems to be mostly diagonal. Taking two random dimensions and plotting the contours of the function around min, I see there is almost no correlation between dimensions.

http://postimg.org/image/5dpm3w2pj/

If this were true, I can compute the diagonal of the hessian after SR1 and estimate the minimum from there. Otherwise, I will see what SteepestDescent can do.

To add a bit more about my application: I am actually doing N (where N can be up to thousands) minimisations of a function over slightly different sets of data. They are different enough so that a warm restart is unlikely to help (the correlation between different instances is fairly low), but I can certanly spend some time in say, the first instance, and tune the parameters for the specific data set. And, by the way, my number of dimensions are of the order of 20·N². The running times can get to hours or days for large datasets.

grad||    ||dx||      merit(x)    KryIter     KryErr      KryStop     ared        pred        ared/pred   delta       
1           3.04e+00    4.04e+00    .           3.04e+00    .           .           .           .           .           .           .           
*           3.04e+00    4.04e+00    4.04e+00    3.04e+00    1           0.00e+00    RelErrSml   -1.22e+02   8.16e+00    -1.49e+01   2.02e+00    
*           3.04e+00    4.04e+00    2.02e+00    3.04e+00    1           5.00e-01    TrstReg     -3.00e+01   6.12e+00    -4.90e+00   1.01e+00    
*           3.04e+00    4.04e+00    1.01e+00    3.04e+00    1           7.50e-01    TrstReg     -6.25e+00   3.57e+00    -1.75e+00   5.05e-01    
*           3.04e+00    4.04e+00    5.05e-01    3.04e+00    1           8.75e-01    TrstReg     -4.67e-01   1.91e+00    -2.44e-01   2.53e-01    
2           2.61e+00    6.99e+00    2.53e-01    2.61e+00    1           9.37e-01    TrstReg     4.36e-01    9.88e-01    4.41e-01    2.53e-01    
*           2.61e+00    6.99e+00    2.53e-01    2.61e+00    2           2.12e-01    TrstReg     -5.48e-01   8.08e-01    -6.78e-01   1.26e-01    
*           2.61e+00    6.99e+00    1.26e-01    2.61e+00    1           2.84e-01    TrstReg     -7.48e-02   5.37e-01    -1.39e-01   6.31e-02    
*           2.61e+00    6.99e+00    6.31e-02    2.61e+00    1           6.15e-01    TrstReg     5.47e-03    3.55e-01    1.54e-02    3.16e-02    
*           2.61e+00    6.99e+00    3.16e-02    2.61e+00    1           8.05e-01    TrstReg     1.41e-02    1.99e-01    7.09e-02    1.58e-02    
*           2.61e+00    6.99e+00    1.58e-02    2.61e+00    1           9.02e-01    TrstReg     9.99e-03    1.05e-01    9.52e-02    7.89e-03    
3           2.60e+00    6.88e+00    7.89e-03    2.60e+00    1           9.51e-01    TrstReg     5.74e-03    5.38e-02    1.07e-01    7.89e-03    
*           2.60e+00    6.88e+00    7.89e-03    2.60e+00    1           9.83e-01    TrstReg     4.24e-03    5.38e-02    7.88e-02    3.95e-03    
*           2.60e+00    6.88e+00    3.95e-03    2.60e+00    1           9.92e-01    TrstReg     2.31e-03    2.70e-02    8.53e-02    1.97e-03    
*           2.60e+00    6.88e+00    1.97e-03    2.60e+00    1           9.96e-01    TrstReg     1.20e-03    1.35e-02    8.86e-02    9.86e-04    
*           2.60e+00    6.88e+00    9.86e-04    2.60e+00    1           9.98e-01    TrstReg     6.11e-04    6.78e-03    9.02e-02    4.93e-04    
*           2.60e+00    6.88e+00    4.93e-04    2.60e+00    1           9.99e-01    TrstReg     3.09e-04    3.39e-03    9.10e-02    2.47e-04    
*           2.60e+00    6.88e+00    2.47e-04    2.60e+00    1           9.99e-01    TrstReg     1.55e-04    1.70e-03    9.14e-02    1.23e-04    
*           2.60e+00    6.88e+00    1.23e-04    2.60e+00    1           1.00e+00    TrstReg     7.77e-05    8.48e-04    9.16e-02    6.16e-05    
*           2.60e+00    6.88e+00    6.16e-05    2.60e+00    1           1.00e+00    TrstReg     3.89e-05    4.24e-04    9.17e-02    3.08e-05    
*           2.60e+00    6.88e+00    3.08e-05    2.60e+00    1           1.00e+00    TrstReg     1.95e-05    2.12e-04    9.18e-02    1.54e-05    
*           2.60e+00    6.88e+00    1.54e-05    2.60e+00    1           1.00e+00    TrstReg     9.73e-06    1.06e-04    9.18e-02    7.71e-06    
*           2.60e+00    6.88e+00    7.71e-06    2.60e+00    1           1.00e+00    TrstReg     4.87e-06    5.30e-05    9.18e-02    3.85e-06    
*           2.60e+00    6.88e+00    3.85e-06    2.60e+00    1           1.00e+00    TrstReg     2.43e-06    2.65e-05    9.18e-02    1.93e-06    
*           2.60e+00    6.88e+00    1.93e-06    2.60e+00    1           1.00e+00    TrstReg     1.22e-06    1.33e-05    9.18e-02    9.63e-07    
*           2.60e+00    6.88e+00    9.63e-07    2.60e+00    1           1.00e+00    TrstReg     6.08e-07    6.63e-06    9.18e-02    4.82e-07    
*           2.60e+00    6.88e+00    4.82e-07    2.60e+00    1           1.00e+00    TrstReg     3.04e-07    3.31e-06    9.18e-02    2.41e-07    
*           2.60e+00    6.88e+00    2.41e-07    2.60e+00    1           1.00e+00    TrstReg     1.52e-07    1.66e-06    9.18e-02    1.20e-07    
*           2.60e+00    6.88e+00    1.20e-07    2.60e+00    1           1.00e+00    TrstReg     7.61e-08    8.28e-07    9.18e-02    6.02e-08    
*           2.60e+00    6.88e+00    6.02e-08    2.60e+00    1           1.00e+00    TrstReg     3.80e-08    4.14e-07    9.18e-02    3.01e-08    
*           2.60e+00    6.88e+00    3.01e-08    2.60e+00    1           1.00e+00    TrstReg     1.90e-08    2.07e-07    9.18e-02    1.51e-08    
*           2.60e+00    6.88e+00    1.51e-08    2.60e+00    1           1.00e+00    TrstReg     9.51e-09    1.04e-07    9.18e-02    7.53e-09    
*           2.60e+00    6.88e+00    7.53e-09    2.60e+00    1           1.00e+00    TrstReg     4.75e-09    5.18e-08    9.18e-02    3.76e-09    
*           2.60e+00    6.88e+00    3.76e-09    2.60e+00    1           1.00e+00    TrstReg     2.38e-09    2.59e-08    9.18e-02    1.88e-09    
*           2.60e+00    6.88e+00    1.88e-09    2.60e+00    1           1.00e+00    TrstReg     1.19e-09    1.29e-08    9.18e-02    9.41e-10    
*           2.60e+00    6.88e+00    9.41e-10    2.60e+00    1           1.00e+00    TrstReg     5.94e-10    6.47e-09    9.18e-02    4.70e-10    
*           2.60e+00    6.88e+00    4.70e-10    2.60e+00    1           1.00e+00    TrstReg     2.97e-10    3.24e-09    9.18e-02    2.35e-10    
*           2.60e+00    6.88e+00    2.35e-10    2.60e+00    1           1.00e+00    TrstReg     1.48e-10    1.62e-09    9.18e-02    1.18e-10    
4           2.60e+00    6.88e+00    0.00e+00    2.60e+00    1           1.00e+00    TrstReg     1.48e-10    1.62e-09    9.18e-02    1.18e-10

joe · September 12, 2014, 6:01pm

Alright, so something funny is going on here. Could you set dscheme to DiagnosticsOnly and f_diag to FirstOrder and then post the results? In a trust-region method, we form a quadratic model of the objective, minimize that, and then compare it to what actually happens. According to Taylor’s Theorem, Taylor Series, we should get a near perfect model around our current iterate. Between iterations 3 and 4, as the step decreases, the actual versus predicted reduction stays about the same, which it shouldn’t do. Basically, we rely on the fact that this ratio should go to 1 when the step size reduces, so something funny is going on. Most of the time, it’s because the gradient is off. However, it could be something else, or, god forbid, a bug in the code ;-). One other way to check is to set PH_type to ScaledIdentity and let it run. This is just a scaled version of the identity matrix that’s setup so that we always hit the trust-region boundary. If the actual versus predicted reduction doesn’t go to 1 as the step size decreases, then something is almost certainly off.

Now, in terms of setting up a preconditioner, unfortunately, SR1 probably won’t work very well. The problem with an SR1 preconditioner for a linear system is that it does the wrong thing. Basically, it slams each eigenvalue, in turn, to 1. Really, we want something that clusters eigenvalues together because it will work vastly faster. Krylov methods don’t care if we have all the eigenvalues at 1; their performance is dictated by the clustering of the eigenvalues.

Certainly, you could try a probing preconditioner like you described. Basically, calculate the diagonal of the Hessian with the trick <H e_i,e_i> where e_i is the ith cannonical vector and then invert it. Sometimes this works, so it’s worth a shot.

As far as using the Hessian directly, Optizelle can correctly use the Hessian without forming it. Remember, all we need is the Hessian-vector product. In theory, this is going to be your best option. However, in practice, there’s no free lunch. Meaning, say you implement a Hessian-vector product and run Newton’s method, if the spectrum of the Hessian is complicated, your problem will still run slowly. In order to understand why, think about how Newton’s method works. Basically, we solve a linear system H dx = -g where H denotes our Hessian and g denotes our gradient. Now, for a matrix-free method, we’re going to apply some sort of truncated-Krylov method to it like conjugate gradient or MINRES. The performance of these methods is determined by spectrum of the Hessian. Now, what’s funny about this situation, is that if we run m iterations of, say, CG on the Newton system, couldn’t we just run m iterations of nonlinear-CG instead? The answer is that we absolutely could. And, in fact, on a quadratic problem, we should get exactly the same result! Nevertheless, in practice, Newton’s method will be faster because it doesn’t have to do globalization at every iteration (trust-region checks or line-searches) and it’s much easier to numerically stabilize because we can over orthogonalize our Krylov vectors. This is what I mean by no free lunch. Really, everything boils down to Taylor series and solving linear systems and the spectrum of the Hessian dictates how fast this works.

That being said, it looks like something else is going on in your problem. Check those diagnostics and let me know about the results.

Davidmh · September 22, 2014, 3:04pm

Sorry for the late reply.

After further digging and investigation, I found out the problem: a few of the dimensions in the gradient were a bit off, and that was causing the confusion. Your gut feeling was correct. I had checked it, but it only happened on a few dimensions, enough to confuse the algorithm until it gave up; but good enough for my pseudo-Newton step to work so well.

This is how it looks for SR1 with the correct gradient:

Iter        f(x)        ||grad||    ||dx||      merit(x)    KryIter     KryErr      KryStop     ared        pred        ared/pred   delta       
1           3.04e+00    4.04e+00    .           3.04e+00    .           .           .           .           .           .           .           
*           3.04e+00    4.04e+00    4.04e+00    3.04e+00    1           0.00e+00    RelErrSml   -1.22e+02   8.16e+00    -1.49e+01   2.02e+00    
*           3.04e+00    4.04e+00    2.02e+00    3.04e+00    1           5.00e-01    TrstReg     -3.00e+01   6.12e+00    -4.89e+00   1.01e+00    
*           3.04e+00    4.04e+00    1.01e+00    3.04e+00    1           7.50e-01    TrstReg     -6.25e+00   3.57e+00    -1.75e+00   5.05e-01    
*           3.04e+00    4.04e+00    5.05e-01    3.04e+00    1           8.75e-01    TrstReg     -4.66e-01   1.91e+00    -2.44e-01   2.53e-01    
2           2.61e+00    9.60e-01    2.53e-01    2.61e+00    1           9.37e-01    TrstReg     4.36e-01    9.88e-01    4.41e-01    2.53e-01    
*           2.61e+00    9.60e-01    2.53e-01    2.61e+00    2           2.80e-01    TrstReg     -6.48e-01   1.06e-01    -6.14e+00   1.26e-01    
*           2.61e+00    9.60e-01    1.26e-01    2.61e+00    2           4.40e-01    TrstReg     -1.10e-01   6.50e-02    -1.69e+00   6.31e-02    
3           2.60e+00    6.61e-01    6.31e-02    2.60e+00    1           5.23e-01    TrstReg     1.03e-02    3.12e-02    3.29e-01    6.31e-02    
*           2.60e+00    6.61e-01    6.31e-02    2.60e+00    2           1.43e-01    TrstReg     -1.93e-02   1.04e-02    -1.86e+00   3.16e-02    
4           2.59e+00    3.35e-01    3.16e-02    2.59e+00    2           8.61e-02    TrstReg     5.40e-03    9.37e-03    5.76e-01    3.16e-02    
5           2.59e+00    2.64e-01    2.10e-02    2.59e+00    7           6.25e-17    RelErrSml   9.69e-04    2.89e-03    3.35e-01    3.16e-02    
6           2.59e+00    4.01e-02    1.40e-02    2.59e+00    8           4.20e-17    RelErrSml   1.84e-03    1.82e-03    1.01e+00    3.16e-02    
*           2.59e+00    4.01e-02    3.16e-02    2.59e+00    1           2.42e+01    NegCurv     -1.09e-02   1.54e-02    -7.09e-01   1.58e-02    
*           2.59e+00    4.01e-02    1.58e-02    2.59e+00    1           1.26e+01    NegCurv     -2.41e-03   4.17e-03    -5.78e-01   7.89e-03    
*           2.59e+00    4.01e-02    7.89e-03    2.59e+00    1           6.77e+00    NegCurv     -4.44e-04   1.20e-03    -3.70e-01   3.95e-03    
*           2.59e+00    4.01e-02    3.95e-03    2.59e+00    1           3.87e+00    NegCurv     -3.18e-05   3.79e-04    -8.39e-02   1.97e-03    
7           2.59e+00    1.31e-02    1.97e-03    2.59e+00    1           2.43e+00    NegCurv     3.16e-05    1.34e-04    2.35e-01    1.97e-03    
*           2.59e+00    1.31e-02    1.97e-03    2.59e+00    4           3.47e-01    TrstReg     -2.17e-05   6.36e-06    -3.42e+00   9.86e-04    
*           2.59e+00    1.31e-02    9.86e-04    2.59e+00    3           2.98e-01    TrstReg     -9.92e-07   4.64e-06    -2.14e-01   4.93e-04    
8           2.59e+00    4.36e-03    4.93e-04    2.59e+00    2           2.74e-01    TrstReg     2.74e-06    2.99e-06    9.15e-01    9.86e-04    
*           2.59e+00    4.36e-03    9.86e-04    2.59e+00    2           1.08e+00    TrstReg     -6.22e-06   2.70e-06    -2.31e+00   4.93e-04    
*           2.59e+00    4.36e-03    4.93e-04    2.59e+00    1           1.20e+00    TrstReg     -6.45e-07   1.20e-06    -5.39e-01   2.47e-04    
9           2.59e+00    1.74e-03    2.47e-04    2.59e+00    1           8.16e-01    TrstReg     3.77e-07    8.37e-07    4.50e-01    2.47e-04    
*           2.59e+00    1.74e-03    2.47e-04    2.59e+00    2           3.83e-01    TrstReg     -4.99e-07   1.33e-07    -3.76e+00   1.23e-04    
*           2.59e+00    1.74e-03    1.23e-04    2.59e+00    2           2.55e-01    TrstReg     -3.51e-08   9.71e-08    -3.62e-01   6.17e-05    
10          2.59e+00    1.20e-04    6.17e-05    2.59e+00    1           2.82e-01    TrstReg     5.43e-08    6.17e-08    8.80e-01    6.17e-05    
*           2.59e+00    1.20e-04    6.17e-05    2.59e+00    2           5.54e-01    TrstReg     -2.83e-08   3.92e-09    -7.22e+00   3.08e-05    
*           2.59e+00    1.20e-04    3.08e-05    2.59e+00    2           8.50e-01    TrstReg     -5.67e-09   2.32e-09    -2.44e+00   1.54e-05    
*           2.59e+00    1.20e-04    1.54e-05    2.59e+00    2           1.08e+00    TrstReg     -6.24e-10   9.67e-10    -6.46e-01   7.71e-06    
11          2.59e+00    4.64e-05    7.71e-06    2.59e+00    1           7.43e-01    TrstReg     3.03e-10    6.83e-10    4.44e-01    7.71e-06    
*           2.59e+00    4.64e-05    7.71e-06    2.59e+00    3           3.52e-01    TrstReg     -3.41e-10   1.05e-10    -3.26e+00   3.85e-06    
*           2.59e+00    4.64e-05    3.85e-06    2.59e+00    2           4.29e-01    TrstReg     -1.03e-11   7.30e-11    -1.41e-01   1.93e-06    
12          2.59e+00    4.71e-06    1.93e-06    2.59e+00    1           2.77e-01    TrstReg     4.78e-11    5.15e-11    9.28e-01    3.85e-06    
*           2.59e+00    4.71e-06    5.49e-07    2.59e+00    10          3.92e-17    IterExcd    -1.39e-12   6.16e-13    -2.26e+00   2.75e-07    
13          2.59e+00    2.26e-06    2.75e-07    2.59e+00    3           1.27e-01    TrstReg     3.47e-13    5.49e-13    6.33e-01    2.75e-07    
*           2.59e+00    2.26e-06    2.75e-07    2.59e+00    2           4.89e-01    TrstReg     -2.48e-13   3.39e-13    -7.30e-01   1.37e-07    
14          2.59e+00    4.54e-07    1.37e-07    2.59e+00    1           4.97e-01    TrstReg     1.34e-13    1.85e-13    7.22e-01    1.37e-07    
*           2.59e+00    4.54e-07    1.37e-07    2.59e+00    3           9.48e-01    TrstReg     -1.75e-13   2.22e-14    -7.88e+00   6.86e-08    
*           2.59e+00    4.54e-07    6.86e-08    2.59e+00    2           6.06e-01    TrstReg     -5.28e-14   1.55e-14    -3.40e+00   3.43e-08    
15          2.59e+00    4.30e-07    3.43e-08    2.59e+00    2           5.81e-01    TrstReg     6.53e-14    8.88e-15    7.35e+00    6.86e-08    
*           2.59e+00    4.30e-07    6.86e-08    2.59e+00    3           1.81e-01    TrstReg     -1.18e-13   6.66e-15    -1.77e+01   3.43e-08    
*           2.59e+00    4.30e-07    3.43e-08    2.59e+00    2           3.69e-01    TrstReg     -6.88e-14   5.33e-15    -1.29e+01   1.72e-08    
*           2.59e+00    4.30e-07    1.72e-08    2.59e+00    1           1.96e-01    TrstReg     -1.38e-14   4.00e-15    -3.44e+00   8.58e-09    
*           2.59e+00    4.30e-07    8.58e-09    2.59e+00    1           5.14e-01    TrstReg     -3.64e-14   2.66e-15    -1.37e+01   4.29e-09    
*           2.59e+00    4.30e-07    4.29e-09    2.59e+00    1           7.54e-01    TrstReg     -5.33e-14   1.33e-15    -4.00e+01   2.15e-09    
*           2.59e+00    4.30e-07    2.15e-09    2.59e+00    1           8.77e-01    TrstReg     -2.98e-14   8.88e-16    -3.35e+01   1.07e-09    
*           2.59e+00    4.30e-07    1.07e-09    2.59e+00    1           9.38e-01    TrstReg     -4.88e-15   4.44e-16    -1.10e+01   5.36e-10    
*           2.59e+00    4.30e-07    5.36e-10    2.59e+00    1           9.69e-01    TrstReg     -1.02e-14   4.44e-16    -2.30e+01   2.68e-10    
*           2.59e+00    4.30e-07    2.68e-10    2.59e+00    1           9.85e-01    TrstReg     -8.44e-15   0.00e+00    -inf        1.34e-10    
16          2.59e+00    4.30e-07    0.00e+00    2.59e+00    1           9.85e-01    TrstReg     -8.44e-15   0.00e+00    -inf        1.34e-10

For completion, and future reference, here are the timings I get now for the first minimisation:

SR1: 5 s (52 function and 17 gradient evaluations)
BFGS: 17 s (320 function and 34 gradient evaluations)
Polak-Ribiere: 103 s (2800 function and 11 gradient evaluations)
Fletcher-Reeves: 66 s (1800 function and 10 gradient evaluations)

The stopping condition is RelativeStepSmall, and they all yield equivalent results. Upon closer inspection, the minimum seems to be right spot on.

I am pleasently surprised how well SR1 works. It is not only much faster, it also requires much less memory. I am grabbing the json files from the examples, but some fiddling and meddling did not change substantially anything.