Computing the gradient alongside the function


#1

Hi,

I have to minimise a function where the gradient can be computed almost for free along the function itself, but I don’t know how to take advantage of it. If a call to the gradient is always preceded by a function call in the same point, I can just store it in the class; is this behaviour guaranteed?

Otherwise, I could implement a memoize pattern and keep the last n gradients, but that would obviously be more work.

Also, I would like to compute the gradient in place. Is this a problem? Does any of the algorithms require the knowlege of the gradient in more than one point at the same time?

I am doing convex optimisations in tens of thousands of dimensions, so I am trying to keep the allocations to a minimum.


#2

Hi Davidmh,

This is a pretty common use case in PDE constrained optimization, so I’ve setup the code accordingly to handle this case. Specifically, at every iteration, Optizelle is guaranteed to calculate the gradient first, then the objective. In addition, the gradient calculation is also guaranteed to be called exactly once per optimization iteration. Internally, we cache the gradient, so unless there’s a bug, we really only call the gradient function once per iteration.

In order to take advantage of simultaneous function/gradient solves, you need a small amount of memoization. Basically, everytime you do a gradient solve, cache the objective evaluation and the point in which you evaluated. Then, in the code for an objective solve, check if we’ve evaluated at this point before. If so, just return the calculated point.

Now, do not cache all objective calls. Only cache the objective value during a gradient solve. Basically, if a line-search or trust-region method wants to cut back, we want to have the objective value at the point where we calcualted the gradient readily available.

Basically, the code should look like this

f.eval(x)
    if x=x_cached then
        return f_cached
    else
        return eval_f(x)

f.grad(x)
    x_cached=x
    (f_cached,grad) = eval_f_and_grad(x)
    return grad

Anyway, if you notice the gradient being called more than once per iteration, let me know and I’ll investigate.


#3

Ok, saving only the last evaluation is simple enough, thanks.

If I understand correctly, the else branch f.eval should never be reached, right? I will add some logging machinery there, to make sure.

Once I have a full grasp on it I will add it to the documentation. I will open an issue to keep track of it.


#4

Sorry I missed your followup question, but I did see the issue. The else branch in f.eval may be reached. Basically, imagine a situation where we’re doing a line search. In that instance, we’re going to be evaluating the objective at several different points during a single optimization iteration. Really, what the caching does is save us one objective evaluation per iteration and its the objective evaluation at the point where the gradient is evaluated.

As a longer comment, this is actually a feature that many of my clients requested over time. In truth, it doesn’t really save that much computation over the course of an optimization run. However, since we often run the first iteration of the optimization over and over again for debugging purposes, everyone seems to notice it. Certainly, it never hurts and it will make things faster.


#5

I realised that a while after. Anyway, for my specific aplication, after memoizing the total running time goes down by a few percents, so definitely worth it.

Implementing the memoization as you suggested, using plain Numpy and on large vectors, is extremely fast, so I guess there is not much to gain performance-wise from implementing this under the hood.

Perhaps, the most flexible approach would be to make the f.grad optionally return the function, in which case f.eval is not called. But as I said, the difference is probably going to be under the percentage.


#6

That’s not a bad idea since it seems like no one likes to implement the memoization on their own. Really, I did it this way because it meant that my mathematical valdiation of the code was easier; the semantics of a function with optional arguments is a bit more complicated. Another compromise may be to just provide another interface that derives off the ScalarValuedFunction interface and then does all the memoization for the user. There, they’d implement the function with the optional argument as you suggested. I need to figure out what this means in MATLAB/Octave, since we don’t use inheritance there, but I’ll keep this on my radar.