Your assessment is correct and you’re running up against what turned out to be a poor design decision for a historical reason. Originally, the code was designed to work with what are called reduced space formulations for PDE constrained optimization problems. Despite the name PDE constrained, these formulations are unconstrained since the PDE solve is conducted implicitly inside of the objective function. There are a fair amount of oddities related to these formulations, but the big one is that the number of variables is very large and would often be somewhere on the order of hundreds of millions. As a result, there was a great amount of sensitivity in understanding exactly what memory was allocated where. In fact, the really early versions had essentially no memory allocation other than what was originally allocated. I think at some point, I also coded some kind of vector pool as to not reallocate new vectors.
In retrospect, this was all very silly. Memory is not that valuable on the vector level when compared to the truly horrific computations being done to solve the PDE. As such, allocators were put into the code, which is how we have init
. Now, part of the problem here was that the code I was hooked to at the time had soft copies of all of the vectors implemented internally. As such, it was really important to have a hard init function to ensure we didn’t accidentally get a soft copy to some vector.
Anyway, due to some exotic processing of the solution during the course of the optimization, I added this concept of a StateManipulator
that basically allowed an arbitrary function to be applied to the state. This can break an optimization solve pretty badly when done poorly, but it allowed a lot of odd tricks, which we thought necessary at the time. When the code was later broadened to allow for inequalities and equalities, these state manipulators were hijacked to insert the necessary code. On one hand, that allowed a rapid progress in the development of the code. On the other, that’s a terrible way to design an optimization solver because it’s not a easily debugable design.
Now, this affects your current issue because, clearly, the caching is poor. Some of those issues are related to a less than optimally implemented GMRES algorithm that’s applying the linear operator to 0. Your code can be made robust to this with the following:
#include <optizelle/optizelle.h>
#include <optizelle/vspaces.h>
#include <iostream>
#include <iomanip>
#include <cstdlib>
template<typename V>
const char* print(const char* symbol, V& vec)
{
std::cout << symbol << " = [" << std::setprecision(16) << vec[0];
for(int i = 1; i < vec.size(); ++i) std::cout << ", " << std::setprecision(16) << vec[i];
std::cout << "]";
return "";
}
#define LOG_FUNC_CALLS
#define PRN_X "x = [" << x[0] << "," << x[1] << "]"
#define PRN_dX "dx = [" << dx[0] << "," << dx[1] << "]"
#define PRN_dY "dy = [" << dy[0] << "]"
template <typename Real>
Real sq(Real const & x){
return x*x;
}
struct MyObj
: public Optizelle::ScalarValuedFunction <double,Optizelle::Rm>
{
typedef Optizelle::Rm <double> X;
// Location of the last computed iterate
mutable std::unique_ptr <X::Vector> x_cache;
// Cache of the last computed gradient
mutable std::unique_ptr <X::Vector> grad_cache;
// Evaluation
double eval(X::Vector const & x) const {
#ifdef LOG_FUNC_CALLS
std::cout << "MyObj::eval(" << print("x", x) << ")" << std::endl;
#endif
double x0_2 = pow(x[0],2);
double x0_3 = pow(x[0],3);
double x1_2 = pow(x[1],2);
double x1_3 = pow(x[1],3);
return (10*(5824*pow(x[0],4) - 7696*x0_3 + 11648*x0_2*x1_2 - 7696*x0_2*x[1] + 2822*x0_2 - 7696*x[0]*x1_2 - 6313*x[0]*x[1] + 253*x[0] + 5824*pow(x[1],4) - 7696*x1_3 + 2822*x1_2 + 253*x[1] + 7651))/9;
}
// Gradient
void grad(
X::Vector const & x,
X::Vector & grad
) const {
// Use a cached value
if(x_cache && *x_cache == x) {
grad = *grad_cache;
// Compute a new value and cache it
} else {
// Allocate the initial memory for the cached value and gradient
if(!x_cache) {
x_cache = std::make_unique <X::Vector> (X::init(x));
grad_cache = std::make_unique <X::Vector> (X::init(x));
}
// Compute the gradient
#ifdef LOG_FUNC_CALLS
std::cout << "MyObj::grad(" << print("x", x) << ")" << std::endl;
#endif
double x0_2 = pow(x[0],2);
double x0_3 = pow(x[0],3);
double x1_2 = pow(x[1],2);
double x1_3 = pow(x[1],3);
grad[0] = (10*(23296*x0_3 - 23088*x0_2 + 23296*x[0]*x1_2 - 15392*x[0]*x[1] + 5644*x[0] - 7696*x1_2 - 6313*x[1] + 253))/9;
grad[1] = (10*(23296*x0_2*x[1] - 7696*x0_2 - 15392*x[0]*x[1] - 6313*x[0] + 23296*x1_3 - 23088*x1_2 + 5644*x[1] + 253))/9;
// Cache the iterate and gradient
X::copy(x,*x_cache);
X::copy(grad,*grad_cache);
}
}
// Hessian-vector product
void hessvec(
X::Vector const & x,
X::Vector const & dx,
X::Vector & H_dx
) const {
#ifdef LOG_FUNC_CALLS
std::cout << "MyObj::hessvec(" << print("x", x) << " " << print("dx", dx) << ")" << std::endl;
#endif
double g11 = 153920*x[0];
double g9 = 232960*x[0]*x[0] - g11;
double g12 = 153920*x[1];
double g10 = 232960*x[1]*x[1] - g12;
double d2dx2 = (g10 + 3*g9 + 56440)/9;
double d2dy2 = (3*g10 + g9 + 56440)/9;
double d2dxy = (465920*x[1]*x[0] - g12 - g11 - 63130)/9;
H_dx[0] = d2dx2*dx[0] + d2dxy*dx[1];
H_dx[1] = d2dxy*dx[0] + d2dy2*dx[1];
}
};
// c(x,y)=(x - 1)**2 + y**2 - 0.81
struct MyEq
:public Optizelle::VectorValuedFunction<double,Optizelle::Rm,Optizelle::Rm>
{
typedef Optizelle::Rm <double> X;
typedef Optizelle::Rm <double> Y;
// Location of the last computed iterate
mutable std::unique_ptr <X::Vector> x_cache;
// Cache of the last computed derivative
mutable std::unique_ptr <X::Vector> gp_cache;
// y=g(x)
void eval(
X::Vector const & x,
Y::Vector & y
) const {
#ifdef LOG_FUNC_CALLS
std::cout << "MyEq::eval(" << print("x", x) << ")" << std::endl;
#endif
y[0] = sq(x[0] - 1) + sq(x[1]) - 0.81;
}
// g'(x)
void gp(
X::Vector const & x
) const {
// Allocate the initial memory for the cached value and derivative
if(!x_cache) {
x_cache = std::make_unique <X::Vector> (X::init(x));
gp_cache = std::make_unique <std::vector <double>>(2);
}
#ifdef LOG_FUNC_CALLS
std::cout << "MyEq::gp(" << print("x", x) << ")" << std::endl;
#endif
// Compute the derivative
(*gp_cache)[0] = 2*(x[0] - 1);
(*gp_cache)[1] = 2*x[1];
// Save the iterate
X::copy(x,*x_cache);
}
// y=g'(x)dx
void p(
X::Vector const & x,
X::Vector const & dx,
Y::Vector & y
) const {
// Return immediately if given a 0 direction
if(std::all_of(dx.begin(),dx.end(),[](double const & dxi){
return dxi==0;})
){
std::fill(y.begin(),y.end(),0.);
return;
}
// Compute a new value and cache it
if(!x_cache || *x_cache != x) {
gp(x);
}
#ifdef LOG_FUNC_CALLS
std::cout << "MyEq::p(" << print("x", x) << " " << print("dx", dx) << ")" << std::endl;
#endif
// Compute the matrix-vector product
y[0] = (*gp_cache)[0]*dx[0] + (*gp_cache)[1]*dx[1];
}
// xhat=g'(x)*dy
void ps(
X::Vector const & x,
Y::Vector const & dy,
X::Vector & xhat
) const {
// Return immediately if given a 0 direction
if(std::all_of(dy.begin(),dy.end(),[](double const & dyi){
return dyi==0;})
){
std::fill(xhat.begin(),xhat.end(),0.);
return;
}
// Compute a new value and cache it
if(!x_cache || *x_cache != x) {
gp(x);
}
#ifdef LOG_FUNC_CALLS
std::cout << "MyEq::ps(" << print("x", x) << " " << print("dy", dy) << ")" << std::endl;
#endif
// Compute the matrix-vector product
xhat[0] = (*gp_cache)[0] * dy[0];
xhat[1] = (*gp_cache)[1] * dy[0];
}
// xhat=(g''(x)dx)*dy
void pps(
X::Vector const & x,
X::Vector const & dx,
Y::Vector const & dy,
X::Vector & xhat
) const {
#ifdef LOG_FUNC_CALLS
std::cout << "MyEq::pps(" << print("x", x) << " " << print("dx", dx) << " " << print("dy", dy) << ")" << std::endl;
#endif
xhat[0] = 2.0*dx[0]*dy[0];
xhat[1] = 2.0*dx[1]*dy[0];
}
};
// Define a Schur preconditioner for the equality constraints
struct MyPrecon:
public Optizelle::Operator <double,Optizelle::Rm,Optizelle::Rm>
{
public:
typedef Optizelle::Rm <double> X;
typedef X::Vector X_Vector;
typedef Optizelle::Rm <double> Y;
typedef Y::Vector Y_Vector;
private:
X_Vector& x;
public:
MyPrecon(X::Vector& x_) : x(x_) {}
void eval(Y_Vector const & dy,Y_Vector & result) const {
#ifdef LOG_FUNC_CALLS
std::cout << "MyPrecon::eval(" << print("x", x) << ")" << std::endl;
#endif
result[0] = dy[0]/(4*sq(x[0] - 1) + 4*sq(x[1]));
}
};
int main()
{
// Create a type shortcut
using Optizelle::Rm;
auto x = std::vector <double> {1.0, 1.0};
// Allocate memory for dual
auto y = std::vector <double> (1);
Optizelle::EqualityConstrained <double,Rm,Rm>::State::t state(x,y);
state.algorithm_class = Optizelle::AlgorithmClass::LineSearch;
Optizelle::EqualityConstrained <double,Rm,Rm>::Functions::t fns;
fns.f.reset(new MyObj);
fns.g.reset(new MyEq);
fns.PSchur_left.reset(new MyPrecon(state.x));
Optizelle::EqualityConstrained <double,Rm,Rm>::Algorithms::getMin(
Optizelle::Messaging::stdout,fns,state);
std::cout << "The algorithm converged due to: " <<
Optizelle::OptimizationStop::to_string(state.opt_stop) <<
std::endl;
std::cout << std::scientific << std::setprecision(16)
<< "The optimal point is: (" << state.x[0] << ','
<< state.x[1] << ')' << std::endl;
return EXIT_SUCCESS;
}
More than that, requires a deeper modification of the code, which is not clear to me is worth the labor. For example, I noticed a redundant computation at line 8103 and 9422 in optizelle.h
. The first relates to computing how well we solved the linear feasibility with the Cauchy step. The second relates to caching linearized feasibility with respect to the trust-region globalization. Could this be cached? Sure. With the current design, it’s not particularly easy to do so.
Broadly speaking, it’s also a question of how much we should care. That really depends on the application. There are absolutely some applications where this is critical. I’ll contend, though, most of the time it doesn’t really affect the overall run time. Generally speaking, the derivative calculations should be far more expensive. In the case of equality constraints, g
, for a problem with a sizable number of constraints, the computation of the derivative g'(x)
should dominate the cost. Now, even though this is a matrix-free code, realistically, we need a matrix to cache g'(x)
because we need a preconditioner for g'(x)g'(x)*
where *
here means adjoint. That factorization is likely vastly more costly than computing a matrix-vector product. And, recall, computing g'(x)
itself is likely vastly more costly that computing a matrix-vector product.
Now, does that mean we should be sloppy? Of course not. There are a variety of issues related to redundant computation in the current code that have been removed in our new code, which is not released. Part of what made this possible was a fresh rewrite based on what we learned from the last code.
As to whether or not it would be worthwhile to just cache all of these combinations with integer identifiers, I’ll still contend that in the general case probably not. The derivative and preconditioner computations should really be far more expensive once we get to more than a handful of variables. Though, certainly, there are cases were it could pay off. Generally speaking, I think there’s better value in a clean rewrite, which is what we ultimately pursued internally.