Next week I'll be at #NeurIPS2022 presenting a couple of papers. The first one is on #autodiff through #optimization (aka #unrolling) and its bizarre convergence properties. A on the paper (https://arxiv.org/pdf/2209.13271.pdf) (1/9)
How to differentiate implicit functions, that is, functions that don't admit a closed form formula but are instead defined as the solution to an optimization problem? (2/9)
One popular approach is differentiating through optimization (aka unrolled differentiation). This approach assumes one has access to a an optimization algorithm to approximate the implicit function and then uses automatic differentiation on this approximation (3/9)
But are good optimization algorithms also good for unrolled differentiation?
Turns out the answer is a resounding NO. Standard optimization algorithms are subject to what we call The curse of unrolling. Let me explain (4/9)
For gradient descent, a large step-size (2/(L+mu)) will lead to fast asymptotic convergence rate and a monotonically decreasing objective. But when used for unrolling, the picture changes. In terms of approximating the Jacobian, the method is no longer monotonically decreasing. (5/9)
Instead, we see a large initial increase – each iteration gives a poorer approximation than the one before. We call this initial phase the Burn-in phase. A similar phenomenon exists for other methods such as the Chebyshev iterative method (special case of gradient descent with momentum) (6/9)
For gradient descent there's a trade-off between the size and length of the burn-in-phase and its asymptotic rates. One cannot have both a small burn-in-phase and a fast asymptotic convergence rate. This is what we call the curse of unrolling. (7/9)
Finally, we propose a class of methods that _do_ overcome the curse of unrolling. These are methods based not on classical orthogonal polynomials (as classical accelerated methods). Instead they are based on Sobolev orthogonal polynomials. These methods have the best of both worlds: no burn-in phase and fast asymptomatic convergence (8/9)
Link to paper:
The Curse of Unrolling: Rate of Differentiating Through Optimization, https://arxiv.org/pdf/2209.13271.pdf
Joint work with Damien Scieur, Quentin Bertrand and Gauthier Gidel