Should be able to take hyper-gradients of Adam and other optimizers where reasonable #15

houstonhaynes · 2025-02-14T02:23:59Z

Should be able to take hyper-gradients of Adam and other optimizers where reasonable

Issue #287 | Created by @dsyme | 2021-03-09 18:32:16 UTC | 1.0

The Adam code is non-differentiable, that is some hyper-gradients are easily becoming NaN in the Adam optimizer as described in this comment.

Specifically, if we look at this code:

        if stateStep = 0 then
            stateExpAvg <- model.parameters.map(fun (t:Tensor) -> t.zerosLike()
            stateExpAvgSq <- model.parameters.map(fun (t:Tensor) -> t.zerosLike()
        stateStep <- stateStep + 1
        let expAvg = stateExpAvg.[name].mul(beta1).add(d*(1.-beta1))
        let expAvgSq = stateExpAvgSq.[name].mul(beta2).add(d*d*(1.-beta2))
        stateExpAvg.[name] <- expAvg
        stateExpAvgSq.[name] <- expAvgSq
        let biasCorrection1 = 1. - beta1 ** stateStep
        let biasCorrection2 = 1. - beta2 ** stateStep
        let denom = (expAvgSq.sqrt() / biasCorrection2.sqrt()).add(eps)
        let stepSize = lr / biasCorrection1
        t - stepSize * (expAvg/denom)

then the sqrt operations have NaN derivative if the primal is zero (see sqrt derivative). So if expAvgSq is zero, then there is no gradient, which in turn happens if d and stateExpEvgSq.[name] was zero, i.e. can happen at the beginning of optimization AFAICS

I believe this is sort of similar to what the "epsilon" parameter is for = to make sure the denom is always non-zero. So I tried sprinkling a couple of additions of eps in a little earlier and the hyper-gradients returned:

        if stateStep = 0 then
            stateExpAvg <- model.parameters.map(fun (t:Tensor) -> t.zerosLike().add(eps))  // add eps so always non-zero
            stateExpAvgSq <- model.parameters.map(fun (t:Tensor) -> t.zerosLike().add(eps))  // add eps so always non-zero
        stateStep <- stateStep + 1
        let expAvg = stateExpAvg.[name].mul(beta1).add((d*(1.-beta1)).add(eps))  // add eps so always non-zero
        let expAvgSq = stateExpAvgSq.[name].mul(beta2).add((d*d*(1.-beta2)).add(eps))  // add eps so always non-zero

Likely we only need one or two of these.

Comment by @dsyme | 2021-07-05 12:39:09 UTC

More generally, we need test cases for taking hyper-gradients of each of the optimizers https://github.com/DiffSharp/DiffSharp/blob/dev/tests/DiffSharp.Tests/TestDerivatives.Nested.fs

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should be able to take hyper-gradients of Adam and other optimizers where reasonable #15

Should be able to take hyper-gradients of Adam and other optimizers where reasonable #15

houstonhaynes commented Feb 14, 2025

Should be able to take hyper-gradients of Adam and other optimizers where reasonable #15

Should be able to take hyper-gradients of Adam and other optimizers where reasonable #15

Comments

houstonhaynes commented Feb 14, 2025

Should be able to take hyper-gradients of Adam and other optimizers where reasonable