You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
then the sqrt operations have NaN derivative if the primal is zero (see sqrt derivative). So if expAvgSq is zero, then there is no gradient, which in turn happens if d and stateExpEvgSq.[name] was zero, i.e. can happen at the beginning of optimization AFAICS
I believe this is sort of similar to what the "epsilon" parameter is for = to make sure the denom is always non-zero. So I tried sprinkling a couple of additions of eps in a little earlier and the hyper-gradients returned:
if stateStep =0then
stateExpAvg <- model.parameters.map(fun(t:Tensor)-> t.zerosLike().add(eps))// add eps so always non-zero
stateExpAvgSq <- model.parameters.map(fun(t:Tensor)-> t.zerosLike().add(eps))// add eps so always non-zero
stateStep <- stateStep +1letexpAvg= stateExpAvg.[name].mul(beta1).add((d*(1.-beta1)).add(eps))// add eps so always non-zeroletexpAvgSq= stateExpAvgSq.[name].mul(beta2).add((d*d*(1.-beta2)).add(eps))// add eps so always non-zero
Should be able to take hyper-gradients of Adam and other optimizers where reasonable
Issue #287 | Created by @dsyme | 2021-03-09 18:32:16 UTC |
1.0
The Adam code is non-differentiable, that is some hyper-gradients are easily becoming NaN in the Adam optimizer as described in this comment.
Specifically, if we look at this code:
then the
sqrt
operations have NaN derivative if the primal is zero (see sqrt derivative). So ifexpAvgSq
is zero, then there is no gradient, which in turn happens ifd
andstateExpEvgSq.[name]
was zero, i.e. can happen at the beginning of optimization AFAICSI believe this is sort of similar to what the "epsilon" parameter is for = to make sure the
denom
is always non-zero. So I tried sprinkling a couple of additions ofeps
in a little earlier and the hyper-gradients returned:Likely we only need one or two of these.
Comment by @dsyme | 2021-07-05 12:39:09 UTC
More generally, we need test cases for taking hyper-gradients of each of the optimizers https://github.com/DiffSharp/DiffSharp/blob/dev/tests/DiffSharp.Tests/TestDerivatives.Nested.fs
The text was updated successfully, but these errors were encountered: