Both series and single models were trained on 2-layer feedforward controller (with hidden sizes 128 and 256 respectively) with ReLU activations, and both share the following set of hyperparameters:
- RMSProp Optimizer with learning rate of 10⁻⁴, momentum of 0.9.
- Memory word size of 10, with a single read head.
- Controller weights are initialized from samples 1 standard-deviation away from a zero mean normal distribution with a variance , where is the size of the input vector coming into the weight matrix.
- A batch size of 1.
All output from the DNC is squashed between 0 and 1 using a sigmoid functions and binary cross-entropy loss (or logistic loss) function of the form:
is used. That is the mean of the logistic loss across the batch, time steps, and output size.
All gradients are clipped between -10 and 10.
Possible NaNs could occur during training!
The model was first trained on a length-2 series of random binary vectors of size 6. Then starting off from the length-2 learned model, a length-4 model was trained in a curriculum learning fashion.
The following plots show the learning curves for the length-2 and length-4 models respectively.
Attempting to train a length-4 model directly always resulted in NaNs. The paper mentioned using curriculum learning for the graph and mini-SHRDLU tasks, but it did not mention any thing about the copy task, so there's a possibility that this is not the most efficient method.
$python tasks/copy/train-series.py --length=2
Then, assuming that the trained model from that execution is saved under the name 'step-100000'.
$python tasks/copy/train-series.py --length=4 --checkpoint=step-100000 --iterations=20000
The model was trained directly on a single input of length between 1 and 10 and the length was chosen randomly at each run, so no curriculum learning was used. The following plot shows the learning curve of the single model.
$python tasks/copy/train.py --iterations=50000