• The continual decay of learning rates throughout training.

• The need for a manually selected global learning rate.

$\begin{split}\boldsymbol{s}_t \leftarrow \rho \boldsymbol{s}_{t-1} + (1 - \rho) \boldsymbol{g}_t \odot \boldsymbol{g}_t, \\ \boldsymbol{g}_t' \leftarrow \sqrt{\frac{\Delta\boldsymbol{x}_{t-1} + \epsilon}{\boldsymbol{s}_t + \epsilon}} \odot \boldsymbol{g}_t, \\ \boldsymbol{x}_t \leftarrow \boldsymbol{x}_{t-1} - \boldsymbol{g}'_t, \\ \Delta\boldsymbol{x}_t \leftarrow \rho \Delta\boldsymbol{x}_{t-1} + (1 - \rho) \boldsymbol{g}'_t \odot \boldsymbol{g}'_t.\end{split}$

$$\rho$$ should be between 0 and 1. A value of rho close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.

$$\rho$$ = 0.95 and :math:epsilon=1e-6 are suggested in the paper and reported to work for multiple datasets (MNIST, speech).

In the paper, no learning rate is considered (so learning_rate=1.0). Probably best to keep it at this value. epsilon is important for the very first update (so the numerator does not become 0).

References

4

 __init__([train_vars, lr, epsilon, rho, name]) check_grads(grads) load_states(filename[, verbose]) Load the model states. nodes([method, level, include_self]) Collect all children nodes. register_implicit_nodes(nodes) register_implicit_vars(variables) register_vars([train_vars]) save_states(filename[, variables]) Save the model states. train_vars([method, level, include_self]) The shortcut for retrieving all trainable variables. unique_name([name, type_]) Get the unique name for this object. update(grads) vars([method, level, include_self]) Collect all variables in this node and the children nodes.
 name