• The continual decay of learning rates throughout training.

• The need for a manually selected global learning rate.

$\begin{split}\boldsymbol{s}_t \leftarrow \rho \boldsymbol{s}_{t-1} + (1 - \rho) \boldsymbol{g}_t \odot \boldsymbol{g}_t, \\ \boldsymbol{g}_t' \leftarrow \sqrt{\frac{\Delta\boldsymbol{x}_{t-1} + \epsilon}{\boldsymbol{s}_t + \epsilon}} \odot \boldsymbol{g}_t, \\ \boldsymbol{x}_t \leftarrow \boldsymbol{x}_{t-1} - \boldsymbol{g}'_t, \\ \Delta\boldsymbol{x}_t \leftarrow \rho \Delta\boldsymbol{x}_{t-1} + (1 - \rho) \boldsymbol{g}'_t \odot \boldsymbol{g}'_t.\end{split}$

$$\rho$$ should be between 0 and 1. A value of rho close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.

$$\rho$$ = 0.95 and :math:epsilon=1e-6 are suggested in the paper and reported to work for multiple datasets (MNIST, speech).

In the paper, no learning rate is considered (so learning_rate=1.0). Probably best to keep it at this value. epsilon is important for the very first update (so the numerator does not become 0).

Parameters:

lr (float, Scheduler) – learning rate.

References

__init__(lr=0.01, train_vars=None, weight_decay=None, epsilon=1e-06, rho=0.95, name=None)[source]#

Methods

 __init__([lr, train_vars, weight_decay, ...]) check_grads(grads) cpu() Move all variable into the CPU device. cuda() Move all variables into the GPU device. load_state_dict(state_dict[, warn, compatible]) Copy parameters and buffers from state_dict into this module and its descendants. load_states(filename[, verbose]) Load the model states. nodes([method, level, include_self]) Collect all children nodes. register_implicit_nodes(*nodes[, node_cls]) register_implicit_vars(*variables[, var_cls]) register_train_vars([train_vars]) register_vars([train_vars]) save_states(filename[, variables]) Save the model states. state_dict() Returns a dictionary containing a whole state of the module. to(device) Moves all variables into the given device. tpu() Move all variables into the TPU device. train_vars([method, level, include_self]) The shortcut for retrieving all trainable variables. tree_flatten() Flattens the object as a PyTree. tree_unflatten(aux, dynamic_values) Unflatten the data to construct an object of this class. unique_name([name, type_]) Get the unique name for this object. update(grads) vars([method, level, include_self, ...]) Collect all variables in this node and the children nodes.

Attributes

 name Name of the model.