# brainpy.optim.Adadelta#

class brainpy.optim.Adadelta(lr=0.01, train_vars=None, weight_decay=None, epsilon=1e-06, rho=0.95, name=None)[source]#

Optimizer that implements the Adadelta algorithm.

Adadelta [4] optimization is a stochastic gradient descent method that is based on adaptive learning rate per dimension to address two drawbacks:

• The continual decay of learning rates throughout training.

• The need for a manually selected global learning rate.

Adadelta is a more robust extension of Adagrad that adapts learning rates based on a moving window of gradient updates, instead of accumulating all past gradients. This way, Adadelta continues learning even when many updates have been done. Compared to Adagrad, in the original version of Adadelta you don’t have to set an initial learning rate.

$\begin{split}\boldsymbol{s}_t \leftarrow \rho \boldsymbol{s}_{t-1} + (1 - \rho) \boldsymbol{g}_t \odot \boldsymbol{g}_t, \\ \boldsymbol{g}_t' \leftarrow \sqrt{\frac{\Delta\boldsymbol{x}_{t-1} + \epsilon}{\boldsymbol{s}_t + \epsilon}} \odot \boldsymbol{g}_t, \\ \boldsymbol{x}_t \leftarrow \boldsymbol{x}_{t-1} - \boldsymbol{g}'_t, \\ \Delta\boldsymbol{x}_t \leftarrow \rho \Delta\boldsymbol{x}_{t-1} + (1 - \rho) \boldsymbol{g}'_t \odot \boldsymbol{g}'_t.\end{split}$

$$\rho$$ should be between 0 and 1. A value of rho close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.

$$\rho$$ = 0.95 and :math:epsilon=1e-6 are suggested in the paper and reported to work for multiple datasets (MNIST, speech).

In the paper, no learning rate is considered (so learning_rate=1.0). Probably best to keep it at this value. epsilon is important for the very first update (so the numerator does not become 0).

Parameters:

lr (float, Scheduler) – learning rate.

References

__init__(lr=0.01, train_vars=None, weight_decay=None, epsilon=1e-06, rho=0.95, name=None)[source]#

Methods

 __init__([lr, train_vars, weight_decay, ...]) check_grads(grads) cpu() Move all variable into the CPU device. cuda() Move all variables into the GPU device. load_state_dict(state_dict[, warn, compatible]) Copy parameters and buffers from state_dict into this module and its descendants. load_states(filename[, verbose]) Load the model states. nodes([method, level, include_self]) Collect all children nodes. register_implicit_nodes(*nodes[, node_cls]) register_implicit_vars(*variables[, var_cls]) register_train_vars([train_vars]) register_vars([train_vars]) save_states(filename[, variables]) Save the model states. state_dict() Returns a dictionary containing a whole state of the module. to(device) Moves all variables into the given device. tpu() Move all variables into the TPU device. train_vars([method, level, include_self]) The shortcut for retrieving all trainable variables. tree_flatten() Flattens the object as a PyTree. tree_unflatten(aux, dynamic_values) Unflatten the data to construct an object of this class. unique_name([name, type_]) Get the unique name for this object. update(grads) vars([method, level, include_self, ...]) Collect all variables in this node and the children nodes.

Attributes

 name Name of the model.