Adam with weight decay regularization [1].

AdamW uses weight decay to regularize learning towards small weights, as this leads to better generalization. In SGD you can also use L2 regularization to implement this as an additive loss term, however L2 regularization does not behave as intended for adaptive gradient algorithms such as Adam.

\begin{align}\begin{aligned}\begin{split}\begin{aligned} &\rule{110mm}{0.4pt} \\ &\textbf{input} : \gamma \text{(lr)}, \: \beta_1, \beta_2 \text{(betas)}, \: \theta_0 \text{(params)}, \: f(\theta) \text{(objective)}, \: \epsilon \text{ (epsilon)} \\ &\hspace{13mm} \lambda \text{(weight decay)}, \: \textit{amsgrad}, \: \textit{maximize} \\ &\textbf{initialize} : m_0 \leftarrow 0 \text{ (first moment)}, v_0 \leftarrow 0 \text{ ( second moment)}, \: \widehat{v_0}^{max}\leftarrow 0 \\[-1.ex] &\rule{110mm}{0.4pt} \\ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\\end{split}\\\begin{split} &\hspace{5mm}\textbf{if} \: \textit{maximize}: \\ &\hspace{10mm}g_t \leftarrow -\nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm}\textbf{else} \\ &\hspace{10mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm} \theta_t \leftarrow \theta_{t-1} - \gamma \lambda \theta_{t-1} \\ &\hspace{5mm}m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ &\hspace{5mm}v_t \leftarrow \beta_2 v_{t-1} + (1-\beta_2) g^2_t \\ &\hspace{5mm}\widehat{m_t} \leftarrow m_t/\big(1-\beta_1^t \big) \\ &\hspace{5mm}\widehat{v_t} \leftarrow v_t/\big(1-\beta_2^t \big) \\ &\hspace{5mm}\textbf{if} \: amsgrad \\ &\hspace{10mm}\widehat{v_t}^{max} \leftarrow \mathrm{max}(\widehat{v_t}^{max}, \widehat{v_t}) \\ &\hspace{10mm}\theta_t \leftarrow \theta_t - \gamma \widehat{m_t}/ \big(\sqrt{\widehat{v_t}^{max}} + \epsilon \big) \\ &\hspace{5mm}\textbf{else} \\ &\hspace{10mm}\theta_t \leftarrow \theta_t - \gamma \widehat{m_t}/ \big(\sqrt{\widehat{v_t}} + \epsilon \big) \\ &\rule{110mm}{0.4pt} \\[-1.ex] &\bf{return} \: \theta_t \\[-1.ex] &\rule{110mm}{0.4pt} \\[-1.ex] \end{aligned}\end{split}\end{aligned}\end{align}
Parameters:
• lr (float, Scheduler) – learning rate.

• beta1 (optional, float) – A positive scalar value for beta_1, the exponential decay rate for the first moment estimates. Generally close to 1.

• beta2 (optional, float) – A positive scalar value for beta_2, the exponential decay rate for the second moment estimates. Generally close to 1.

• eps (optional, float) – A positive scalar value for epsilon, a small constant for numerical stability.

• weight_decay (float) – Strength of the weight decay regularization. Note that this weight decay is multiplied with the learning rate.

• amsgrad (bool) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond.

• name (optional, str) – The optimizer name.

References

__init__(lr, train_vars=None, beta1=0.9, beta2=0.999, eps=1e-08, weight_decay=0.01, amsgrad=False, name=None)[source]#

Methods

 __init__(lr[, train_vars, beta1, beta2, ...]) check_grads(grads) cpu() Move all variable into the CPU device. cuda() Move all variables into the GPU device. load_state_dict(state_dict[, warn, compatible]) Copy parameters and buffers from state_dict into this module and its descendants. load_states(filename[, verbose]) Load the model states. nodes([method, level, include_self]) Collect all children nodes. register_implicit_nodes(*nodes[, node_cls]) register_implicit_vars(*variables[, var_cls]) register_train_vars([train_vars]) register_vars([train_vars]) save_states(filename[, variables]) Save the model states. state_dict() Returns a dictionary containing a whole state of the module. to(device) Moves all variables into the given device. tpu() Move all variables into the TPU device. train_vars([method, level, include_self]) The shortcut for retrieving all trainable variables. tree_flatten() Flattens the object as a PyTree. tree_unflatten(aux, dynamic_values) Unflatten the data to construct an object of this class. unique_name([name, type_]) Get the unique name for this object. update(grads) vars([method, level, include_self, ...]) Collect all variables in this node and the children nodes.

Attributes

 name Name of the model.