neural network training optimization gradient descent
The retrieved evidence covers three distinct contributions to gradient-descent-based optimization for neural networks.
1. Warm Restarts for SGD
propose a warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks. The core idea is drawn from restart strategies in gradient-free optimization that handle multimodal loss landscapes.
The standard SGD parameter update can be written as:
$$\theta_{t+1} = \theta_t - \eta_t \nabla_\theta \mathcal{L}(\theta_t)$$
where $\theta_t$ are the model parameters at step $t$, $\eta_t$ is the learning rate schedule, and $\nabla_\theta \mathcal{L}$ is the gradient of the loss. The warm-restart approach periodically resets $\eta_t$ rather than monotonically decaying it, allowing the optimizer to escape local minima.
.
2. Distributed Asynchronous SGD for Large-Scale RNN Acoustic Modeling
introduced the first distributed training of LSTM RNNs using asynchronous stochastic gradient descent on a large cluster of machines. In asynchronous SGD, multiple workers compute gradients independently and update a shared parameter server without locking, which scales training across many nodes.
.
3. SGD for Optimizing Multiresolution Hash Feature Vectors
Müller et al. (2022) apply SGD in the context of neural graphics primitives, augmenting a small neural network with a multiresolution hash table of trainable feature vectors whose values are optimized through stochastic gradient descent. The multiresolution structure allows the network to disambiguate hash collisions, enabling a simpler, GPU-parallelizable architecture . The combined system achieves training of high-quality neural graphics primitives in seconds and rendering in tens of milliseconds at 1920×1080 resolution .
Summary of Contributions
| Contribution | Technique | Source |
|---|---|---|
| Warm restarts for learning rate scheduling | Cosine annealing SGD restarts | |
| Distributed async SGD for LSTMs | Asynchronous multi-machine SGD | |
| Hash-table feature optimization | SGD on trainable feature vectors |
Scope note: The evidence partially answers the broader question of gradient descent optimization. It does not cover topics such as Adam, momentum, second-order methods, or batch normalization, as those are not present in the retrieved evidence blocks.
Research smarter with AI-powered citations
Researchly finds and cites academic papers for any research topic in seconds. Used by students across India.