How To Increase Sparsity Neural Network

1 Introduction

Deep neural networks attain land-of-the-art performance in a variety of domains including image classification (He et al., 2022), machine translation (Vaswani et al., 2022), and text-to-speech (van den Oord et al., 2022; Kalchbrenner et al., 2022). While model quality has been shown to calibration with model and dataset size (Hestness et al., 2022), the resource required to train and deploy big neural networks can exist prohibitive. State-of-the-art models for tasks like image classification and machine translation commonly have tens of millions of parameters, and require billions of floating-betoken operations to make a prediction for a single input sample.

Sparsity has emerged as a leading approach to accost these challenges. By sparsity, we refer to the property that a subset of the model parameters have a value of exactly zero ² ² 2The term sparsity is also ordinarily used to refer to the proportion of a neural network's weights that are naught valued. Higher sparsity corresponds to fewer weights, and smaller computational and storage requirements. We use the term in this manner throughout this newspaper. . With goose egg valued weights, any multiplications (which dominate neural network computation) tin be skipped, and models can be stored and transmitted compactly using thin matrix formats. Information technology has been shown empirically that deep neural networks tin can tolerate high levels of sparsity (Han et al., 2022; Narang et al., 2022; Ullrich et al., 2022), and this holding has been leveraged to significantly reduce the toll associated with the deployment of deep neural networks, and to enable the deployment of country-of-the-art models in severely resource constrained environments (Theis et al., 2022; Kalchbrenner et al., 2022; Valin & Skoglund, 2022).

Over the by few years, numerous techniques for inducing sparsity have been proposed and the fix of models and datasets used as benchmarks has grown too large to reasonably expect new approaches to explore them all. In addition to the lack of standardization in modeling tasks, the distribution of benchmarks tends to camber heavily towards convolutional architectures and computer vision tasks, and the tasks used to evaluate new techniques are ofttimes non representative of the calibration and complexity of real-world tasks where model pinch is most useful. These characteristics make it difficult to come away from the sparsity literature with a clear understanding of the relative merits of different approaches.

In addition to applied concerns effectually comparing techniques, multiple independent studies have recently proposed that the value of sparsification in neural networks has been misunderstood (Frankle & Carbin, 2022; Liu et al., 2022). While both papers suggest that sparsification can exist viewed as a form of neural compages search, they disagree on what is necessary to achieve this. Specifically, Liu et al. (2018) re-railroad train learned sparse topologies with a random weight initialization, whereas Frankle & Carbin (2018) posit that the exact random weight initialization used when the thin architecture was learned is needed to lucifer the test gear up performance of the model sparsified during optimization.

In this newspaper, nosotros accost these ambiguities to provide a strong foundation for future piece of work on sparsity in neural networks. Our main contributions: (ane) We perform a comprehensive evaluation of variational dropout (Molchanov et al., 2022), regularization (Louizos et al., 2022b), and magnitude pruning (Zhu & Gupta, 2022) on Transformer trained on WMT 2022 English language-to-German and ResNet-50 trained on ImageNet. To the all-time of our knowledge, we are the first to apply variational dropout and regularization to models of this scale. While variational dropout and regularization attain state-of-the-art results on small datasets, we show that they perform inconsistently for big-calibration tasks and that unproblematic magnitude pruning can accomplish comparable or improve results for a reduced computational budget. (2) Through insights gained from our experiments, we achieve a new state-of-the-art sparsity-accurateness trade-off for ResNet-50 using only magnitude pruning. (iii) We repeat the lottery ticket (Frankle & Carbin, 2022) and scratch (Liu et al., 2022) experiments on Transformer and ResNet-fifty across a full range of sparsity levels. We testify that unstructured sparse architectures learned through pruning cannot be trained from scratch to the same exam set operation equally a model trained with pruning as part of the optimization process. (4) We open-source our lawmaking, model checkpoints, and results of all hyperparameter settings to establish rigorous baselines for future work on model compression and sparsification ³ ³ threehttps://bit.ly/2ExE8Yj .

2 Sparsity in Neural Networks

We briefly provide a not-exhaustive review of proposed approaches for inducing sparsity in deep neural networks.

Uncomplicated heuristics based on removing small magnitude weights have demonstrated loftier compression rates with minimal accurateness loss

(Ström, 1997; Collins & Kohli, 2022; Han et al., 2022), and further refinement of the sparsification procedure for magnitude pruning techniques has increased achievable pinch rates and greatly reduced computational complexity (Guo et al., 2022; Zhu & Gupta, 2022).

Many techniques grounded in Bayesian statistics and information theory take been proposed

(Dai et al., 2022; Molchanov et al., 2022; Louizos et al., 2022b, a; Ullrich et al., 2022). These methods have achieved loftier compression rates while providing deep theoretical motivation and connections to classical sparsification and regularization techniques.

Some of the primeval techniques for sparsifying neural networks make utilise of second-gild approximation of the loss surface to avoid damaging model quality (LeCun et al., 1989; Hassibi & Stork, 1992)

. More recent work has achieved comparable pinch levels with more computationally efficient first-lodge loss approximations, and farther refinements have related this work to efficient empirical estimates of the Fisher information of the model parameters

(Molchanov et al., 2022; Theis et al., 2022).

Reinforcement learning has also been applied to automatically clip weights and convolutional filters (Lin et al., 2022; He et al., 2022)

, and a number of techniques have been proposed that draw inspiration from biological phenomena, and derive from evolutionary algorithms and neuromorphic computing

(Guo et al., 2022; Bellec et al., 2022; Mocanu et al., 2022).

A key feature of a sparsity inducing technique is if and how it imposes construction on the topology of sparse weights. While unstructured weight sparsity provides the nigh flexibility for the model, information technology is more hard to map efficiently to parallel processors and has express support in deep learning software packages. For these reasons, many techniques focus on removing whole neurons and convolutional filters, or impose block construction on the sparse weights

(Liu et al., 2022; Luo et al., 2022; Gray et al., 2022). While this is practical, there is a trade-off between doable compression levels for a given model quality and the level of structure imposed on the model weights. In this work, we focus on unstructured sparsity with the expectation that information technology upper bounds the compression-accuracy trade-off achievable with structured sparsity techniques.

3 Evaluating Sparsification Techniques at Scale

Equally a showtime step towards addressing the ambiguity in the sparsity literature, we rigorously evaluate magnitude-based pruning (Zhu & Gupta, 2022), thin variational dropout (Molchanov et al., 2022), and regularization (Louizos et al., 2022b) on two big-calibration deep learning applications: ImageNet nomenclature with ResNet-50 (He et al., 2022)

, and neural machine translation (NMT) with the Transformer on the WMT 2022 English language-to-German language dataset

(Vaswani et al., 2022). For each model, we besides benchmark a random weight pruning technique, representing the lower bound of pinch-accuracy merchandise-off any method should be expected to reach.

Hither we briefly review the iv techniques and innovate our experimental framework. We provide a more detailed overview of each technique in Appendix A.

3.one Magnitude Pruning

Magnitude-based weight pruning schemes apply the magnitude of each weight as a proxy for its importance to model quality, and remove the least important weights according to some sparsification schedule over the grade of grooming. For our experiments, nosotros use the approach introduced in Zhu & Gupta (2017)

, which is conveniently available in the TensorFlow model_pruning library

^four ⁴ 4https://fleck.ly/2T8hBGn . This technique allows for masked weights to reactivate during grooming based on slope updates, and makes utilise of a gradual sparsification schedule with sorting-based weight thresholding to achieve a user specified level of sparsification. These features enable high compression ratios at a reduced computational price relative to the iterative pruning and re-training approach used past Han et al. (2015), while requiring less hyperparameter tuning relative to the technique proposed by Guo et al. (2016).

3.2 Variational Dropout

Variational dropout was originally proposed as a re-interpretation of dropout preparation equally variational inference, providing a Bayesian justification for the apply of dropout in neural networks and enabling useful extensions to the standard dropout algorithms like learnable dropout rates (Kingma et al., 2022). It was later demonstrated that past learning a model with variational dropout and per-parameter dropout rates, weights with loftier dropout rates can exist removed post-training to produce highly sparse solutions (Molchanov et al., 2022).

Variational dropout performs variational inference to larn the parameters of a fully-factorized Gaussian posterior over the weights under a log-compatible prior. In the standard formulation, nosotros apply a local reparameterization to move the sampled racket from the weights to the activations, and then apply the additive noise reparameterization to further reduce the variance of the gradient estimator. Under this parameterization, we directly optimize the mean and variance of the neural network parameters. After training a model with variational dropout, the weights with the highest learned dropout rates can be removed to produce a sparse model.

3.3 Regularization

regularization explicitly penalizes the number of non-goose egg weights in the model to induce sparsity. However, the -norm is both non-convex and non-differentiable. To address the non-differentiability of the -norm, Louizos et al. (2017b) propose a reparameterization of the neural network weights as the product of a weight and a stochastic gate variable sampled from a hard-concrete distribution. The parameters of the difficult-concrete distribution can be optimized straight using the reparameterization play tricks, and the expected

-norm can exist computed using the value of the cumulative distribution part of the random gate variable evaluated at zero.

Tabular array 1: Constant hyperparameters for all Transformer experiments. More details on the standard configuration for grooming the Transformer can be constitute in Vaswani et al. (2017).

3.iv Random Pruning Baseline

For our experiments, we also include a random sparsification procedure adapted from the magnitude pruning technique of Zhu & Gupta (2017). Our random pruning technique uses the aforementioned sparsity schedule, but differs by selecting the weights to exist pruned each step at random rather based on magnitude and does not allow pruned weights to reactivate. This technique is intended to represent a lower-bound of the accuracy-sparsity trade-off curve.

3.5 Experimental Framework

For magnitude pruning, we used the TensorFlow model pruning library. Nosotros implemented variational dropout and regularization from scratch. For variational dropout, nosotros verified our implementation by reproducing the results from the original newspaper. To verify our regularization implementation, we applied our weight-level code to Broad ResNet (Zagoruyko & Komodakis, 2022) trained on CIFAR-ten and replicated the training FLOPs reduction and accuracy results from the original publication. Verification results for variational dropout and regularization are included in Appendices B and C. For random pruning, we modified the TensorFlow model pruning library to randomly select weights as opposed to sorting them based on magnitude.

For each model, nosotros kept the number of training steps abiding across all techniques and performed all-encompassing hyper-parameter tuning. While magnitude pruning is relatively elementary to apply to big models and achieves reasonably consequent performance beyond a wide range of hyperparameters, variational dropout and -regularization are much less well understood. To our cognition, we are the first to utilize these techniques to models of this calibration. To produce a fair comparison, we did non limit the amount of hyperparameter tuning we performed for each technique. In total, our results encompass over 4000 experiments.

Figure 1: Sparsity-BLEU trade-off curves for the Transformer. Top: Pareto frontiers for each of the four sparsification techniques applied to the Transformer. Bottom: All experimental results with each technique. Despite the diversity of approaches, the relative performance of all three techniques is remarkably consequent. Magnitude pruning notably outperforms more circuitous techniques for high levels of sparsity.

4 Sparse Neural Motorcar Translation

We adjusted the Transformer (Vaswani et al., 2022) model for neural machine translation to utilize these 4 sparsification techniques, and trained the model on the WMT 2022 English language-German dataset. Nosotros sparsified all fully-continued layers and embeddings, which make up 99.87% of all of the parameters in the model (the other parameters coming from biases and layer normalization). The constant hyperparameters used for all experiments are listed in table one. We followed the standard training procedure used by Vaswani et al. (2017), but did non perform checkpoint averaging. This setup yielded a baseline BLEU score of 27.29 averaged across five runs.

We extensively tuned the remaining hyperparameters for each technique. Details on what hyperparameters we explored, and the results of what settings produced the best models can exist constitute in Appendix D.

4.1 Sparse Transformer Results & Analysis

All results for the Transformer are plotted in effigy 1. Despite the vast differences in these approaches, the relative functioning of all iii techniques is remarkably consistent. While regularization and variational dropout produce the top performing models in the depression-to-mid sparsity range, magnitude pruning achieves the best results for highly sparse models. While all techniques were able to outperform the random pruning technique, randomly removing weights produces surprisingly reasonable results, which is peradventure indicative of the models power to recover from harm during optimization.

Effigy 2: Boilerplate sparsity in Transformer layers. Distributions calculated on the top performing model at 90% sparsity for each technique.
regularization and variational dropout are able to learn non-uniform distributions of sparsity, while magnitude pruning induces user-specified sparsity distributions (in this case, compatible).

What is particularly notable about the performance of magnitude pruning is that our experiments uniformly remove the aforementioned fraction of weights for each layer. This is in stark contrast to variational dropout and regularization, where the distribution of sparsity across the layers is learned through the training procedure. Previous work has shown that a non-compatible sparsity amid different layers is primal to achieving high compression rates (He et al., 2022), and variational dropout and regularization should theoretically exist able to leverage this feature to learn amend distributions of weights for a given global sparsity.

Figure 2 shows the distribution of sparsity beyond the different layer types in the Transformer for the top performing model at 90% global sparsity for each technique. Both regularization and variational dropout larn to go on more parameters in the embedding, FFN layers, and the output transforms for the multi-head attending modules and induce more sparsity in the transforms for the query and value inputs to the attending modules. Despite this advantage, regularization and variational dropout did not significantly outperform magnitude pruning, even yielding inferior results at loftier sparsity levels.

Information technology is as well important to annotation that these results maintain a constant number of training steps across all techniques and that the Transformer variant with magnitude pruning trains 1.24x and i.65x faster than regularization and variational dropout respectively. While the standard Transformer training scheme produces fantabulous results for machine translation, it has been shown that training the model for longer can amend its operation by every bit much as 2 BLEU (Ott et al., 2022). Thus, when compared for a stock-still training cost magnitude pruning has a distinct reward over these more than complicated techniques.

v Sparse Image Classification

To benchmark these four sparsity techniques on a large-scale figurer vision task, nosotros integrated each method into ResNet-fifty and trained the model on the ImageNet large-scale epitome classification dataset. We sparsified all convolutional and fully-connected layers, which brand up 99.79% of all of the parameters in the model (the other parameters coming from biases and batch normalization).

Table 2: Abiding hyperparameters for all RN50 experiments.

The hyperparameters we used for all experiments are listed in Table two

. Each model was trained for 128000 iterations with a batch size of 1024 images, stochastic slope descent with momentum, and the standard learning rate schedule (see Appendix

E.1). This setup yielded a baseline tiptop-1 accuracy of 76.69% averaged across three runs. We trained each model with 8-fashion data parallelism across 8 accelerators. Due to the extra parameters and operations required for variational dropout, the model was unable to fit into device retentiveness in this configuration. For all variational dropout experiments, we used a per-device batch size of 32 images and scaled the model over 32 accelerators.

Figure iii: Sparsity-accuracy merchandise-off curves for ResNet-50. Elevation: Pareto frontiers for variational dropout, magnitude pruning, and random pruning applied to ResNet-50. Lesser: All experimental results with each technique. We observe large variation in functioning for variational dropout and regularization between Transformer and ResNet-fifty. Magnitude pruning and variational dropout achieve comparable performance for most sparsity levels, with variational dropout achieving the all-time results for high sparsity levels.

5.1 ResNet-50 Results & Assay

Figure 3 shows results for magnitude pruning, variational dropout, and random pruning practical to ResNet-50. Surprisingly, we were unable to produce sparse ResNet-l models with regularization that did not significantly damage model quality. Across hundreds of experiments, our models were either able to achieve full test ready performance with no sparsification, or sparsification with test set functioning alike to random guessing. Details on all hyperparameter settings explored are included in Appendix E.

This result is especially surprising given the success of regularization on Transformer. One nuance of the regularization technique of Louizos et al. (2017b) is that the model tin can have varying sparsity levels between the training and test-time versions of the model. At training time, a parameter with a dropout rate of ten% will be zero 10% of the time when sampled from the hard-physical distribution. However, under the test-time parameter estimator, this weight will be non-nil. ^five ⁵ 5The fraction of time a parameter is set up to zero during grooming depends on other factors, east.g. the parameter of the hard-concrete distribution. However, this point is generally truthful that the training and test-fourth dimension sparsities are non necessarily equivalent, and that in that location exists some dropout rate threshold below which a weight that is sometimes zero during training will exist not-zero at test-fourth dimension. . Louizos et al. (2017b) reported results applying regularization to a wide residual network (WRN) (Zagoruyko & Komodakis, 2022) on the CIFAR-10 dataset, and noted that they observed minor accuracy loss at every bit low as viii% reduction in the number of parameters during training. Applying our weight-level regularization implementation to WRN produces a model with comparable training fourth dimension sparsity, only with no sparsity in the examination-fourth dimension parameters. For models that reach examination-time sparsity, we observe significant accuracy deposition on CIFAR-10. This result is consistent with our ascertainment for regularization applied to ResNet-50 on ImageNet.

The variation in performance for variational dropout and regularization between Transformer and ResNet-l is striking. While achieving a good accuracy-sparsity trade-off, variational dropout consistently ranked backside regularization on Transformer, and was bested by magnitude pruning for sparsity levels of 80% and upward. However, on ResNet-fifty we observe that variational dropout consistently produces models on-par or better than magnitude pruning, and that regularization is not able to produce sparse models at all. Variational dropout achieved particularly notable results in the high sparsity range, maintaining a superlative-1 accuracy over 70% with less than 4% of the parameters of a standard ResNet-50.

The distribution of sparsity across different layer types in the best variational dropout and magnitude pruning models at 95% sparsity are plotted in effigy 4. While nosotros kept sparsity constant across all layers for magnitude and random pruning, variational dropout significantly reduces the amount of sparsity induced in the offset and last layers of the model.

Figure iv: Boilerplate sparsity in ResNet-50 layers. Distributions calculated on the acme performing model at 95% sparsity for each technique. Variational dropout is able to larn non-compatible distributions of sparsity, decreasing sparsity in the input and output layers that are known to be disproportionately important to model quality.

It has been observed that the start and concluding layers are often disproportionately of import to model quality (Han et al., 2022; Bellec et al., 2022). In the example of ResNet-50, the first convolution comprises only .037% of all the parameters in the model. At 98% sparsity the first layer has only 188 non-zip parameters, for an boilerplate of less than three parameters per output feature map. With magnitude pruning uniformly sparsifying each layer, it is surprising that information technology is able to reach any test set functioning at all with so few parameters in the input convolution.

While variational dropout is able to learn to distribute sparsity non-uniformly across the layers, information technology comes at a significant increment in resources requirements. For ResNet-50 trained with variational dropout we observed a greater than 2x increase in memory consumption. When scaled across 32 accelerators, ResNet-50 trained with variational dropout completed training in 9.75 hours, compared to ResNet-50 with magnitude pruning finishing in 12.50 hours on only 8 accelerators. Scaled to a 4096 batch size and 32 accelerators, ResNet-50 with magnitude pruning can complete the aforementioned number of epochs in simply 3.fifteen hours.

v.2 Pushing the Limits of Magnitude Pruning

Given that a uniform distribution of sparsity is suboptimal, and the significantly smaller resource requirements for applying magnitude pruning to ResNet-l it is natural to wonder how well magnitude pruning could perform if nosotros were to distribute the non-zero weights more advisedly and increase training time.

To understand the limits of the magnitude pruning heuristic, we modify our ResNet-50 grooming setup to leave the beginning convolutional layer fully dense, and but clip the last fully-connected layer to 80% sparsity. This heuristic is reasonable for ResNet-50, as the commencement layer makes upwards a modest fraction of the total parameters in the model and the concluding layer makes upwards just .03% of the total FLOPs. While tuning the magnitude pruning ResNet-50 models, nosotros observed that the all-time models always started and ended pruning during the tertiary learning rate phase, before the 2d learning charge per unit drib. To take reward of this, we increase the number of training steps by i.5x by extending this learning rate region. Results for ResNet-fifty trained with this scheme are plotted in figure v.

Effigy 5: Sparsity-accurateness trade-off curves for ResNet-50 with modified sparsification scheme. Altering the distribution of sparsity beyond the layers and increasing grooming time yield significant improvement for magnitude pruning.

With these modifications, magnitude pruning outperforms variational dropout at all but the highest sparsity levels while withal using less resources. Still, variational dropout's performance in the high sparsity range is peculiarly notable. With very low amounts of non-zero weights, we find it likely that the models performance on the exam set is closely tied to precise allocation of weights across the dissimilar layers, and that variational dropout's ability to larn this distribution enables it to better maintain accuracy at high sparsity levels. This consequence indicates that efficient sparsification techniques that are able to learn the distribution of sparsity beyond layers are a promising management for future work.

Its also worth noting that these changes produced models at lxxx% sparsity with top-ane accurateness of 76.52%, only .17% off our baseline ResNet-50 accurateness and .41% ameliorate than the results reported by He et al. (2018), without the extra complication and computational requirements of their reinforcement learning approach. This represents a new state-of-the-art sparsity-accuracy merchandise-off for ResNet-50 trained on ImageNet.

6 Sparsification as Architecture Search

While sparsity is traditionally thought of as a model pinch technique, ii independent studies have recently suggested that the value of sparsification in neural networks is misunderstood, and that once a sparse topology is learned it can be trained from scratch to the full operation achieved when sparsification was performed jointly with optimization.

Frankle & Carbin (2018) posited that over-parameterized neural networks contain pocket-size, trainable subsets of weights, accounted "winning lottery tickets". They suggest that sparsity inducing techniques are methods for finding these sparse topologies, and that in one case found the sparse architectures can be trained from scratch with the same weight initialization that was used when the thin compages was learned

. They demonstrated that this property holds across unlike convolutional neural networks and multi-layer perceptrons trained on the MNIST and CIFAR-10 datasets.

Liu et al. (2018) similarly demonstrated this phenomenon for a number of activation sparsity techniques on convolutional neural networks, as well as for weight level sparsity learned with magnitude pruning. Even so, they demonstrate this result using a random initialization during re-training.

The implications of being able to train sparse architectures from scratch in one case they are learned are big: in one case a sparse topology is learned, information technology can be saved and shared as with whatsoever other neural network architecture. Re-training and so can exist done fully sparse, taking advantage of sparse linear algebra to greatly accelerate fourth dimension-to-solution. However, the combination of these 2 studies does not clearly plant how this potential is to be realized.

Beyond the question of whether or non the original random weight initialization is needed, both studies only explore convolutional neural networks (and pocket-sized multi-layer perceptrons in the example of Frankle & Carbin (2018)). The majority of experiments in both studies besides limited their analyses to the MNIST, CIFAR-10, and CIFAR-100 datasets. While these are standard benchmarks for deep learning models, they are not indicative of the complexity of existent-world tasks where model compression is nearly useful. Liu et al. (2018) practice explore convolutional architectures on the ImageNet datasets, simply simply at ii relatively low sparsity levels (thirty% and 60%). They too note that weight level sparsity on ImageNet is the only case where they are unable to reproduce the full accuracy of the pruned model.

To analyze the questions surrounding the idea of sparsification as a course of neural architecture search, we repeat the experiments of Frankle & Carbin (2018) and Liu et al. (2018) on ResNet-fifty and Transformer. For each model, we explore the full range of sparsity levels (fifty% - 98%) and compare to our well-tuned models from the previous sections.

Figure 6: Scratch and lottery ticket experiments with magnitude pruning. Top: results with Transformer. Bottom: Results with ResNet-l. Across all experiments, training from scratch using a learned sparse architecture is unable to re-produce the performance of models trained with sparsification as part of the optimization process.

six.1 Experimental Framework

The experiments of Liu et al. (2018) comprehend taking the final learned weight mask from a magnitude pruning model, randomly re-initializing the weights, and training the model with the normal preparation procedure (i.e., learning rate, number of iterations, etc.). To account for the presence of sparsity at the offset of training, they scale the variance of the initial weight distribution by the number of non-zeros in the matrix. They additionally train a variant where they increase the number of training steps (upwardly to a factor of 2x) such that the re-trained model uses approximately the same number of FLOPs during training as model trained with sparsification as part of the optimization process. They refer to these two experiments as "scratch-e" and "scratch-b" respectively.

Frankle & Carbin (2018) follow a similar procedure, but use the same weight initialization that was used when the thin weight mask was learned and do not perform the longer preparation time variant.

For our experiments, we repeat the scratch-e, scratch-b and lottery ticket experiments with magnitude pruning on Transformer and ResNet-50. For scratch-e and scratch-b, we also railroad train variants that do not alter the initial weight distribution. For the Transformer, we re-trained five replicas of the all-time magnitude pruning hyperparameter settings at each sparsity level and salve the weight initialization and final sparse weight mask. For each of the five learned weight masks, nosotros train five identical replicas for the scratch-due east, scratch-b, scratch-e with augmented initialization, scratch-b with augmented initialization, and the lottery ticket experiments. For ResNet-50, we followed the same procedure with three re-trained models and three replicas at each sparsity level for each of the five experiments. Figure 6 plots the averages and min/max of all experiments at each sparsity level ⁶ ⁶ 6

Two of the 175 Transformer experiments failed to train from scratch at all and produced BLEU scores less than 1.0. We omit these outliers in effigy

6.2 Scratch and Lottery Ticket Results & Analysis

Across all of our experiments, nosotros observed that training from scratch using a learned sparse architecture is non able to lucifer the performance of the same model trained with sparsification every bit part of the optimization process.

Across both models, we observed that doubling the number of grooming steps did improve the quality of the results for the scratch experiments, but was non sufficient to match the test set performance of the magnitude pruning baseline. As sparsity increased, nosotros observed that the deviation between the models trained with magnitude pruning and those trained from scratch increased. For both models, we did not detect a benefit from using the augmented weight initialization for the scratch experiments.

For ResNet-50, we experimented with four dissimilar learning rates schemes for the scratch-b experiments. We constitute that scaling each learning rate region to double the number of epochs produced the best results past a wide margin. These results are plotted in figure 6. Results for the ResNet-50 scratch-b experiments with the other learning rate variants are included with our release of hyperparameter tuning results.

For the lottery ticket experiments, we were not able to replicate the miracle observed by Frankle & Carbin (2018). The key difference between our experiments is the complication of the tasks and scale of the models, and it seems likely that this is the main factor contributing to our disability to railroad train these architecture from scratch.

For the scratch experiments, our results are consistent with the negative upshot observed by (Liu et al., 2022) for ImageNet and ResNet-l with unstructured weight pruning. By replicating the scratch experiments at the full range of sparsity levels, we observe that the quality of the models degrades relative to the magnitude pruning baseline as sparsity increases. For unstructured weight sparsity, information technology seems likely that the miracle observed by Liu et al. (2018) was produced by a combination of low sparsity levels and pocket-sized-to-medium sized tasks. Nosotros'd similar to emphasize that this result is only for unstructured weight sparsity, and that prior piece of work Liu et al. (2018) provides stiff evidence that activation pruning behaves differently.

vii Limitations of This Study

Hyperparameter exploration. For all techniques and models, nosotros advisedly hand-tuned hyperparameters and performed extensive sweeps encompassing thousands of experiments over manually identified ranges of values. However, the number of possible settings vastly outnumbers the set up of values that tin exist practically explored, and we cannot eliminate the possibility that some techniques significantly outperform others nether settings nosotros did non try.

Neural architectures and datasets. Transformer and ResNet-50 were chosen as benchmark tasks to correspond a cross section of big-scale deep learning tasks with various architectures. We can't exclude the possibility that some techniques reach consistently high operation beyond other architectures. More models and tasks should be thoroughly explored in futurity work.

8 Decision

In this work, nosotros performed an extensive evaluation of three state-of-the-fine art sparsification techniques on two large-scale learning tasks. Notwithstanding the limitations discussed in section 7, we demonstrated that complex techniques shown to yield state-of-the-art pinch on small-scale datasets perform inconsistently, and that uncomplicated heuristics can achieve comparable or meliorate results on a reduced computational budget. Based on insights from our experiments, we accomplish a new country-of-the-art sparsity-accuracy merchandise-off for ResNet-50 with but magnitude pruning and highlight promising directions for research in sparsity inducing techniques.

Additionally, we provide strong counterexamples to two recently proposed theories that models learned through pruning techniques can exist trained from scratch to the same test set functioning of a model learned with sparsification as part of the optimization process. Our results highlight the need for large-scale benchmarks in sparsification and model compression. Every bit such, nosotros open-source our lawmaking, checkpoints, and results of all hyperparameter configurations to establish rigorous baselines for hereafter work.

Acknowledgements

We would like to thank Benjamin Caine, Jonathan Frankle, Raphael Gontijo Lopes, Sam Greydanus, and Keren Gu for helpful discussions and feedback on drafts of this newspaper.

References

Bellec et al. (2017) Bellec, Yard., Kappel, D., Maass, Westward., and Legenstein, R. A. Deep Rewiring: Training Very Sparse Deep Networks. CoRR, abs/1711.05136, 2022.
Collins & Kohli (2014) Collins, M. D. and Kohli, P. Memory Divisional Deep Convolutional Networks. CoRR, abs/1412.1442, 2022. URL http://arxiv.org/abs/1412.1442.
Dai et al. (2018) Dai, B., Zhu, C., and Wipf, D. P. Compressing Neural Networks using the Variational Information Bottleneck. CoRR, abs/1802.10399, 2022.
Frankle & Carbin (2018) Frankle, J. and Carbin, K. The Lottery Ticket Hypothesis: Training Pruned Neural Networks. CoRR, abs/1803.03635, 2022. URL http://arxiv.org/abs/1803.03635.
Grey et al. (2017) Gray, Due south., Radford, A., and Kingma, D. P. Cake-sparse gpu kernels. https://blog.openai.com/block-sparse-gpu-kernels/, 2022.
Guo et al. (2016) Guo, Y., Yao, A., and Chen, Y. Dynamic Network Surgery for Efficient DNNs. In NIPS, 2022.
Han et al. (2015) Han, S., Pool, J., Tran, J., and Coquet, W. J. Learning both Weights and Connections for Efficient Neural Network. In NIPS, pp. 1135–1143, 2022.
Hassibi & Stork (1992) Hassibi, B. and Stork, D. G. 2nd order derivatives for network pruning: Optimal brain surgeon. In NIPS, pp. 164–171. Morgan Kaufmann, 1992.
He et al. (2016) He, K., Zhang, 10., Ren, S., and Sun, J. Deep Residue Learning for Image Recognition. In
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2022, Las Vegas, NV, USA, June 27-30, 2022
, pp. 770–778, 2022.
He et al. (2018) He, Y., Lin, J., Liu, Z., Wang, H., Li, L., and Han, South. AMC: automl for model pinch and acceleration on mobile devices. In Computer Vision - ECCV 2022 - 15th European Conference, Munich, Federal republic of germany, September viii-14, 2022, Proceedings, Part VII, pp. 815–832, 2022.
Hestness et al. (2017) Hestness, J., Narang, S., Ardalani, N., Diamos, M. F., Jun, H., Kianinejad, H., Patwary, Thousand. Thou. A., Yang, Y., and Zhou, Y. Deep learning scaling is predictable, empirically. CoRR, abs/1712.00409, 2022.
Kalchbrenner et al. (2018) Kalchbrenner, N., Elsen, East., Simonyan, Yard., Noury, Southward., Casagrande, N., Lockhart, East., Stimberg, F., van den Oord, A., Dieleman, S., and Kavukcuoglu, K. Efficient Neural Audio Synthesis. In
Proceedings of the 35th International Conference on Motorcar Learning, ICML 2022, Stockholmsmässan, Stockholm, Sweden, July x-15, 2022
, pp. 2415–2424, 2022.
Kingma & Welling (2013) Kingma, D. P. and Welling, M. Motorcar-encoding variational bayes. CoRR, abs/1312.6114, 2022.
Kingma et al. (2015) Kingma, D. P., Salimans, T., and Welling, Grand. Variational dropout and the local reparameterization trick. CoRR, abs/1506.02557, 2022.
LeCun et al. (1989) LeCun, Y., Denker, J. S., and Solla, S. A. Optimal Brain Damage. In NIPS, pp. 598–605. Morgan Kaufmann, 1989.
Lin et al. (2017) Lin, J., Rao, Y., Lu, J., and Zhou, J. Runtime neural pruning. In NIPS, pp. 2178–2188, 2022.
Liu et al. (2017) Liu, Z., Li, J., Shen, Z., Huang, Thousand., Yan, S., and Zhang, C. Learning Efficient Convolutional Networks through Network Slimming. In IEEE International Conference on Estimator Vision, ICCV 2022, Venice, Italian republic, October 22-29, 2022, pp. 2755–2763, 2022.
Liu et al. (2018) Liu, Z., Sun, 1000., Zhou, T., Huang, G., and Darrell, T. Rethinking the Value of Network Pruning. CoRR, abs/1810.05270, 2022.
Louizos et al. (2017a) Louizos, C., Ullrich, K., and Welling, M. Bayesian Compression for Deep Learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2022, 4-9 Dec 2022, Long Embankment, CA, U.s.a., pp. 3290–3300, 2022a.
Louizos et al. (2017b) Louizos, C., Welling, M., and Kingma, D. P. Learning Sparse Neural Networks through Fifty Regularization. CoRR, abs/1712.01312, 2022b.
Luo et al. (2017) Luo, J., Wu, J., and Lin, W. Thinet: A Filter Level Pruning Method for Deep Neural Network Compression. In IEEE International Conference on Estimator Vision, ICCV 2022, Venice, Italy, October 22-29, 2022, pp. 5068–5076, 2022.
Mitchell & Beauchamp (1988) Mitchell, T. J. and Beauchamp, J. J.
Bayesian Variable Pick in Linear Regression.
Periodical of the American Statistical Association, 83(404):1023–1032, 1988.
Mocanu et al. (2018) Mocanu, D. C., Mocanu, E., Stone, P., Nguyen, P. H., Gibescu, Chiliad., and Liotta, A. Scalable Grooming of Artificial Neural Networks with Adaptive Sparse Connectivity Inspired by Network Science. Nature Communications, 2022.
Molchanov et al. (2017) Molchanov, D., Ashukha, A., and Vetrov, D. P. Variational Dropout Sparsifies Deep Neural Networks. In Proceedings of the 34th International Briefing on Automobile Learning, ICML 2022, Sydney, NSW, Australia, half dozen-xi Baronial 2022, pp. 2498–2507, 2022.
Molchanov et al. (2016) Molchanov, P., Tyree, South., Karras, T., Aila, T., and Kautz, J. Pruning Convolutional Neural Networks for Resource Efficient Transfer Learning. CoRR, abs/1611.06440, 2022.
Narang et al. (2017) Narang, S., Diamos, G. F., Sengupta, S., and Elsen, E. Exploring Sparsity in Recurrent Neural Networks. CoRR, abs/1704.05119, 2022.
Ott et al. (2018) Ott, M., Edunov, S., Grangier, D., and Auli, Yard. Scaling Neural Machine Translation. In Proceedings of the 3rd Conference on Machine Translation: Research Papers, WMT 2022, Belgium, Brussels, October 31 - November 1, 2022, pp. 1–9, 2022.
Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D.
Stochastic Backpropagation and Judge Inference in Deep Generative models.
In ICML, volume 32 of JMLR Workshop and Conference Proceedings, pp. 1278–1286. JMLR.org, 2022.
Ström (1997) Ström, Northward. Sparse Connection and Pruning in Large Dynamic Artificial Neural Networks. In EUROSPEECH, 1997.
Theis et al. (2018) Theis, L., Korshunova, I., Tejani, A., and Huszár, F. Faster gaze prediction with dense networks and Fisher pruning. CoRR, abs/1801.05787, 2022. URL http://arxiv.org/abs/1801.05787.
Ullrich et al. (2017) Ullrich, K., Meeds, Eastward., and Welling, Yard. Soft Weight-Sharing for Neural Network Compression. CoRR, abs/1702.04008, 2022.
Valin & Skoglund (2018) Valin, J. and Skoglund, J. Lpcnet: Improving Neural Speech Synthesis Through Linear Prediction. CoRR, abs/1810.11846, 2022. URL http://arxiv.org/abs/1810.11846.
van den Oord et al. (2016) van den Oord, A., Dieleman, Due south., Zen, H., Simonyan, Thou., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., and Kavukcuoglu, K. Wavenet: A Generative Model for Raw Sound. In The ninth ISCA Spoken communication Synthesis Workshop, Sunnyvale, CA, Usa, 13-15 September 2022, pp. 125, 2022.
Vaswani et al. (2017) Vaswani, A., Shazeer, North., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Almanac Conference on Neural Information Processing Systems 2022, 4-9 Dec 2022, Long Embankment, CA, U.s., pp. 6000–6010, 2022.
Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Wide Residual Networks. In Proceedings of the British Auto Vision Conference 2022, BMVC 2022, York, Great britain, September 19-22, 2022, 2022.
Zhu & Gupta (2017) Zhu, Thousand. and Gupta, S. To prune, or not to prune: exploring the efficacy of pruning for model pinch. CoRR, abs/1710.01878, 2022. URL http://arxiv.org/abs/1710.01878.

Appendix A Overview of Sparsity Inducing Techniques

Here we provide a more than detailed review of the three sparsity techniques nosotros benchmarked.

a.1 Magnitude Pruning

Magnitude-based weight pruning schemes apply the magnitude of each weight equally a proxy for its importance to model quality, and remove the least important weights according to some sparsification schedule over the course of training. Many variants have been proposed (Collins & Kohli, 2022; Han et al., 2022; Guo et al., 2022; Zhu & Gupta, 2022), with the key differences lying in when weights are removed, whether weights should be sorted to remove a precise proportion or thresholded based on a fixed or decaying value, and whether or non weights that take been pruned still receive slope updates and have the potential to return after being pruned.

Han et al. (2015) use iterative magnitude pruning and re-training to progressively sparsify a model. The target model is first trained to convergence, later on which a portion of weights are removed and the model is re-trained with these weights stock-still to naught. This procedure is repeated until the target sparsity is accomplished. Guo et al. (2016) amend on this approach by allowing masked weights to still receive gradient updates, enabling the network to recover from wrong pruning decisions during optimization. They achieve college compression rates and interleave pruning steps with gradient update steps to avoid expensive re-training. Zhu & Gupta (2017) similarly allow gradient updates to masked weights, and make use of a gradual sparsification schedule with sorting-based weight thresholding to maintain accuracy while achieving a user specified level of sparsification.

Information technology'southward worth noting that magnitude pruning can easily be adapted to induce block or activation level sparsity by removing groups of weights based on their p-norm, boilerplate, max, or other statistics. Variants have also been proposed that maintain a constant level of sparsity during optimization to enable accelerated training

(Mocanu et al., 2022).

a.2 Variational Dropout

Consider the setting of a dataset of i.i.d. samples and a standard classification problem where the goal is to learn the parameters

of the conditional probability

. Bayesian inference combines some initial belief over the parameters

in the course of a prior distribution with observed information into an updated belief over the parameters in the form of the posterior distribution . In practice, calculating the truthful posterior using Bayes' rule is computationally intractable and good approximations are needed. In variational inference, we optimize the parameters of some parameterized model such that is a close approximation to the true posterior distribution

as measured by the Kullback-Leibler divergence betwixt the two distributions. The divergence of our approximate posterior from the true posterior is minimized in practise by maximizing the variational lower-bound

where

Using the Stochastic Gradient Variational Bayes (SGVB) (Kingma et al., 2022) algorithm to optimize this bound, reduces to the standard cantankerous-entropy loss, and the KL divergence between our judge posterior and prior over the parameters serves as a regularizer that enforces our initial belief about the parameters .

In the standard formulation of variational dropout, we assume the weights are fatigued from a fully-factorized Gaussian estimate posterior.

Where and are neural network parameters. For each training step, nosotros sample weights from this distribution and apply the reparameterization trick (Kingma & Welling, 2022; Rezende et al., 2022)

to differentiate the loss w.r.t. the parameters through the sampling functioning. Given the weights are normally distributed, the distribution of the activations

afterwards a linear operation like matrix multiplication or convolution is as well Gaussian and can be calculated in airtight grade ^vii ⁷ 7We ignore correlation in the activations, as is washed by Molchanov et al. (2017) .

with and and where are the inputs to the layer. Thus, rather than sample weights, we can directly sample the activations at each layer. This step is known as the local reparameterization fox, and was shown by Kingma et al. (2015) to reduce the variance of the gradients relative to the standard conception in which a unmarried set of sampled weights must be shared for all samples in the input batch for efficiency. Molchanov et al. (2017) showed that the variance of the gradients could be further reduced past using an condiment racket reparameterization, where nosotros define a new parameter

Under this parameterization, nosotros directly optimize the mean and variance of the neural network parameters.

Under the assumption of a log-uniform prior on the weights , the KL divergence component of our objective function tin can exist accurately approximated (Molchanov et al., 2022):

After training a model with variational dropout, the weights with the highest values tin exist removed. For all their experiments, Molchanov et al. (2017) removed weights with larger than iii.0, which corresponds to a dropout rate greater than 95%. Although they demonstrated skillful results, it is likely that the optimal threshold varies across different models and fifty-fifty different hyperparameter settings of the aforementioned model. Nosotros address this question in our experiments.

a.3 Regularization

To optimize the -norm, we reparameterize the model weights

every bit the production of a weight and a random variable drawn from the hard-concrete distribution.

In this formulation, the parameter that controls the position of the hard-concrete distribution (and thus the probability that is zippo) is optimized with gradient descent. , , and are fixed parameters that control the shape of the hard-concrete distribution. controls the curvature or temperature

of the hard-concrete probability density part, and

and stretch the distribution due south.t. takes value 0 or 1 with non-nothing probability.

On each training iteration, is sampled from this distribution and multiplied with the standard neural network weights. The expected -norm can so be calculated using the cumulative distribution function of the hard-concrete distribution and optimized directly with stochastic slope descent.

At test-fourth dimension, Louizos et al. (2017b) use the post-obit gauge for the model parameters.

Interestingly, Louizos et al. (2017b) showed that their objective role under the penalty is a special case of a variational lower-bound over the parameters of the network under a spike and slab (Mitchell & Beauchamp, 1988) prior.

Appendix B Variational Dropout Implementation Verification

To verify our implementation of variational dropout, we applied it to LeNet-300-100 and LeNet-five-Caffe on MNIST and compared our results to the original paper

(Molchanov et al., 2022). We matched our hyperparameters to those used in the code released with the paper ⁸ ⁸ 8 https://github.com/ars-ashuha/variational-dropout-sparsifies-dnn . All results are listed in table 3

Table 3: Variational Dropout MNIST Reproduction Results.

Our baseline LeNet-300-100 model achieved test set accuracy of 98.42%, slightly college than the baseline of 98.36% reported in (Molchanov et al., 2022). Applying our variational dropout implementation to LeNet-300-100 with these hyperparameters produced a model with 97.52% global sparsity and 98.42% test accuracy. The original newspaper produced a model with 98.57% global sparsity, and 98.08% test accuracy. While our model achieves .34% college tests accuracy with 1% lower sparsity, we believe the discrepancy is mainly due to deviation in our software packages: the authors of (Molchanov et al., 2022)

used Theano and Lasagne for their experiments, while we utilise TensorFlow.

Given our model achieves highest accurateness, we tin can decrease the threshold to merchandise accuracy for more sparsity. With a threshold of 2.0, our model achieves 98.v% global sparsity with a test fix accuracy of 98.40%. With a threshold of 0.1, our model achieves 99.1% global sparsity with 98.xiii% test set up accurateness, exceeding the sparsity and accuracy of the originally published results.

On LeNet-5-Caffe, our implementation achieved a global sparsity of 99.29% with a test set accurateness of 99.26%, versus the originaly published results of 99.6% sparsity with 99.25% accuracy. Lowering the threshold to ii.0, our model achieves 99.5% sparsity with 99.25% test accuracy.

Appendix C Regularization Implementation Verification

The original regularization newspaper uses a modified version of the proposed technique for inducing group sparsity in models, so our weight-level implementation is non directly comparable. However, to verify our implementation we trained a Wide ResNet (WRN) (Zagoruyko & Komodakis, 2022) on CIFAR-10 and compared results to those reported in the original publication for group sparsity.

Effigy seven: Frontward pass FLOPs for WRN-28-10 trained with regularization. Our implementation achieves FLOPs reductions comparable to those reported in Louizos et al. (2017b).

As done by Louizos et al. (2017b), we apply to the first convolutional layer in the balance blocks (i.east., where dropout would normally be used). We utilize the weight disuse formulation for the re-parameterized weights, and scale the weight disuse coefficient to maintain the aforementioned initial length calibration of the parameters. We use the aforementioned batch size of 128 samples and the same initial log , and train our model on a single GPU.

Our baseline WRN-28-x implementation trained on CIFAR-10 accomplished a test fix accurateness of 95.45%. Using our regularization implementation and a -norm weight of .0003, we trained a model that achieved 95.34% accurateness on the test set while achieving a consistent training-time FLOPs reduction comparable to that reported past Louizos et al. (2017b). Floating-betoken operations (FLOPs) required to compute the forrad over the form of training WRN-28-10 with are plotted in figure vii.

During our re-implementation of the WRN experiments from Louizos et al. (2017b), we identified errors in the original publications Flop calculations that caused the number of floating-point operations in WRN-28-10 to exist miscalculated. We've contacted the authors, and hope to resolve this event to analyze their performance results.

Appendix D Sparse Transformer Experiments

d.1 Magnitude Pruning Details

For our magnitude pruning experiments, we tuned 4 key hyperparameters: the starting iteration of the sparsification procedure, the ending iteration of the sparsification procedure, the frequency of pruning steps, and the combination of other regularizers (dropout and label smoothing) used during grooming. We trained models with seven unlike target sparsities: 50%, lx%, seventy%, 80%, 90%, 95%, and 98%. At each of these sparsity levels, we tried pruning frequencies of one thousand and 10000 steps. During preliminary experiments we identified that the best settings for the training step to end pruning at were typically closer to the end of preparation. Based on this insight, we explored every possible combination of start and end points for the sparsity schedule in increments of 100000 steps with an ending pace of 300000 or greater.

Past default, the Transformer uses dropout with a dropout charge per unit of x% on the input to the encoder, decoder, and earlier each layer and performs label smoothing with a smoothing parameter of .1. We found that decreasing these other regularizers produced higher quality models in the mid to high sparsity range. For each hyperparameter combination, we tried 3 dissimilar regularization settings: standard label smoothing and dropout, label smoothing only, and no regularization.

d.2 Variational Dropout Details

For the Transformer trained with variational dropout, we extensively tuned the coefficient for the KL divergence component of the objective function to find models that achieved high accuracy with sparsity levels in the target range. We plant that KL divergence weights in the range , where is the number of samples in the training set, produced models in our target sparsity range.

(Molchanov et al., 2022) noted difficulty training some models from scratch with variational dropout, as big portions of the model prefer high dropout rates early on in training before the model tin learn a useful representation from the data. To address this effect, they utilise a gradual ramp-upwardly of the KL difference weight, linearly increasing the regularizer coefficient until information technology reaches the desired value.

For our experiments, we explored using a constant regularizer weight, linearly increasing the regularizer weight, and besides increasing the regularizer weight following the cubic sparsity function used with magnitude pruning. For the linear and cubic weight schedules, we tried each combination of possible first and end points in increments of 100000 steps. For each hyperparameter combination, nosotros also tried the three different combinations of dropout and label smoothing as with magnitude pruning. For each trained model, we evaluated the model with eleven thresholds in the range . For all experiments, nosotros initialized all parameters to the constant value .

d.3 Regularization Details

For Transformers trained with regularization, nosotros similarly tuned the coefficient for the -norm in the objective role. We observed that much higher magnitude regularization coefficients were needed to produce models with the aforementioned sparsity levels relative to variational dropout. We found that -norm weights in the range produced models in our target sparsity range.

For all experiments, we used the default settings for the paramters of the difficult-concrete distribution: , , and . We initialized the parameters to , corresponding to a 10% dropout charge per unit.

For each hyperparameter setting, we explored the three regularizer coefficient schedules used with variational dropout and each of the three combinations of dropout and label smoothing.

d.4 Random Pruning Details

Nosotros identified in preliminary experiments that random pruning typically produces the best results by starting and ending pruning early and allowing the model to terminate the balance of the training steps with the final thin weight mask. For our experiments, we explored all hyperparameter combinations that we explored with magnitude pruning, and likewise included get-go/stop pruning step combinations with an end step of less than 300000.

Appendix Due east Thin ResNet-50

e.1 Learning Charge per unit

For all experiments, the we used the learning rate scheme used past the official TensorFlow ResNet-l implementation ⁹ ⁹ 9 https://scrap.ly/2Wd2Lk0 . With our batch size of 1024, this includes a linear ramp-upwardly for five epochs to a learning rate of .4 followed by learning charge per unit drops by a factor of 0.1 at epochs 30, 60, and 80.

e.2 Magnitude Pruning Details

For magnitude pruning on ResNet-50, nosotros trained models with a target sparsity of 50%, 70%, 80%, 90%, 95%, and 98%. At each sparsity level, we tried starting pruning at steps 8k, 20k, and 40k. For each potential starting point, we tried catastrophe pruning at steps 68k, 76k, and 100k. For every hyperparameter setting, we tried pruning frequencies of 2k, 4k, and 8k steps and explored training with and without label smoothing. During preliminary experiments, we observed that removing weight decay from the model consistently caused meaning decreases in test accuracy. Thus, for all hyperparameter combinations, nosotros left weight decay on with the standard coefficient.

For a target sparsity of 98%, nosotros observed that very few hyperparameter combinations were able to complete preparation without failing due to numerical issues. Out of all the hyperparameter configurations we tried, only a single model was able to complete training without erroring from the presence of NaNs. Equally explained in the main text, at loftier sparsity levels the first layer of the model has very few non-zero parameters, leading to instability during preparation and low test set performance. Pruned ResNet-50 models with the commencement layer left dense did not exhibit these issues.

eastward.iii Variational Dropout Details

For variational dropout applied to ResNet-50, we explored the same combinations of start and end points for the kl-divergence weight ramp up every bit we did for the start and end points of magnitude pruning. For all transformer experiments, we did not observe a pregnant gain from using a cubic kl-divergence weight ramp-up schedule and thus but explored the linear ramp-up for ResNet-50. For each combination of start and end points for the kl-departure weight, we explored nine different coefficients for the kl-divergence loss term: .01 / Northward, .03 / Northward, .05 / N, .1 / N, .3 / N, .5 / Due north, one / N, 10 / Northward, and 100 / N.

Reverse to our feel with Transformer, we found ResNet-fifty with variational dropout to be highly sensitive to the initialization for the log parameters. With the standard setting of -ten, we couldn't match the baseline accurateness, and with an initialization of -xx our models achieved skillful test operation but no sparsity. Subsequently some experimentation, we were able to produce skillful results with an initialization of -15.

While with Transformer nosotros saw a reasonable amount of variance in examination set up performance and sparsity with the same model evaluated at different log thresholds, nosotros did not observe the same phenomenon for ResNet-l. Across a range of log values, nosotros saw consistent accuracy and nearly identical sparsity levels. For all of the results reported in the main text, we used a log threshold of 0.5, which we establish to produce slightly amend results than the standard threshold of 3.0.

east.4 Regularization Details

For regularization, nosotros explored iv different initial log values corresponding to dropout rates of ane%, 5%, x%, and 30%. For each dropout charge per unit, we extenively tuned the -norm weight to produce models in the desired sparsity range. Later on identifying the proper range of -norm coefficients, nosotros ran experiments with xx different coefficients in that range. For each combination of these hyperparameters, we tried all four combinations of other regularizers: standard weight decay and label smoothing, only weight disuse, only label smoothing, and no regularization. For weight decay, we used the formulation for the reparameterized weights provided in the original paper, and followed their arroyo of scaling the weight decay coefficient based on the initial dropout rate to maintain a abiding length-scale betwixt the regularized model and the standard model.

Across all of these experiments, we were unable to produce ResNet models that achieved a test ready operation meliorate than random guessing. For all experiments, we observed that training proceeded reasonably normally until the -norm loss began to drop, at which bespeak the model incurred astringent accurateness loss. We include the results of all hyperparameter combinations in our data release.

Additionally, we tried a number of tweaks to the learning process to improve the results to no avail. We explored training the model for twice the number of epochs, training with much higher initial dropout rates, modifying the parameter for the hard-concrete distribution, and a modified test-time parameter calculator.

e.5 Random Pruning Details

For random pruning on ResNet-50, we shifted the gear up of possible starting time and end points for pruning earlier in grooming relative to those nosotros explored for magnitude pruning. At each of the sparsity levels tried with magnitude pruning, we tried starting pruning at step 0, 8k, and 20k. For each potential starting bespeak, we tried ending pruning at steps 40k, 68k, and 76k. For every hyperparameter setting, we tried pruning frequencies of 2k, 4k, and 8k and explored training with and without label smoothing.

due east.half dozen Scratch-B Learning Rate Variants

For the scratch-b (Liu et al., 2022) experiments with ResNet-50, we explored four different learning charge per unit schemes for the extended grooming time (2x the default number of epochs).

The first learning rate scheme we explored was uniformly scaling each of the five learning charge per unit regions to last for double the number of epochs. This setup produced the best results by a wide margin. We written report these results in the primary text.

The second learning rate scheme was to keep the standard learning rate, and maintain the final learning rate for the actress training steps as is common when fine-tuning deep neural networks. The third learning rate scheme was to maintain the standard learning charge per unit, and continually driblet the learning rate by a factor of 0.ane every thirty epochs. The last scheme nosotros explored was to skip the learning rate warm-up, and drop the learning charge per unit by 0.1 every 30 epochs. This learning rate scheme is closest to the i used by Liu et al. (2018). We plant that this scheme underperformed relative to the scaled learning rate scheme with our training setup.

Results for all learning rate schemes are included with the released hyperparameter tuning data.