SLIDE : In Defense Of Smart Algorithms Over Hardware Acceleration For Large-Scale Deep Learning Systems
MLsys
ML
Abstract
It’s hard to maintain enough capacity to memorize many parameters and obtain state-of-the-art accuracy when training large neural networks.
Specialized hardware for model training is expensive and hard to generalize, compared to GPUs.
This paper propose SLIDE (Sub-Linear Deep learning Engine) that uniquely blends smart randomized algorithms, with multi-core parallelism and workload optimization.
Performance
44 core CPU = 3.5*1 Tesla V100
1 SLIDE = 10x TF
1 Introduction
the idea of adaptive sparsity or adaptive dropouts
design a system that can effectively leverage the computational advantage and at the same time compensate for the hash table overheads using limited (only a few cores) parallelisms.
Our Contributions
This unique possibility is because the parallelism in SLIDE is naturally asynchronous by design. We have our code and benchmark scripts for reproducibility.
in designing the LSH based sparsification to minimize the computational overheads to a few memory lookups only (truly O(1))
- The implementation further takes advantage of the sparse gradient updates to achieve negligible update conflicts, which creates ideal settings for Asynchronous SGD
SLIDE is a memory-bound application
- With careful workload and cache optimizations (eg. Transparent Hugepages) and a data access pattern (eg. SIMD instructions), we further speed up SLIDE by roughly 1.3x
2 LOCALITY SENSITIVE HASHING
Initialization:
Sparse Feed-Forward Pass with Hash Table Sampling:
Sparse Backpropagation or Gradient Update:
Here we use the classical backpropagation message passing type implementation rather than vector multiplication based.
take full advan-tage of sparsity.
Update Hash Tables after Weight Updates:
- Updating neurons typically involves deletion from the old bucket followed by an addition to the new bucket, which can be expensive.
OpenMP Parallelization across a Batch:
- Each data instance in the batch runs in a separate thread and its gradients are computed in parallel.
The extreme sparsity and randomness in gradient updates allow us to asynchronously parallelize the accumulation step of the gradient across different training data without leading to a considerable amount of overlapping updates.
4 REDUCING OVERHEAD
4.2 Updating Overhead
Dynamically change the update frequency of hash tables to reduce the overhead.
- $N_0 e^{\lambda i}$
The gradient updates in the initial stage of the training are larger than those in the later stage
Reservoir sampling algorithm
6 CONCLUSION AND FUTURE WORK
Our system SLIDE is a combination of carefully tailored randomized hashing algorithms with the right data structures that allow asynchronous parallelism.
Our next steps are to extend SLIDE to include convolutional layers.
SLIDE has unique benefits when it comes to random memory accesses and parallelism. We anticipate that a distributed implementation of SLIDE would be very appealing because the communication costs are minimal due to sparse gradients.