Persistent RNNs: Stashing Recurrent Weights On-Chip

Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, Sanjeev Satheesh
Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:2024-2033, 2016.

Abstract

This paper introduces a new technique for mapping Deep Recurrent Neural Networks (RNN) efficiently onto GPUs. We show how it is possi- ble to achieve substantially higher computational throughput at low mini-batch sizes than direct implementations of RNNs based on matrix multiplications. The key to our approach is the use of persistent computational kernels that exploit the GPU’s inverted memory hierarchy to reuse network weights over multiple timesteps. Our initial implementation sustains 2.8 TFLOP/s at a mini-batch size of 4 on an NVIDIA TitanX GPU. This provides a 16x reduction in activation memory footprint, enables model training with 12x more parameters on the same hardware, allows us to strongly scale RNN training to 128 GPUs, and allows us to efficiently explore end-to-end speech recognition models with over 100 layers.

Cite this Paper


BibTeX
@InProceedings{pmlr-v48-diamos16, title = {Persistent RNNs: Stashing Recurrent Weights On-Chip}, author = {Diamos, Greg and Sengupta, Shubho and Catanzaro, Bryan and Chrzanowski, Mike and Coates, Adam and Elsen, Erich and Engel, Jesse and Hannun, Awni and Satheesh, Sanjeev}, booktitle = {Proceedings of The 33rd International Conference on Machine Learning}, pages = {2024--2033}, year = {2016}, editor = {Balcan, Maria Florina and Weinberger, Kilian Q.}, volume = {48}, series = {Proceedings of Machine Learning Research}, address = {New York, New York, USA}, month = {20--22 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v48/diamos16.pdf}, url = {https://proceedings.mlr.press/v48/diamos16.html}, abstract = {This paper introduces a new technique for mapping Deep Recurrent Neural Networks (RNN) efficiently onto GPUs. We show how it is possi- ble to achieve substantially higher computational throughput at low mini-batch sizes than direct implementations of RNNs based on matrix multiplications. The key to our approach is the use of persistent computational kernels that exploit the GPU’s inverted memory hierarchy to reuse network weights over multiple timesteps. Our initial implementation sustains 2.8 TFLOP/s at a mini-batch size of 4 on an NVIDIA TitanX GPU. This provides a 16x reduction in activation memory footprint, enables model training with 12x more parameters on the same hardware, allows us to strongly scale RNN training to 128 GPUs, and allows us to efficiently explore end-to-end speech recognition models with over 100 layers.} }
Endnote
%0 Conference Paper %T Persistent RNNs: Stashing Recurrent Weights On-Chip %A Greg Diamos %A Shubho Sengupta %A Bryan Catanzaro %A Mike Chrzanowski %A Adam Coates %A Erich Elsen %A Jesse Engel %A Awni Hannun %A Sanjeev Satheesh %B Proceedings of The 33rd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2016 %E Maria Florina Balcan %E Kilian Q. Weinberger %F pmlr-v48-diamos16 %I PMLR %P 2024--2033 %U https://proceedings.mlr.press/v48/diamos16.html %V 48 %X This paper introduces a new technique for mapping Deep Recurrent Neural Networks (RNN) efficiently onto GPUs. We show how it is possi- ble to achieve substantially higher computational throughput at low mini-batch sizes than direct implementations of RNNs based on matrix multiplications. The key to our approach is the use of persistent computational kernels that exploit the GPU’s inverted memory hierarchy to reuse network weights over multiple timesteps. Our initial implementation sustains 2.8 TFLOP/s at a mini-batch size of 4 on an NVIDIA TitanX GPU. This provides a 16x reduction in activation memory footprint, enables model training with 12x more parameters on the same hardware, allows us to strongly scale RNN training to 128 GPUs, and allows us to efficiently explore end-to-end speech recognition models with over 100 layers.
RIS
TY - CPAPER TI - Persistent RNNs: Stashing Recurrent Weights On-Chip AU - Greg Diamos AU - Shubho Sengupta AU - Bryan Catanzaro AU - Mike Chrzanowski AU - Adam Coates AU - Erich Elsen AU - Jesse Engel AU - Awni Hannun AU - Sanjeev Satheesh BT - Proceedings of The 33rd International Conference on Machine Learning DA - 2016/06/11 ED - Maria Florina Balcan ED - Kilian Q. Weinberger ID - pmlr-v48-diamos16 PB - PMLR DP - Proceedings of Machine Learning Research VL - 48 SP - 2024 EP - 2033 L1 - http://proceedings.mlr.press/v48/diamos16.pdf UR - https://proceedings.mlr.press/v48/diamos16.html AB - This paper introduces a new technique for mapping Deep Recurrent Neural Networks (RNN) efficiently onto GPUs. We show how it is possi- ble to achieve substantially higher computational throughput at low mini-batch sizes than direct implementations of RNNs based on matrix multiplications. The key to our approach is the use of persistent computational kernels that exploit the GPU’s inverted memory hierarchy to reuse network weights over multiple timesteps. Our initial implementation sustains 2.8 TFLOP/s at a mini-batch size of 4 on an NVIDIA TitanX GPU. This provides a 16x reduction in activation memory footprint, enables model training with 12x more parameters on the same hardware, allows us to strongly scale RNN training to 128 GPUs, and allows us to efficiently explore end-to-end speech recognition models with over 100 layers. ER -
APA
Diamos, G., Sengupta, S., Catanzaro, B., Chrzanowski, M., Coates, A., Elsen, E., Engel, J., Hannun, A. & Satheesh, S.. (2016). Persistent RNNs: Stashing Recurrent Weights On-Chip. Proceedings of The 33rd International Conference on Machine Learning, in Proceedings of Machine Learning Research 48:2024-2033 Available from https://proceedings.mlr.press/v48/diamos16.html.

Related Material