Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, Yoshua Bengio
Proceedings of the 32nd International Conference on Machine Learning, PMLR 37:2048-2057, 2015.

Abstract

Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.

Cite this Paper


BibTeX
@InProceedings{pmlr-v37-xuc15, title = {Show, Attend and Tell: Neural Image Caption Generation with Visual Attention}, author = {Xu, Kelvin and Ba, Jimmy and Kiros, Ryan and Cho, Kyunghyun and Courville, Aaron and Salakhudinov, Ruslan and Zemel, Rich and Bengio, Yoshua}, booktitle = {Proceedings of the 32nd International Conference on Machine Learning}, pages = {2048--2057}, year = {2015}, editor = {Bach, Francis and Blei, David}, volume = {37}, series = {Proceedings of Machine Learning Research}, address = {Lille, France}, month = {07--09 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v37/xuc15.pdf}, url = {https://proceedings.mlr.press/v37/xuc15.html}, abstract = {Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.} }
Endnote
%0 Conference Paper %T Show, Attend and Tell: Neural Image Caption Generation with Visual Attention %A Kelvin Xu %A Jimmy Ba %A Ryan Kiros %A Kyunghyun Cho %A Aaron Courville %A Ruslan Salakhudinov %A Rich Zemel %A Yoshua Bengio %B Proceedings of the 32nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2015 %E Francis Bach %E David Blei %F pmlr-v37-xuc15 %I PMLR %P 2048--2057 %U https://proceedings.mlr.press/v37/xuc15.html %V 37 %X Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.
RIS
TY - CPAPER TI - Show, Attend and Tell: Neural Image Caption Generation with Visual Attention AU - Kelvin Xu AU - Jimmy Ba AU - Ryan Kiros AU - Kyunghyun Cho AU - Aaron Courville AU - Ruslan Salakhudinov AU - Rich Zemel AU - Yoshua Bengio BT - Proceedings of the 32nd International Conference on Machine Learning DA - 2015/06/01 ED - Francis Bach ED - David Blei ID - pmlr-v37-xuc15 PB - PMLR DP - Proceedings of Machine Learning Research VL - 37 SP - 2048 EP - 2057 L1 - http://proceedings.mlr.press/v37/xuc15.pdf UR - https://proceedings.mlr.press/v37/xuc15.html AB - Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO. ER -
APA
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R. & Bengio, Y.. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Proceedings of the 32nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 37:2048-2057 Available from https://proceedings.mlr.press/v37/xuc15.html.

Related Material