Deliberate Attention Networks for Image Captioning
In daily life, deliberation is a common behavior for human to improve or refine their work (e.g., writing, reading and drawing). To date, encoder-decoder framework with attention mechanisms has achieved great progress for image captioning. However, such framework is in essential an one-pass forward process while encoding to hidden states and attending to visual features, but lacks of the deliberation action. The learned hidden states and visual attention are directly used to predict the final captions without further polishing. In this paper, we present a novel Deliberate Residual Attention Network, namely DA, for image captioning. The first-pass residual-based attention layer prepares the hidden states and visual attention for generating a preliminary version of the captions, while the second-pass deliberate residual-based attention layer refines them. Since the second-pass is based on the rough global features captured by the hidden layer and visual attention in the first-pass, our DA has the potential to generate better sentences. We further equip our DA with discriminative loss and reinforcement learning to disambiguate image/caption pairs and reduce exposure bias. Our model improves the state-of-the-arts on the MSCOCO dataset and reaches 37.5% BELU-4, 28.5% METEOR and 125.6% CIDEr. It also outperforms the-state-ofthe-arts from 25.1% BLEU-4, 20.4% METEOR and 53.1% CIDEr to 29.4% BLEU-4, 23.0% METEOR and 66.6% on the Flickr30K dataset.