Dynamic Capsule Attention for Visual Question Answering
In visual question answering (VQA), recent advances have well advocated the use of attention mechanism to precisely link the question to the potential answer areas. As the difficulty of the question increases, more VQA models adopt multiple attention layers to capture the deeper visual-linguistic correlation. But a negative consequence is the explosion of parameters, which makes the model vulnerable to over-fitting, especially when limited training examples are given. In this paper, we propose an extremely compact alternative to this static multi-layer architecture towards accurate yet efficient attention modeling, termed as Dynamic Capsule Attention (CapsAtt). Inspired by the recent work of Capsule Network, CapsAtt treats visual features as capsules and obtains the attention output via dynamic routing, which updates the attention weights by calculating coupling coefficients between the underlying and output capsules. Meanwhile, CapsAtt also discards redundant projection matrices to make the model much more compact. We quantify CapsAtt on three benchmark VQA datasets, i.e., COCO-QA, VQA1.0 and VQA2.0. Compared to the traditional multi-layer attention model, CapsAtt achieves significant improvements of up to 4.1%, 5.2% and 2.2% on three datasets, respectively. Moreover, with much fewer parameters, our approach also yields competitive results compared to the latest VQA models. To further verify the generalization ability of CapsAtt, we also deploy it on another challenging multi-modal task of image captioning, where state-of-the-art performance is achieved with a simple network structure.